Increasing document parsing accuracy by 2x through LLM augmentation

Increase in parsing accuracy

80%

Reduction in issues

10+

Languages supported

How we partnered with a fintech startup to reduce document parsing errors by 50% - leading to less manual interventions and a long term scalable solution.

Background

Our client, a FinTech startup, had been in operation for 3 years - and was looking to raise another round of funding in next 12 months. Core to their product was the parsing of invoices, however their existing in-house solution had been built with limited machine learning or AI tooling - and instead relied on template and regular expression based data extraction from an early doors build.

Whilst their parsing engine had served well at the beginning, too many assumptions about invoices had been embedded - leading to a growing use of manual interventions and workarounds that simply couldn’t scale with a full offshore team spun up to handle - and blocking them from being able to expand to other regions as their existing setup had been designed only for the UK market.

In addition, these manual interventions didn’t follow a linear process with an in-house tool suite - but instead relied on back and forth email contact with customers, leading to data privacy concerns and repeat failure - as well as being prone to human error.

Without any in-house expertise in the development of AI and ML, they turned to us to see if we could help them unlock the future of their product.

Deep diving the problem

Firstly, we assessed the previous 6 month throughput of the parsing engine - understanding which documents were parsed with 100% accuracy - and then bucketing those with reported failures based on accuracy tranches at ten percentile intervals.

We found that those in the lowest tranche had the highest variance between each other, as well as the largest delta when compared to those parsed with high accuracy in terms of data formatting and structure.

Some of the common things we saw across this tranche were:

Embedded images in documents: A sizeable number of the invoices uploaded contained embedded images rather than text. The existing engine did not support these at all, resulting in full failure to parse.
Non standard currency formatting: Many invoices contained various idiosyncrasies when formatting values, for example - we found 20 different ways the value “£1,000.00” was represented - including varying inclusion of decimal places and comma breaks, positioning of the currency symbol etc.
Non standard invoice layouts: Whilst most of the invoices in the high accuracy tranches followed a common table layout - many in this tranche did not, making it harder to ascertain where information was.

Building for accuracy and scale

With the above learnings, we set about to build a prototype that would address the concerns above - whilst providing for scale in terms of throughput and speed of resolution where human intervention was needed along with being built to support cross country expansion goals.

For this, we started with a data set of 5,000 documents - with a heavier weighting towards those from the lower accuracy tranches. We re-built the parsing pipeline from scratch - using the 5,000 documents as a baseline and comparing accuracy with the previous setup.

The pipeline would first extract the contents from the PDF. Where large images which could be text were detected, they were run through a well tested cloud OCR (optical character recognition) service in order to achieve a pure text baseline.

If the text was detected to not be English, we would next run it through DeepL with a translation back into English (keeping both the original and translated copies) - as most LLMs are far more proficient in English by size of corpus.

We would then inject a prompt into a fine-tuned Mistral model, in order to extract as machine readable JSON various elements that would make up the invoice. Each element would then be validated against a wide set of type and boundary validations in order to ensure that the data had been parsed. In the case of non English invoices, the English extraction would then be matched with the portion of translated text - so that the end user would have an experience which was entirely within their language, whilst allowing oversight in English - pulling in multi-lingual staff only where needed.

Any failures at each stage of the pipeline would be reported, along with full context, in a custom built management portal which was integrated with the company’s internal chat tool (Slack) - allowing most issues to be raised and resolved before surfacing to the customer.

Aligning with governance concerns

Given the regulatory nature of the financial services space - it was imperative that every component of the parsing engine was auditable and deployable on cloud infrastructure that they had full governance over.

This is specifically why Mistral was selected, as an easy to fine-tune but open source LLM - we could ensure that all data injected into the context window would remain ring-fenced. Equally, there was a full audit trail of each action taken by the pipeline available to staff - and only accessible through the company’s single sign-on system.

Going into production

As large language models are indeterministic by nature, it was important to use a wide array of new observability tools to ensure that specific inputs would give specific outputs - and so build a programatically generated synthetic data set based on the original 5,000 documents in order to continue to monitor the performance of the model and prompts.

The parsing engine was then deployed as a standalone application with a REST API interface, allowing for an easy migration from the previous engine to the re-built one.

After 3 months of being in production - the team found a 2x increase in parsing accuracy - which along with the new suite of management tools and full audit tools translated to an 80% reduction on time spent resolving parsing issues - and the ability to finally scale out to other regions.

Industry

Financial Services

Project Type

Product

Size

Startup

The client, a FinTech startup, needed to enhance their invoice parsing engine for scalability and accuracy, as their existing system relied on manual interventions and outdated methods. We assessed their parsing failures and identified issues such as embedded images, non-standard currency formatting, and diverse invoice layouts, which hindered accuracy.

We built a new parsing pipeline using OCR, translation services, and a fine-tuned Mistral model to handle diverse formats and languages, integrated with a custom management portal for real-time issue resolution. After deploying the solution, parsing accuracy doubled, reducing resolution time by 80% and enabling the startup to expand to new regions.

Summary by GPT-4