🧪 Beta tests of Alphamoon's automation platform are open. Sign up and process invoices for three months for free.
21 May 2022

How to compare data extraction tools?

Data extraction, the process of pulling specific information from documents, applies to a wide range of use cases. But which software for document extraction fits your needs best? In this comparison, we provide an insight into how Alphamoon solves the pains for enterprises and SMBs in the field of information extraction and showcase a comparison of our platform with products from Google, Microsoft, ABBYY, and Kofax.

Every company, organization, or individual generates documents.

From letters sent in bottles that drifted across the seas to invoices provided to sellers or contracts signed between employers and employees – documents are as old as civilization.

It wouldn’t be an over-exaggeration to say that individuals and companies generate thousands of documents every second around the world.

And the more documents are created in a company, the more complex processes are established to classify them. Handling paperwork manually can be quite a nightmare; hence, more companies turn to solutions that help automate some of the tasks related to document workflows.

And with solutions like Alphamoon, this will no longer be relatable:

In this article, we will focus on the following:

  • how companies evaluate their data extraction software & how to compare them
  • what are the challenges in fair measurements of such tools
  • how Alphamoon improves document management with data extraction

If you’re already sure that data capture and extraction technology is for you, click below and talk to our team.

And if not, read on.

Best data extraction tools – comparison assumptions and details

An algorithm that requires constant work and adjustments is at the core of every AI technology. Each update or alteration of the existing code improves a specific metric. The same happens in business conduct.

All new products, services, or bundles, as well as additional headcount or innovative tools, chip in to better results.

However, only some of these changes are easy to measure. How to compare document automation tools without jeopardizing the legitimacy of the results?

Let’s start from the IDP supplier’s perspective.

Metrics used in data extraction

Data extraction companies evaluate the performance of different extraction tools by testing their accuracy on document samples.

The document sample has to be manually labeled (annotated) first. It means a person must provide a complete set of correct answers/fields for each document, also called ground truth.

The accuracy is then calculated by comparing the ground truth with each software’s analysis results. AI-based platforms that implement Intelligent Document Processing use this ground truth to learn and improve their performance – contrary to the so-called legacy OCR software, which typically relies on pre-defined templates.

Consequently, the main criterion to look at while comparing different data extraction tools is accuracy.

Accuracy is the percentage of correctly extracted data from all test examples. If the software reads the data accurately and the data extractor – you or your employee – recognizes the field type and marks it as correctly extracted, the process is considered successful.

Here’s an example of how invoice processing looks in practice.

data extraction

The second metric often used when comparing data extraction software is Straight-Through Processing.

Straight-Through Processing (STP) is a metric used to determine the level of automation. It is the percentage of all documents in the task where all fields were extracted correctly.

While data extraction accuracy tells you the percentage of correctly extracted pieces of information from the set, Straight-Through Processing takes the macro-perspective. It establishes a portion of all correctly processed documents within the dataset.

We could consider other technical criteria in such a comparison.

These may include:

  • the number and types of data fields that are available for extraction
  • time of processing
  • scalability to huge document volumes and robustness to demand peaks
  • deployment options and integration capabilities
  • extraction capabilities beyond pure text (e.g., extracting tables, handwriting, graphical objects)
  • additional validation and post-processing capabilities
  • intuitive UI

Challenges in comparing data extraction tools

Fair and transparent comparison of information extraction engines is problematic for several reasons.

The list of extracted fields usually differs for different providers. Therefore, the accuracy comparison must be limited to only a subset of standard fields, which shows only a partial picture of each provider’s performance.

Furthermore, most vendors do not provide public access to their platforms, so data is hard to obtain. Since they’re not cloud-based, open solutions, testing each tool’s capability isn’t simple.

The biggest challenge, however, is the need for publicly available and well-crafted datasets for extracting information from standard document types such as invoices or receipts. Since invoices and receipts contain sensitive data concerning transactions, only a few datasets can be used to test the accuracy.

The above prompts some providers to utilize their datasets to improve the capabilities of their tools, thus leading to a highly biased result in any accuracy test. To keep our own comparison fair, we have used two publicly available datasets – one for receipts and one for invoices.

Alphamoon’s platform: Invoices & receipts automation

Alphamoon platform employs the Intelligent Document Processing technology, the best method for document automation. IDP combines the joint power of NLP, AI, and ML to provide the most capable automation of all document processing.

To better depict the capabilities of our platform, we have compared Alphamoon with leading cloud providers of document automation:

  • ABBYY FlexiCapture for Invoices Cloud
  • Google Document AI Invoice Parser
  • Microsoft Azure Form Recognizer invoice model
  • Kofax AP Essentials for Invoice Automation

We used publicly available datasets for this experiment – The RVL-CDIP Dataset for invoices and SROIE Dataset for receipts. Both datasets contain challenging scans of documents – blurry, low-quality, and with varying templates.

Note: We have described both comparisons in separate articles. Read our OCR for receipts and OCR for invoices deep-dives.

Here are the results:

  • Invoices: in both metrics – accuracy and straight-through processing – Alphamoon achieves higher results than the competitors. Based on a challenging set of outdated invoices, Alphamoon has reached 82.5% accuracy. The 2nd score was 75.4%, achieved by Microsoft.

Below you can find an in-depth look at all the fields and how well each tool performed in terms of information extraction.

Comparison of capabilities of Alphamoon, Microsoft, ABBYY and Google in data extraction from invoices
Comparison of specific data fields extraction capability between Alphamoon, Microsoft, Google, ABBYY and Kofax. Source: own materials.

The above shows that Alphamoon provides the broadest range of particular data fields extraction. Thanks to the IDP technology, our tool learns with each set of invoices you need to process.

In other words, the more documents the AI OCR from Alphamoon processes, the better the fit.

  • Receipts: Alphamoon topped the competition once again. Alphamoon has achieved 89.5% accuracy, followed by Microsoft’s 87.8%.

The high scores achieved by Alphamoon have direct business benefits that you can also test in your business case.

The business meaning of information extraction metrics

Accuracy is usually proportional to the time savings that the user gets. By calculating the average time spent on a single invoice – manual data extraction and further processing – you can easily estimate the time saved when the same invoice might – depending on the learning process of the platform – take less than a minute to process. All it requires is the verification step.

More time translates to freedom in tasks and focus areas. Teams that use document automation tend to become more productive, improve their skills in more challenging projects, and create more business value due to fewer tedious tasks that pile up.

If that sounds like the team you’d like to be a part of, get in touch with our sales.

Complementary reading

– What is Intelligent Document Processing, and how can your company benefit from it
– How to implement automation (and in which areas) in accounting depts
– What are the business benefits of automated invoice processing

If you’re looking for the top quality of automated invoices processing, get in touch with us now.

CTA button

Less paperwork. More time for business.

Learn more