Automating data extraction for a fintech startup
Our client is a software company that specializes in artificial intelligence applied to compliance management, data processing, and automation of complex business processes. Counting major international financial institutions in France, the UK, and the USA among its customers, our client develops disruptive tools that help businesses achieve their strategic objectives. The company has been experiencing triple-digit growth for the last three years.
As part of our long-term collaboration, we helped our client solve a specific problem in the field of data extraction. As a result, we developed a solution with the field detection efficiency of over 85%.
Automating the process of extracting data
Our client had a massive database of investment fund prospectuses that were unstructured and stored in the PDF format. As a result, processing and analyzing the data enclosed in these documents was impossible. The company didn’t want to miss out on the value this information presented to their customers.
That’s why our client wanted to build a solution that would extract information from these prospectuses. In particular, the company wanted to get data such as the name of the investment fund, its duration, and the start date. The idea was that the tool would return all this data in the structured JSON format to enable further processing by our client’s platform.
Moreover, the data was prepared in a raw form, so that the company could use it somewhere else if required. Based on the data, the client could search for key information that the company would later use in its activities.
Outsourcing digital transformation
We have already been supporting our client in developing key product features when the company decided to build a new tool for data extraction.
This project was one of several projects we have implemented for the client as part of our long-term cooperation. We cooperated with the company from November 2017 to November 2018, completing a total of 4 projects that focused on searching and parsing tables in prospectuses, text classification, and the classification of objects, together with the type of images (logo, signature, bullet, etc.).
We quickly set up a one-person team and initiated the project right after the completion of another project. The client’s team consisted of 3 people supervised by one project manager.
Here’s how our team built the solution
To deliver a data extraction tool to help in processing investment fund prospectuses, we used the Python programming language in combination with TensorFlow, an open-source library that supports data science projects (especially machine learning). The project also involved GPU computing in accelerating the process of extracting data from the client’s PDF documentation.
Our team created a library that extended the capabilities of the company’s system using the following technologies:
1. BPE (Byte Part Encoding)
We used it to build a token dictionary for our client’s solution. Our team created a database by putting together the PDF documents provided by the client. Note that we used a specific coding technique allowed us to immunize the solution against the potential errors in the client’s documentation. We did it to ensure a smooth data extraction flow.
2. Deep learning models
Our team used the following deep learning models:
- Bi-LSTM (Bidirectional Long Short-Term Memory) model
- CRF (Conditional Random Field)
- CNN (Convolutional Neural Network)
- CRF (Conditional Random Field)
All these models were trained by our team to predict the probability that the information users were looking for would be included in a particular word. Our team created eight classes to cover data such as fund name, start dates, and duration. Moreover, the CRF model allowed our team to improve the quality of results by taking contextual knowledge into account.
Results of our cooperation
Following 12 weeks of work, we provided our client with a solution characterized by incredibly high efficiency of field detection, at the level of over 85% (f-score). State of the art, off-the-shelf model used as the solution’s baseline generated results only at the level of 75%. Our team managed to substantially improve the tool’s efficiency, enabling it to process 15,000 pages per hour on a single GPU.
We developed a solution that balanced massive workload with computing power efficiencies to provide our client with a tool based on efficient, innovative technologies.
Are you looking for a team of experts that specialize in building tools for data extraction? Get in touch with us; we’re experienced in developing solutions that help businesses unlock the potential of their data.