Automating document classification with 90%+ efficiency for a finance company
Our client is a leading company in the Polish debt management sector. To help our client reduce costs and increase operational efficiency, we created a solution that automated the processing and classification of incoming documents. Our tool can identify document type and indicate to which case it is relevant.
Classifying a massive set of documents
Since our client is a debt collection company, its employees deal with hundreds, if not thousands, of incoming documents every day. In particular, they are different types of court and bailiff documents.
Manual sorting and categorizing these materials took the company’s employees a lot of time and effort. That’s why our client wanted to automate the process and reduce the cost of this operation.
The idea was simple. The tool would automatically tag the incoming materials by type and assign them to different departments for further processing. By introducing our system, the company spent far less time on pre-processing correspondence, and employees could instead focus on the cases. Thanks to our implementation, the staff can easily find similar cases.
Teaming up with AI experts
Our team composed of one backend developer started working on the project within a month after setting its requirements. Our technical expert was supervised by a skilled Project manager who ensured the successful delivery of the solution.
The software engineer was able to develop a working tool within 8 weeks.
Here’s how our team built the solution
To build the solution our client needed, we used the Python programming language and PyTorch, an open-source machine learning library developed by Facebook’s AI unit and used for applications such as computer vision and natural language processing.
Our software engineer created a microservice based on the REST technology. The company was looking to integrate the tool with its system without causing too much interference. Our team used an entirely different programming language and the stateless method of document classification, which means that the tool didn’t have to access the database systems. Since the team wasn’t sure about the volume of documents to be processed by the tool, they proposed a solution that would be scalable and potentially ran on a cloud service.
The company received the documents in the form of scans (PDF and TXT formats). The first step to building the tool was converting these images into text our solution could use to add tags to documents. We used Tesseract OCR to convert scans into text.
Next, the documents were tokenized. Our tool then calculated the embedding using the FastText technique. As the documents were written using natural language, the text included errors and OCR inaccuracies. FastText deals well with both of these problems.
We also calculated the latent representation for the entire document using a hierarchical model of the Transformer type. To understand the sense and context of these documents, our team needed to use a model capable of learning not only the meaning of words but their meaning in a broader context. We proposed an architecture able to understand words in the context of the sentence structure and the meaning of sentences in the context of the document structure. Finally, we used a Softmax unit to classify documents into one of the fourteen document types defined by our client.
The model we delivered is able to process long, multi-page documents in a single session. To deliver such a solution, our team had to face a significant challenge: loading into memory and storing gradients, all the while training the model.
Results of our cooperation
We developed a tool that successfully automated a process with over 90% classification efficiency (f-score). As a result, our client no longer has to allocate substantial resources to tag and distribute documents.
The company was able to significantly reduce the costs involved in handling incoming documents. In 93% of cases, the time of processing was reduced from 3-5 min to 10 seconds. In the remaining 7% of cases, the documents were returned for manual analysis.