AI Intelligent Document Processing
We extract meaningful data from unstructured documents with AI.
Intelligent document processing abbreviated IDP is a technique used to process paper-based documents or document images and extract information and process into into structured data. IDP uses machine learning algorithms to classify data and natural language processing (NLP) to interpret the context of the data. Our IDP uses human-in—the-loop (HITL) to enhance extraction models and continuously learning and improving based on human corrections.
We built our IDP engine tailored for financial and annual reports, KYC and other documents surrounding our area of expertise.
Differences IDP vs OCR
Optical character recognition, OCR, should not be mistaken for IDP since OCR only deals with recognizing characters and converting them into text. OCR has no features or understanding what the text means or how it should be extracted.
IDP | OCR | |
---|---|---|
Used for | Extract information from paper-based documents or document images | Convert scanned documents to searchable documents |
Technology | Classify, Entity Recognition, Extract, HITL, NLP | Pure character recogniztion |
Use case | Extraction of data in financial reports, annual reports, KYC | Archiving, search |
Market Growth
IDP market is expected to grow from 1.7 - 1.9 billion USD in 2023 to 17.2 - 19.3 billion USD in 2032 (CAGR 28.9% - 30.5%). That’s more than ten times the coming 10 years.
References: Market.us and Fortune Business Insights
Our IDP Flow
The flow of the IDP engine is shown in the image below highlighing an annual report being processed.
Computer Vision
To provide superior recognition we use computer vision along with algorithms to identify location of areas in the documents.
Typical algorithms and filters used:
Extraction and OCR
One of the techniques we use for extraction in images and documents is agglomerative hierarchical clustering (AHC). The algorithm finds dissimilarities between data and outputs an hierarchical representation.
Our OCR engine is based on Tesseract with a kind of recurrent neural network called long-short term memory (LSTM). We train our models continously based on the data we typically process.
Example
Step 1 - raw page
The following image shows a raw page that has been highlighted by the IDP engine. The blue marked areas are going to be removed by the preprocessing system such as punch holes, stamps and page artifacts. Pages will be automatically rotated, deskewed and cleaned. Text blocks and characters are enhanced.
Step 2 - post processed page
The following image shows the same page but post processed by the set of instructions to determine the locations of meaningful data.
Step 3 - determine data locations
The following image shows the IDP engine locating areas of data subject for extraction.
Step 4 - extract to entities and run NLP
The final step is to extract the data by intepreting the context through the use of natural language processing (NLP).
The data is then exposed in the API by a simple request:
And the result looks like this (has been shortend because of large response)