Intelligent document processing abbreviated IDP is a technique used to process paper-based documents or document images and extract information and process into into structured data. IDP uses machine learning algorithms to classify data and natural language processing (NLP) to interpret the context of the data. Our IDP uses human-in—the-loop (HITL) to enhance extraction models and continuously learning and improving based on human corrections.

We built our IDP engine tailored for financial and annual reports, KYC and other documents surrounding our area of expertise.

Differences IDP vs OCR

Optical character recognition, OCR, should not be mistaken for IDP since OCR only deals with recognizing characters and converting them into text. OCR has no features or understanding what the text means or how it should be extracted.

IDPOCR
Used forExtract information from paper-based documents or document imagesConvert scanned documents to searchable documents
TechnologyClassify, Entity Recognition, Extract, HITL, NLPPure character recogniztion
Use caseExtraction of data in financial reports, annual reports, KYCArchiving, search

Market Growth

IDP market is expected to grow from 1.7 - 1.9 billion USD in 2023 to 17.2 - 19.3 billion USD in 2032 (CAGR 28.9% - 30.5%). That’s more than ten times the coming 10 years.

References: Market.us and Fortune Business Insights

Our IDP Flow

The flow of the IDP engine is shown in the image below highlighing an annual report being processed.

Computer Vision

To provide superior recognition we use computer vision along with algorithms to identify location of areas in the documents.

Typical algorithms and filters used:

Extraction and OCR

One of the techniques we use for extraction in images and documents is agglomerative hierarchical clustering (AHC). The algorithm finds dissimilarities between data and outputs an hierarchical representation.

Our OCR engine is based on Tesseract with a kind of recurrent neural network called long-short term memory (LSTM). We train our models continously based on the data we typically process.

Example

Step 1 - raw page

The following image shows a raw page that has been highlighted by the IDP engine. The blue marked areas are going to be removed by the preprocessing system such as punch holes, stamps and page artifacts. Pages will be automatically rotated, deskewed and cleaned. Text blocks and characters are enhanced.

Step 2 - post processed page

The following image shows the same page but post processed by the set of instructions to determine the locations of meaningful data.

Step 3 - determine data locations

The following image shows the IDP engine locating areas of data subject for extraction.

Step 4 - extract to entities and run NLP

The final step is to extract the data by intepreting the context through the use of natural language processing (NLP).

The data is then exposed in the API by a simple request:

cURL
curl https://api.tic.io/datasets/companies/2692844/financial-reports?
annualReportPeriod=202112&
iso639_2_LanguageCode=sv&
key=your_api_key

And the result looks like this (has been shortend because of large response)

Response
{
    "resultSheets": [
        {
            "order": 1,
            "localizedFieldName": null,
            "enFieldName": "Period",
            "displayFormat": null,
            "type": "Int",
            "value": 202112,
            "notes": null,
            "highlight": null
        },
        {
            "order": 2,
            "localizedFieldName": null,
            "enFieldName": "Duration",
            "displayFormat": null,
            "type": "Int",
            "value": "12",
            "notes": null,
            "highlight": null
        },
        {
            "order": 10,
            "localizedFieldName": null,
            "enFieldName": "Net sales",
            "displayFormat": "N0",
            "type": "Int",
            "value": 26004,
            "notes": null,
            "highlight": null
        },...
        "footnotes": [
        {
            "order": 1,
            "localizedFieldName": null,
            "enFieldName": "Period",
            "displayFormat": null,
            "type": "Int",
            "value": 202112,
            "notes": null,
            "highlight": null
        },
        {
            "order": 2,
            "localizedFieldName": null,
            "enFieldName": "Duration",
            "displayFormat": null,
            "type": "Int",
            "value": "12",
            "notes": null,
            "highlight": null
        },
        {
            "order": 10,
            "localizedFieldName": null,
            "enFieldName": "Floating charges",
            "displayFormat": "N0",
            "type": "Int",
            "value": 500,
            "notes": null,
            "highlight": null
        },...
         "keyMetrics": [
        {
            "order": 1,
            "localizedFieldName": null,
            "enFieldName": "Period",
            "displayFormat": null,
            "type": "Int",
            "value": 202112,
            "notes": null,
            "highlight": null
        },
        {
            "order": 2,
            "localizedFieldName": null,
            "enFieldName": "Duration",
            "displayFormat": null,
            "type": "Int",
            "value": "12",
            "notes": null,
            "highlight": null
        },
        {
            "order": 10,
            "localizedFieldName": null,
            "enFieldName": "Gross margin",
            "displayFormat": "{0:#0.##\\%}",
            "type": "Decimal",
            "value": 21.98,
            "notes": null,
            "highlight": null
        }