How to use the AI IDP intelligent document processing API
Our AI based intelligent document processing API is used to process documents and extract
meaningful data that you can use to create your own datasets. It’s used by us also to process
both financial and annual reports that companies are submitting to registration authorities.You can read about the technology we are using and how it works here.Folders are used to group documents togheter that can be optionally linked to a company. Our API
contains public folders that are accessible by all users and private folders that belongs
to one or more users.All folders that you create is private and cannot be reached by “everyone”. But you can still share
folders inside your team by giving them access to the account that has ownership of the folder.
Our API will take care of preprocessing the documents and their images while maintaining the source resolution. PDF
files containing images will be extracted maintaining their original format and compression. If the source file
contain 95% or more grayscale pixels the destination file will always be created in grayscale.Summary of processing events:
Extract images
Remove metadata
Rotate if needed
Deskew pages
Remove punch holes
Remove border lines
Adjust for page margins
Remove page artifacts (arises from scanners and smudges)
You can list, delete, upload and download documents. Each document can be associated with a company. When you upload
a document you can pick four different tranformations.
transformToEntity
This will transform each page in the file to a set of understandable
structured entities in JSON format. Think like an Excel sheet. You will get
each data in columns and rows. It deals with the complexity around
borderless and bordered tables.
transformToSearchablePDF
This will transform the source file to a searchable visually pleasing PDF
adhering to PDF version 1.4 and linearized.
transformToOCR
An hOCR result will be provided than can be used to feed a search database
that contains the location of text boxes and the contents of the text box.
Also a full string representing the content of the page is included.
transformToMetadata
An json file will be provided containing the original metadata of the file.
It will for example contain format and resolution of images in PDF files.
When you upload a PDF or when a PDF is created from images all metadata will be removed when a new searchable
PDF is created. The new PDF will be linearized
which means they are optimized to be viewed on mobil and desktop apps by enabling the viewer to incrementally download the pages.
You will receive the stored document id and the status of the transformation. Currentley you need to manually
check back for the result. A webhook feature is planned to get the status posted back.