Blogs

Beyond OCR Key Considerations in Data Extraction

By Greg Council posted 05-07-2015 11:12

  

Blog Post Authored By Greg Council, VP of Marketing and Product Management at Parascript

 

When it comes to extracting data from documents, there are three things to consider: file type, data types, and the type of document that all affect the flow-through of document data extraction.

 

When most people think of “data extraction” or forms processing, they think “OCR” or “recognition”. They appear synonymous, but the reality is that nothing could be further from the truth.

File Type: When to use OCR or Not

Let’s start with the file type. If the file requiring data extraction is a PDF, the text may already be present, so there’s no need for OCR. This is certainly the case for documents such as those that are the product of word processing or email software. Some document capture software actually rasterize these types so that they can be processed using the same techniques used for scanned documents. This actually introduces error and throughput problems. Clearly, if you don’t need OCR to process certain file types, don’t use it.

Data Types: Advancing Past OCR

For data types, OCR is applicable to machine-printed information only. If there are other types of data required, other technologies will be needed. As you know, if you have looked at the raw output of OCR, any data that is not machine printed is displayed as unusable characters. For instance, if logos need to be matched, or if there is handwriting involved, data extraction technologies designed to deal with that information are required.

Document Type: Structured vs. Unstructured

The third and probably the most important factor is the type of document. Is it a structured document such as a claims form? Or, is it a variably structured document such as an invoice, remittance, or even a retail receipt? For claims, it is fairly straightforward to extract data using a combination of OCR and static coordinates of where the OCR is to be applied. Using these “templates”, the OCR simply processes specific regions of the document and outputs the results, which can be highly accurate. For the unstructured or variable documents, it is an entirely different story.

 

Invoices, remittances, and receipts all contain similar data, but the location of each data type can vary widely. Applying OCR on the entire document will not provide a solution—only the entire text of the document. Getting the exact data so that businesses can use it takes more. Certainly, an organization can entertain creating “templates” for each variant in order to extract only the needed data, but the process gets unwieldy both from an implementation and performance perspective, when the variants get over 10 or 20. The reason is that when a specific document variant changes (and they always do), businesses need to adapt the template and performance suffers until they do.

Advanced Data Extraction

The industry is full of terms like “keyword search”, “pattern matching”, and “database lookups”. These techniques provide a good basis for semi-structured document data extraction, but really only provide the foundation. These techniques imply that the data must always be in text or numeric form. Additional techniques and technologies also must consider pre-parsing the document in order to quickly locate “regions of interest” in order to narrow-down the focus of data location. Otherwise, organizations must bear the burden of full-text OCR, which has a significant throughput penalty. They also can make use of structural analysis of the document to identify different forms of data, for instance field-based vs. table-based data or use presence of other non-text data to provide “clues” as to where the data to extract are located.

 

Data extraction technologies go far beyond OCR at this point. The techniques described above can, and should, be applied to not only scanned documents, but also documents born digitally, such as emails and Word documents.

 

Lastly, consider the unstructured document: things like correspondence or other business documents that are not form-based. OCR can only supply the full-text of the document. This text might serve use for a full-text search engine. However, as tens of thousands of organizations have already realized, full-text search is insufficient and fails to provide a basis for knowledge management and information governance. Using advanced technologies on-top of OCR is the only way to properly identify the document type and then accurately locate, extract, and validate the data.

0 comments
364 views