HOW TO INDEX DOCUMENTS

By Mark Mandel posted 05-13-2010 11:32

Recommend

Building on my last post about how to pick a scanner, this blog targets indexing methods.

Indexing is required so that your users can find the documents and records that are stored in the ECM system. There is a tradeoff between too few and too many index fields; too few and you may not be able to locate documents easily; too many and it is difficult and expensive to capture the index data. Typically a balance is struck so that somewhere between 5 and 10 index fields are captured.

Ideally, in order to capture a paper or electronic document into your ECM system, you won’t have to index at all. This ideal is achieved when the business process that captures the document indexes it automatically. For example, a user fills out an application online and submits it electronically. The system knows what form it is and processes it automatically. Another example is at a doctor’s office when you are asked for an insurance card. The receptionist scans your card and the information is read automatically.

As we design our business processes using our ECM products and associated tools, this type of automatic recognition and indexing should be the end goal. However, in many cases this ideal cannot be achieved, either because you are receiving paper documents that are not under your control, or perhaps because that new system is not fully implemented yet.

When your organization is receiving paper documents that have to be scanned and indexed, there are a number of options to choose from. The advantages and disadvantages of each option depend on your specific environment. Let’s explore how to determine which approach works best for you. The labor required to index documents is often one of the most expensive ongoing operational costs for an ECM system, so optimizing this process is very important.

Manual Index From Image

The traditional method of indexing incoming paper documents is to use an “Index from Image” paradigm. In this paradigm documents are scanned, often using patch pages or barcodes to delineate the start of a new document, and then indexed from a “heads up” imaging workstation.

This process may be part of a fairly high volume capture workflow, where batches of documents are routed to indexing stations. Data entry operators enter index data while viewing the image. Often the image viewer is optimized to zoom into the appropriate section of the image to help the operator see the information.

Sometimes this approach is used in low volume situations where the operator also designates, through a mouse click or keyboard entry, the start of a new document, and then enters the appropriate information.

Each index field is configured to constrain and validate the information being entered, such as required field or not, alpha, numeric or date, data mask such as phone number or SSN, drop down list, database lookup, and so on. These constraints help the data entry operator enter valid data and reduce errors.

Drag and Drop OCR

The Index from Image paradigm can be supplemented by using “drag and drop” OCR. This technique allows the data entry operator to “rubber band” the portion of the image needed and the system does an on the fly OCR and places the data into the Windows clipboard, where it can then be pasted into the appropriate index field. This helps make the indexing quicker and reduces the number of keystrokes required. This approach is also useful when the person doing the indexing is not a full time data entry person.

Index From Paper

Indexing from paper instead of the scanned image is often done when the capture paradigm reflects a “back end” scanning approach. This approach is useful as an interim step during an ECM implementation before the full optimized business process is developed. In this case the scanning is performed after the paper is processed. As the paper is processed (for example, information from the paper document is entered into a primary business application) a barcode cover sheet or label is printed.

This cover sheet or label is placed on the document and when it is scanned, the index information is extracted automatically. The barcode can include all the data (such as Employee ID or SSN, or hundreds of characters in a 2D barcode), or just a unique ID so that the data can be extracted from the business application in a database lookup.

Zone OCR

Zone OCR is used to extract data from fixed fields on paper forms such as applications. This is used when you receive a lot of the same types of forms. Software is used to design a forms template for the form so that it can find the zone you plan to extract data from. In its simplest form Zone OCR extracts machine print data from one or more zones on the document, validates it using simple rules such as format, length, data mask, etc., and then populates a data entry form. This data can then be routed to error correction stations before upload to the destination system.

Forms Processing

Forms processing takes Zone OCR to another level. It is used in high volume forms capture environments to extract data from fixed field forms. The data may include machine print, handprint, or mark sense (like the circles on the SAT test for choosing the correct answer). This technology uses forms recognition to determine which form is being processed. It uses anchor points to match the stored form template to the image, often deskewing and stretching the image to match the anchor points and align the template properly.

Forms processing, in its most advanced state, uses different OCR (machine print) and ICR (handprint) engines to extract data from each field, often mixing them based on the type of field. Thus machine print, handprint and mark sense can be extracted from the same form. Advanced validation rules, including lookups to existing databases, calculations of dollar amounts, and line item detail vs. total checks, are applied to each field to ensure that the data is extracted correctly. This approach is used on health claim forms, tax forms, census forms and more. There are a number of specialized products on the market for this type of processing, including Kofax, Captiva, AnyDoc Software, ReadSoft, and many more.

A key component of this approach is Error Correction. After automated forms recognition and data extraction, forms are routed to manual error correction stations. The efficiency of the error correction processing determines how effective the overall capture process is. Automated forms processing using OCR, ICR and mark sense can be very accurate, especially when combined with advanced validation rules. However it is not perfect, especially when image quality is not the best. Therefore the error correction process is needed to obtain the accuracy required.

Error correction software flags suspect characters from the OCR process, where the OCR engine determines that the recognition was low confidence. It also flags characters or fields that fail any of the validation rules. The user interface for the data entry operator is designed to move quickly through the errors, document by document, field by field, or even character by character. The ergonomics of this interface differentiates different products – the fastest, most ergonomic interface produces the best results. Image “snippets” are used to isolate fields or characters that are suspect.

The net effective throughput of a forms processing solution is measured by the automated processes plus the manual error correction process. Typically this will result in a 50% or more reduction in data entry labor required to capture the data. In a large data entry project that may have 40 operators, reducing that number by 20 is a huge cost savings.

It is not realistic to expect a 90% or more savings. 50% is a more realist target, and if you can achieve higher that is great.

Unstructured Forms Processing

Unstructured forms processing deals with forms that are not structured uniformly. The best example is invoices. Invoices contain common information such as PO number, invoice number, line item information, total information, dates, vendor information and so on. However this information is located differently depending on the vendor. There is no accepted standard for invoice format.

Unstructured forms processing performs a full text OCR on the document and locates information on the form using keywords and database lookups. For example, it looks for PO, P.O. or Purchase Order, and then looks to the right, left, up or down from that point until it finds data that fits the desired format.

This process requires a lot of setup to obtain high accuracy. It has been done very successfully for invoices and Explanation of Benefit forms. Most vendors have sample templates and rules for these forms to provide a rapid startup for their customers. The technology can be used for other types of documents that fit this paradigm, however if they do not already have a commonly used rule set, expect that process to require a lot of development that consists of scripting and training.

Auto Classification

In high volume capture environments, one of the high cost components is the labor to prepare the documents for scanning. This often includes placing patch pages (a specialized barcoded page) in between folders and/or documents. When recognized by the scanner or scan software, these patch pages automatically delineate the start of a new folder or document, thus making it easier to index the documents downstream. This extra step adds to the cost of the document preparation process and over time can add up to significant increased cost.

Auto classification software is used to identify document types in a capture process. It can be used in all kinds of data streams, but here we are focusing on auto classification of scanned paper documents. The software requires setup and lots of training to identify documents. It can determine the start of a new document, the type of document, and the end of a multipage document. Once that occurs, documents are routed to indexing stations or data extraction processes that operate based on the document type.

This software is found in capture products such as Kofax KTM, Captiva, and others, as well as stand alone products that can be integrated into your solution.

Summary

As you can see there are many alternatives to choose from for capturing index fields or data from paper documents or forms. The desired end game is to do this all automatically, avoiding paper processing if possible. But for those situations where paper processing is unavoidable, choosing the paradigm that best fits your situation is critical to your success.

Blogs