A Document by any other type

By Chris Riley, ECMp, IOAp posted 09-21-2010 13:05


Would be recognized incorrectly.  How much effort have you put into your document classification processes?  Be it manual, or automated, fast, and accurate document classification is the first step to success for recognition.

What is your Documents Class?

Before Data Capture technologies can do the magic of field location and extraction using OCR or ICR, they must first determine the type of each page and sometimes of an entire document.  Types might be obvious to the world, or only specific to an organization.  Types can be determined by layout ( lines, barcodes, graphics ), or by context ( words, codes, dictionaries ).  All Data Capture solutions have this built in as a part of the template matching or document identification process.  When these packages decided to deploy classification it was for the purpose of feeding the Data Capture process not necessarily meant to stand on its own.  Interestingly enough however many organizations have bought Data Capture applications just for the purpose of classification.  They have done so with a success rate that seems to dwarf the overall data capture process.  Let’s look at why.

It appears this trend is only going to become more popular as companies see that they can first tackle one major problem, the human labor associated with putting documents into groups.  When this is a win they can take on the more laborious step of Data Capture, but have now a better chance of success. 

It seems so simple…

Classification can be a dream to setup or a true nightmare.  It all depends on the documents and the organizations perception of them.  Here I'm using the term document to mean a single record which could be single or multiple pages, but each page somehow relating to all the others.  If you are a little confused, you should be.  This is the biggest stumbling block to classification, understanding your documents.  Documents sometimes are very clear.  Let’s take accounts payable processing for example.  A document could be a purchase order that connects to a received invoice, this is the entire document.  Within this document are the types purchase order, and vendor invoice.  That was not so bad.  Now what happens if you scan in duplex and the invoice on the back has payment instructions or disclaimers.  What do you do with this page?  Still probably not too complicated, as you may just decide to omit the page if it does not have pertinent payable data from the document.  The purpose is to illustrate the rate at which the definition of a document for an organization gets complicated.

The desired approach would be a study of what your objective types ( page level understanding are ).  This could be as deep as disclaimers, waivers, and descriptor pages.  Once this is done determine the rules that combine the pages together.  In most environments the rules are flexible.  For example an invoice from a vendor can be 1 to 10 pages the first page will have a header and the last page will have a total, everything in between is a detail page.  When you do this you allow the ability to use all the cool tools automated document classification has to offer.  Your only problem with this approach is the possibility of never-ending objective page level types.

Why is class important?

What is so cool about classification is there is an even tighter control of the quality of the automatic classification because it's much easier to toggle what is right or wrong.  This allows an organization once they have a clear understanding of their documents, then an understanding of their complexity relative to automated classification, the ability to determine more actually the ROI.  Also because it's just a component of the whole Data Capture process it allow the organization to deploy exceptions faster, and perform initial setup faster with lesser expertise.  Document classification weather realized or not is a mandatory step in any data capture process and cannot be avoided, why not excel at it.

As I mentioned before the trend of tackling Data Capture's pieces vs. the whole is becoming more and more popular as the market education on this type of technology increases.  Companies are seeking a path to success in document automation and taking it step-by-step vs. the sometimes overwhelming whole.  When an organization makes the determination to do this and truly understand their documents, they are taking the accuracy of an automated system into their own hands and really giving technology the best chance to actualize for them.

#classification #ScanningandCapture #Document