Document classification occurs at some level in all advanced capture solutions. It’s a process that is often ignored. In fixed forms processing ever page has to be classified (often referred to as registration, or IDR) in order to fit a template. In semi-structured forms processing keywords are used to identify forms before any field location takes place. Sometimes, however, image classification stands out as its own process in the areas of mail-rooms, EMR, and batch auto-filing.
Whatever the scenario is, attention should be given to how scanned pages are distinguished from the whole. Without classification the chances of a single OCR result are impossible. There are two core types of classification.
Image-based classification:
In image-based classification, an engine looks at pixels on a document to determine its type. A sort of thumb printing process is used. Sometimes the classification is very discrete such as a vertical line of a specific length on the left side of the page. However, more advanced image-based classification solutions look at a page holistically, and is comparing variations to other pages, not necessarily specific attributes on a single page. The benefit of image-based classification is that it’s very fast. Documents can be classified in fractions of a second. The downside is that for documents that are visually the same, but contextually very different, the technology does not work so well. The other cool aspect of image-based classification is that it can be used in not only classifying images, but analyzing the difference between images, as would be needed for some business intelligence applications.
Contextual classification:
Unlike image-based classification, contextual classification runs a process of full-page OCR before classification occurs. It uses the text results as well as their x, y, height, width location on the page to identify one document from another. Some people confuse this process for the actual extraction of results, which it is not. Surprisingly, in data capture applications contextual classification will be run, and then after field location, OCR is run again. So actually OCR is being done on the document several times. Contextual classification can distinguish even similar looking documents. The problem is it tends to be more expensive, and substantially slower than image-based.
The best classification solutions out there will likely have some combination of image-based and contextual classification. They also allow the toggling of which you favor, allowing organizations to get the right balance between accuracy and performance. Some solutions require up-front training of the classification engine; others allow training to occur naturally during operation.
Although classification is very powerful, there is a use case that today it cannot fully satisfy. Imaging technology can only really be used in objective scenarios. However, in the world of EMR, for example, subjective classification is very important. For instance, if I have patients file Chris Riley, that contains all current and history medical records for myself. For the Chris Riley file, I scan a lab report. Classification technology will very accurately identify the document as a lab report, but for patient files it’s also important to know if the lab report is current, historical, and if it’s tied to any physician requests or upcoming procedures. At some level, the rules can be built for this subjective classification, but in a broad sense it’s a very difficult to scale. In the future this will be improved by more standards, but not necessarily the core technology.
The future of classification to me is very interesting. First off, because you can’t perform data capture without it, and second, there are evolving applications in the area of business intelligence that are extremely fascinating. It’s important for organizations to take some time to see how they use document classification in their imaging environment and how it can be optimized.
#classification #ScanningandCapture