They're scanned images, not flowers

By Chris Riley, ECMp, IOAp posted 08-03-2010 10:28


Of course you want your scanned images to look as pretty as possible on the screen, but who’s to say what the OCR engines agree with your conditions for beauty.  This blog post, perhaps a slap in the face, is about why you can over clean scanned images, to the point where your recognition accuracy decreases.

There is software that cleans up images so that document recognition technology has the best fighting chance at accuracy, there is also software makes scanned images look as if they originated digitally.  Very often these two technologies come bundled together.

This sounds good, but has gotten many companies in trouble.  Especially those doing large volume document scanning.  Why?  Because it would not be unheard of for the technology that makes the image look good on the screen, hurts the image for recognition technologies OCR, ICR, etc.  The logic is simple.  The algorithms to do this image manipulation were created with two very different purposes in mind.

Image cleanup for viewing was created by looking at before and after images. Much like your eye doctor asks you very softly, “One or two? Two, or three?”  the developers of this technology opened the original image on one monitor, the new image with the proposed new algorithm on another, and saw which was better.  The assumption was that if it looks better it will recognize better, which we will find out was not the case.

Image cleanup for OCR was a similar before and after scenario.  The developers took images, often on a character level, and test the OCR engine on them before and after the proposed new image cleanup algorithm.  If the accuracy was better ( i.e. correct recognition and percentage of uncertainty decreased ), then it was implemented.

To confuse you a little, I’ve yet to find a case where image cleanup for OCR was not also good for viewing, but found many cases where image clean-up for viewing was bad for OCR. The reality is, you can clean your images too much.  Here is how you know.

If you cleanup typographic text too much, it looks to the OCR engine like a graphic.  Because of this the OCR engine skips it, this results in what is called a “high confidence blank”.  If you cleanup handprint too much, well, just don’t do it.  Image cleanup for hand printed text is removing portions of a hand stroke, the very information Intelligent Character Recognition (ICR, technology for reading handprint) uses to figure out what letters are.

Here are some tips. Stick to image clean up for the desired purpose of the scan.  If the image is simply for viewing, clean it up to perfection.  If it’s for OCR, stick to those settings most conducive to OCR.  If it’s both, that is not a problem as many scanners and software support what is called “dual stream”, one image going two paths.  Enhanced for OCR goes to recognition software, enhanced for viewing to a ECM system. Cleanup that is good for OCR and ICR is:

  1. Despeckle ( unless dot-matrix font )
  2. Line Straightening
  3. Basic Thresholding
  4. Background removal
  5. Correction of Linear Distortion
  6. Dropout
  7. Line Removal ( sometimes )

Bad for OCR and ICR is:

  1. Adaptive Thresholding: Often causes a condition called “Fuzzy Characters”. “c”'s will be “e”'s. For handprint you often remove portions of characters.
  2. Character Regeneration: Removes critical information important to OCR and ICR processes. If you use it in OCR ( Machine-Print ) you will notice more “high confidence blanks”, the characters are so perfect they look like images to the OCR engine and are ignored. In ICR ( Hand-Print ) you will damage the hand stroke of the characters thus confusing the ICR algorithms and reducing trainings ability to understand the subject and this ultimately reduces accuracy.
  3. Line Removal: Bad line removal makes bad OCR. Line fragments really interfere with OCR and ICR processes.

So there you have it.  It is not as clear cut as, the prettier the image the better the recognition.  Although this can be used as a general guide it’s not the fact, it is an assumption that has limited the success of recognition projects.  The simple answer to all of this is, drum roll, test!  Test just as the developers did when creating the technology.  Alternatively, you can become like me, and slowly develop a built in OCR result predictor.  I do not recommend the later, as it does not promote social life.

#Image #ScanningandCapture #imagecleanup #OCR.ICR #quality