Occasionally, my OCR nerdiness is satisfied with something other than just explaining how the technology works. The first line of the email that kicks it off usually reads, “I have the nastiest images you have ever seen” or something similar. No, not what you are thinking! Scanned images, with text, that are difficult to OCR using out-of-the box OCR software and settings.
I’m quickly filled with the hope, that, in fact, I will be soon seeing something new. It’s an OCR challenge, how to get the technology to work on not just low resolution small font images, but images with truly unique problems.
When I get these emails of course the first thing I ask is for samples, once I have the samples, below are the tricks I consider for getting better OCR results. And no, it’s not a fancy off the shelf image clean-up application.
1.) First question I ask is, can you rescan them? If it’s clear the problem was the original scan, and the documents still exist, the easy solution is re-scan. Most the time this is impossible as the documents have been destroyed, or the document itself is very poor quality due to fading, paper type, or some blemishes.
2.) Second, if the document comes to me as a format other than TIFF, I ask “Was this scanned as TIFF Group 4? If not can it be”. Another easy fix, scan the proper way. More specifically TIFF Group 4 300 DPI. After I convert the thing I can output it to any imaginable format, but this is where to start.
3.) If I get this far without an answer the fun begins. First trick I try is opening the image in a decent image viewer that has good contrast, and brightness controls. Sometimes you can play with the contrast and brightness enough to get the characters to “Pop” more. O this is a two-step process where you actually have to increase the contrast enough that even the background starts looking dark. After you can see the characters, a second pass of binarization to drop-out the background does the trick to get more legible letters.
4.) If that does not work, I invert the image. It’s amazing that sometimes a simple inversion makes letters stand out a lot better where the background is now black and the letters white. The top 4 commercial OCR engines support reading of inverted text. Sometimes you have to select this as a separate option and detection is not automatic, but it’s always worth a try. Often in higher volume processes what I will do is OCR the original image, OCR the inverted image and compare the results.
5.) That didn’t work!? Well now it’s time to really dig in. Now I will approach pattern training. I will utilize the ability of the top 3 commercial OCR engines to pattern train to specific fonts. This technique can also be very damaging if not done correctly. You first do a pass on a few pages where each character is presented to you, and you tell this system the proper character value. Once you have trained the system you re-run the images with just this training file to produce results. What you are basically doing is teaching the engine a new font.
Well if that did not work you have a few scenarios, what you are trying to convert is not even typographic, so OCR is the wrong technology and IWR or ICR is the technology to turn too. OR the document is so bad, that your human eye can’t even read the text, if so I will probably just laugh at you and walk away.
I always have fun with these experiments, there are a few other even trickier techniques that I’ve used to solve the problem, but they require some development experience. Earlier I said, no it’s not a fancy off the shelf image enhancement program. The reason for this are, these applications while almost always produce a result that is good for viewing, almost also ruin the fonts for recognition. Sad but true. Where they can be beneficial is in binarization, line straightening, watermark removal, and despeckle/dust removal. If you happen to have a challenging image send ‘em over!#OCR #imagecleanup #ScanningandCapture