OCR Engine voting is one of those things that just seem to make sense. “Seems” being the key word. Engine voting is touted as the process of taking two separate OCR engines and comparing their results in order to choose the best. The thought is that engines tend to excel in different areas and by combining engines you will have the best chance at accuracy. Just like real campaigning and elections, there is more to the story.
Although it sounds promising the practice of voting two separate engines can be quite problematic, and actually detrimental to OCR accuracy. The problem of voting is that engines don’t all speak the same language. Voting is judged based on what is called character confidence. When you vote, you look character by character and pick the result of the engine with the highest confidence. The problem, however, is that none of the engines report confidence at the same level. Some engines are more conservative, while others less. Therefore, when you vote you are automatically trending towards the less conservative engine. Let me give you an example. Take the letter “c”, Engine A might report a confidence of 98% that it's an “e”, while Engine B might report with a confidence of 78% that it is a “c”. When you vote these two, Engine A will win even though it's wrong. One argument that could be made is that you can determine the difference in reporting between the two engines. This too seems possible, until you try it, and realize that all the engine's confidence is influenced by hundreds of other factors such as contrast, font size, etc. Thus normalization of two engines would be a significant burden and simply not worth it. Some odd things will happen when separate engines are voted such as sudden accuracy drops in seemingly high quality documents, or portions of a document recognized accurately and other portions not.
I’m not actually attacking the act of voting. I'm attacking the perception that products with several engines automatically equal better. Voting itself is not a novel idea. In fact, the top 4 OCR engines vote internally. Utilizing what are called “experts” that take several different approaches to each character and compare the results. Because it’s all within the same engine, using the same algorithms for computing confidence, it works very well. This is great news because it means you can with great confidence vote the same engine against itself. Taking engine B now with setting C, tailored to small fonts, and setting D tailored to fax images, you can increase your accuracy for recognition of varied documents.
The reality is, it’s not always about throwing more technology at the problem. This often results in additional cost for possibly a reduction in quality. Achieving higher OCR accuracy is somewhat of an Art, but also clearly demonstrable via testing. Modern OCR engines have thousands of levers and pulleys to toggle accuracy. If it’s a boxed product the vendor has specifically chosen settings, which may or may not be changed. We all want more accuracy, but the number of tools already in your position most of the time is enough to make great strides.
#accuracy #ScanningandCapture #Voting #OCR