I’ve classified two generations of OCR. Generation one, I call old OCR, and the current generation. Generation one OCR engines did not think too much about what they did. Give it an image, and it will try to read anything and everything as text. Modern day OCR is quite different.
Many modern day engines changed nothing to the core character level recognition. In fact, the core of the primary OCR engines has not been touched for many years. So how are they improving accuracy? It might not be what you think. The greatest improvements in OCR in the last 10 years have not been so much on character level recognition. It's been more about how the engine's understand the structure of documents, then the meaning of the words it contains. Now OCR engines before jumping the gun to read the text in a document, spend time understanding it and building a strategy. Once that strategy is built, and characters read, they spend time checking their results.
The strategy is called document analysis ( DA) . Theoretically, if you were to compare two engines that had identical character recognition, but engine A had document analysis and engine B did not, engine A would almost always win ( The exceptions: EOBs, and Student Transcripts ).
Document analysis identifies all the components of documents, lines, text, paragraphs, columns, and graphics, all before reading a thing. The purpose is to increase accuracy by fine-tuning the engine based on what it’s reading (example small fonts), increase efficiency by not trying to read blades of grass in a photo, as “I”s, and increase the resulting quality ( for full page conversion only ) by delivering formatted documents at export.
Why should you care? First is in testing, most products out there have tools to toggle settings in the document analysis. Find the right setting win the prize, amazing accuracy. Second, it will guide you in your scan settings to improve accuracy. Third, when the argument comes up, as it does so often for me, of how OCR is improving, you have one of the answers.
Most importantly is so you understand it’s not magic. The second greatest improvement in accuracy has been spending time understanding what the text is actually saying. In data capture it’s called data types and dictionaries, in full-page recognition is called dictionaries and morphology. The result is, unlike me, OCR engines check their work! They don’t just send an email without validating its correctness.
OCR dictionaries are morphologies are where much R&D is going for the OCR engines. By making a more robust understanding of sentence structure engines can tell that “I cat” should probably be “I eat”. Most engines today even have ways for you to create custom dictionaries, thus further increasing your read accuracy.
And you thought it was all magic, well it’s not. A high level understanding of the guts of OCR benefits companies in planning for scan quality, fine-tuning results, and ultimately picking the correct capture tool.
#ScanningandCapture #accuracy #documentanalysis #OCR