What IT should know when preparing for OCR

By Chris Riley, ECMp, IOAp posted 11-16-2010 12:46


Every now and then, you have to roll up your sleeves and get down to business.  This post is that, targeted at IT managers tasked with implementing high performance OCR environments. A properly architected OCR environment will maximize license value, improve turnaround, and help improve accuracy.

One challenge organizations face when performing OCR at volume is that in order to scale, they need to purchase additional licenses.  However, there are ways to get the most out of a single license by maximizing hardware.  The goal is to process as many pages as possible with a single OCR license before you add new ones.

Along the same lines when you increase the efficiency of machines doing OCR, you are able to enable advanced recognition features that increase the overall accuracy of your OCR.  The more accurate an OCR engine, the more computer power it requires.  Many organizations are faced with the decision of turning off accuracy improving features. Not only can you improve the use of existing licenses and use all the accuracy enhancing features, with better hardware the time it takes an image to enter the OCR process and return a result is reduced.

The bottom line is, IT has a lot of control over the efficacy of a production level OCR environment.  OCR takes up maximum resources of a machine.  Many organizations choose to just throw the latest and greatest technology to improve accuracy, and are surprised when it does not have the expected gains.  Below is a list of the top things to consider when enhancing your OCR environment.

  1. Bus speed. OCR processes move images in and out of memory, and serialize to the hard drive more times than you can imagine.  This process alone can really slow down a machine. Let’s try an analogy. San Francisco, and New York are two very large cities. They have quite an amazing capacity for people, and things.  Let's say San Francisco is the best computer memory, and New York the best largest hard-drive. If I and 200 of my friends want to move from San Francisco to New York with all our stuff, driving 100 or so VW Beatles cross country, it would take a LONG time.  This is a poor BUS. But if we were to all load on a jumbo jet, we would be there in a matter of hours. The slower the BUS speed on memory, hard-drive, and CPU the greater a delay for image files to be moved from one location to another. Server grade hardware often has fast BUS speeds but have a tremendous amount of overhead that gets in the way.  BUS speed is a very important consideration when looking at hardware components and how they benefit OCR.  You might maximize memory size, but if it takes too long to write the images to memory, it’s never utilized.
  2. OCR is a CPU HOG. It will take 99% of any single thread when it is running, so putting energy into a more powerful CPU with more threads is not a bad idea. However, assuming that a server grade CPU such as the Xeon is better than a Desktop CPU such as the i7 might be a mistake. The reason for this is simple and two fold. Again servers have more overhead, which can get in the way of processes that have a lot of moving from one place to another. Most importantly is that the chip-set of the older established CPUs is just that, older.  Because OCR is so math intensive, the chips optimized for math operations outperform.  Because of this it’s not surprising that the chips that run the latest video game amazingly well, tend to also do very well with OCR. Two chips may have the same Mhz speed, but they don't deploy some of the faster math processing that is very good for OCR and found in the new chip sets.  It’s like the difference between a diesel engine, and a Ferrari engine.  The diesel engine is a power house once it gets going, but out of the gate just not as fast.
  3. Hard-Drive speed is the same story as BUS speed. You want your hard drives to write quickly. Images are being serialized very often with OCR. Not only do you want it to be fast, but you want its connection to the motherboard to be fast. Serial ATA so far is the proven fastest way. Servers tend to implement SCSI, which is great for redundancy, but not a promoter of speed because of the overhead.  On the flip side the promise of solid state drives is great.  In tests the solid state drive does magic for OCR performance. However, the reliability is not there yet.
  4. Memory is important but amount of memory is less important than the memory speed. 4 GB should be sufficient for most activity any machine can handle. The difference between DDR speed and DDR3 is a huge difference.

If you keep it simple and focus on those tools that REALLY increase OCR performance you may be surprised that you have to pay less to get more in this case.  Often a desktop machine with the right considerations will outperform a server, because OCR uses and abuses a system in quick spurts versus a steady draw of resources.  The above four items, in orde,r are the top considerations when architecting your OCR production environment to provide the greatest efficiency and quality.

#architecture #ScanningandCapture #OCR #memory #cpu #hardware