Avoiding the document imaging Mulligan

By Chris Riley, ECMp, IOAp posted 12-14-2010 14:33


One of the worst things that can happen in an imaging environment is the need to re-image documents.  But it happens.  It can happen because of dramatic advances in the technology being used, because poor planning was done initially or very commonly because images are being lost.  The latter is a combination of poor planning yes, but also because imaging does not live in isolation like many believe.  To achieve a successful document imaging environment organizations must not think about it as just a content gathering process, they must also think about how the data will be retrieved.

If there is an opposite of computer rage I have it.  Call it computer mania.  Get a new technology, use it exhaustively, and ignore the obvious consequences.   I have learned a lot of great lessons doing this.  One of those lessons is related to my paperless office.  When I first started imaging ALL my documents about five years ago, I did so blindly.  I assumed that via OCR results, I would be able to find any  documents.  What I neglected to investigate was how I searched.  A blind reliance on the desktop search clients resulted in me losing documents for a period of time. 

Initially, I approached the problem by trying every desktop search client I could find.  I waited for large indexes to build, and tried a search for documents I had identified as existing but could not find.  Some of the missing documents I could find manually, and I could also verify that the OCR results were there to back up the search.  What I found is that the core functionality of each search client was more or less the same, and could not combat the problem I had created. So I was stuck.  I did not the re-image the documents, because in my brilliance, I shredded every piece of paper.  My options were first re-design the system and put some serious manual effort in order to get current documents to comply, or second re-ocr a large volume of already OCRed PDFs.  Because I knew how badly I would compromise the final result with the second option, I rolled up my sleeves, did a re-design, and began a slow process of getting my existing document s to comply.

That was 300K documents (no I did not look at everyone), what if I had a million? This little story illustrates the issues of not thinking about how you will retrieve documents at the same time you are thinking about how you input them.  I ended up refining my system, and it works very well.  There are now some additional input steps on my part, but the assurance that those additional steps provide will save me from ever facing the issue again.  Here are some cool things you should consider bringing to the document imaging table.

  1. Taxonomy.  Build a high level classification for your documents, so that you can at the very least reduce the burden of the search to some subset of images.  Taxonomy is also useful in refining search results.  A well designed Taxonomy will reduce your reliance on search.
  2. Meaningful file names.  When you get your scanner your images may be produced with a name generated by some prefix, date stamp, and maybe an iterated number.  If at all possible, name your documents with meta-data, or some more relevant piece of information.  This could even be on a batch level.  I’m not implying to get rid of dates, they are always useful.  When I had to use brute force search on a large collection of documents, good naming would have saved me time.
  3. Facets / Keywords.   Incorporate into the meta-data of a document, keywords or facets that clearly dignify that document’s topic. These will help in search filtering as well as getting the right documents in the right place.

You will notice a pattern in these three tools; they all provide a quick way to take a large population of documents and create more manageable subsets.  This improves search, and in the event of brute force search, effort required.  Proper implementation of these techniques also gives you the ability to create an endless number of virtual folders on the fly.  Which not only improves search but your ability to perform some more advance analysis such as business intelligence.

If you do have to start over and take a document imaging Mulligan, consider re-imaging the original documents.  Unless you have also saved a TIFF Group 4 version of each image, OCRing already OCRed documents such as PDFs dramatically reduce the quality of the output.  If you have the storage space and plan ahead you can keep a copy of the TIFF Group 4 image of every document, and this will give you the greatest opportunity, should the need to re-ocr ever arise.

I will repeat this mistake with some other new technology until I find some cure for computer mania.  But you can learn from my mistake.  Do it right the first time and plan.

#Scanning #ScanningandCapture #OCR #paperlessoffice