(note : This is a republication of a blog post I have written here. After reading Chris Riley' s post about OCR image clean up thought of sharing it here.)
Two main concerns for any document imaging exercise are the image quality and the file size. Anyone will need to get the best possible image quality while keeping the file size to a minimum for obvious reasons. Thus image enhancement has become an essential step in a well defined capture workflow. The purpose of image enhancement (image cleanup / image processing) is to make the images more readable, and also to remove unwanted noise reducing the storage requirements. This is especially important for forms processing / OCR applications in order to improve character recognition. There are number of image enhancement techniques available today. Described below are 8 such image processing techniques.
In a production scanning set up, document pre-processing is the most time consuming step. One objective of this step is to arrange the documents correctly by rotating (incorrectly filed documents) and aligning them together. The De-skew facility in production capture applications helps to reduce this effort by automatically de-skewing misaligned images. The De-skew process can straighten pages which were misaligned during the document feeding process, within a specified range of degrees.
A more advanced feature is available with Kofax VRS called content based rotation. VRS can analyze the content of the image and correct the orientation accordingly.
Here is a nice illustration called “The Effects of Deskewing a Document” in ScanHelp.com
2. Black border cropping & removing
Cropping refers to the removal of the outer parts of an image. In document scanning, black border cropping is one technique that is used to remove the unnecessary black colour borders from an image. Border cropping removes black borders from the image completely also resulting in the reduction of image height and width. However this does not reduce the resolution of the image. (This is an Illustration of border cropping).
The other technique is to replace the black coloured pixels in the borders with white colour pixels which is called black border removal. Unlike cropping this does not reduce the image size
3. De-speckling / Noise reduction
When scanning old documents we usually get unwanted dots (speckles) in the background. This could be in two forms; black speckles in a white background as well as white speckles in a black background. This is also known as Salt and pepper noise. (This is an example for an image with salt and pepper noise)
Whatever the form, this affects the image compression and increases the file size. De-speckling (also known as noise reduction) is the process of removing such unwanted speckles from the image background. (Illustration : noise removal)
4. Colour drop out
Colour dropout is a proven useful technique for forms processing applications such as census projects. The idea is to discard the text boxes and lines of a scanned image. This will increase the recognition rate of OCR. Earlier scanners used specific colored lamps to achieve this. (eg : Blue Imaging Color Drop-Out Element for Kodak 9520/9500). Now this has been improved and is achieved by software.
Colour drop out accuracy directly depends on the printing quality of the forms. Only selected colors (shades of red, blue and green) can be dropped, which depends from scanner to scanner. Therefore it is essential to use the recommended color pantone (e.g. : Fujitsu PANTONE Dropout Confirmation Listing) for printing the forms.
This is a very informative article on color drop-out by the Document Doctor.
Thresholding is a technique used when scanning grayscale images and saving as Black & white. A grayscale image will have 16 bits per pixel (representing 65,536 shades of gray) and a black & white image will have 1 bit per pixel (representing either black or white). When converting from grayscale to black & white (example : scanning a photograph in black & white mode), each pixel having a different shade of gray should be converted in to either black or white. This point of separation is called the threshold. By changing the threshold value the output image quality will change
This is an illustration of thresholding by imagebeat.com
As shown in the above illustration this is a fixed thresholding, which is ideal for separating solid colors (e.g.: text) from background. However for images with various shades of gray a advanced version of thresholding called adaptive thresholding is used. In adaptive thresholding the threshold value is calculated independently from pixel to pixel based on the contrast. Different scanner manufacturers and capture applications have come up with many different technologies and algorithms on this such as Kodak ithresholding developed on Adaptive Threshold Processing - ATP)
6. Line Removal
Line removal is a very useful feature especially for OCR applications. This feature is used to remove unwanted lines from scanned images. These lines could be either actual content or noise. Most application forms such as credit cards, account opening etc.. consist of text boxes. Although such lines are actual content of the document, they interfere in the character recognition process hence are unwanted. Also when scanning documents that are folded or when scanning fax copies, there is a high possibility of getting unwanted horizontal lines in the scanned image. These lines, especially vertical ones can interfere in the OCR process. Also if there are any texts that intersect with these lines, they appear as broken in the scanned image resulting in incorrect text recognition.
When line removal is used, these unwanted lines will not be included in the scanned image resulting in a clean image optimized for character recognition. Also characters that are broken due to horizontal lines will be corrected. Further line removal will also reduce the image size.
Here is an illustration of line removal by Oracle.
7. Punch Hole filling
When filed documents having punched holes are scanned, most of the images will show these holes as black spots. In addition to the distracted appearance of the image, this results in two main problems. First is If the file contains large number of documents and the left margin is not adequate, these black spots could interfere with the actual content of the document. The second issue is that having such black spots in blank pages could interfere with the automatic blank page deletion, since they could be recognized as actual content. Earlier these black marks were removed manually which required lot of time and effort. With the advancement of image processing applications such as Kofax VRS, this can be now automated. This feature will change the color of such black spots with the surrounding image color. Most such applications take in to consideration the dimensions and locations of such black spots and compare with the different manufacturer specifications and standards.
8. Blank Page Deletion
lank page deletion is useful when scanning in duplex mode where some documents contain information in both sides of the document as it requires the scanner operator to manually delete the blank pages. Automatic blank page deletion will delete the pages based on a threshold value (in bytes) specified. When a page size is less than the threshold value specified, it is considered as a blank page and will be automatically deleted. Selecting this value depends on the document type and the scanner being used and usually done after some testing with few experimental values. For blank page removal to be effective, it is essential to use some of the features described above such as black border removal, de-speckling, line removal and punch hole filling.
A common issue faced when using blank page deletion is the bleed-through effect, where content in one side of the paper appearing in the other side of the page, especially in very thin papers. Because of this the blank page is mistakenly recognized as having actual content. Advanced capture applications such as Kofax VRS, tries to address this by differentiating actual content and bleed through.
#imageprocessing #ScanningandCapture #thresholding #deskew #despeckle #VRS #noise #Capture