Current solutions to the problem of optical character recognition (OCR) have advanced to the point where recognition rates of 99% are common for clean, uniformly formatted text. Unfortunately, the performance of most OCR algorithms degrades very rapidly when even small amounts of noise are introduced into the original document or during the scanning process. In many situations, this increased error rate quickly decreases the return on investment to the point where it is not cost-effective to integrate automated recognition technology solutions. To push this critical point lower and deal robustly with noise, OCR systems often perform some type of image enhancement as a preprocessing step.
Traditional enhancement techniques are applied at the pixel or local level and include, for example, the use of morphological operators to reduce speckle, fill small holes and smooth edges. Enhancement at the symbol level has received much less attention, but may include the identification, normalization and precise segmentation of glyphs, for example. Enhancement at the page level ideally includes the elimination of copier noise and streaks and the identification of higher-level structure.
In this research consider the problem of structure based document image enhancement. We will build on previous work to develop the ability to learn the symbol classes which appear in a given document. These classes can then used to enhance segmentation and symbol appearance, providing an improved version of the original document for OCR. If knowledge about the OCR system is available, the resulting document can be formatted to optimize the system's performance and statistical information about the learned classes can be provided. The general methodology is alphabet-independent, and depends only on the ability to segment the text into characters.
Noise in binary document images can be viewed as coherent or incoherent with the underlying document content. Ink blobs, salt-n-pepper, stray marks, marginal noise are, in general, independent of location, size or other properties of text data in the document image. Recorded images having this type of noise, can be expressed as the sum of true image and the noise. Such noise can also be termed as incoherent noise with respect to the content. On the other hand, when the noise is included inside the spatial frequency domain of the image and can not be suppressed without a priori knowledge of the content, it is termed as coherent noise. Blur, pixel-shift or bleed-through on other hand, manifest themselves differently depending on the content. Such coherent noise is comparatively more difficult to model, mathematically non-linear and often multiplicative.
Detection of noise is generally based on its properties like its shape, position, frequency, gray-values or periodicity of occurrence in the document. Unwanted punched holes exhibit regularity in their shapes, marginal noise show regularity in their positions, while ruled lines show periodicity in their positions and consistency in direction. On the other hand, noise such as ink blobs, complex background binarized patterns are denser than text, whereas salt-n-pepper are impulsive noise sparser than content pixels. If noise shows a consistent behavior in terms of these properties, it is easier to detect it and separate it from content. However, there has not been much work reported on the removal of noise which does not adhere to a consistent shape, position or size and which tend to interact with text in the foreground in irregular ways.