Language Identification for Handwritten Document Images Using A Shape Codebook
Guangyu Zhu, Xiaodong Yu, Yi Li and David Doermann
Language identification for handwritten document images is an open document analysis problem. In this paper, we propose a novel approach to language identification for documents containing mixture of handwritten and machine printed text using image descriptors constructed from a codebook of shape features. We encode local text structures using scale and rotation invariant codewords, each representing a segmentation-free shape feature that is generic enough to be detected repeatably. We learn a concise, structurally indexed shape codebook from training by clustering and partitioning similar feature types through graph cuts. Our approach is easily extensible and does not require skew correction, scale normalization, or segmentation. We quantitatively evaluate our approach using a large real-world document image collection, which is composed of 1,512 documents in eight languages (Arabic, Chinese, English, Hindi, Japanese, Korean, Russian, and Thai) and contains a complex mixture of handwritten and machine printed content. Experiments demonstrate the robustness and flexibility of our approach, and show exceptional language identification performance that exceeds the state of the art.
Reference: Pattern Recognition, 42, pp. 3184-3191, December 2009.