Technical Reports
Media Publications
Unconstrained Language Identification Using A Shape Codebook

Guangyu Zhu, Xiaodong Yu, Yi Li and David Doermann


We propose a novel approach to language identification in document images containing handwriting and machine printed text using image descriptors constructed from a codebook of shape features. We encode local text structures using scale and rotation invariant codewords, each representing a characteristic shape feature that is generic enough to appear repeatably. We learn a concise, structurally indexed shape codebook from training data by clustering similar features and partitioning the feature space by graph cuts. Our approach is segmentation free and easily extensible. We quantitatively evaluate our approach using a large real-world document image collection, which consists of more than 1,500 documents in 8 languages (Arabic, Chinese, English, Hindi, Japanese, Korean, Russian, and Thai) and contains a complex mixture of handwritten and machine printed content. Experimental results demonstrate the robustness and flexibility of our approach, and show exceptional language identification performance that exceeds the state of art.

Reference: The 11th International Conference on Frontiers in Handwritting Recognition (ICFHR 2008), pp. 13-18, Montreal, Canada, 2008. (BibTex)

Manuscript: (PDF)

home | language group | media group | sponsors & partners | publications | seminars | contact us | staff only
© Copyright 2001, Language and Media Processing Laboratory, University of Maryland, All rights reserved.