Information Retrieval from Indic Script OCRed Text
This project focuses on evaluating information retrieval (IR) effectiveness on Indic script OCRed text. It considers a relevance judged collection of 62,825 articles of a leading Bangla newspaper, Anandabazar Patrika (2004-2006). For each article, both the original digital text and corresponding OCR results are given. Relevance judgments are available for 92 topics. The OCR output is obtained by rendering each digital document as a document image, which is then processed by a Bangla OCR system. The document images have variation in font faces, character styles and sizes. The character level (more specifically, Unicode level) accuracy of the OCR engine is about 92%. We attempt to develop IR techniques to retrieve documents from these collections and report the MAP and Precision@10 separately for the digital text collection and for the OCR collection. Retrieval from the OCR collection is expected to show degradation in IR effectiveness, and therefore the search algorithms are being developed to make use of additional techniques (e.g., OCR error corrections, modeling of OCR errors for IR purposes, etc.) to improve the performance of IR from OCRed text.
Last Updated: Monday 3 October, 2011