Keyword based Information Retrieval System for Urdu Document Images

Abstract : Various dynasties ruled the Indian sub-continent and left behind enormous and rich cultural heritage that also included intellectually enriched research in the shape of various documents scripted in Urdu. In order to provide efficient access to this knowledge, analysis though digitizing the existing work is the need of hour. In addition to digitization, efficient search mechanisms also need to be implemented to provide users a rapid access to the queried information. In most cases, the digitized documents are complemented by manually assigned tags which not only is a time consuming task but also provides a very limited search facility. Automating the transcription of these documents using Optical Character Recognition (OCR) systems is also challenging due to the very complex cursive nature of Urdu text. To overcome these limitations, a keyword spotting based information retrieval system for document images is introduced in this study. The proposed technique relies on two major modules, document indexing and retrieval. Images of documents are segmented into partial words (ligatures) and identical partial words (PWs) are grouped into clusters. We introduce the concept of considering each (partial) word as a unique shape and a set of shape descriptors is extracted to characterize the PWs. The clusters of PWs are used to index a given set of documents. During retrieval, the query word presented to the system is matched with the clusters in the database and all documents containing instances of the query word are retrieved and presented to the user. The system evaluated on a set of printed Urdu documents in Nastaliq font realized promising precision and recall rates.
Type de document :
Communication dans un congrès
Liste complète des métadonnées

https://hal-univ-bourgogne.archives-ouvertes.fr/hal-01435327
Contributeur : Le2i - Université de Bourgogne <>
Soumis le : vendredi 13 janvier 2017 - 18:00:02
Dernière modification le : mercredi 12 septembre 2018 - 01:27:54

Identifiants

Collections

Citation

Raashid Hussain,, Haris Ahmad Khan,, Imran Siddiqi, Khurram Khurshid,, Asif Masood,. Keyword based Information Retrieval System for Urdu Document Images. 2015 11th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS), University of Bourgogne; University of Milan, May 2015, Bangkok, Thailand. pp.27-33, ⟨10.1109/SITIS.2015.16⟩. ⟨hal-01435327⟩

Partager

Métriques

Consultations de la notice

92