Keyword based Information Retrieval System for Urdu Document Images - Université de Bourgogne Accéder directement au contenu
Communication Dans Un Congrès Année : 2015

Keyword based Information Retrieval System for Urdu Document Images

Résumé

Various dynasties ruled the Indian sub-continent and left behind enormous and rich cultural heritage that also included intellectually enriched research in the shape of various documents scripted in Urdu. In order to provide efficient access to this knowledge, analysis though digitizing the existing work is the need of hour. In addition to digitization, efficient search mechanisms also need to be implemented to provide users a rapid access to the queried information. In most cases, the digitized documents are complemented by manually assigned tags which not only is a time consuming task but also provides a very limited search facility. Automating the transcription of these documents using Optical Character Recognition (OCR) systems is also challenging due to the very complex cursive nature of Urdu text. To overcome these limitations, a keyword spotting based information retrieval system for document images is introduced in this study. The proposed technique relies on two major modules, document indexing and retrieval. Images of documents are segmented into partial words (ligatures) and identical partial words (PWs) are grouped into clusters. We introduce the concept of considering each (partial) word as a unique shape and a set of shape descriptors is extracted to characterize the PWs. The clusters of PWs are used to index a given set of documents. During retrieval, the query word presented to the system is matched with the clusters in the database and all documents containing instances of the query word are retrieved and presented to the user. The system evaluated on a set of printed Urdu documents in Nastaliq font realized promising precision and recall rates.
Fichier non déposé

Dates et versions

hal-01435327 , version 1 (13-01-2017)

Identifiants

Citer

Raashid Hussain,, Haris Ahmad Khan,, Imran Siddiqi, Khurram Khurshid,, Asif Masood,. Keyword based Information Retrieval System for Urdu Document Images. 2015 11th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS), University of Bourgogne; University of Milan, May 2015, Bangkok, Thailand. pp.27-33, ⟨10.1109/SITIS.2015.16⟩. ⟨hal-01435327⟩
72 Consultations
0 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More