Rule Based Approach for Word Normalization in Transliterated Search Queries

Author
Varsha M. Pathak and Manish R. Joshi
Keywords
Information Retrieval; SMS Based Information System; Vector Space Model; Minimum Edit Distance; Noisy Query; Transliterated Search
Abstract
SMS based Information Systems is the need of the age. Most of the present SMS based information systems send one way SMS based informative text messages generated from respective knowledge systems. By applying information retrieval methodology using models like Vector Space Mode, the systems can allow its users to send queries as per their requirement of information. This makes the system more fruitful from the user’s point of view. This paper is about such initiatives for accessing relevant literature like poems, phrases, Rhymes, stories, abhang and much more. The mobile based quick library access system MQuickLib allows users to access such literature by formulating transliterated queries. The Vector Space Model is used to create the systems knowledge base by processing. The document terms and matched with the query terms by allowing variation in spelling due to transliteration style of the users. The matching score is assigned by devising a set of rules that identify the distance between two terms dk the term from document and qj the query term. The original Levenshtein’s minimum edit distance algorithm is modified by applying this rule based approach. These rules are identified by collecting SMS queries from users for a given set of known queries in Marathi (Devnagari). Experiments were carried out for the collection of Marathi and Hindi literature that mainly include songs, gazals, powadas, bharud and other types. These documents are available in a standard transliteration form like ITRANS (an Indic Transliteration System). This paper elaborated a rule based approach and analyses the results to select appropriate rule based model that is further applied for the development of MQuickLib system.
References
[1] Sanskrit documents collection : Home page, URL: http://www.sanskritdocuments.org,http://www.giitaayan.com/
[2] Rahis Shaikh Anwar Dilawar Shaikh, Rajiv Ratn Shah. SMS based FAQ retrieval for hindi, english and malayalam. 2013.
[3] CTIA annual wireless industry survey report. URL http://www.ctia.org/ industry-data/ctia-annual-wireless-industry-survey.
[4] Ian Ruthven and Mounia Lalmas. A survey on the use of relevance feedback for information access systems. The Knowledge Engineering Review, 18(02): 95–145, 2003.
[5] Pakray, Dr. Partha & Bhaskar, Pinaki. (2013). Transliterated Search System for Indian Languages.
[6] Sreangsu Acharyya, Sumit Negi, L Venkata Subramaniam, and Shourya Roy. Unsupervised learning of multilingual short message service (sms) dialect fromnoisy examples. In Proceedings of the second workshop on Analytics for noisy unstructured text data, pages 67–74. ACM, 2008.
[7] AiTi Aw, Min Zhang, Juan Xiao, and Jian Su. A phrase-based statistical model for sms text normalization. In Proceedings of the COLING/ACL on Main conference poster sessions, pages 33–40. Association for Computational Linguistics, 2006.
[8] Shahbaaz Mhaisale; Sangameshwar Patil; Kiran Mahamuni; Kiranjot Dhillon; Karan Parashar. Faq retrieval using noisy queries: English monolingual sub-task. In FIRE 2013 Shared Task on FAQ. DTU, Delhi, India, FIRE 2013, 2013.
[9] Govind Kothari, Sumit Negi, Tanveer A Faruquie, Venkatesan T Chakaravarthy, and L Venkata Subramaniam. SMS based interface for FAQ retrieval. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2, pages 852–860. Association for Computational Linguistics, 2009.
[10] G Raghavendran. SMS based wireless home appliance control system. In Proceedingsof International Conference on Life Science and Technology (ICLST 2011), 2011.
[11] Pathak, V. M., Joshi, M. R. (2015). Natural Language Query Refinement Scheme for Indic Literature Information System on Mobiles. In Emerging ICT for Bridging the Future-Proceedings of the 49th Annual Convention of the Computer Society of India CSI Volume 2 (pp. 145-156). Springer, Cham.
[12] Majumder Prasenjit, Mitra Mandar, Agrawal Madhu, Mehta Parth (2015), Proceedings of the 7th Forum for Information Retrieval Evaluation. 10.1145/2838706.

Received : 28 January 2020  

Accepted : 17 May 2020

Published : 04 June 2020  

DOI: 10.30726/ijlca/v7.i2.2020.72002

Rule Based Approach for Word Normalization in Transliterated Search Queries