International Journal of Electrical and Computer Engineering (IJECE) Vol. No. February 2017, pp. ISSN: 2088-8708. DOI: 10. 11591/ijece. Context Sensitive Search String Composition Algorithm using User Intention to Handle Ambiguous Keywords Uma Gajendragadkar1. Sarang Joshi2 COEP. Savitribai Phule Pune University. Pune. Maharshtra. India PICT. Savitribai Phule Pune University. Pune. Maharshtra. India Article Info ABSTRACT Article history: Finding the required URL among the first few result pages of a search engine is still a challenging task. This may require number of reformulations of the search string thus adversely affecting user's search time. Query ambiguity and polysemy are major reasons for not obtaining relevant results in the top few result pages. Efficient query composition and data organization are necessary for getting effective results. Context of the information need and the user intent may improve the autocomplete feature of existing search This research proposes a Funnel Mesh-5 algorithm (FM. to construct a search string taking into account context of information need and user intention with three main steps . Predict user intention with user profiles and the past searches via weighted mesh structure . Resolve ambiguity and polysemy of search strings with context and user intention . Generate a personalized disambiguated search string by query expansion encompassing user intention and predicted query. Experimental results for the proposed approach and a comparison with direct use of search engine are A comparison of FM5 algorithm with K Nearest Neighbor algorithm for user intention identification is also presented. The proposed system provides better precision for search results for ambiguous search strings with improved identification of the user intention. Results are presented for English language dataset as well as Marathi . n Indian languag. dataset of ambiguous search strings. Received Aug 5, 2016 Revised Nov 12, 2016 Accepted Nov 26, 2016 Keyword: Autocompletion Context Data mining Search User intention Copyright A 2017 Institute of Advanced Engineering and Science. All rights reserved. Corresponding Author: Uma Gajendragadkar. COEP. Phone 919822479128. G7/9 Omkar Garden. Manikbaug. Pune. Maharshtra. India. Email: umagadkar@gmail. INTRODUCTION Current search engines churn a large volume of data to obtain meaningful information. however, the main challenge is to get relevant results in the top few result pages . , . Search engines check for the presence of keywords in documents. Mere presence of keywords in a document may not match the user's search intention and need. User satisfaction increases when more relevant and exact information is presented in the top few results. An appropriately composed query is the starting point for handling this challenge . Performance of search engines can be improved with the use of appropriate keywords or prediction of such keywords . Search engines use search logs and most popular queries. however, these are not sufficient to predict the user's interests or intention . Users are of three types, first - Internet skilled users, second - Internet aware users and third Internet unskilled users. Many times, users do not know the proper keywords for searching information and they cannot express their information need or intent of search . , . This results in search results often not satisfying user's information need. This problem can be addressed by query expansion and reformulation . Search engines provide autocompletions of queries based on popularity . however, they are Journal homepage: http://iaesjournal. com/online/index. php/IJECE A ISSN: 2088-8708 inadequate . , . Although different users may use the same query keyword, their intent and context may be different. Current search engines provide the same results to all users using the same keywords at a given point in time. Personalization is desirable to better satisfy the needs of the user . The following experiment illustrates this further. If a user searches for 'Michael Jackson' then search engines return results for the famous singer Michael Jackson in majority of result pages. These results would be treated as irrelevant and incorrect if the user intent was to search for professor Michael Jackson. Table 1. Example search query done on Google . on 29th May 2015 result rows Query String Total Results Search Results as Singer Search Results as Professor Search Results as Software Development Search Results as VP Michael Jackson About 39,00,00,000 First 13 pages and after Page 17 8th result Page 13 last result Page 16 4th result Michael Jackson professor About 7,89,00,000 results Page 3 - 5th result First page Second page 2nd result Not present in the first 20 pages As shown in Table 1, when one searches for the query string 'Michael Jackson', results for the singer 'Michael Jackson' are returned in the first 13 pages whereas no result is returned for the professor 'Michael Jackson'. With each page containing 10 results, the relevant results start appearing after 130 result rows. However, when a word 'professor' is added to the query string 'Michael Jackson', the results for professor Michael Jackson are seen in the first result page itself. This demonstrates that if keywords based on user intention are used then better hits can be obtained in the first few search result pages. Query expansion based on user intention has shown to give better search results over large data sets like Web . , . Thus user intention can be used to disambiguate a query . User context can include parameters such as 'gender', 'age', 'topic', location' etc. It can be short-term . or long-term . In the proposed method, user intention is identified with the help of user profile containing parameters like 'gender', 'profession', 'interests', 'location' and past searches. User intention identified with FM5 algorithm is used to reformulate the query. This paper brings together different IR (Information Retrieva. areas like QAC (Query autocompletio. Query Personalization and automatic query expansion. Our contributions are: A novel user intention identification algorithm is proposed to predict user intention. Query expansion is done using identified user intention to get improved precision for ambiguous search . Experimental evaluation of the method is conducted with dataset collected from users. The results reflect improvement in user intention identification and precision of search results. Results of query expansion using the identified user intention are compared with the results of Google search engine . directly as first baseline and also with results obtained for ambiguous queries by Chirita et al . as a second baseline. In this paper. Section 2 describes the related work. Section 3 explains data description and how it is used by the proposed system while Section 4 describes the FM5 user intention identification algorithm. Results and discussion are described in Section 5. Conclusion is presented in Section 6. RELATED WORK Autocompletions and Personalization Bhatia et al. presents work where phrases and n-grams are mined from text collections and used for generating autocompletions. Most popular completion i. autocompletions based on past popularity of queries in query logs are modeled in Bar-Yossef and Kraus's work . , . Commercial search engines use MPC . ost popular completio. for query autocompletion . Other query autocompletion methods include personalized autocompletion, context based autocompletion using previous queries by user . , time based autocompletion . , time and context based autocompletion . Homologous queries and semantically related terms are used to generate autocompletions by Cai et al. Personalization of query results by using the interests of users is done by many researchers . User preferences are collected by either implicit or explicit method. Gender and age are used for personalizing the results by Kharitonov and Serdyukov . User context based on their recent queries is generated and used to rank the query results in a session by Xiang et al . Most of the research conducted is for personalizing the query results by reranking them using user profile rather than query autocompletion. IJECE Vol. No. February 2017 : 432 Ae 450 IJECE ISSN: 2088-8708 This paper proposes an algorithm that uses personalization for query completion or autocompletions in An improvement in autocompletion ranking is claimed by personalization in Shokouhi's work . Shokouhi et al. also presented ranking of autocompletions with a time-sensitive approach as per their expected popularity . Ambiguous queries are handled by Shoukhoi et al. by providing user context in terms of session context. Query suggestion is achieved by using click information along with previous queries in a session as context and then mining query log sessions for query reformulations . This work is similar to us but it does not consider long-term user context instead focuses on session based user context in terms of click information and previous queries. User Intention Many studies have tried to identify user intention in different ways. Most of them try to categorize the queries as informational, navigational and transactional as proposed by Jansen et al . Given a query suggestion, efforts have been done to understand the user intention using different means like web search logs . , . , previous user's search log for same query . , clicked pages . , user's search session history . Wikipedia . Wordnet and Google n-gram . Using search query logs for existing users to identify intention cannot guarantee the correctness of search results . Search intent prediction along with query autocompletion is a less explored area. According to Cheng et al. , many searches are triggered by browsed web pages . Kong et al. tried to predict search intent using recently browsed news article before search . A large number of queries are triggered by news article daily . Predicting search intent using browsed pages is inadequate . Our proposed method uses live RSS newsfeed for query prediction. makes use of user profiles to predict the search intent. Query Expansion Query expansion is used to reformulate the original user query so as to improve retrieval of search results to better satisfy user needs. One of them is relevance feedback using the returned results and adding new terms related to the original query and selected documents . Other methods include adding relevant terms based on term frequency, document frequency from top ranked documents . , . , co-occurrence based techniques . , thesaurus based techniques . - . , desktop specific techniques . , probability of terms over search logs . Our approach uses a user intention based keyword addition to expand the original query to handle ambiguous query terms. DATA DESCRIPTION Data collection Methodology and Data Resources The system uses different types of data sources. For temporal contextual corpustwo elements are One is static contextual data based on current month and the second is dynamic contextual data based on daily current events. Based on the parameter Aperiod', a month-wise list of occasions from Hindu and Christian calendar is taken and their associated keywords list is built. Secondly based on daily current events. RSS news feed from Reuters . is processed and a dataset of keywords is built . The temporal data is refreshed every day and also at restart of server. This contextual data is generated for both English and Marathi-an Indian language popularly used in the state of Maharashtra by more than 70 million people. Marathi n-gram dataset is also created by crawling Marathi websites for about four months and processing the web pages and is available . The proposed algorithm also uses data from various sources like Google n-gram . and Wordnet . for English and Marathi Wordnet data . How to use abovedescribed contextual data to mine possible query autocompletions is discussed by Uma Gajendragadkar et al . Autocompletions for all sample test queries are collected from popular search engines for comparison. This is done foreach character key press of all the test queries. User Intention Based Query Expansion AKA user profiles returned by KNN (K Nearest Neighbo. algorithm are used as input to the FM5 Let be a set of profiles such that Context Sensitive Search String Composition Algorithm using User Intention to A (Uma Gajendragadka. A ISSN: 2088-8708 I } where P(Z) is the probability of the known user profiles and Let I is the probability of unknown user profiles. be the set of n query words and let { | be the set of m intentions identified. A trial is conducted by collecting random samples