Topic‐sensitive search engine evaluation

DOIhttps://doi.org/10.1108/14684521111193184
Pages893-908
Date29 November 2011
Published date29 November 2011
AuthorNa Dai,Brian D. Davison
Subject MatterInformation & knowledge management,Library & information science
Topic-sensitive search engine
evaluation
Na Dai and Brian D. Davison
Department of Computer Science and Engineering, Lehigh University,
Bethlehem, Pennsylvania, USA
Abstract
Purpose – This work aims to investigate the sensitivity of ranking performance with respect to the
topic distribution of queries selected for ranking evaluation.
Design/methodology/approach – The authors reweight queries used in two TREC tasks to make
them match three real background topic distributions, and show that the performance rankings of
retrieval systems are quite different.
Findings – It is found that search engines tend to perform similarly on queries about the same topic;
and search engine performance is sensitive to the topic distribution of queries used in evaluation.
Originality/value – Using experiments with multiple real-world query logs, the paper demonstrates
weaknesses in the current evaluation model of retrieval systems.
Keywords Search engines,Query stream, Query classification,Topic distribution,Ranking evaluation,
Function evaluation,Information retrieval, Functional analysis
Paper type Research paper
Introduction
As the world wide web has grown in size and popularity, so has the use of search
services to find (and re-find) information on the web. Logs of queries submitted to
search engines provide significant information for search engine maintainers,
designers, and researchers in information retrieval. The activities recorded help
provide feedback on feature usage (Spink et al., 1998, 1999; Spink and Jansen, 2004),
estimates of searcher satisfaction and prediction of future click activity (Piwowarski
and Zaragoza, 2007), training for ranking improvement ( Joachims, 2002; Joachims et al.,
2005; Agichtein et al., 2006), and patterns of query reformulation (Bruza and Dennis,
1997; Spink et al., 2000; Joachims et al., 2007).
The contents of such queries also provide significant information about user
interests, express users’ information needs, and represent what users hope or expect to
find on the web. Therefore, understanding search query properties implicitly helps
build an objective ranking evaluation system that can help direct what a search service
should improve in order to enrich users’ search experience. For example the popularity
of queries related to “pumpkins” or “trick or treat”, etc., increases significantly around
The current issue and full text archive of this journal is available at
www.emeraldinsight.com/1468-4527.htm
This material is based on work supported in part by the National Science Foundation under
grant numbers IIS-032885 and IIS-0545875, and by Microsoft (through its “Accelerating search”
programme). The authors particularly thank Microsoft for providing access to the query logs
and corresponding result sets. In addition they thank Xiaoguang Qi for his code and assistance in
snippet classification. The authors also thank the anonymous reviewers for their useful
comments.
Topic-sensitive
search engine
evaluation
893
Received 15 September 2010
Accepted 20 June 2011
Online Information Review
Vol. 35 No. 6, 2011
pp. 893-908
qEmerald Group Publishing Limited
1468-4527
DOI 10.1108/14684521111193184
Halloween, and therefore ranking improvements for those queries have a greater
influence on search engine performance evaluation during that time.
This paper is motivated by the idea that queries selected for ranking evaluation
should maximally represent the characteristics of query logs so that the overall
performance of prospective systems can be evaluated objectively. In this work we
focus on the topic distribution of the query sample used for ranking evaluation, where
the topic distribution denotes the topics of queries belonging to that sample in
aggregate. We argue that this is an important characteristic to influence ranking
objectivity because the distribution of topics represented in query logs portrays users’
interests as query logs record the history of users’ behaviour.
Here we investigate the sensitivity of ranking performance with respect to the topic
distribution of queries selected for ranking evaluation. Specifically we demonstrate
that topical representativeness can be a significant factor influencing the objectivity of
search engine performance evaluation by showing evidence that search engines tend to
demonstrate more similar ranking performance on queries within the same topic.
Perhaps more importantly we demonstrate that the query sets selected for standard
retrieval evaluation in TREC (NIST, 2011) fail to match several real-world search logs
in terms of their topic distribution and thus rank retrieval systems differently from
how the systems will likely perform under a real-world query stream.
In the remainder of this paper we provide additional background and related work,
introduce our dataset, and present our experimental results. We conclude with a
discussion of the value and limitations of our findings and a summary of our results.
Related work
In this section we review background material and prior work.
Automated query classification
To provide the right kind of search results, it is often important to know (or estimate)
the intent of the user. For example whether the user has a navigational interest or an
informational need (Broder, 2002; Rose and Levinson, 2004) can affect which
algorithms are most useful. As a result there is significant interest in automatic intent
classification (Kang and Kim, 2003; Lee et al., 2005; Jansen et al., 2007).
In general query classification of almost any kind is known to be difficult, primarily
because of the short and often ambiguous queries generated by searchers. However
some methods have been successful for query topic classification, e.g. utilising
additional unlabeled data (Taksa et al., 2007; Beitzel, Jensen, Lewis, Chowdhury and
Frieder, 2007) and bridging topic hierarchies to enable training on larger datasets (Li
et al., 2005; Vogel et al., 2005; Shen et al., 2006a). As a result query topic classification
can be useful in many tasks, including:
.phrase suggestion based on query topic (Jensen et al., 2006 );
.web search personalisation (Liu et al., 2002);
.recognition of search multitasking, i.e. to watch for transitions in topics within
sessions, as in Ozmutlu et al. (2006);
.monetisation of search through relevant advertising (Broder et al., 2007); and
.the understanding and analysis of searcher topics of interest ( Jansen and Spink,
2006).
OIR
35,6
894

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT