Comparative evaluation of web search engines in health information retrieval

Published date29 November 2011
Pages869-892
Date29 November 2011
DOIhttps://doi.org/10.1108/14684521111193175
AuthorCarla Teixeira Lopes,Cristina Ribeiro
Subject MatterInformation & knowledge management,Library & information science
Comparative evaluation of web
search engines in health
information retrieval
Carla Teixeira Lopes and Cristina Ribeiro
Department of Informatics Engineering, University of Porto, Porto, Portugal
Abstract
Purpose – The intent of this work is to evaluate several generalist and health-specific search engines
for retrieval of health information by consumers: to compare the retrieval effectiveness of these
engines for different types of clinical queries, medical specialties and condition severity; and to
compare the use of evaluation metrics for binary relevance scales and for graded ones.
Design/methodology/approach – The authors conducted a study in which users evaluated the
relevance of documents retrieved by four search engines for two different health information needs.
Users could choose between generalist (Bing, Google, Sapo and Yahoo!) and health-specific
(MedlinePlus, SapoSau
´de and WebMD) search engines. The authors then analysed the differences
between search engines and groups of information needs with six different measures: graded average
precision (gap), average precision (ap), gap@5, gap@10, ap@5 and ap@10.
Findings The results show that generalist web search engines surpass the precision of
health-specific engines. Google has the best performance, mainly in the top ten results. It was found
that information needs associated with severe conditions are associated with higher precision, as are
overview and psychiatry questions.
Originality/value – The study is one of the first to use a recently proposed measure to evaluate the
effectiveness of retrieval systems with graded relevance scales. It includes tasks from several medical
specialties, types of clinical questions and different levels of severity which, to the best of the authors’
knowledge, has not been done before. Moreover, users have considerable involvement in the
experiment. The results help in understanding how search engines differ in their responses to health
information needs, what types of online health information are more common on the web and how to
improve this type of search.
Keywords Evaluation,Health information retrieval, Userstudies, Graded-relevance,
Web search engines,Information retrieval, Medicalinformatics
Paper type Research paper
Introduction
Patients, their family and friends, commonly designated as health consumers, are
increasingly using the web to search for health information. The last Pew Internet
report on health information (Fox and Jones, 2009) reveals that 61 percent of US adults
look online for health information. Among internet users this proportion rises to 83
percent. A previous study reported that 66 percent of health information sessions start
at generalist search engines and 27 percent start at health-specific web sites (Fox,
2006). Large companies in the information retrieval sector have been developing
health-specific services, e.g. Google Health and Bing Health.
The current issue and full text archive of this journal is available at
www.emeraldinsight.com/1468-4527.htm
This work is funded by the Foundation for Science and Technology under the grant SFRH/
BD/40982/2007.
Comparative
evaluation of
search engines
869
Received 20 September 2010
Accepted 27 March 2011
Online Information Review
Vol. 35 No. 6, 2011
pp. 869-892
qEmerald Group Publishing Limited
1468-4527
DOI 10.1108/14684521111193175
According to Hersh (2008) the amount and quality of evaluation research has not
addressed the changes in information retrieval caused by the ubiquity of the web. In
his opinion the number of studies that evaluate the performance of web search systems
in health is surprisingly small. We focus on consumers because, in health information
retrieval, they receive less attention than professionals (Lopes and Ribeiro, 2010a).
This study evaluates the performance of four generalist search engines (Google, Bing,
Yahoo! and Sapo) and three health-specific search engines (MedlinePlus, WebMD and
SapoSau
´de). The evaluation is based on the data collected in a user study with
undergraduate students and work tasks defined according to the framework proposed by
Borlund (2003). Besides an overall comparison the search engines are differentiated by
their performance on different clinical questions, medical specialties and levels of severity.
We start by reviewing previous works on the evaluation of web search engines and,
more specifically, on their evaluation in the health domain. Next we describe our
methodology, present the study and discuss our results. We conclude with some final
remarks.
Literature review
Evaluation in information retrieval
Information retrieval (IR) is a highly empirical field in which evaluation is essential to
demonstrate the performance of new techniques (Manning et al., 2008). The use of test
collections is the dominant evaluation standard, being used since the early 1950s along
with evaluation measures (Sanderson, 2010). Since 1992 TRECs (Text REtrieval
Conferences) have been a major forum in which research evaluated using this model is
discussed. The use of test collections is particularly well suited to system-oriented
performance evaluations that focus on specific aspects of systems.
There are also experimental methods involving the user which have been promoted
by Ingwersen and Ja
¨rvelin (2005) and Borlund (2003). Ingwersen (2009) identified three
major types of research methods involving users:
(1) ultra-light IR interaction experiments;
(2) interactive light IR experiments; and
(3) naturalistic IR field studies in the context of, for instance, an organisational
setting.
The first focus on short-term IR interaction composed of one or two retrieval runs. The
second entails session-based multi-run interaction with more intensive monitoring such
as log analysis, interviews and observation. These studies can be run in a laboratory, in
naturalistic settings or on the internet through what Sanderson (2010) calls live labs.
Another methodthat has been growing in popularitysince the appearance of web search
engines involves the study of user behaviour using query logs.
The two most popular measures for IR effectiveness are precision and recall
(Manning et al., 2008). Precision is the fraction of retrieved documents that are relevant
and recall is the fraction of relevant documents that are retrieved. Along with the F
measure that is the weighted harmonic mean of precision and recall, these are the most
used measures in unranked retrieval. In a ranked retrieval context precision-recall
curves can be plotted. A very common measure is the mean average precision (MAP)
and, in scenarios such as the web in which it is important to have good results on the
first pages, precision is also measured at fixed levels of retrieval (e.g. precision at 10). In
OIR
35,6
870

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT