Evaluating the effectiveness of information retrieval systems using effort-based relevance judgment
Date | 21 January 2019 |
Pages | 2-17 |
DOI | https://doi.org/10.1108/AJIM-04-2018-0086 |
Published date | 21 January 2019 |
Author | Prabha Rajagopal,Sri Devi Ravana,Yun Sing Koh,Vimala Balakrishnan |
Evaluating the effectiveness
of information retrieval
systems using effort-based
relevance judgment
Prabha Rajagopal and Sri Devi Ravana
University of Malaya, Kuala Lumpur, Malaysia
Yun Sing Koh
University of Auckland, Auckland, New Zealand, and
Vimala Balakrishnan
University of Malaya, Kuala Lumpur, Malaysia
Abstract
Purpose –The effort in addition to relevance is a major factor for satisfaction and utility of the document to
the actual user. The purpose of this paper is to propose a method in generating relevance judgments that
incorporate effort without human judges’involvement. Then the study determines the variation in system
rankings due to low effort relevance judgment in evaluatingretrieval systems at different depth of evaluation.
Design/methodology/approach –Effort-based relevance judgments are generated using a proposed
boxplot approach for simple document features, HTML features and readability features. The boxplot
approach is a simple yet repeatable approach in classifying documents’effort while ensuring outlier scores do
not skew the grading of the entire set of documents.
Findings –The retrieval systems evaluation using low effort relevance judgments has a stronger influence
on shallow depth of evaluation compared to deeper depth. It is proved that difference in the system rankings
is due to low effort documents and not the number of relevant documents.
Originality/value –Hence, it is crucial to evaluate retrieval systems at shallow depth using low effort
relevance judgments.
Keywords Information system, Information retrieval, TREC, Large-scale experimentation,
Relevance judgments, System-oriented evaluation
Paper type Research paper
1. Introduction
There are two categories of information retrieval evaluation, the system-oriented evaluation,
and user-oriented evaluation. One of the important aspects of the system-oriented evaluation is
the relevance judgment of a test collection containing information on the relevancy of
documents concerning queries. In the TREC environment, topic experts judge the documents in
regards toeach query. In user-oriented evaluations, the interaction ofthe actual users with the
retrieval systems is measured. Nonetheless, one of the disagreements between both the
information retrieval evaluation categories is the consideration of relevance by the expert
judges and the utility of the documents to actual users (Yilmaz et al., 2014).
System-oriented evaluation has always prioritized relevance for user satisfaction.
Therefore, relevance is used as a way of measuring the effectiveness of retrieval systems.
However, recent studies highlighted that effort in retrieving relevant documents is equally
important for user satisfaction (Verma et al., 2016; Yilmaz et al., 2014). The effort in this
context is referring to the amount of work needed by the user to find and identify the
relevant content in the document. The effort can be classified as low effort or high effort.
Aslib Journal of Information
Management
Vol. 71 No. 1, 2019
pp. 2-17
© Emerald PublishingLimited
2050-3806
DOI 10.1108/AJIM-04-2018-0086
Received 19 April 2018
Revised 18 July 2018
2 September 2018
Accepted 13 September 2018
The current issue and full text archive of this journal is available on Emerald Insight at:
www.emeraldinsight.com/2050-3806.htm
This research was supported by UMRG Programme RP059A-17SBS from Universiti Malaya and the
Ministry of Higher Education, Malaysia.
2
AJIM
71,1
Low effort indicates less amount of work needed by the user to identify the relevant context
within a document. Meanwhile, the high effort requires more work by the user to identify the
relevant content within a document.
Real users easily give up and do not put in as much effort as expert judges while identifying
relevancy in a document (Villa and Halvey, 2013; Yilmaz et al., 2014). Therefore, a retrieval system
incorporating retrieval of low effort documents is preferred by the user as it requires less effort in
identifying relevancy of the documents compared to a retrieval system retrieving high effort
documents. The effort in addition to relevance is a major factor for satisfaction and utility of the
document to the actual user. Consequently, it is vital to evaluate the retrieval systems based on
the amount of effort needed to identify the relevancy of documents to ensure user satisfaction.
The importance of effort is measured in various ways in previous studies (Verma et al.,
2016; Villa and Halvey, 2013) but limited depth of evaluation and retrieval systems (top ten
only) were used to show the differences in system rankings due to low effort relevance
judgments. The differences in system rankings using original and low effort relevance
judgments beyond evaluation depth 10 within a test collection are unknown. Possibilities
exist that the differences in system rankings may vary widely as a result of larger numbers
of relevant documents found with deeper depth of evaluation. Therefore, this study
questions how the system rankings change when evaluated beyond evaluation depth
10 using low effort relevance judgment. There are likelihoods for real users, depending on
the type of user using the search engine, to look beyond ten or as many retrieved documents
to fulfill their query (Sanderson, 2010). Hence, it is necessary to evaluate the retrieval
systems beyond evaluation depth 10. Consequently, this study attempts to explore deeper
depth of evaluation (up to 1,000) as opposed to a previous study (Verma et al., 2016) which
only evaluated at depth of 10 using effectiveness metrics P@10. Besides, this study also
questions if any change in the system rankings arise from the low effort relevance
judgments. Changes in system rankings could also be due to number of relevant documents.
Thus, it is needed to determine the cause of system rankings changes, if any.
Attemptsto reduce the amount of work needed for judgingrelevance had been an attention
of the research community. Any advancement that reduces the workload on relevance
judgments withoutjeopardizing the qualityof evaluation is an added advantage(Guiver et al.,
2009). Nevertheless, inquiring the amount of effort needed for relevance judgment from the
judges (Verma et al., 2016; Yilmaz et al., 2014) may cause variation in judgments (Carterette
and Soboroff, 2010;Chandar et al., 2013; Scholer et al., 2011;Webber et al., 2012) and the effort
needed. Possibleadvancements in overcomingthese drawbacks are minimizingor eliminating
the involvement of human judges in obtaining effort information. Therefore, this study
attempts to create the relevance judgment without human participation which is different
from previous studies (Verma et al., 2016; Yilmaz et al., 2014) involving human assessors.
This study aims to propose a method in generating relevance judgments that incorporate
effort without involvement of human judges. The study also aims to determine the variation in
system rankings due to low effort relevance judgment in evaluating retrieval systems at
different depth of evaluation. The major contribution of this study is to highlightthe importance
of low effort relevant documents in retrieval system evaluation at different depth of evaluation.
Section 2 is the research background discussing the importance of effort in user
satisfaction and the effect of relevance to system performance. Section 3 is the effort feature
classification and Section 4 addresses document grades classification using the boxplot
approach. The results and discussion section are included next. Finally, the conclusion is
drawn and the future work is proposed.
2. Research background
The following subsections highlight the importance of effort in user satisfaction and the
effect of relevance to system performance.
3
Effectiveness
of information
retrieval
systems
To continue reading
Request your trial