A decision theoretic approach to combining information filtering

Published date25 September 2009
Date25 September 2009
DOIhttps://doi.org/10.1108/14684520911001918
Pages920-942
AuthorAlexander Binun,Bracha Shapira,Yuval Elovici
Subject MatterInformation & knowledge management,Library & information science
A decision theoretic approach to
combining information filtering
Alexander Binun
Informatics III (INformatik III) Department, University of Bonn, Bonn,
Germany, and
Bracha Shapira and Yuval Elovici
Department of Information Systems Engineering and
Deutsche Telekom Laboratories, Ben-Gurion University, Beer-Sheva, Israel
Abstract
Purpose – The purpose of this paper is to present an extension to a framework based on the
information structure (IS) model for combining information filtering (IF) results. The main goal of the
framework is to combine the results of the different IF systems so as to maximise the expected payoff
(EP) to the user. In this paper we compare three different approaches to tuning the relevance
thresholds of individual IF systems that are being combined in order to maximise the EP to the user. In
the first approach we set the same threshold for each of the IF systems. In the second approach the
threshold of each IF system is tuned independently to maximise its own EP (“local optimisation”). In
the third approach the thresholds of the IF systems are jointly tuned to maximise the EP of the
combined system (“global optimisation”).
Design/methodology/approach An empirical evaluation is conducted to examine the
performance of each approach using two IF systems based on somewhat different filtering
algorithms (TFIDF, OKAPI). Experiments are run using the TREC3, TREC6, and TREC7 test
collections.
Findings – The experiments revealthat, as expected, the third approach alwaysoutperforms the first
and the second, andthat for some user profiles, the differenceis significant. However, operationalgoals
argue against global optimisation, and the costs ofmeeting these operational goals are discussed.
Research limitations/implications – One limitation is the assumption of independence of the IF
systems: in real life systems usually use similar algorithms, so dependency might occur. The approach
also tends to be examined with the assumption of dependency between systems.
Practical implications The main practical implications of this study lie in the empirical proof that
combination of filtering systems improves filtering results and the finding about the optimal
combination methods for the different user profiles. Many filtering applications exist (e.g. spam filters,
news personalisation systems, etc.) that can benefit from these findings.
Originality/value – The study presents and comparesthe contribution of three differentcombination
methods of filteringsystems to the improvement of filteringresults It empirically shows the benefits of
each method and draws important conclusions about the combination of filtering systems.
Keywords Information control,Information modelling, Informationretrieval
Paper type Research paper
Introduction
Many informationretrieval (IR) studies have shown that the userbenefit of the output of
a combination of several systems is higher than that of each individual system (Bartell
et al., 1994; Croft, 2000; Fox and Shaw, 1994; Lee, 1997; Saracevic and Kantor, 1988). A
review of system combinations by Croft (2000) identified the following four approaches:
The current issue and full text archive of this journal is available at
www.emeraldinsight.com/1468-4527.htm
OIR
33,5
920
Refereed article received
14 October 2008
Approved for publication
4 May 2009
Online Information Review
Vol. 33 No. 5, 2009
pp. 920-942
qEmerald Group Publishing Limited
1468-4527
DOI 10.1108/14684520911001918
combination of multiple representationsof documents in a single search; combination of
different queriesas additional evidence of the searcher’s information needs; combination
of ranking algorithms; and combination of output from different search systems.
The framework presented by Elovici et al. (2005) performs the last type of
combination i.e. fusing the outputs of two information filtering (IF) systems to
maximise the user benefit by selecting the fusion strategy that would maximise the
user’s expected payoff (EP) of the combined system. The combination strategy is thus
dictated by user preferences (also known as user profiles). In this approach, IF sys tems
are treated as information structures (ISs) based on their performance characteristics
(which may be expressed by precision and recall); the optimal combination strategy to
achieve maximal payoff for the user is then derived. However, the observed precision
and recall of each System X depend on its relevance threshold T(that is, the first T
documents taken from the top of the output stream of X; they are believed to be
relevant by X). The original framework presented by Elovici et al. (2005) did not
elaborate on how to set this relevance threshold of the combined IF systems in order to
maximise the EP of the user (i.e. tune an IF optimally).
In this paper we extend the IF systems combination framework described by Elovici
et al. (2005) by analysing three approaches to calibrating the threshold of each of the
systems being combined. We clarify the contribution of fusion of output of IF systems
and analyse the effect of the thresholding method on the improvement.
In the first combination approach, we set the same threshold for each system and
analyse the combined output for different thresholds. In the second approach, the
threshold of each IF system is tuned independently to maximise its EP. Thus, each of
the systems being combined is optimally tuned before it is combined with the output of
the other systems. By conducting experiments on several TREC collections we show
that the result produced by this approach is often significantly worse than the optimal
one. The third approach is based on setting the thresholds of the systems to maximise
the EP of the combined system. We discuss the results of the experiments and suggest
how to improve the second approach, which has operational advantages over the third
approach that provide the best results.
The experiments show that choosing the optimal threshold for each concrete user
profile yields much better results than those achieved when the threshold is constant
for all user profiles. We also note that the optimal thresholds strongly depend on the
specific collection. For example, we found and recorded the optimal thresholds for
TREC3. When these thresholds were applied to the TREC6 test set, the results were
significantly worse than the optimal ones.
The rest of the paper is organised as follows: the relevant background is presented
including a brief review of the IS model and its application to IF systems, followed by
an overview of related combination studies. Then we detail the approaches for setting
the relevance threshold parameters. The next section presents the results of evaluating
the performance of each of the threshold setting approaches, and the paper conclu des
with a discussion and suggests future directions.
Background
Review of IR algorithms
Around 1978-1980 several IR engines were already in use. It had appeared that
different retrieval engines return quite dissimilar sets of relevant document sets (Croft
Combining
information
filtering
921

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT