Document text characteristics affect the ranking of the most relevant documents by expanded structured queries

Pages358-376
DOIhttps://doi.org/10.1108/EUM0000000007087
Published date01 June 2001
Date01 June 2001
AuthorEero Sormunen,Jaana Kekÿlÿinen,Jussi Koivisto,Kalervo Jÿrvelin
Subject MatterInformation & knowledge management,Library & information science
DOCUMENT TEXT CHARACTERISTICS AFFECT THE RANKING OF
THE MOST RELEVANT DOCUMENTS BY EXPANDED
STRUCTURED QUERIES
EERO SORMUNEN, JAANA KEKÄLÄINEN, JUSSI KOIVISTO
and KALERVO JÄRVELIN
{lieeso, lijakr, lijuko, likaja}@uta.fi
Department of Information Studies, University of Tampere
Finland
The increasing flood of documentary information through the
Internet and other information sources challenges the developers of
information retrieval systems. It is not enough that an IR system is
able to make a distinction between relevant and non-relevant
documents. The reduction of information overload requires that IR
systems provide the capability of screening the most valuable
documents out of the mass of potentially or marginally relevant
documents. This paper introduces a new concept-based method to
analyse the text characteristics of documents at varying relevance
levels. The results of the document analysis were applied in an
experiment on query expansion (QE) in a probabilistic IR system.
Statistical differences in textual characteristics of highly relevant and
less relevant documents were investigated by applying a facet
analysis technique. In highly relevant documents a larger number of
aspects of the request were discussed, searchable expressions for the
aspects were distributed over a larger set of text paragraphs, and a
larger set of unique expressions were used per aspect than in
marginally relevant documents. A query expansion experiment
verified that the findings of the text analysis can be exploited in
formulating more effective queries for best match retrieval in the
search for highly relevant documents. The results revealed that
expanded queries with concept-based structures performed better
than unexpanded queries or ‘natural language’ queries. Further, it
was shown that highly relevant documents benefit essentially more
from the concept-based QE in ranking than marginally relevant
documents.
1. INTRODUCTION
Fundamental problems of IR experiments are linked to the complex notion of
relevance [1–6]. One of the problems is that in most laboratory experiments
documents are judged either relevant or irrelevant with regard to the request.
Binary relevance cannot reect the possibility that documents may be relevant to
a different degree; some documents contribute more information to the request,
some less without being totally irrelevant. Relevance has been assessed at multi-
ple levels in some studies of operational Boolean systems but even then the lev-
els have been conated into two categories at the analysis phase for the
358358
Journal of Documentation, vol. 57, no. 3, May 2001, pp. 358–376
calculation of precision and recall [e.g. 7–9]. We therefore do not know how dif-
ferent best match IR methods are able to rank documents of varying relevance
levels.
The need for IR methods that are more selective in retrieving highly relevant
documents is quite obvious in large databases like those provided by the Internet
search services [10, 11]. As more documents become available, the number of
potentially relevant items increases. From the user’s viewpoint, the major chal-
lenge for IR systems is not how to differentiate between relevant and non-relevant
documents but rather to separate the highly relevant and potentially relevant doc-
uments. In the evaluation of IR systems, this challenge causes pressure to raise the
threshold for what is accepted as relevant, i.e. what is relevant enough.
One interpretation of the degree of relevance is that highly relevant documents
tend to convey more information about the topic of interest than marginally rele-
vant ones. From this viewpoint one may hypothesise that highly relevant docu-
ments tend to have the following characteristics:
1(a) the topic is discussed in them at length;
(b) they deal with several aspects of the topic;
(c) they have many words pertaining to the topic of the request;
(d) authors use multiple unique expressions to refer to the concepts they
discuss in order to avoid tautology.
In contrast, marginal documents mention the topic briey; present just one
aspect or contain just a few words referring to the topic; discuss the topic from a
viewpoint not included in the request; no problem of tautology occurs in them. In
this paper, we test these hypotheses by analysing document text characteristics
(expressions used and concepts referred to) through the facet analysis technique
developed by Sormunen [12, 13].
In best match retrieval, documents are ranked according to scores calculated
from the weights of search keys occurring in documents. These weights are typi-
cally based on the frequency of a key in a document and on the inverse collection
frequency of the documents containing the key (tf.idf weighting) [14]. The devel-
opment of tf.idf weighting schemes has been based on similar statistical hypothe-
ses of document characteristics as were presented above (characteristic 1c).
However, we will emphasise in this paper that the analysis of document texts can
be elaborated, and further that the effectiveness of best match queries can be
improved, especially in retrieving the most valuable documents.
Query structure refers to the syntactic structure of a query expression, marked
with query operators and parentheses. Best match queries may either have a struc-
ture similar to Boolean queries, or queries may be ‘natural language queries’
without differentiated relations between search keys. In the former case, concepts
are identied (henceforth concept-based or strong structures); in the latter, con-
cepts are not identied, queries are mere sets of search keys, ‘natural language
queries’ (henceforth weak structures). In the mainstream of experimental IR
research weak query structures are nearly exclusively employed. However, recent
ndings have shown the positive inuence of concept-based query structuring.
For instance, strong query structures improve retrieval performance when queries
are expanded [15, 16]. The positive effect of strong query structures seems to hold
May 2001 DOCUMENT TEXTS
359

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT