INFORMATION RETRIEVAL TEST COLLECTIONS

DOIhttps://doi.org/10.1108/eb026616
Published date01 January 1976
Date01 January 1976
Pages59-75
AuthorK. SPARCK JONES,C.J. VAN RIJSBERGEN
Subject MatterInformation & knowledge management,Library & information science
PROGRESS IN DOCUMENTATION
INFORMATION RETRIEVAL TEST COLLECTIONS*
K. SPARCK JONES and C. J. VAN RIJSBERGEN
Computer
Laboratory,
University
of Cambridge
Many retrieval experiments have been based on inadequate test collections,
and current research is hampered by the lack of proper collections. This short
review does not attempt a fully documented survey of all the collections
used in the past decade: hopefully representative examples have been studied
to throw light on the requirements test collections should meet, to show
how past collections have been defective, and to suggest guidelines for a
future 'ideal' test collection. The specifications for this collection can be
taken as an indirect comment on our present state of knowledge of major
retrieval system variables, and experience in conducting experiments.
TESTS AND TEST COLLECTIONS
Information retrieval experiments, particularly involving automatic indexing
and searching, have been carried out for some fifteen years. The Cranfield 1
project may be taken as a starting point. These experiments have differed in
purpose,
scope,
and scale, in methodology, data, and results. One major difference
is between tests associated with operational systems, of which Lancaster's
MEDLARS investigation
is
an example, and independent ones like those carried
out by the SMART Project. Some aspects of retrieval testing have received a
good deal of attention: methodology and specifically the measurement of per-
formance have been extensively discussed, sometimes in the context of global
systems models. In recent years the proper conduct of tests of on-line search
techniques has presented new problems of controlling human factors and
measuring performance.
But though some aspects of testing have been extensively discussed, the charac-
teristics of retrieval test data, and their implications for the conduct of
tests
and
interpretation of results have been relatively neglected. The requirements that
a test collection should satisfy have often not been explicitly considered, and
many project reports say very little about the design criteria that were adopted
when the test collections used were set up. In consequence it is difficult to know
whether experimental results should be attributed to the particular techniques
adopted, say in indexing, or to essential properties of the test material, like the
subject spread of the document set. These problems are exacerbated by the fact
that individual projects typically work with their own data, though some test
collections like the Cranfield
2
one have been exploited by different projects. This
makes it extremely difficult to compare the results obtained by different projects
in some
areas,
for example that of index term weighting; so cumulative progress
in understanding how retrieval systems work, through the correlation of a range
of results, is very slow.
*
Editorial
note.
We are greatly indebted to the authors for preparing this survey at very short notice.
The commissioned article was not forthcoming as promised.
Journal of Documentation, vol. 32, no. 1, March 1976, pp. 59-75.
59
JOURNAL OF DOCUMENTATION Vol.
32,
no. 1
It is evident that many of the test collections used in the past were inadequate
in being small, and/or carelessly constructed, and/or inappropriate in character.
The same criticisms apply to some of the collections currently available in
machine-readable form for general use. For example it must be admitted that
the widely used Cranfield subcollection of 42 requests and 200 documents
is too small for proper experiments. Such collections are unsatisfactory on
purely statistical grounds, and the increasing scale of operational systems makes
results obtained with them appear irrelevant to real
systems.
At the same time it is
by no means evident that research in information retrieval is no longer needed
and that modern on-line systems, for example, are incapable of improvement
through the application of results obtained from experiments. Even if the topics
of future research differ from those of past research, test data
is
still required. Thus
on the one hand past results, which suffer from being disconnected and limited,
require validation by tests with adequate data; and on the other future projects
need equally adequate data with which to work. It is unfortunately the case that
the preparation of
a
test collection of any useful size and generality is a major
enterprise, particularly where this includes bringing the collection up to a well-
defined and convenient machine-readable format. Since many projects have set
up collections for immediate use for their own purposes only, there has been a
great deal of duplicated and wasted effort in data preparation.
For these reasons there is a growing feeling in the research community that
the study of information retrieval systems would be enormously benefited by the
provision of one or more test collections for common use, designed to meet a
range of requirements determined by past experience and likely future needs.
Such a collection or set of collections, which may be referred to as the 'ideal
collection(s)', would permit more effective research by satisfying requirements
for commonality between projects, hospitality to projects, adequacy for projects,
and convenience in projects. In 1975 the British Library Research and Develop-
ment Department supported a study of ideal collection(s) characteristics and
provision mechanisms by the
authors.1
The present review considers the specifica-
tions of the collection(s) in terms of the way they summarize experience to date
in working with retrieval test collections. Indirectly, they throw light on the
present state of understanding of retrieval system variables.
PAST COLLECTIONS
Current views of the requirements to be met by the ideal collection(s) are largely
based on experience with past collections. The general character of these collec-
tions can be illustrated by representative examples. (We have not attempted an
exhaustive survey.) In the United Kingdom collections used by major projects, or
by several different projects, include the Cranfield
2,
INSPEC, College of Librar-
ianship Wales
ISILT,
UKCIS, University of Newcastle MEDLARS (MEDUSA),
National Physical Laboratory (NPL) and Harwell NSA (UKAEA) collections.2-8
Full details of these collections are given in Appendix 1. As this appendix shows,
not merely were the objectives of the projects very different, virtually every
feature of the collections is different. They vary in size, ranging from some 500
to some 50,000 documents, and from around 50 to over 200 queries; in subject
matter, though most are scientific; in indexing source, ranging from title to full
text; in number and types of indexing language, ranging from simple title
words to fully controlled subject
headings;
and in the treatment of relevance. The
60

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT