ON THE SPECIFICATION OF TERM VALUES IN AUTOMATIC INDEXING

Date01 April 1973
DOIhttps://doi.org/10.1108/eb026562
Pages351-372
Published date01 April 1973
AuthorG. SALTON,C.S. YANG
Subject MatterInformation & knowledge management,Library & information science
THE
Journal of Documentation
VOLUME 29 NUMBER 4 DECEMBER 1973
ON THE SPECIFICATION OF TERM VALUES
IN AUTOMATIC INDEXING
G. SALTON and C. S. YANG
Department
of
Computer
Science,
Cornell
University,
Ithaca
The existing practice in automatic indexing is reviewed, and it is shown
that the standard theories for the specification of term
values
(or weights) are
not adequate. New techniques are introduced for the assignment of weights
to index terms, based on the characteristics of individual document collec-
tions.
The effectiveness of some of the proposed methods is evaluated.
I. CURRENT INDEXING PRACTICE
TWO FUNDAMENTAL notions in the theory of automatic indexing are
known respectively as
indexing
exhaustivity and
term
specificity.
Indexing
exhaustivity refers to the accuracy and depth with which the various topic
areas germane to a given document are reflected in the set of index terms
assigned to the document, whereas term specificity is a function of the
exactness with which a term characterizes a given subject. In general,
increasing exhaustivity implies
a
better recall performance, while increasing
term specificity means better
precision.
In particular, the more exhaustive the
indexing, that is, the more thorough the coverage of the various subject
areas,
the more likely it is that relevant items are actually retrieved in
response to user queries, thus achieving high recall; similarly, the greater
the term specificity, that
is,
the more precise the definition of each term, the
less likely it is that extraneous non-relevant items are also retrieved, thus
achieving high precision. In a given user and collection context, one must
then look for an optimum level of specificity in the vocabulary, and an
optimum level of exhaustivity in the indexing to cover the recall and/or
precision performance desired by the user population.
In an actual operating environment, one may conjecture that indexing
exhaustivity
has
something to do with the number of index terms assigned
351
JOURNAL OF DOCUMENTATION Vol.
29,
no. 4
to a given document, particularly the number of higher frequency terms
those largely responsible for the recall performance. Term specificity, on
the other hand may be assumed to be related to the number of documents
to which a given term is assigned in a given collection, the idea being that
the smaller the document frequency, that is, the more concentrated the
assignment of
a
term to only a few documents in a collection, the more
likely it is that a given term is reasonably specific.1
The introduction of relationships between the indexing exhaustivity and
specificity on the one hand, and the frequency characteristics of the index
terms on the other, has led to certain indexing theories which have been
used widely in practice. Before reviewing the main theories, it is conven-
ient to distinguish two different frequency measures. The term
frequency
fki
is the frequency of occurrence of term i in document k. The total frequency
of occurrence, Fi, of term
i is
then defined simply as the sum of the indivi-
dual term frequencies across the N documents of a collection, that is,
A somewhat different measure
is
the document
frequency
di
of term
i
which
measures the number of documents to which term i is assigned. In an
indexing system in which no weights are assigned to the terms, that is,
where f
ki
is
equal to
1
for
all
k and all
i
whenever term
i
appears in document
k, and
fki
is
zero otherwise, the document frequency di then equals the total
frequency Fi for all i.
Based on the concepts of term and document frequencies, a large variety
of indexing methods can be implemented using completely objective
criteria which depend only on the occurrence characteristics of terms in
documents. The first and best known of these is due to Luhn, and assumes
that the value, or weight, of a term, assigned to a document is simply
proportional to the term frequency (TF); that is, the more often a term
occurs in the text of a document, the higher its weight.2 The Luhn theory
reflects the fact that high frequency terms are often essential for the specifi-
cation of document content and for the retrieval of relevant information.
In many environments, the standard term frequency weights do, in fact,
enhance the retrieval performance, particularly at the high recall end of the
performance curve, as shown in the example of Figure
1
for a collection of
425 documents in world affairs taken from issues of Time magazine pub-
lished in 1963, and processed against twenty-four user queries.* It may be
* A recall-precision graph such as that of Figure I is obtained by matching queries and
documents (using a cosine coefficient), and ranking all documents in decreasing order of
query-document similarity. Precision values are then computed at fixed recall levels of
0·1,
0·2, 0·3,
etc,
for each query, and the resulting
values are
averaged for
a
given query set. When
recall-precision graphs for different indexing or search methods are shown in the same figure,
the curve closest to the upper right-hand corner (where recall and precision are both near 1)
reflects the better performance.3
352

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT