INFORMATION RETRIEVAL BASED ON CONCEPTUAL DISTANCE IN IS‐A HIERARCHIES

Pages188-207
Published date01 February 1993
Date01 February 1993
DOIhttps://doi.org/10.1108/eb026913
AuthorJOON HO LEE,MYOUNG HO KIM,YOON JOON LEE
Subject MatterInformation & knowledge management,Library & information science
INFORMATION RETRIEVAL BASED ON CONCEPTUAL
DISTANCE IN IS-A HIERARCHIES
JOON HO LEE, MYOUNG HO KIM and YOON JOON LEE
Department of Computer Science
Korea Advanced Institute of Science and Technology
373-1,
Kusung-dong, Yusung-gu, Taejon, 305-701, Korea
There have been several document ranking methods to calculate the conceptual
distance or closeness between a Boolean query and a document. Though they
provide good retrieval effectiveness in many cases, they do not support effective
weighting schemes for queries and documents and also ha\e several problems
resulting from inappropriate evaluation of Boolean operators. We propose a
new method called Knowledge-Based Extended Boolean Model
(KB-EBM)
in
which Salton's extended Boolean model is incorporated.
KB-EBM
evaluates
weighted queries and documents effectively, and avoids the problems of the
previous methods.
KB-EBM
provides high quality document rankings by using
term dependence information from is-a hierarchies The performance experi-
ments show that the proposed method closely simulates human behaviour.
1.
INTRODUCTION
RAPID ADVANCES IN SCIENCE AND TECHNOLOGY in the last three
decades have lead us to call our society the information society - more
information is generated about more topics than ever before. In this
complicated society, we often need relevant information to carry out the tasks
at hand and to make intelligent decisions. From a large amount of data it is
difficult to find actually needed data at a given time, and to distinguish relevant
from extraneous data. The research area called information retrieval
(IR)
was
established in the early 1960s to develop computer-aided effective processes of
searching and extracting specific information.
Most
IR
researchers in the past have mainly focussed on automatic indexing
and retrieval models, which are based on statistical techniques using
frequencies of words in documents. The performance of
IR
systems based on
such statistical techniques has improved considerably over the years, but it is
expected that these techniques will soon reach their limit. In recent years
interest is shifting toward the use of artificial intelligence techniques to provide
effective retrieval over a large amount of textual information [1]. The most
important property of artificial intelligence approaches is the use of domain
knowledge from information structure to retrieve relevant information. The
domain knowledge gives more exact term dependencies than the conventional
statistical measures [2], and this term dependence information can be used to
improve retrieval effectiveness of
IR
systems.
Journal
of
Documentation,
vol. 49, no. 2, June
1993,
pp. 188-207
188
June 1993 CONCEPTUAL DISTANCE
Since Quillian proposed the idea of
a
semantic network representation for
human knowledge
[3],
the semantic network has been used in the literature as
an information structure to construct a knowledge base [4]. The semantic
network is broadly described as any representation interlinking nodes with
arcs,
where the nodes are concepts and the links are various kinds of
relationships between concepts. 'Is-a hierarchies' are defined as simplified
semantic networks in which only is-a relationships are permitted. It can be
shown that shortest path lengths between concepts in is-a hierarchies can be
used to measure conceptual distance between them. Although semantic
networks can contain other useful relationships such as synonym, related term
and so on, only is-a relationships are used to compute the conceptual distance.
This is because is-a relationships are most often used by humans in evaluating
query-document similarities, and are also sufficient to develop an inference
mechanism for high retrieval effectiveness [5, 6].
There have been a few ranking methods based on the conceptual distance in
is-a hierarchies. They calculate the conceptual closeness or distance between a
Boolean query and a document. Though the previous methods provide good
retrieval effectiveness in many cases, they do not support an effective way of
evaluating the various weights such as query term, query clause and document
term weights. They also have several problems resulting from inappropriate
evaluation of Boolean operators
AND,
OR
and
NOT
[7, 8]. First, using
MIN
or
MAX
functions to evaluate
OR
operators may produce inappropriate results in
certain cases. Second, transforming input Boolean queries into minimal
disjunctive normal forms increases the complexity of computation. Third,
most of them suffer from inefficiency in evaluating
NOT
operators.
In this paper we propose a new ranking method based on the conceptual
distance in is-a hierarchies. The proposed method called Knowledge-Based
Extended Boolean Model (KB-EBM) exploits Salton's extended Boolean model
[9-11].
KB-EBM evaluates the various weights effectively and avoids the
problems of the previous methods. The proposed method also provides a high
quality of document rankings by using term dependence information, i.e. is-a
relationships from is-a hierarchies through the conceptual distance.
The remainder of this paper is organised as follows. Section 2 describes
previous ranking methods called Relevance, R-Distance and K-Distance. In
section 3 we propose a new method which exploits the extended Boolean
model. A brief introduction to the extended Boolean model is also given. The
results of performance comparison are presented in section 4. Finally,
concluding remarks are given in section 5.
2 RANKING METHODS BASED ON CONCEPTUAL DISTANCE
In the context of Quillian's semantic networks shortest path lengths between
two concepts are not sufficient to represent conceptual distance between those
concepts. However, when the paths are restricted to is-a links, the shortest
path length does measure conceptual distance. In other words, when is-a
hierarchies are defined as simplified semantic networks permitting only is-a
189

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT