Multi-word terms selection for information retrieval

DOIhttps://doi.org/10.1108/IDD-12-2021-0142
Published date28 June 2022
Date28 June 2022
Pages74-87
Subject MatterLibrary & information science,Library & information services,Lending,Document delivery,Collection building & management,Stock revision,Consortia
AuthorChedi Bechikh Ali,Hatem Haddad,Yahya Slimani
Multi-word terms selection for
information retrieval
Chedi Bechikh Ali
Institut National des Sciences Appliquées et de Technologie (INSAT), LISI, University of Carthage, Tunis, Tunisia
Hatem Haddad
iCompass, Tunis, Tunisia, and
Yahya Slimani
Institut Supérieur des Arts Multimédia (ISAMM), University of Manouba, Manouba, Tunisia
Abstract
Purpose A number of approaches and algorithms have been proposed over the years as a basis for automatic indexing. Many of these approaches
suffer from precision inefciency at low recall. The choice of indexing units has a great impact on search system effectiveness. The authors dive
beyond simple terms indexing to propose a framework for multi-word terms (MWT) ltering and indexing.
Design/methodology/approach In this paper, the authors rely on ranking MWT to lter them, keeping the most effective ones for the indexing
process. The proposed model is based on ltering MWT according to their ability to capture the document topic and distinguish between different
documents from the same collection. The authors rely on the hypothesis that the best MWT are those that achieve the greatest association degree.
The experiments are carried out with English and French languages data sets.
Findings The results indicate that this approach achieved precision enhancements at low recall, and it performed better than more advanced
models based on terms dependencies.
Originality/value Using and testing different association measures to select MWT that best describe the documents to enhance the precision in
the rst retrieved documents.
Keywords Performance measurement, Statistics, Information systems, Information retrieval, Information science, Collection management,
Indexing, Multi-word terms, Association measure, Precision
Paper type Research paper
1. Introduction
Existing information retrieval (IR) models are very well dened
from the theoretical point of view and give good results in
evaluation campaigns. However, these models are not efcient at
low recall level. Indeed, some documents can have a low ranking
despite being the most relevant. This can be explained by the fact
that these models do not take the complex grammatical structure
of queries and documents into account. Consequently, there is a
need to use models based on dependencies between terms to
improve the accuracy of IR systems. For classical systems, the
text is considered from a statistical point of view, and no
linguistic, syntactic or dependency information is used. We rely
on the hypothesis that a better understanding of the relationships
and dependencies that may exist between terms in queries and
documents can allow the IR system to perform better. In this
paper, we propose a document indexing model based on multi-
word terms (MWT) extracted using syntactic patterns and
statistical techniques to capture term dependencies. The
syntactic patterns make it possible to eliminate irrelevant
structures when extracting MWT, and the statistical measures
make it possible to lterMWTbyusingaweighttokeep themost
efcient ones for indexing based on MWT. The fusion of
linguistic and statistical approaches for extracting MWTs shows
itsusefulnessintheterminologyextractionfromdocuments
(Pecina, 2010). It achieved high precision and high mean average
precision (MAP) in collocation extraction when using pointwise
mutual information (MI).
To characterize the extent to which the parts of a MWT are
semantically connected, we choose to use the notion of association
degree (Henry et al.,2018) used for the classication of MWT in
key MWT or non-key MWT. A key MWT is used for
representing an important concept in a document or a query.
Indeed, a key MWT can be used to annotate do cuments with
terms that describe best the semantic content (Bendersky and
Croft, 2008). Statistical measures have been proposed to
measure the degree of association, including MI, Dice coefcient
or log-likelihood (Kilgariff, 1992). In corpus linguistics, these
measures are used to identify collocations denedasgroupsof
terms that tend to appear in close proximity to one another
signicantly more often than one might predict based on the
The current issue and full text archiveof this journal is available on Emerald
Insight at: https://www.emerald.com/insight/2398-6247.htm
Information Discovery and Delivery
51/1 (2023) 7487
© Emerald Publishing Limited [ISSN 2398-6247]
[DOI 10.1108/IDD-12-2021-0142]
This research received no specic grant from any funding agency in the
public, commercial or not-for-prot sectors.
Received 29 December 2021
Revised 22 March 2022
7 May 2022
Accepted 4 June 2022
74

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT