Multi-word terms selection for information retrieval
DOI | https://doi.org/10.1108/IDD-12-2021-0142 |
Published date | 28 June 2022 |
Date | 28 June 2022 |
Pages | 74-87 |
Subject Matter | Library & information science,Library & information services,Lending,Document delivery,Collection building & management,Stock revision,Consortia |
Author | Chedi Bechikh Ali,Hatem Haddad,Yahya Slimani |
Multi-word terms selection for
information retrieval
Chedi Bechikh Ali
Institut National des Sciences Appliquées et de Technologie (INSAT), LISI, University of Carthage, Tunis, Tunisia
Hatem Haddad
iCompass, Tunis, Tunisia, and
Yahya Slimani
Institut Supérieur des Arts Multimédia (ISAMM), University of Manouba, Manouba, Tunisia
Abstract
Purpose –A number of approaches and algorithms have been proposed over the years as a basis for automatic indexing. Many of these approaches
suffer from precision inefficiency at low recall. The choice of indexing units has a great impact on search system effectiveness. The authors dive
beyond simple terms indexing to propose a framework for multi-word terms (MWT) filtering and indexing.
Design/methodology/approach –In this paper, the authors rely on ranking MWT to filter them, keeping the most effective ones for the indexing
process. The proposed model is based on filtering MWT according to their ability to capture the document topic and distinguish between different
documents from the same collection. The authors rely on the hypothesis that the best MWT are those that achieve the greatest association degree.
The experiments are carried out with English and French languages data sets.
Findings –The results indicate that this approach achieved precision enhancements at low recall, and it performed better than more advanced
models based on terms dependencies.
Originality/value –Using and testing different association measures to select MWT that best describe the documents to enhance the precision in
the first retrieved documents.
Keywords Performance measurement, Statistics, Information systems, Information retrieval, Information science, Collection management,
Indexing, Multi-word terms, Association measure, Precision
Paper type Research paper
1. Introduction
Existing information retrieval (IR) models are very well defined
from the theoretical point of view and give good results in
evaluation campaigns. However, these models are not efficient at
low recall level. Indeed, some documents can have a low ranking
despite being the most relevant. This can be explained by the fact
that these models do not take the complex grammatical structure
of queries and documents into account. Consequently, there is a
need to use models based on dependencies between terms to
improve the accuracy of IR systems. For classical systems, the
text is considered from a statistical point of view, and no
linguistic, syntactic or dependency information is used. We rely
on the hypothesis that a better understanding of the relationships
and dependencies that may exist between terms in queries and
documents can allow the IR system to perform better. In this
paper, we propose a document indexing model based on multi-
word terms (MWT) extracted using syntactic patterns and
statistical techniques to capture term dependencies. The
syntactic patterns make it possible to eliminate irrelevant
structures when extracting MWT, and the statistical measures
make it possible to filterMWTbyusingaweighttokeep themost
efficient ones for indexing based on MWT. The fusion of
linguistic and statistical approaches for extracting MWTs shows
itsusefulnessintheterminologyextractionfromdocuments
(Pecina, 2010). It achieved high precision and high mean average
precision (MAP) in collocation extraction when using pointwise
mutual information (MI).
To characterize the extent to which the parts of a MWT are
semantically connected, we choose to use the notion of association
degree (Henry et al.,2018) used for the classification of MWT in
key MWT or non-key MWT. A key MWT is used for
representing an important concept in a document or a query.
Indeed, a key MWT can be used to annotate do cuments with
terms that describe best the semantic content (Bendersky and
Croft, 2008). Statistical measures have been proposed to
measure the degree of association, including MI, Dice coefficient
or log-likelihood (Kilgariff, 1992). In corpus linguistics, these
measures are used to identify collocations definedasgroupsof
terms “that tend to appear in close proximity to one another
significantly more often than one might predict based on the
The current issue and full text archiveof this journal is available on Emerald
Insight at: https://www.emerald.com/insight/2398-6247.htm
Information Discovery and Delivery
51/1 (2023) 74–87
© Emerald Publishing Limited [ISSN 2398-6247]
[DOI 10.1108/IDD-12-2021-0142]
This research received no specific grant from any funding agency in the
public, commercial or not-for-profit sectors.
Received 29 December 2021
Revised 22 March 2022
7 May 2022
Accepted 4 June 2022
74
To continue reading
Request your trial