An evaluation of conflation accuracy using finite‐state transducers

Date01 May 2006
Pages328-349
DOIhttps://doi.org/10.1108/00220410610666493
Published date01 May 2006
AuthorCarmen Galvez,Félix de Moya‐Anegón
Subject MatterInformation & knowledge management,Library & information science
An evaluation of conflation
accuracy using finite-state
transducers
Carmen Galvez and Fe
´lix de Moya-Anego
´n
Department of Information Science, University of Granada, Granada, Spain
Abstract
Purpose – To evaluate the accuracy of conflation methods based on finite-state transducers (FSTs).
Design/methodology/approach – Incorrectly lemmatized and stemmed forms may lead to the
retrieval of inappropriate documents. Experimental studies to date have focused on retrieval
performance, but very few on conflation performance. The process of normalization we used involved
a linguistic toolbox that allowed us to construct, through graphic interfaces, electronic dictionaries
represented internally by FSTs. The lexical resources developed were applied to a Spanish test corpus
for merging term variants in canonical lemmatized forms. Conflation performance was evaluated in
terms of an adaptation of recall and precision measures, based on accuracy and coverage, not actual
retrieval. The results were compared with those obtained using a Spanish version of the Porter
algorithm.
Findings – The conclusion is that the main strength of lemmatization is its accuracy, whereas its
main limitation is the underanalysis of variant forms.
Originality/value The report outlines the potential of transducers in their application to
normalization processes.
Keywords Linguistics,Semantics, Programming and algorithmtheory, Accuracy
Paper type Research paper
Introduction
Conflation is the process of matching and grouping together variants of the same term
that are semantically equivalent. A variant is defined as a text occurrence that is
conceptually related to an original term and can be used to search for information in
text databases (Sparck Jones and Tait, 1984; Tzoukermann et al., 1997; Jacquemin and
Tzoukermann, 1999). This is done by means of computational procedures known as
conflation algorithms, whose primary goal is the normalization of uniterms and
multiterms (Galvez et al., 2005). Uniterm conflation algorithms take into account the
common endings of the words that can be conflated. The programs that carry out this
process are called: stemmers, when the process involves non-linguistic techniques and
stemming algorithms, and lemmatizers, when linguistic techniques and lemmatization
algorithms are used.
A stemmer tries to reduce various forms of a word to a single stem, defined as the
“base form,” from which inflected forms are derived. A common method of stemming is
affix removal based on a list of affixes and rules. A stemmer, however, operates on a
single word without knowledge of the context, and therefore cannot discriminate words
that may have different meanings depending on the context of their appearance. At the
same time, stemmers are typically easy to implement, and run fast, yet they do not give a
high percentage of accuracy, making them inappropriate for some applications.
The current issue and full text archive of this journal is available at
www.emeraldinsight.com/0022-0418.htm
JDOC
62,3
328
Received February 2005
Revised July 2005
Accepted July 2005
Journal of Documentation
Vol. 62 No. 3, 2006
pp. 328-349
qEmerald Group Publishing Limited
0022-0418
DOI 10.1108/00220410610666493
A lemmatizer attempts to obtain the lemma, defined as the combination of the stem
and its part-of-speech (POS) tag, which defines the role of terms in a sentence. The
correct identification of the syntactical category of a word in a sentence requires
knowledge of the grammar of a language, implying natural language processing (NLP).
A well-known method of lemmatization consists of a morphological analysis of the
variants and their reduction to lemmas. The process of lemmatization with finite-state
technology consists of standardizing the terms according to a dictionary look-up, or a
lexicon, that is configured as a lexical database used to treat as equivalent forms
certain entry terms, related with the canonical form or lemma.
Literature review and evaluation measures
Our general understanding is that literature regarding automatic conflation methods in
IR can depart from one of several frameworks that are not mutually exclusive, taking
the following criteria into account:
.non-linguistic vs linguistic techniques;
.language independent vs dependent techniques; and
.similarity vs equivalence relations.
Within this general structure, we may classify the different means of reducing
morphological variants in IR as elimination of affixes, stemming, word segmentation,
n-grams, and linguistic morphology (Lennon et al., 1981). A categorization of methods
for reducing morphological variants begins with the distinction between manual
methods and automatic ones, the latter including: affix removal, successor variety,
n-gram matching, and table lookup (Frakes, 1992), whereas the conflation techniques
employed habitually in IR are stemming and lexical lookup (Paice, 1996).
Conflation based on stemming techniques involves the elimination of the longest
possible affixes, and so the algorithms applied in this way are known as longest match
or simple-removal algorithms. The ones most often used with the English language are
those of Lovins (1968), Dawson (1974), Porter (1980) and Paice (1990). The Porter
algorithm, available at the Snowball web site (2003), has been implemented with
French, Spanish, Italian and Portuguese, as well as with German, Norwegian, Swedish,
and other languages.
The best known string-similarity algorithms are those based on n-gram similarities,
the n-gram of a string being any substring of some fixed length. These have been
extensively applied to IR-related tasks such as query expansion (Adamson and
Boreham, 1974; Lennon et al., 1981; Cavnar, 1994; Damashek, 1995; Robertson and
Willett, 1998). They are used as well for automatic spelling correction (Angell et al.,
1983; Kosinov, 2001), based on the assumption that the problems of morphological
variants and spelling variants are similar.
In language-dependent linguistic techniques, dictionaries are utilized to fuse lexical
variants into lemmas, by means of lemmatization algorithms. The first computational
implementation of this approach was with the PC-KIMMO parser (Karttunen, 1983),
later used as the scheme for the Xerox morphological analyzer developed by the
Multi-lingual Theory and Technology Group. One of the top applications of the Xerox
tool, designed for morphological parsing using finite-state technology, is the reduction
of lexical variants in IR systems. The XEROX-XRCE analyzer has been applied
to English, Dutch, German, Hungarian, French, Italian, Portuguese, and Spanish.
An evaluation of
conflation
accuracy
329

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT