An evaluation of conflation accuracy using finite‐state transducers

Document

Cited in

Date	01 May 2006
Pages	328-349
DOI	https://doi.org/10.1108/00220410610666493
Published date	01 May 2006
Author	Carmen Galvez,Félix de Moya‐Anegón
Subject Matter	Information & knowledge management,Library & information science

An evaluation of conﬂation

accuracy using ﬁnite-state

transducers

Carmen Galvez and Fe

´lix de Moya-Anego

´n

Department of Information Science, University of Granada, Granada, Spain

Abstract

Purpose – To evaluate the accuracy of conﬂation methods based on ﬁnite-state transducers (FSTs).

Design/methodology/approach – Incorrectly lemmatized and stemmed forms may lead to the

retrieval of inappropriate documents. Experimental studies to date have focused on retrieval

performance, but very few on conﬂation performance. The process of normalization we used involved

a linguistic toolbox that allowed us to construct, through graphic interfaces, electronic dictionaries

represented internally by FSTs. The lexical resources developed were applied to a Spanish test corpus

for merging term variants in canonical lemmatized forms. Conﬂation performance was evaluated in

terms of an adaptation of recall and precision measures, based on accuracy and coverage, not actual

retrieval. The results were compared with those obtained using a Spanish version of the Porter

algorithm.

Findings – The conclusion is that the main strength of lemmatization is its accuracy, whereas its

main limitation is the underanalysis of variant forms.

Originality/value – The report outlines the potential of transducers in their application to

normalization processes.

Keywords Linguistics,Semantics, Programming and algorithmtheory, Accuracy

Paper type Research paper

Introduction

Conﬂation is the process of matching and grouping together variants of the same term

that are semantically equivalent. A variant is deﬁned as a text occurrence that is

conceptually related to an original term and can be used to search for information in

text databases (Sparck Jones and Tait, 1984; Tzoukermann et al., 1997; Jacquemin and

Tzoukermann, 1999). This is done by means of computational procedures known as

conﬂation algorithms, whose primary goal is the normalization of uniterms and

multiterms (Galvez et al., 2005). Uniterm conﬂation algorithms take into account the

common endings of the words that can be conﬂated. The programs that carry out this

process are called: stemmers, when the process involves non-linguistic techniques and

stemming algorithms, and lemmatizers, when linguistic techniques and lemmatization

algorithms are used.

A stemmer tries to reduce various forms of a word to a single stem, deﬁned as the

“base form,” from which inﬂected forms are derived. A common method of stemming is

afﬁx removal based on a list of afﬁxes and rules. A stemmer, however, operates on a

single word without knowledge of the context, and therefore cannot discriminate words

that may have different meanings depending on the context of their appearance. At the

same time, stemmers are typically easy to implement, and run fast, yet they do not give a

high percentage of accuracy, making them inappropriate for some applications.

The current issue and full text archive of this journal is available at

www.emeraldinsight.com/0022-0418.htm

JDOC

62,3

328

Received February 2005

Revised July 2005

Accepted July 2005

Journal of Documentation

Vol. 62 No. 3, 2006

pp. 328-349

qEmerald Group Publishing Limited

0022-0418

DOI 10.1108/00220410610666493

A lemmatizer attempts to obtain the lemma, deﬁned as the combination of the stem

and its part-of-speech (POS) tag, which deﬁnes the role of terms in a sentence. The

correct identiﬁcation of the syntactical category of a word in a sentence requires

knowledge of the grammar of a language, implying natural language processing (NLP).

A well-known method of lemmatization consists of a morphological analysis of the

variants and their reduction to lemmas. The process of lemmatization with ﬁnite-state

technology consists of standardizing the terms according to a dictionary look-up, or a

lexicon, that is conﬁgured as a lexical database used to treat as equivalent forms

certain entry terms, related with the canonical form or lemma.

Literature review and evaluation measures

Our general understanding is that literature regarding automatic conﬂation methods in

IR can depart from one of several frameworks that are not mutually exclusive, taking

the following criteria into account:

.non-linguistic vs linguistic techniques;

.language independent vs dependent techniques; and

.similarity vs equivalence relations.

Within this general structure, we may classify the different means of reducing

morphological variants in IR as elimination of afﬁxes, stemming, word segmentation,

n-grams, and linguistic morphology (Lennon et al., 1981). A categorization of methods

for reducing morphological variants begins with the distinction between manual

methods and automatic ones, the latter including: afﬁx removal, successor variety,

n-gram matching, and table lookup (Frakes, 1992), whereas the conﬂation techniques

employed habitually in IR are stemming and lexical lookup (Paice, 1996).

Conﬂation based on stemming techniques involves the elimination of the longest

possible afﬁxes, and so the algorithms applied in this way are known as longest match

or simple-removal algorithms. The ones most often used with the English language are

those of Lovins (1968), Dawson (1974), Porter (1980) and Paice (1990). The Porter

algorithm, available at the Snowball web site (2003), has been implemented with

French, Spanish, Italian and Portuguese, as well as with German, Norwegian, Swedish,

and other languages.

The best known string-similarity algorithms are those based on n-gram similarities,

the n-gram of a string being any substring of some ﬁxed length. These have been

extensively applied to IR-related tasks such as query expansion (Adamson and

Boreham, 1974; Lennon et al., 1981; Cavnar, 1994; Damashek, 1995; Robertson and

Willett, 1998). They are used as well for automatic spelling correction (Angell et al.,

1983; Kosinov, 2001), based on the assumption that the problems of morphological

variants and spelling variants are similar.

In language-dependent linguistic techniques, dictionaries are utilized to fuse lexical

variants into lemmas, by means of lemmatization algorithms. The ﬁrst computational

implementation of this approach was with the PC-KIMMO parser (Karttunen, 1983),

later used as the scheme for the Xerox morphological analyzer developed by the

Multi-lingual Theory and Technology Group. One of the top applications of the Xerox

tool, designed for morphological parsing using ﬁnite-state technology, is the reduction

of lexical variants in IR systems. The XEROX-XRCE analyzer has been applied

to English, Dutch, German, Hungarian, French, Italian, Portuguese, and Spanish.

An evaluation of

conﬂation

accuracy

329

To continue reading

Request your trial

Subscribers can access the reported version of this case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the cited cases and legislation of a document.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the documents that have cited the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the revised versions of legislation with amendments.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see any amendments made to the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a visualisation of a case and its relationships to other cases. An alternative to lists of cases, the Precedent Map makes it easier to establish which ones may be of most relevance to your research and prioritise further reading. You also get a useful overview of how the case was received.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the list of results connected to your document through the topics and citations Vincent found.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

An evaluation of conflation accuracy using finite‐state transducers

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users