A STEMMING ALGORITHM FOR LATIN TEXT DATABASES

DOIhttps://doi.org/10.1108/eb026966
Pages172-187
Date01 February 1996
Published date01 February 1996
AuthorROBYN SCHINKE,MARK GREENGRASS,ALEXANDER M. ROBERTSON,PETER WILLETT
Subject MatterInformation & knowledge management,Library & information science
A STEMMING ALGORITHM FOR LATIN TEXT DATABASES
ROBYN SCHINKE*, MARK GREENGRASS*, ALEXANDER M. ROBERTSON and
PETER WILLETT1
Humanities Research Institute and Departments of History* and of Information
Studies, University of
Sheffield,
Western Bank, Sheffield S10 2TN
This paper describes the design of a stemming algorithm for searching
databases of Latin text. The algorithm uses a simple longest-match
approach with some recoding but differs from most stemmers in its
use of two separate suffix dictionaries (one for nouns and adjectives
and one for verbs) for processing query and database words. These
dictionaries and the associated stemming rules are arranged in such
a way that the stemmer does not need to know the grammatical
category of the word that
is
being stemmed. It
is
very easy to overstem
in Latin: the stemmer developed here tends, rather, towards
understemming, leaving sufficient grammatical information attached
to the stems resulting from its use to enable users to pursue very
specific searches for single grammatical forms of individual words.
INTRODUCTION
AN IMPORTANT COMPONENT of any system for text searching
is
the ability
to identify accurately the variant word forms that arise from grammatical
modifications or alternative spellings of the words in a user's query. Such variants
are normally encompassed by means of either right-hand truncation or stem-
ming. Truncation is carried out by the searcher, who removes as many letters
from the right-hand side of the word as seems appropriate to achieve a plausible
root, and the search then retrieves all words that commence with this root,
regardless of their endings. Stemming, conversely, is carried out automatically
by reducing all words with the same stem to a common form, typically by
removing the inflectional and derivational suffixes
[1-3].
Stemming algorithms
for English have been discussed extensively in the literature [1-6] and there has
recently been much interest in stemming algorithms for other languages
[7-12].
In this paper, we describe the development of a stemming algorithm to support
free-text searches of Latin databases, an increasing number of which have
become available to scholars over the last few years.
In some respects, Latin is ideally suited to right-hand truncation, since it is
an inflected language which makes extensive use of suffixes to convey syntactic,
rather than semantic, information about words. In practice, however, a simple,
right-hand truncation search for a standard Latin word produces very poor
1To whom all
correspondence should
be
addressed.
Email address:p.willett@sheffield.ac.uk
Journal
of
Documentation,
vol. 52, no. 2, June 1996, pp. 172-187
172
June 1996 LATIN STEMMING ALGORITHM
results for two principal reasons. The first problem is that many Latin words
have more than one distinct stem. For example, nouns such as vox (voice) yield
the stems 'vox-' and 'voc-', while verbs such as agere (to do, drive) exhibit a
minimum of three stems: 'ag-', 'eg-', and 'act-'. Accordingly, high recall will only
be achieved in a right-hand truncation search if the user knows all of the stems
of the query word and carries out several separate, but related, searches of the
database for each distinct query word. A further problem is that a simple
truncation search produces a large number of words that are unrelated to the
initial query. This arises not only because Latin roots tend to be quite short (e.g.
vox and agere as discussed above) but also because many Latin words with
different meanings have similar roots, e.g. the three words portus (harbour),
porta (gateway) and portare (to carry).
This combination of factors means that when Latin words are truncated to
their linguistic roots, such as ducere to 'duc-' or mater to 'matr-', the resultant
stems are so short that many other words may begin with the same three or
four letters, and many of these words may not be semantically related to the
query word. For example, when the verb ducere (to lead) was sought in the
Patrologia Latino (a database that contains a huge compendium of patristic and
medieval thought originally published in the middle of the nineteenth century)
with the stem 'duc-', a total of
404
different words was retrieved. Only
118,
just
29%,
at most, of these words can actually be variants of the query verb since
Latin verbs have a maximum of 144 different forms and since twenty-six of the
forms of
ducere
commence with 'dux-', rather than 'duc-'. By similar reasoning,
only 7% at most of the 2,104 words retrieved in a search of the same database
for the verb few, ferre, tuli, latum (to carry), using the three stems 'fer-', 'tul-'
and 'lat-', can possibly be variants of the query verb. The results are somewhat
better when words with longer roots are searched. For example, searching the
stem 'ambul-' for the verb ambulare (to walk) retrieved 129 different words, of
which 94 (73%) were found to be query variants. These occurrences are examples
of the general problem of overstemming, which occurs when too short a stem
remains after the removal of a word's ending(s), with the result that unrelated
words are conflated to the same stem. The converse of this, understemming,
occurs when too short a suffix is removed (so that related words are not all
conflated to the same stem). Both problems occur with any stemming algorithm
[13,
14]; in the particular context of Latin, we believe that understemming is of
less importance since it can be circumvented, at least in part, by an appropriate
algorithm design, as detailed below.
Right-hand truncation can be used to search Latin text databases, but only
if the user has sufficient knowledge of Latin to be able to enter all of the necessary
stems for a given query word and to identify the (possibly large number of)
related forms in the search output. In this paper, we describe a stemming
algorithm for Latin that seeks to provide effective access to such databases for
users with only a basic knowledge of the language. The next section provides a
brief overview of the main characteristics of stemming algorithms, and we then
discuss some of the main features of Latin that need to be reflected if a stemmer
is to perform effectively. The following, and longest, section details the
173

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT