A STEMMING ALGORITHM FOR LATIN TEXT DATABASES

Document

Cited in

DOI	https://doi.org/10.1108/eb026966
Pages	172-187
Date	01 February 1996
Published date	01 February 1996
Author	ROBYN SCHINKE,MARK GREENGRASS,ALEXANDER M. ROBERTSON,PETER WILLETT
Subject Matter	Information & knowledge management,Library & information science

ROBYN SCHINKE*, MARK GREENGRASS*, ALEXANDER M. ROBERTSON† and

PETER WILLETT†1

Humanities Research Institute and Departments of History* and of Information

Studies†, University of

Sheffield,

Western Bank, Sheffield S10 2TN

This paper describes the design of a stemming algorithm for searching

databases of Latin text. The algorithm uses a simple longest-match

approach with some recoding but differs from most stemmers in its

use of two separate suffix dictionaries (one for nouns and adjectives

and one for verbs) for processing query and database words. These

dictionaries and the associated stemming rules are arranged in such

a way that the stemmer does not need to know the grammatical

category of the word that

being stemmed. It

very easy to overstem

in Latin: the stemmer developed here tends, rather, towards

understemming, leaving sufficient grammatical information attached

to the stems resulting from its use to enable users to pursue very

specific searches for single grammatical forms of individual words.

INTRODUCTION

AN IMPORTANT COMPONENT of any system for text searching

the ability

to identify accurately the variant word forms that arise from grammatical

modifications or alternative spellings of the words in a user's query. Such variants

are normally encompassed by means of either right-hand truncation or stem-

ming. Truncation is carried out by the searcher, who removes as many letters

from the right-hand side of the word as seems appropriate to achieve a plausible

root, and the search then retrieves all words that commence with this root,

regardless of their endings. Stemming, conversely, is carried out automatically

by reducing all words with the same stem to a common form, typically by

removing the inflectional and derivational suffixes

[1-3].

Stemming algorithms

for English have been discussed extensively in the literature [1-6] and there has

recently been much interest in stemming algorithms for other languages

[7-12].

In this paper, we describe the development of a stemming algorithm to support

free-text searches of Latin databases, an increasing number of which have

become available to scholars over the last few years.

In some respects, Latin is ideally suited to right-hand truncation, since it is

an inflected language which makes extensive use of suffixes to convey syntactic,

rather than semantic, information about words. In practice, however, a simple,

right-hand truncation search for a standard Latin word produces very poor

1To whom all

correspondence should

addressed.

Email address:p.willett@sheffield.ac.uk

Journal

Documentation,

vol. 52, no. 2, June 1996, pp. 172-187

172

June 1996 LATIN STEMMING ALGORITHM

results for two principal reasons. The first problem is that many Latin words

have more than one distinct stem. For example, nouns such as vox (voice) yield

the stems 'vox-' and 'voc-', while verbs such as agere (to do, drive) exhibit a

minimum of three stems: 'ag-', 'eg-', and 'act-'. Accordingly, high recall will only

be achieved in a right-hand truncation search if the user knows all of the stems

of the query word and carries out several separate, but related, searches of the

database for each distinct query word. A further problem is that a simple

truncation search produces a large number of words that are unrelated to the

initial query. This arises not only because Latin roots tend to be quite short (e.g.

vox and agere as discussed above) but also because many Latin words with

different meanings have similar roots, e.g. the three words portus (harbour),

porta (gateway) and portare (to carry).

This combination of factors means that when Latin words are truncated to

their linguistic roots, such as ducere to 'duc-' or mater to 'matr-', the resultant

stems are so short that many other words may begin with the same three or

four letters, and many of these words may not be semantically related to the

query word. For example, when the verb ducere (to lead) was sought in the

Patrologia Latino (a database that contains a huge compendium of patristic and

medieval thought originally published in the middle of the nineteenth century)

with the stem 'duc-', a total of

404

different words was retrieved. Only

118,

just

29%,

at most, of these words can actually be variants of the query verb since

Latin verbs have a maximum of 144 different forms and since twenty-six of the

forms of

ducere

commence with 'dux-', rather than 'duc-'. By similar reasoning,

only 7% at most of the 2,104 words retrieved in a search of the same database

for the verb few, ferre, tuli, latum (to carry), using the three stems 'fer-', 'tul-'

and 'lat-', can possibly be variants of the query verb. The results are somewhat

better when words with longer roots are searched. For example, searching the

stem 'ambul-' for the verb ambulare (to walk) retrieved 129 different words, of

which 94 (73%) were found to be query variants. These occurrences are examples

of the general problem of overstemming, which occurs when too short a stem

remains after the removal of a word's ending(s), with the result that unrelated

words are conflated to the same stem. The converse of this, understemming,

occurs when too short a suffix is removed (so that related words are not all

conflated to the same stem). Both problems occur with any stemming algorithm

[13,

14]; in the particular context of Latin, we believe that understemming is of

less importance since it can be circumvented, at least in part, by an appropriate

algorithm design, as detailed below.

Right-hand truncation can be used to search Latin text databases, but only

if the user has sufficient knowledge of Latin to be able to enter all of the necessary

stems for a given query word and to identify the (possibly large number of)

related forms in the search output. In this paper, we describe a stemming

algorithm for Latin that seeks to provide effective access to such databases for

users with only a basic knowledge of the language. The next section provides a

brief overview of the main characteristics of stemming algorithms, and we then

discuss some of the main features of Latin that need to be reflected if a stemmer

is to perform effectively. The following, and longest, section details the

173

To continue reading

Request your trial

Subscribers can access the reported version of this case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the cited cases and legislation of a document.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the documents that have cited the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the revised versions of legislation with amendments.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see any amendments made to the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a visualisation of a case and its relationships to other cases. An alternative to lists of cases, the Precedent Map makes it easier to establish which ones may be of most relevance to your research and prioritise further reading. You also get a useful overview of how the case was received.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the list of results connected to your document through the topics and citations Vincent found.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

A STEMMING ALGORITHM FOR LATIN TEXT DATABASES

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users