Morphological typology of languages for IR

Date01 June 2001
Published date01 June 2001
Pages330-348
DOIhttps://doi.org/10.1108/EUM0000000007085
AuthorAri Pirkola
Subject MatterInformation & knowledge management,Library & information science
MORPHOLOGICAL TYPOLOGY OF LANGUAGES FOR IR
ARI PIRKOLA
pirkola@cc.jyu.fi
Department of Information Studies, University of Tampere
PO Box 607, 33101 Tampere, Finland
This paper presents a morphological classification of languages from
the IR perspective. Linguistic typology research has shown that the
morphological complexity of every language in the world can be
described by two variables, index of synthesis and index of fusion.
These variables provide a theoretical basis for IR research handling
morphological issues. A common theoretical framework is needed in
particular because of the increasing significance of cross-language
retrieval research and CLIR systems processing different languages.
The paper elaborates the linguistic morphological typology for the
purposes of IR research. It studies how the indexes of synthesis and
fusion could be used as practical tools in mono- and cross-lingual
IR research. The need for semantic and syntactic typologies is
discussed. The paper also reviews studies made in different
languages on the effects of morphology and stemming in IR.
1. INTRODUCTION
There are at least 4,000 languages in the world [1, 2]. The precise gure depends
on, for example, where the line is drawn between a dialect and a distinct lan-
guage.1Languages are classied on the one hand on the basis of their supposed
genetic relationships into language families, and on the other on linguistic
grounds. The language families include Indo-European (the largest family includ-
ing the western languages), Finno-Ugric (including Finnish and Hungarian) and
Sino-Tibetan (including Chinese). Some languages are difcult to include in the
established families, and they are called isolates (e.g. Japanese). The traditional
morphological typology distinguishes four language types. The syntactic typolo-
gy formulated by Greenberg divides languages into different types on the basis of
the order of sentence elements [4].
This paper presents a morphological classication of languages from the
standpoint of IR. It considers morphology associated with texts, i.e. the written
form of languages. IR research is an international research area. Monolingual
research is performed in different languages. Cross-language information
retrieval (CLIR) has become an important research area on a global scale [5–7]. It
is difcult to follow and do research if one does not master the languages
involved. This difculty could be relieved by a common linguistic framework
applicable to IR. This study collects the results of morphological typology
330330
Journal of Documentation, vol. 57, no. 3, May 2001, pp. 330–348
1Saussure discusses the difference between a language and a dialect [3].
research done in linguistics and combines the results into a theoretical framework
for IR research. The present paper shows that the variation in morphological
properties among the world’s languages is high. However, it also shows that the
same morphological processes affect all languages and that all languages can be
described using the same morphological variables. This paper also discusses lex-
ical-semantic variation in languages, but the theoretical framework only covers
the structure of words.
The aim of the paper is also to provide practical tools for IR research, in par-
ticular for text retrieval research. Text retrieval refers to retrieving documents
from text databases, i.e. electronic collections of documents, such as magazine,
journal and newspaper articles. Morphological typology research has shown that
it is possible to describe the morphological complexity of each language using
two variables, index of synthesis and index of fusion [8–10]. The former
describes the amount of afxation in an individual language, and the latter the
ease with which afxes can be segmented in words in a language. It is proposed
in the present paper that, for each language, these variables could be utilised in IR
within a language and across languages as practical tools in system development
and evaluation.
The rest of this paper is organised as follows. Section 2 considers the central
concepts of morphology. Section 3 considers the most important morphological
phenomena related to information retrieval, i.e. inection, derivation and com-
pound words, and reviews studies done on the effects of stemming in IR. Section
4 presents the traditional morphological typology as well as the recent one based
on the variables of index of synthesis and index of fusion. In Section 5 the recent
morphological typology is subcategorised for the purpose of IR. Section 6
considers how languages differ in inection, derivation and the frequency of
compound words. Section 7 discusses how the indexes of synthesis and fusion
could be utilised in empirical IR research and system development. In Section 8
the need for semantic and syntactic typologies is discussed. Section 9 presents
conclusions.
2. CORE CONCEPTS OF MORPHOLOGY
Morphology is the eld of linguistics which studies word structure and formation.
It is composed of inectional morphology and derivational morphology [9, 11,
12]. Inection is dened as the use of morphological methods to create inec-
tional word forms from a lexeme.2Inectional word forms indicate grammatical
relations between words. Derivational morphology is concerned with the deriva-
tion of new words from other words using derivational afxes. Compounding is
another method for forming new words. A compound word (or a compound) is
dened as a word formed from two or more words written together. The compo-
nent words are themselves independent words (free morphemes).
A morpheme is the smallest unit of a language which has a meaning [9, 15].
May 2001 LANGUAGE TYPOLOGY
331
2A lexeme is a set of word forms which belong together [13], or a word considered as a
lexical unit, in abstraction from the specific word forms it takes in specific constructions
[14]. For example, the lexeme sing has the following word forms or inflectional forms:
sing, sang, sung, sings, singing.

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT