A study on automatic creation of a comparable document collection in cross‐language information retrieval

Date01 May 2006
Published date01 May 2006
DOIhttps://doi.org/10.1108/00220410610666510
Pages372-387
AuthorTuomas Talvensaari,Jorma Laurikkala,Kalervo Järvelin,Martti Juhola
Subject MatterInformation & knowledge management,Library & information science
A study on automatic creation of a
comparable document collection
in cross-language information
retrieval
Tuomas Talvensaari and Jorma Laurikkala
Department of Computer Sciences, University of Tampere, Finland
Kalervo Ja
¨rvelin
Department of Information Studies, University of Tampere, Finland, and
Martti Juhola
Department of Computer Sciences, University of Tampere, Finland
Abstract
Purpose – To present a method for creating a comparable document collection from two document
collections in different languages.
Design/methodology/approach The best query keys were extracted from a Finnish source
collection (articles of the newspaper Aamulehti ) with the relative average term frequency formula. The
keys were translated into English with a dictionary-based query translation program. The resulting
lists of words were used as queries that were run against the target collection (Los Angeles Times
articles) with the nearest neighbor method. The documents were aligned with unrestricted and
date-restricted alignment schemes, which were also combined.
Findings The combined alignment scheme was found the best, when the relatedness of the
document pairs was assessed with a five-degree relevance scale. Of the 400 document pairs, roughly 40
percent were highly or fairly related and 75 percent included at least lexical similarity.
Research limitations/implications – The number of alignment pairs was small due to the short
common time period of the two collections, and their geographical (and thus, topical) remoteness.
In future, our aim is to build larger comparable corpora in various languages and use them as source of
translation knowledge for the purposes of cross-language information retrieval (CLIR).
Practical implications – Readily available parallel corpora are scarce. With this method, two
unrelated document collections can relatively easily be aligned to create a CLIR resource.
Originality/value – The method can be applied to weakly linked collections and morphologically
complex languages, such as Finnish.
Keywords Information retrieval, Document management, Languageand literature
Paper type Research paper
The current issue and full text archive of this journal is available at
www.emeraldinsight.com/0022-0418.htm
This research was funded, in part, by Tampere Graduate School in Information Science and
Engineering (TISE) and the Academy of Finland, Project Nos. 80771,177033, 200844, 202185,
204978, 206568 and 1209960. The InQuery search engine was provided by the Center for
Intelligent Information Retrieval at the University of Massachusetts. FINTWOL (morphological
description of Finnish): Copyright (c) Kimmo Koskenniemi and Lingsoft plc. 1983-1993.TWOL-R
(run-time two-level program): Copyright (c) Kimmo Koskenniemi and Lingsoft plc.
1983-1992.GlobalDix Dictionary Software was used for automatic word-by-word translations.
Copyright (c) 1998 Kielikone plc, Finland. The SNOWBALL stemmers by Martin Porter.
JDOC
62,3
372
Received June 2005
Revised September 2005
Accepted October 2005
Journal of Documentation
Vol. 62 No. 3, 2006
pp. 372-387
qEmerald Group Publishing Limited
0022-0418
DOI 10.1108/00220410610666510
1. Introduction
In traditional information retrieval tasks, queries and documents are in the same
language. Conversely, in cross-language information retrieval (CLIR) (Oard and
Diekema, 1998), the language of the queries (source language) and the language of the
document collection (target language) are different. The problem is basically similar to
that of the single language searches: to find documents in a collection that best match
the user’s request, but, additionally, we have to cross the language barrier somehow.
After the huge growth of the multilingual internet, CLIR has become more and more
important (Grefenstette, 1998).
There are various approaches to query formation in CLIR. Oard and Diekema (1998)
represent a framework where query formulation is examined from the viewpoints of
different matching strategies and the sources of translation knowledge needed in the
matching. Cognate matching does not involve actual translation. Instead, rules to
identify similarities in spelling of pronunciation are applied. For instance, proper nouns
and technical terms can be similar between languages. Such words often vary a little,
which allows the usage of approximate string matching, such as n-grams (Pirkola et al.,
2002a, 2003). Conversely, query translation, document translation and interlingual
matching techniques require deeper translation knowledge, which can be drawn from
ontologies, bilingual dictionaries, machine translation lexicons, or corpora. The query
translation approach is the most popular in CLIR and is also used in this study.
The target document collection could be translated into the source language, but it
would be a very complex task, yielding low quality, and it is far easier to translate
concise queries (Oard and Dorr, 1996; Oard and Diekema, 1998). The interlingual
techniques convert both the query and the documents into a language-independ ent
representation. However, some of these techniques, such as the latent semantic
indexing, are computationally expensive.
The translation techniques are not mutually exclusive, but, on the contrary, can be
used jointly. Dictionary-based translation, where the query keys are simply replaced
by their counterparts in a bilingual dictionary (Hull and Grefenstette, 1996), is often the
starting point. Using this technique alone is nonetheless problematic according to
Ballesteros and Croft (1998), who listed three weaknesses. First, some of the translation
alternatives may not correspond to the words of the query in the sense desired by the
user. Extraneous terms increase the ambiguity of the query, which in turn damages
retrieval performance. Second, dictionaries are limited in scope. Special terms and
proper nouns are often absent from general dictionaries. Third, the recognition and
translation of phrases constructed from several words can be difficult. However, this is
not a major problem in languages where such phrases often form compound words
spelled together. The ambiguity introduced by translating the query can be dealt
with in many ways (Pirkola et al., 2001). For example, in part-of-speech tagging
the translation alternatives that have the same part-of-speech as the source language
words are selected.
Corpus-based translation utilizes multilingual document collections, where the
documents of the two languages are aligned as pairs so that each pair contains a
translation of each other (parallel corpora) or at least deal with the same topic
(comparable corpora). Sheridan and Ballerini (1996) used an aligned corpus to
expand the source language queries with target language words that co-occur with
the query keys in the aligned documents, thus achieving a “translation effect.”
Cross-language
information
retrieval
373

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT