A study on automatic creation of a comparable document collection in cross‐language information retrieval

Document

Cited in

Date	01 May 2006
Published date	01 May 2006
DOI	https://doi.org/10.1108/00220410610666510
Pages	372-387
Author	Tuomas Talvensaari,Jorma Laurikkala,Kalervo Järvelin,Martti Juhola
Subject Matter	Information & knowledge management,Library & information science

A study on automatic creation of a

comparable document collection

in cross-language information

retrieval

Tuomas Talvensaari and Jorma Laurikkala

Department of Computer Sciences, University of Tampere, Finland

Kalervo Ja

¨rvelin

Department of Information Studies, University of Tampere, Finland, and

Martti Juhola

Department of Computer Sciences, University of Tampere, Finland

Abstract

Purpose – To present a method for creating a comparable document collection from two document

collections in different languages.

Design/methodology/approach – The best query keys were extracted from a Finnish source

collection (articles of the newspaper Aamulehti ) with the relative average term frequency formula. The

keys were translated into English with a dictionary-based query translation program. The resulting

lists of words were used as queries that were run against the target collection (Los Angeles Times

articles) with the nearest neighbor method. The documents were aligned with unrestricted and

date-restricted alignment schemes, which were also combined.

Findings – The combined alignment scheme was found the best, when the relatedness of the

document pairs was assessed with a ﬁve-degree relevance scale. Of the 400 document pairs, roughly 40

percent were highly or fairly related and 75 percent included at least lexical similarity.

Research limitations/implications – The number of alignment pairs was small due to the short

common time period of the two collections, and their geographical (and thus, topical) remoteness.

In future, our aim is to build larger comparable corpora in various languages and use them as source of

translation knowledge for the purposes of cross-language information retrieval (CLIR).

Practical implications – Readily available parallel corpora are scarce. With this method, two

unrelated document collections can relatively easily be aligned to create a CLIR resource.

Originality/value – The method can be applied to weakly linked collections and morphologically

complex languages, such as Finnish.

Keywords Information retrieval, Document management, Languageand literature

Paper type Research paper

The current issue and full text archive of this journal is available at

www.emeraldinsight.com/0022-0418.htm

This research was funded, in part, by Tampere Graduate School in Information Science and

Engineering (TISE) and the Academy of Finland, Project Nos. 80771,177033, 200844, 202185,

204978, 206568 and 1209960. The InQuery search engine was provided by the Center for

Intelligent Information Retrieval at the University of Massachusetts. FINTWOL (morphological

description of Finnish): Copyright (c) Kimmo Koskenniemi and Lingsoft plc. 1983-1993.TWOL-R

(run-time two-level program): Copyright (c) Kimmo Koskenniemi and Lingsoft plc.

1983-1992.GlobalDix Dictionary Software was used for automatic word-by-word translations.

JDOC

62,3

372

Received June 2005

Revised September 2005

Accepted October 2005

Journal of Documentation

Vol. 62 No. 3, 2006

pp. 372-387

qEmerald Group Publishing Limited

0022-0418

DOI 10.1108/00220410610666510

1. Introduction

In traditional information retrieval tasks, queries and documents are in the same

language. Conversely, in cross-language information retrieval (CLIR) (Oard and

Diekema, 1998), the language of the queries (source language) and the language of the

document collection (target language) are different. The problem is basically similar to

that of the single language searches: to ﬁnd documents in a collection that best match

the user’s request, but, additionally, we have to cross the language barrier somehow.

After the huge growth of the multilingual internet, CLIR has become more and more

important (Grefenstette, 1998).

There are various approaches to query formation in CLIR. Oard and Diekema (1998)

represent a framework where query formulation is examined from the viewpoints of

different matching strategies and the sources of translation knowledge needed in the

matching. Cognate matching does not involve actual translation. Instead, rules to

identify similarities in spelling of pronunciation are applied. For instance, proper nouns

and technical terms can be similar between languages. Such words often vary a little,

which allows the usage of approximate string matching, such as n-grams (Pirkola et al.,

2002a, 2003). Conversely, query translation, document translation and interlingual

matching techniques require deeper translation knowledge, which can be drawn from

ontologies, bilingual dictionaries, machine translation lexicons, or corpora. The query

translation approach is the most popular in CLIR and is also used in this study.

The target document collection could be translated into the source language, but it

would be a very complex task, yielding low quality, and it is far easier to translate

concise queries (Oard and Dorr, 1996; Oard and Diekema, 1998). The interlingual

techniques convert both the query and the documents into a language-independ ent

representation. However, some of these techniques, such as the latent semantic

indexing, are computationally expensive.

The translation techniques are not mutually exclusive, but, on the contrary, can be

used jointly. Dictionary-based translation, where the query keys are simply replaced

by their counterparts in a bilingual dictionary (Hull and Grefenstette, 1996), is often the

starting point. Using this technique alone is nonetheless problematic according to

Ballesteros and Croft (1998), who listed three weaknesses. First, some of the translation

alternatives may not correspond to the words of the query in the sense desired by the

user. Extraneous terms increase the ambiguity of the query, which in turn damages

retrieval performance. Second, dictionaries are limited in scope. Special terms and

proper nouns are often absent from general dictionaries. Third, the recognition and

translation of phrases constructed from several words can be difﬁcult. However, this is

not a major problem in languages where such phrases often form compound words

spelled together. The ambiguity introduced by translating the query can be dealt

with in many ways (Pirkola et al., 2001). For example, in part-of-speech tagging

the translation alternatives that have the same part-of-speech as the source language

words are selected.

Corpus-based translation utilizes multilingual document collections, where the

documents of the two languages are aligned as pairs so that each pair contains a

translation of each other (parallel corpora) or at least deal with the same topic

(comparable corpora). Sheridan and Ballerini (1996) used an aligned corpus to

expand the source language queries with target language words that co-occur with

the query keys in the aligned documents, thus achieving a “translation effect.”

Cross-language

information

retrieval

373

To continue reading

Request your trial

Subscribers can access the reported version of this case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the cited cases and legislation of a document.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the documents that have cited the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the revised versions of legislation with amendments.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see any amendments made to the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a visualisation of a case and its relationships to other cases. An alternative to lists of cases, the Precedent Map makes it easier to establish which ones may be of most relevance to your research and prioritise further reading. You also get a useful overview of how the case was received.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the list of results connected to your document through the topics and citations Vincent found.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

A study on automatic creation of a comparable document collection in cross‐language information retrieval

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users