Bilingual lexical extraction based on word alignment for improving corpus search

Date05 August 2019
Pages722-739
DOIhttps://doi.org/10.1108/EL-03-2019-0056
Published date05 August 2019
AuthorJelena Andonovski,Branislava Šandrih,Olivera Kitanović
Subject MatterInformation & knowledge management,Information & communications technology,Internet
Bilingual lexical extraction
based on word alignment for
improving corpus search
Jelena Andonovski
University of Belgrade, University Library Svetozar Markovic, Belgrade, Serbia
Branislava Šandrih
University of Belgrade, Faculty of Philology, Belgrade, Serbia, and
Olivera Kitanovi
c
University of Belgrade, Faculty of Mining and Geology, Belgrade, Serbia
Abstract
Purpose This paper aims to describe the structure of an aligned Serbian-German literary corpus
(SrpNemKor) contained in a digital library Bibliša. The goal of the research was to create a benchmark
Serbian-Germanannotated corpus searchable withvarious query expansions.
Design/methodology/approach The presented research is particularly focused on the enhancement of
bilingual search queries in a full-text search of aligned SrpNemKor collection. The enhancement is based on
using existing lexical resources such as Serbian morphological electronic dictionaries and the bilingual lexical
database Termi.
Findings For the purpose of this research, the lexical databaseTermi is enriched with a bilingual list of
German-Serbian translated pairs of lexical units. The list of correct translation pairs was extracted from
SrpNemKor, evaluated and integrated into Termi. Also, Serbian morphological e-dictionaries are updated
with new entriesextracted from the Serbian part of the corpus.
Originality/value A bilingual search of SrpNemKor in Bibliša is available within the user-friendly
platform. The enricheddatabase Termi enables semantic enhancement andrenement of users search query
based on synonyms both in Serbian and Germanat a very high level. Serbian morphologicale-dictionaries
facilitate the morphologicalexpansion of search queries in Serbian, thereby enabling the analysisof concepts
and concept structures by identifying terms assigned to the concept, and by establishing relationsbetween
terms in Serbian and German which makes Bibliša a valuable Web tool that can support research and
analysisof SrpNemKor.
Keywords Digital libraries, Aligned parallel corpora, Bilingual lexical resources,
Lexical unit extraction, Bilingual search
Paper type Research paper
1. Introduction
Aligned multilingual corporabodies of text in parallel translation, also known as bitexts
have become an essential resource for work in multilingual natural language processing
(NLP). These corpora represent the relationships between units in a source language and
The authors would like to thank Prof Dr Krstev Cvetana and Prof Dr Ranka Stankovi
c for rereading
carefully our paper several times and providing numerous suggestions for its improvement.
This research was supported by Serbian Ministry of Education and Science under the grants
178006 and TR 33003.
EL
37,4
722
Received1 March 2019
Revised28 May 2019
13July 2019
Accepted30 July 2019
TheElectronic Library
Vol.37 No. 4, 2019
pp. 722-739
© Emerald Publishing Limited
0264-0473
DOI 10.1108/EL-03-2019-0056
The current issue and full text archive of this journal is available on Emerald Insight at:
www.emeraldinsight.com/0264-0473.htm
their translation in a target language and, thus, they are an important digital resource for
establishing equivalents between languages. More specically, these corpora types allow
researchers to determine the frequency of the occurrence of a particular word or a phrase
dened in a search query in two or more languages, their grammatical forms and variants,
as well as their semantic correlation with otherwords and phrases and their forms in two or
more languages.
The Belgrade NLP group, researchers from the University of Belgrade of various
backgrounds, has been developing language resources and tools for the processing of
Serbian for decades, and some of which are the aligned multilingual corpora. Developed
aligned texts are stored in the Corpus of Contemporary Serbian and in the aligned textual
collections supported by the digitallibrary Bibliša, both developed with the aim of enabling
advanced search possibilities in the above-mentioned bilingual aligned textual collections.
The Corpus of Contemporary Serbian (Utvi
c, 2013) contains several parallelcorpora aligned
with mostly literary textsbut also texts from other domains, such as general news, scientic
journals, Web journalism,health, law, education and movie subtitles[1].
The digital library Bibliša[2] contains collections of aligned texts from several scientic
journals published bilingually in Serbia, texts produced within international projects, and
some parallel corpora. This paper is focused on the Serbian-German literary corpus
(SrpNemKor), which is stored in Bibliša. Furthermore, in the paper the possibilities of
bilingual search of SrpNemKor in Bibliša are analysedbased on existing bilingual Serbian-
German lexical resources.
The next section presents a brief overview of previous work related to the bilingual
terminology extraction based on parallel corpora. Section 3 provides the structure of the
SrpNemKor and, subsequently, describes the pre-processing and alignment process of
selected texts, as well as the languagetools and resources used for these purposes. Section 4
introduces the web tool Bibliša, as well as the incorporationof SrpNemKor within this tool,
while the search advantages of SrpNemKor within Bibliša followed by the evaluation of
bilingual corpus search are described in Section 5. In Section 6, the achieved results and
plans for furtherwork are put forward.
2. Related work
2.1 Bilingual lexical extraction
Over the years, various researchers have used different techniques for multi-word unit
(MWU) extraction and alignment to compile bilingual lexica. These approaches differ in
terms of their methodology, used resources, languages involved and the purpose for which
they have been built.
In several cases, the bilingual lists of MWUs were compiled to improve statistical
machine translation of an existing machine translation system(Arcan et al., 2017;Bouamor
et al.,2012;Irvine and Callison-Burch, 2016;Naguib, 2016;Oliver, 2017;Semmar, 2018;
Tsvetkov and Wintner, 2010), for the development of an existing language resource in a
target language on the basis of a corresponding resource in a source language examples
include the development of the Slovenian WordNet (Vintar and Fišer, 2008) based on the
English WordNet and the development of the bilingual terminology based on the aligned
corpora for the library and information science domain (Krstev et al., 2018)or for the
presentation of bilingual correspondences between two languages for example,
correspondencesbetween Slovak-Bulgarian parallel corpus (Garabík and Dimitrova,2015).
Some of theseapproaches rely on the existenceof a seed lexicon (Semmar, 2018;Tsvetkov
and Wintner,2010;Xu et al.,2015)or existing translation memoriesand phrase tables (Oliver,
2017), while in some cases the existence of additional resources, in addition to the input
Bilingual
lexical
extraction
723

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT