Semi-automatic extraction of multiword terms from domain-specific corpora

Date04 June 2018
Published date04 June 2018
Pages550-567
DOIhttps://doi.org/10.1108/EL-06-2017-0128
AuthorVesna Pajić,Staša Vujičić Stanković,Ranka Stanković,Miloš Pajić
Subject MatterInformation & knowledge management,Information & communications technology,Internet
Semi-automatic extraction
of multiword terms from
domain-specic corpora
Vesna Paji
c
Department for Mathematics and Physics, University of Belgrade, Belgrade, Serbia
Staša Vuji
ci
c Stankovi
c
Department for Computer Science and Informatics, University of Belgrade,
Belgrade, Serbia
Ranka Stankovi
c
Chair of Applied Mathematics and Informatics, University of Belgrade,
Belgrade, Serbia, and
Miloš Paji
c
Department of Agricultural Engineering, University of Belgrade, Belgrade, Serbia
Abstract
Purpose A hybrid approach is presented, which combines linguistic and statistical information to semi-
automatically extract multiword term candidates from texts.
Design/methodology/approach The method is designed to be domain and language independent,
focusing on languages with rich morphology. Here, it is used for extracting multiword terms from texts in
Serbian, belonging to the agricultural engineering domain, as a use case. Predened syntactic structures were
used for multiword terms. For each structure, a nite state transducer was developed, which recognizes text
sequences having that structure and outputs the sequence in a normalized form, so that different inectional
forms of the same multiword term can be counted properly. Term candidates were further ltered by their
frequencies and evaluated by two domain experts.
Findings By using language resources, such as electronic dictionaries and grammars, 928 multiword
terms were extracted out of 1,523 multiword terms that were recognized as candidates from a corpus having
42,260 different simple word forms; 870 of these were new, not already contained in the existing electronic
dictionary of compounds for Serbian, and they were used to enrich the dictionary.
Originality/value The paper presents methodology that can signicantly contribute to the development of
terminology lexicons in different areas. In this particular use case, some important agricultural engineering
concepts were extracted from the text, but this approach could be used for other domains and languages as well.
Keywords Digital documents, Data analysis, Evaluation, Information retrieval, Data processing,
Foreign languages, Data retrieval, Document handling
Paper type Research paper
1. Introduction
Multiword expressions (MWEs) are lexical units composed of more than one word, which are
syntactically, semantically, pragmatically and/or statistically idiosyncratic (Baldwin and
This paper is part of the research funded by the Ministry of Education, Science and Technological
Development of the Republic of Serbia, Ref. No. 178006 and III 47003.
EL
36,3
550
Received 9 June 2017
Revised 12 August 2017
28 August 2017
Accepted 30 September 2017
The Electronic Library
Vol. 36 No. 3, 2018
pp. 550-567
© Emerald Publishing Limited
0264-0473
DOI 10.1108/EL-06-2017-0128
The current issue and full text archive of this journal is available on Emerald Insight at:
www.emeraldinsight.com/0264-0473.htm
Kim, 2010). Domain-specic MWEs are usually referred to as multiword terms (MWTs). It is
estimated that they constitute a signicant portion of terminology; over 70 per cent of the
terms are complex lexical units (da Graça Krieger and Finatto, 2004). It is difcult to identify
them automatically using existing methods because there are relatively few MWT instances
in big corpora, which cannot be spotted by exploiting their statistical properties only. The
extraction is even more complex for languages with rich morphological systems, such as
Serbian (Mariani, 2005; Vitas et al., 2005).
Collection and extraction of MWTs are two of the most important steps in the process
of creating a terminological lexicon, and they are the most time-consuming. Human
expert engagement cannot and should not be avoided, but such work could be
signicantly facilitated by well-designed automatic or semi-automatic extraction
procedures. Thus, the present focus is on developing a method for identifying and
extracting MWTs directly from domain-specic corpora, which is suitable for processing
morphologically rich languages.
In this paper, a hybrid approach is presented, which combines linguistic and statistical
information to extract term candidates from texts in languages with rich morphological
systems. The method is designed to be domain and language independent, although the
current focus is on identication and extraction of MWTs from texts in Serbian, belonging
to the domain of agricultural engineering as a use case.
2. Related work
In the past two decades, there has been considerable natural language processing (NLP)
research into MWEs (Liang et al., 2017b; Nakov and Hearst, 2013; Ramisch, 2015; Tsvetkov
and Wintner, 2014). Work on MWEs in English still dominates, although there has been
some research in languages other than English, such as Czerepowicka and Savary (2015) for
Polish; Liang et al. (2017a) for Chinese; Macken and Tezcan (2016) for Dutch;
Mandravickaite and Krilavi
cius (2017) for Latvian and Lithuanian; and Zaninello and
Nissim (2010) for Italian.
The problem of MWE extraction from literary texts in Serbian was described in detail by
Krstev et al. (2014). The authors presented nite state automata (FSA) for describing MWEs
that have a predictable structure and potentially innite number of instances (e.g. date and
time expressions). They also identied the most frequent structures of Serbian MWEs.
These structures will be further explained later in this paper as they were used for
extracting MWTs in the current experiment.
Automatic term extraction is an important part of NLP systems. It is used for lexicon
creation, acquisition of novel terms, text classication, text indexing, machine-assisted
translation and other NLP tasks. Different approaches to MWT extraction, linguistics- or
statistics-based (or both), have already been published recently (Cram and Daille, 2016;
Sclano and Velardi, 2007; Verberne et al., 2016; Vivaldi and Rodríguez, 2007; Yin et al., 2016;
Zhang and Wu, 2012). Most of the methods used for MWT extraction today are hybrid; that
is, they usually integrate statistical information, such as frequencies of n-grams and
collocations, with linguistic information, such as syntactic patterns of expressions. There is
no consensus on the best method or even if there is any. It depends on what expressions are
considered MWTs and on the level of their compositionality, the text domain, language
specics and application needs. As statistical information, different frequency and
association measures are being used in the MWT extraction process, such as T-score
(Church et al., 1991), the log-likelihood ratio (LLR; Dunning, 1993), C/NC value (Frantzi et al.,
1998) and keyness (Scott and Tribble, 2006).
Extraction of
multiword
terms
551

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT