Semi-automatic extraction of multiword terms from domain-specific corpora

Document

Cited in

Date	04 June 2018
Published date	04 June 2018
Pages	550-567
DOI	https://doi.org/10.1108/EL-06-2017-0128
Author	Vesna Pajić,Staša Vujičić Stanković,Ranka Stanković,Miloš Pajić
Subject Matter	Information & knowledge management,Information & communications technology,Internet

Semi-automatic extraction

of multiword terms from

domain-specic corpora

Vesna Paji

Department for Mathematics and Physics, University of Belgrade, Belgrade, Serbia

Staša Vuji

ci

c Stankovi

Department for Computer Science and Informatics, University of Belgrade,

Belgrade, Serbia

Ranka Stankovi

Chair of Applied Mathematics and Informatics, University of Belgrade,

Belgrade, Serbia, and

Miloš Paji

Department of Agricultural Engineering, University of Belgrade, Belgrade, Serbia

Abstract

Purpose – A hybrid approach is presented, which combines linguistic and statistical information to semi-

automatically extract multiword term candidates from texts.

Design/methodology/approach – The method is designed to be domain and language independent,

focusing on languages with rich morphology. Here, it is used for extracting multiword terms from texts in

Serbian, belonging to the agricultural engineering domain, as a use case. Predened syntactic structures were

used for multiword terms. For each structure, a nite state transducer was developed, which recognizes text

sequences having that structure and outputs the sequence in a normalized form, so that different inectional

forms of the same multiword term can be counted properly. Term candidates were further ltered by their

frequencies and evaluated by two domain experts.

Findings – By using language resources, such as electronic dictionaries and grammars, 928 multiword

terms were extracted out of 1,523 multiword terms that were recognized as candidates from a corpus having

42,260 different simple word forms; 870 of these were new, not already contained in the existing electronic

dictionary of compounds for Serbian, and they were used to enrich the dictionary.

Originality/value – The paper presents methodology that can signicantly contribute to the development of

terminology lexicons in different areas. In this particular use case, some important agricultural engineering

concepts were extracted from the text, but this approach could be used for other domains and languages as well.

Keywords Digital documents, Data analysis, Evaluation, Information retrieval, Data processing,

Foreign languages, Data retrieval, Document handling

Paper type Research paper

1. Introduction

Multiword expressions (MWEs) are lexical units composed of more than one word, which are

syntactically, semantically, pragmatically and/or statistically idiosyncratic (Baldwin and

This paper is part of the research funded by the Ministry of Education, Science and Technological

Development of the Republic of Serbia, Ref. No. 178006 and III 47003.

36,3

550

Received 9 June 2017

Revised 12 August 2017

28 August 2017

Accepted 30 September 2017

The Electronic Library

Vol. 36 No. 3, 2018

pp. 550-567

0264-0473

DOI 10.1108/EL-06-2017-0128

The current issue and full text archive of this journal is available on Emerald Insight at:

www.emeraldinsight.com/0264-0473.htm

Kim, 2010). Domain-specic MWEs are usually referred to as multiword terms (MWTs). It is

estimated that they constitute a signicant portion of terminology; over 70 per cent of the

terms are complex lexical units (da Graça Krieger and Finatto, 2004). It is difcult to identify

them automatically using existing methods because there are relatively few MWT instances

in big corpora, which cannot be spotted by exploiting their statistical properties only. The

extraction is even more complex for languages with rich morphological systems, such as

Serbian (Mariani, 2005; Vitas et al., 2005).

Collection and extraction of MWTs are two of the most important steps in the process

of creating a terminological lexicon, and they are the most time-consuming. Human

expert engagement cannot and should not be avoided, but such work could be

signicantly facilitated by well-designed automatic or semi-automatic extraction

procedures. Thus, the present focus is on developing a method for identifying and

extracting MWTs directly from domain-specic corpora, which is suitable for processing

morphologically rich languages.

In this paper, a hybrid approach is presented, which combines linguistic and statistical

information to extract term candidates from texts in languages with rich morphological

systems. The method is designed to be domain and language independent, although the

current focus is on identication and extraction of MWTs from texts in Serbian, belonging

to the domain of agricultural engineering as a use case.

2. Related work

In the past two decades, there has been considerable natural language processing (NLP)

research into MWEs (Liang et al., 2017b; Nakov and Hearst, 2013; Ramisch, 2015; Tsvetkov

and Wintner, 2014). Work on MWEs in English still dominates, although there has been

some research in languages other than English, such as Czerepowicka and Savary (2015) for

Polish; Liang et al. (2017a) for Chinese; Macken and Tezcan (2016) for Dutch;

Mandravickaite and Krilavi

cius (2017) for Latvian and Lithuanian; and Zaninello and

Nissim (2010) for Italian.

The problem of MWE extraction from literary texts in Serbian was described in detail by

Krstev et al. (2014). The authors presented nite state automata (FSA) for describing MWEs

that have a predictable structure and potentially innite number of instances (e.g. date and

time expressions). They also identied the most frequent structures of Serbian MWEs.

These structures will be further explained later in this paper as they were used for

extracting MWTs in the current experiment.

Automatic term extraction is an important part of NLP systems. It is used for lexicon

creation, acquisition of novel terms, text classication, text indexing, machine-assisted

translation and other NLP tasks. Different approaches to MWT extraction, linguistics- or

statistics-based (or both), have already been published recently (Cram and Daille, 2016;

Sclano and Velardi, 2007; Verberne et al., 2016; Vivaldi and Rodríguez, 2007; Yin et al., 2016;

Zhang and Wu, 2012). Most of the methods used for MWT extraction today are hybrid; that

is, they usually integrate statistical information, such as frequencies of n-grams and

collocations, with linguistic information, such as syntactic patterns of expressions. There is

no consensus on the best method or even if there is any. It depends on what expressions are

considered MWTs and on the level of their compositionality, the text domain, language

specics and application needs. As statistical information, different frequency and

association measures are being used in the MWT extraction process, such as T-score

(Church et al., 1991), the log-likelihood ratio (LLR; Dunning, 1993), C/NC value (Frantzi et al.,

1998) and keyness (Scott and Tribble, 2006).

Extraction of

multiword

terms

551

To continue reading

Request your trial

Subscribers can access the reported version of this case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the cited cases and legislation of a document.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the documents that have cited the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the revised versions of legislation with amendments.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see any amendments made to the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a visualisation of a case and its relationships to other cases. An alternative to lists of cases, the Precedent Map makes it easier to establish which ones may be of most relevance to your research and prioritise further reading. You also get a useful overview of how the case was received.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the list of results connected to your document through the topics and citations Vincent found.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Semi-automatic extraction of multiword terms from domain-specific corpora

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users