Grammar checking and relation extraction in text: approaches, techniques and open challenges

Pages373-394
DOIhttps://doi.org/10.1108/DTA-01-2019-0001
Publication Date01 July 2019
AuthorNora Madi,Rawan Al-Matham,Hend Al-Khalifa
SubjectLibrary & information science
Grammar checking and relation
extraction in text: approaches,
techniques and open challenges
Nora Madi, Rawan Al-Matham and Hend Al-Khalifa
Department of Information Technology,
King Saud University, Riyadh, Saudi Arabia
Abstract
Purpose The purpose of this paper is to provide an overall review of grammar checking and relation
extraction (RE) literature, their techniques and the open challenges associated with them; and, finally, suggest
future directions.
Design/methodology/approach The review on grammar checking and RE was carried out using the
following protocol: we prepared research questions, planed for searching strategy, addressed paper selection
criteria to distinguish relevant works, extracted data from these works, and finally, analyzed and synthesized
the data.
Findings The output of error detection models could be used for creating a profile of a certain writer. Such
profiles can be used for author identification, native language identification or even the level of education, to
name a few. The automatic extraction of relations could be used to build or complete electronic lexical
thesauri and knowledge bases.
Originality/value Grammar checking is the process of detecting and sometimes correcting erroneous
words in the text, while RE is the process of detecting and categorizing predefined relationships between
entities or words that were identified in the text. The authors found that the most obvious challenge is the lack
of data sets, especially for low-resource languages. Also, the lack of unified evaluation methods hinders the
ability to compare results.
Keywords Evaluation, Review, Approaches, Techniques, Grammar_checking, Relation_extraction
Paper type Literature review
1. Introduction
Natural language processing (NLP) refers to processing human languages automatically
using computational algorithms. Grammar checking (GC) has gained popularity in the area
of NLP and different approaches have been used in order to build grammar error detection
and/or correction systems (Madi and Al-khalifa, 2018). Various works use machine learning,
mainly classification, for solving this task. Classification requires feature engineering for
training a data set, for example, semantic relations (Xiang et al., 2013). Semantic relations are
the associations that exist between the meanings of words.
Relation extraction (RE) is another NLP task concerned with detecting or classifying the
relationships between words. RE has been utilized for many purposes such as extracting
semantic relations for creating lexical thesauri and extracting grammatical relations for
building grammar error detection and/or correction systems (Maynard et al., 2016).
Several review papers for grammar checking and RE were found in the literature. For
example Madi and Al-khalifa (2018) presented previous works of Grammatical Error
Correction (GEC) and Detection systems, and discussed some shared tasks and showed
works that have followed multiple approaches such as syntax-based, rule-based, machine
learning approaches as well as works that have approached automating GEC and Detection
using deep learning. On the other hand, review papers for RE varied based on the specific
language reviewed such as English. For instance, Pawar et al. (2017) conducted a
comprehensive survey for RE for English, followed by Chinese and Arabic.
A review of grammar checking and RE provides an objective procedure for identifying
the extent of the research that is available; to the best of the authorsknowledge, no prior
Data Technologies and
Applications
Vol. 53 No. 3, 2019
pp. 373-394
© Emerald PublishingLimited
2514-9288
DOI 10.1108/DTA-01-2019-0001
Received 10 January 2019
Revised 4 March 2019
Accepted 31 March 2019
The current issue and full text archive of this journal is available on Emerald Insight at:
www.emeraldinsight.com/2514-9288.htm
373
Grammar
checking and
relation
extraction in text
review exists that focuses on both tasks. With this in mind, this paper aims to provide an
overall review of grammar checking and RE literature, their techniques, open challenges
associated with them, and finally, suggested future directions.
The rest of this paper is structured as follows: Section 2 provides a brief background on
grammar checking and RE and their approaches. Section 3 shows the methodology of this
literature review, which includes search strategy and inclusion criteria. Section 4 describes
the results of the review by answering the review research questions. Section 5 discusses the
main findings of this review. Finally, Sections 6 concludes the paper with open challenges
and future work.
2. Background
2.1 Grammar checking (GC)
GC is a process of detecting and sometimes correcting incorrect words in a piece of text.
Various approaches have been used for the task of grammar checking. These are mainly:
rule-based approach, syntax-based approach, machine learning approach and neural
networks. Rule-based checking uses a set of manually specified rules (patterns) to match
alongside the text. Text is considered flawed if it fitted one of the rules.
In syntax-based checking, morphology and syntax of the text are entirely considered.
It involves a lexical database, a morphological analyzer and a parser. According to the
grammar of a language, the parser creates a syntactic tree structure for each sentence.
If complete parsing were not successful, the text is considered incorrect.
Furthermore, machine learning methods for grammar checking have two types:
classification and statistical-based methods. Statistical models are used to acquire
linguistic knowledge from large text. This approach employs part-of-speech (POS)
annotated corpus to create a list of POS tag sequences known as n-grams (Shaalan, 2005).
Statistical models can assign a probability to a new sequence of words based on the
counts of observed word combinations in the training corpus. Sequences with high
probabilities are considered correct whereas uncommon sequences could include mistakes
(Leacock et al., 2014).
In the classification approach, however, a classifier is trained on a correct text. All
systems use the tokens in the immediate context of a potential error site as features
(37 tokens to the left and right of the error site). Other common features are POS tags, parse
trees, n-grams and grammatical relations. Models are trained using classifiers such as
maximum entropy, Naïve Bayes and support vector machines. Using the trained model,
errors can be detected and corrected by matching the original word in the text with the most
appropriate candidate predicted by the classifier (Leacock et al., 2014; Krishna Chaitanya
and Bhattacharyya, 2017).
Deep learning is a branch of machine learning. It provides a group of learning techniques
called neural networks. Although all machine learning can be described as learning to make
predictions based on past observations, deep learning approaches work by learning to
correctly represent the data so that it is appropriate for prediction. Deep learning
approaches work by feeding data into a network that produces sequential transformations
of the input data until a final transformation predicts a label (Goldberg, 2017).
2.2 Relation extraction (RE)
RE is the process of finding and categorizing predefined relationships between entities or
words that were identified in the text. The main objective of RE is extracting tuples of the
form: relation oword1, word2 W(Maynard et al., 2016). Some approaches have been widely
used for RE such as rule-based, semi-supervised (bootstrapping), supervised, distance
supervision and unsupervised.
374
DTA
53,3

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT