FactQA: question answering over domain knowledge graph based on two-level query expansion

Publication Date22 November 2019
AuthorXiaoming Zhang,Mingming Meng,Xiaoling Sun,Yu Bai
SubjectLibrary & information science,Librarianship/library management,Library technology,Information behaviour & retrieval,Metadata,Information & knowledge management,Information & communications technology,Internet
FactQA: question answering over
domain knowledge graph based
on two-level query expansion
Xiaoming Zhang, Mingming Meng, Xiaoling Sun and Yu Bai
School of Information Science and Engineering,
Hebei University of Science and Technology, Shijiazhuang, China
Purpose With the advent of the era of Big Data, the scale of knowledge graph (KG) in various domains is
growing rapidly, which holds huge amount of knowledge surely benefiting the question answering (QA)
research. However, the KG, which is always constituted of entities and relations, is structurally inconsistent
with the natural language query. Thus, the QA system based on KG is still faced with difficulties. The
purpose of this paper is to propose a method to answer the domain-specific questions based on KG, providing
conveniences for the information query over domain KG.
Design/methodology/approach The authors propose a method FactQA to answer the factual questions
about specific domain. A series of logical rules are designed to transform the factual questions into the triples,
in order to solve the structural inconsistency between the users question and the domain knowledge. Then,
the query expansion strategies and filtering strategies are proposed from two levels (i.e. words and triples in
the question). For matching the question with domain knowledge, not only the similarity values between the
words in the question and the resources in the domain knowledge but also the tag information of these words
is considered. And the tag information is obtained by parsing the question using Stanford CoreNLP. In this
paper, the KG in metallic materials domain is used to illustrate the FactQA method.
Findings The designed logical rules have time stability for transforming the factual questions into the
triples. Additionally, after filtering the synonym expansion results of the words in the question, the expansion
quality of the triple representation of the question is improved. The tag information of the words in the
question is considered in the process of data matching, which could help to filter out the wrong matches.
Originality/value Although the FactQA is proposed for domain-specific QA, it can also be applied to any
other domain besides metallic materials domain. For a question that cannot be answered, FactQA would
generate a new related question to answer, providing as much as possible the user with the information they
probably need. The FactQA could facilitate the users information query based on the emerging KG.
Keywords DBpedia, Knowledge graph, Question answering, Question expansion, Question matching,
Question understanding
Paper type Research paper
1. Introduction
With the explosive growth of the data scale of knowledge graph (KG), the data users in
specific domain desire to obtain the required information accurately and quickly. As a form
of information retrieval, question answering (QA) can answer the natural language
questions using a simple and accurate natural language result, barely concerning about the
format or structure of the data source. The QA system has been studied for the open domain
(Park et al., 2015; Sun et al., 2015) and many specific domains, e.g. medicine (Balikas et al.,
2012; Hristovski et al., 2015; Goodwin and Harabagiu, 2016), agriculture (Gaikwad et al.,
2015), cooking (Yin et al., 2015), geography (Zhao et al., 2016) and tourism (Pathak and
Mishra, 2016; Kahaduwa et al., 2017).
There are a variety of possible data sources for the QA system, e.g. web pages (Liu et al.,
2010; Grappy et al., 2011), community QA pairs (Cheng et al., 2015; Quan et al., 2015) and KG
(Unger et al., 2012; Athira et al., 2013; Zou et al., 2014; Ilievski and Feng, 2017). With the
Data Technologies and
Vol. 54 No. 1, 2020
pp. 34-63
© Emerald PublishingLimited
DOI 10.1108/DTA-02-2019-0029
Received 18 February 2019
Revised 27 July 2019
Accepted 9 October 2019
The current issue and full text archive of this journal is available on Emerald Insight at:
This research was funded by the Natural Science Foundation of Hebei Province (Grant No.
F2018208116), Hebei Science and Technology Support Program (No. 16210312D) and Key Project of
Hebei Education Department (Grant No. ZD2015099).
enrichment of the data in open KG, the advantages of QA system based on KG are
manifested in three aspects: KG is large-scale, covering a large number of entities and
concepts; KG is structured and contains rich semantic relations, which facilitate the
acquisition of relevant data of one concept or word; and the quality of the data in KG is good,
and the data are usually well annotated by the domain experts or achieves cross validation
by multiple data sources.
However, in order to answer the domain-specific questions based on the corresponding
domain KG, there are stillexisting challenges including the structural heterogeneity between
the natural languagequestion and the KG which always containsmany entities and relations,
and the possible semantic heterogeneity between the elements in the question and the
elements in the KG,which will have significant influenceon the performance of QA. Thus, the
structured representation of the naturallanguage question and semanticallyquery expansion
are the essentialissues behind these challenges.The former can facilitate the representation of
the question understanding to show the queryintent in a more formal way. And the latter can
enrich the expression for a specific question so as to resolve the diversity issue of question
expression,which is one of the key points to correctly discoverthe matching information from
the KG, increasing the chances of finding the corresponding answer in the KG.
Therefore,we propose a domain-specific QA methodFactQA based on two-level (i.e. words
and triples) query expansion, and the metallic materials domain, which has always been an
area of concern to us, isused to facilitate the illustrationof the proposed method. Since there
are still few mature KG in metallic materials domain, we try to extract and use the metallic
materials knowledge residing in DBpedia(Lehmann, 2015; Zaveri et al., 2013), which is a large
high-quality open KG and contains nearly 30m entities and hundreds of millions of triples.
The main contributions of this paper are summarized as follows:
(1) A series of logical rules are proposed to transform a question into a structured
representation. The factual questions are understood from two aspects (i.e. words
and relations) to obtain the tag information of the words in the question and the
relations in the question, respectively. And a set of logical rules are designed, to
integrate the tag information and the relations into the structured triples
(i.e. question triples (QTs) defined in this paper).
(2) Two-level-based question expansion and filtering strategy are designed for
question expansion. The question is expanded from two levels (i.e. seed concepts
(SCs) and QTs), and at each level, the expansion results are fined by filtering out
the possibly wrong results. A QT is composed of some SCs, so the expansion effect
of the SCs in a QT jointly determines the expansion quality of this QT. The
expansion of the SCs is based on WordNet[1], and the expansion results are filtered
based on Word2Vec[2] and UMBC (Han et al., 2013). The retained expansion
results are used to realize the expansion of the QT, and the expansion results of the
QT are further filtered by using UMBC. Thus, the twice filtrations improve the
quality of the question expansion.
(3) A matching strategybased on similarity calculation and taginformation is proposed
to match a questionwith the domain knowledge. A group of rewrite rules for QTs are
designed to facilitate the matching process. The similarity values between the words
in the question and the domain knowledge are calculated based on the string
similarity algorithm SMOA (Stoilos et al.,2005
). Then, the tag information of the
words, which is obtainedby parsing the question using Stanford CoreNLP[3], is used
for the validation of the matching. Although a word in the question and one of its
matches may have higher string similarity value, this match should be filtered out if
this word and this match do not have the same tag information. Therefore, the tag
information could help to improve the accuracy of matching.

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT