A cooperative crowdsourcing framework for knowledge extraction in digital humanities – cases on Tang poetry

Published date23 February 2020
Pages243-261
DOIhttps://doi.org/10.1108/AJIM-07-2019-0192
Date23 February 2020
AuthorLiang Hong,Wenjun Hou,Zonghui Wu,Huijie Han
Subject MatterLibrary & information science,Information behaviour & retrieval,Information & knowledge management,Information management & governance,Information management
A cooperative crowdsourcing
framework for
knowledge extraction in digital
humanities cases on Tang poetry
Liang Hong, Wenjun Hou, Zonghui Wu and Huijie Han
School of Information Management, Wuhan University, Wuhan, China
Abstract
Purpose The purpose of this paper is to propose a knowledge extraction framework to extract knowledge,
including entities and relationships between them, from unstructured texts in digital humanities (DH).
Design/methodology/approach The proposed cooperative crowdsourcing framework (CCF) uses both
humancomputer cooperation and crowdsourcing to achieve high-quality and scalable knowledge extraction.
CCF integrates active learning with a novel category-based crowdsourcing mechanism to facilitate domain
experts labeling and verifying extracted knowledge.
Findings The case study shows that CCF can effectively and efficiently extract knowledge from multi-
sourced heterogeneous data in the field of Tang poetry. Specifically, CCF achieves higher accuracy of
knowledge extraction than the state-of-the-artmethods, the contribution of feedbacks to the training model can
be maximized by the active learning mechanism and the proposed category-based crowdsourcing mechanism
can scale up the effective humancomputer collaboration by considering the specialization of workers in
different categories of tasks.
Research limitations/implications This research proposes CCF to enable high-quality and scalable
knowledge extraction in the field of Tang poetry. CCF can be generalized to other fields of DH by introducing
domain knowledge and experts.
Practical implications The extracted knowledge is machine-understandable and can support the research
of Tang poetry and knowledge-driven intelligent applications in DH.
Originality/value CCF is the first human-in-the-loop knowledge extraction framework that integrates
active learning and crowdsourcing mechanisms; he humancomputer cooperation method uses the feedback of
domain experts through the active learning mechanism; the category-based crowdsourcing mechanism
considers the matching of categories of DH data and especially of domain experts.
Keywords Crowdsourcing, Humancomputer cooperation, Knowledge extraction, Digital humanities, Tang
poetry
Paper type Research paper
1. Introduction
With the continuous development of information technology, researchers begin to use
interdisciplinary research methods of digital humanities (DH) to open up a new paradigm
for humanities research. Consequently, a great deal of DH data has been accumulated,
such as subject databases, electronic archives, knowledge bases, webpages and so on.
The multi-source heterogeneous data, which are difficult to read and understand by
computers, increase the difficulty and workload of DH research. Therefore, it is necessary
to extract machine-understandable knowledge from the data and organize the extracted
knowledge into a knowledge graph to support DH research.
Tang poetry is the representative of traditional Chinese literature and one of the highest
achievements of Chinese poetry creation. There are more than 50,000 poetrieswritten by over
2,200 poets in the Tang dynasty, which have a far-reaching influence on Chinese culture and
even world culture. At present, there are a large number of experts in China who conduct
researches on Tang poetry and have made fruitful achievements (Li, 2010).
As one of the important fields of DH, Tang poetry has accumulated a large amount of data
resources. However, these resources are scattered, sparse and lack effective organization.
Digital
humanities
243
The current issue and full text archive of this journal is available on Emerald Insight at:
https://www.emerald.com/insight/2050-3806.htm
Received 31 July 2019
Revised 15 November 2019
4 January 2020
5 January 2020
14 January 2020
Accepted 27 January 2020
Aslib Journal of Information
Management
Vol. 72 No. 2, 2020
pp. 243-261
© Emerald Publishing Limited
2050-3806
DOI 10.1108/AJIM-07-2019-0192
In the field of Tang poetry, knowledge extraction can provide a solution to transform
multi-source heterogeneous data into intelligentlinked data, i.e. entities and relationships
between them. This, in turn, provides a solid foundation for knowledge association and
reasoning, which supports DH studies and intelligent applications.
In this paper, we study the problem of extracting knowledge from unstructured texts of
Tang poetry. However, it is not an easy task because of the large scale of data and the unique
characteristics of Tang poetry. Firstly, Tang poetry is a type of ancient Chinese text that has
unique terms of words, sentence patterns, grammar and rhyme schemes. Secondly,
state-of-the-art knowledge extraction techniques such as machine learning and deep learning
are lack of training instances and prior knowledge (Alani et al., 2003), thus cannot be directly
applied to such humanities research. Last but not least, knowledge extraction relying on
domain experts is costly and not scalable to a large amount of data. Although some studies
(Plaisant, 2006) combined the efforts of domain experts and computer systems in the DH field,
they still cannot solve the problem of unmatchable speed and scale between computer
processing (e.g. machine learning) and human work. For instance, machine learning
algorithms can generate hundreds of thousands of training results in 1 min, while domain
experts can only label several of them during such a short time. Moreover, because of the
specialty and universality of Tang poetry, each domain expert is good at a small portion of
whole domain knowledge.
To address the above challenges, we propose a cooperative crowdsourcing framework
(CCF) to extract knowledge from Tang poetry data effectively and efficiently. CCF improves
the quality of extracted knowledge by introducing feedbacks of domain experts through
active learning mechanisms. Meanwhile, the quality and scalability of labeling by domain
experts are also improved by introducing a category-based crowdsourcing mechanism.
CCF contains Input Engine, Machine Extraction Engine and Crowdsourcing Engine.
In Input Engine, we use an entropy-based non-dictionary word segmentation method to
generate Tang poetry corpus, which is the basis of domain knowledge extraction. We then
propose an active learning mechanism to extract knowledge using machine learning
algorithms and crowdsourcing. Domain experts can label or correct extraction results
(i.e. entities and their relations) to help train the learning models interactively. Meanwhile, we
propose a category-based crowdsourcing mechanism to facilitate domain experts labeling
extraction results. The hypothesis is that workers (i.e. domain experts) are specialized in one
or more categories of knowledge and can achieve high accuracy. Specifically, we first classify
tasks based on categories (e.g. theme, poet), and then assign tasks to workers who have high
accuracy on the corresponding categories.
We build a humancomputer cooperation and crowdsourcing platform based on CCF.
In total, 30 domain professionals, including professionals on DH and Tang poetry, are invited
to participate in experiments. Experimental results on Tang poetry data show that CCF can
extract high-quality knowledge with good scalability. The extracted knowledge reveals
inherent associations among entities of Tang poetry, which form a knowledge graph for
global and fine-grained DH studies and intelligent applications.
This paper is organized as follows. In Section 2, we review the related literature. Section 3
provides a framework for this paper. In Sections 4 and 5, we present CCF in detail. Section 6 is
the experiment and case study in Tang poetry. Section 7 concludes the paper.
2. Literature review
2.1 Crowdsourcing in digital humanity
Crowdsourcing is an open call for contributions from workers of the crowd to carry out
human intelligence tasks (Kazai, 2011). In the era of big data, the popularity and development
of the internet have greatly increased the scope and participation of crowdsourcing.
AJIM
72,2
244

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT