A cooperative crowdsourcing framework for knowledge extraction in digital humanities – cases on Tang poetry

Document

Cited in

Published date	23 February 2020
Pages	243-261
DOI	https://doi.org/10.1108/AJIM-07-2019-0192
Date	23 February 2020
Author	Liang Hong,Wenjun Hou,Zonghui Wu,Huijie Han
Subject Matter	Library & information science,Information behaviour & retrieval,Information & knowledge management,Information management & governance,Information management

A cooperative crowdsourcing

framework for

knowledge extraction in digital

humanities –cases on Tang poetry

Liang Hong, Wenjun Hou, Zonghui Wu and Huijie Han

School of Information Management, Wuhan University, Wuhan, China

Abstract

Purpose –The purpose of this paper is to propose a knowledge extraction framework to extract knowledge,

including entities and relationships between them, from unstructured texts in digital humanities (DH).

Design/methodology/approach –The proposed cooperative crowdsourcing framework (CCF) uses both

human–computer cooperation and crowdsourcing to achieve high-quality and scalable knowledge extraction.

CCF integrates active learning with a novel category-based crowdsourcing mechanism to facilitate domain

experts labeling and verifying extracted knowledge.

Findings –The case study shows that CCF can effectively and efficiently extract knowledge from multi-

sourced heterogeneous data in the field of Tang poetry. Specifically, CCF achieves higher accuracy of

knowledge extraction than the state-of-the-artmethods, the contribution of feedbacks to the training model can

be maximized by the active learning mechanism and the proposed category-based crowdsourcing mechanism

can scale up the effective human–computer collaboration by considering the specialization of workers in

different categories of tasks.

Research limitations/implications –This research proposes CCF to enable high-quality and scalable

knowledge extraction in the field of Tang poetry. CCF can be generalized to other fields of DH by introducing

domain knowledge and experts.

Practical implications –The extracted knowledge is machine-understandable and can support the research

of Tang poetry and knowledge-driven intelligent applications in DH.

Originality/value –CCF is the first human-in-the-loop knowledge extraction framework that integrates

active learning and crowdsourcing mechanisms; he human–computer cooperation method uses the feedback of

domain experts through the active learning mechanism; the category-based crowdsourcing mechanism

considers the matching of categories of DH data and especially of domain experts.

Keywords Crowdsourcing, Human–computer cooperation, Knowledge extraction, Digital humanities, Tang

poetry

Paper type Research paper

1. Introduction

With the continuous development of information technology, researchers begin to use

interdisciplinary research methods of digital humanities (DH) to open up a new paradigm

for humanities research. Consequently, a great deal of DH data has been accumulated,

such as subject databases, electronic archives, knowledge bases, webpages and so on.

The multi-source heterogeneous data, which are difficult to read and understand by

computers, increase the difficulty and workload of DH research. Therefore, it is necessary

to extract machine-understandable knowledge from the data and organize the extracted

knowledge into a knowledge graph to support DH research.

Tang poetry is the representative of traditional Chinese literature and one of the highest

achievements of Chinese poetry creation. There are more than 50,000 poetrieswritten by over

2,200 poets in the Tang dynasty, which have a far-reaching influence on Chinese culture and

even world culture. At present, there are a large number of experts in China who conduct

researches on Tang poetry and have made fruitful achievements (Li, 2010).

As one of the important fields of DH, Tang poetry has accumulated a large amount of data

resources. However, these resources are scattered, sparse and lack effective organization.

Digital

humanities

243

The current issue and full text archive of this journal is available on Emerald Insight at:

https://www.emerald.com/insight/2050-3806.htm

Received 31 July 2019

Revised 15 November 2019

4 January 2020

5 January 2020

14 January 2020

Accepted 27 January 2020

Aslib Journal of Information

Management

Vol. 72 No. 2, 2020

pp. 243-261

2050-3806

DOI 10.1108/AJIM-07-2019-0192

In the field of Tang poetry, knowledge extraction can provide a solution to transform

multi-source heterogeneous data into “intelligent”linked data, i.e. entities and relationships

between them. This, in turn, provides a solid foundation for knowledge association and

reasoning, which supports DH studies and intelligent applications.

In this paper, we study the problem of extracting knowledge from unstructured texts of

Tang poetry. However, it is not an easy task because of the large scale of data and the unique

characteristics of Tang poetry. Firstly, Tang poetry is a type of ancient Chinese text that has

unique terms of words, sentence patterns, grammar and rhyme schemes. Secondly,

state-of-the-art knowledge extraction techniques such as machine learning and deep learning

are lack of training instances and prior knowledge (Alani et al., 2003), thus cannot be directly

applied to such humanities research. Last but not least, knowledge extraction relying on

domain experts is costly and not scalable to a large amount of data. Although some studies

(Plaisant, 2006) combined the efforts of domain experts and computer systems in the DH field,

they still cannot solve the problem of unmatchable speed and scale between computer

processing (e.g. machine learning) and human work. For instance, machine learning

algorithms can generate hundreds of thousands of training results in 1 min, while domain

experts can only label several of them during such a short time. Moreover, because of the

specialty and universality of Tang poetry, each domain expert is good at a small portion of

whole domain knowledge.

To address the above challenges, we propose a cooperative crowdsourcing framework

(CCF) to extract knowledge from Tang poetry data effectively and efficiently. CCF improves

the quality of extracted knowledge by introducing feedbacks of domain experts through

active learning mechanisms. Meanwhile, the quality and scalability of labeling by domain

experts are also improved by introducing a category-based crowdsourcing mechanism.

CCF contains Input Engine, Machine Extraction Engine and Crowdsourcing Engine.

In Input Engine, we use an entropy-based non-dictionary word segmentation method to

generate Tang poetry corpus, which is the basis of domain knowledge extraction. We then

propose an active learning mechanism to extract knowledge using machine learning

algorithms and crowdsourcing. Domain experts can label or correct extraction results

(i.e. entities and their relations) to help train the learning models interactively. Meanwhile, we

propose a category-based crowdsourcing mechanism to facilitate domain experts labeling

extraction results. The hypothesis is that workers (i.e. domain experts) are specialized in one

or more categories of knowledge and can achieve high accuracy. Specifically, we first classify

tasks based on categories (e.g. theme, poet), and then assign tasks to workers who have high

accuracy on the corresponding categories.

We build a human–computer cooperation and crowdsourcing platform based on CCF.

In total, 30 domain professionals, including professionals on DH and Tang poetry, are invited

to participate in experiments. Experimental results on Tang poetry data show that CCF can

extract high-quality knowledge with good scalability. The extracted knowledge reveals

inherent associations among entities of Tang poetry, which form a knowledge graph for

global and fine-grained DH studies and intelligent applications.

This paper is organized as follows. In Section 2, we review the related literature. Section 3

provides a framework for this paper. In Sections 4 and 5, we present CCF in detail. Section 6 is

the experiment and case study in Tang poetry. Section 7 concludes the paper.

2. Literature review

2.1 Crowdsourcing in digital humanity

Crowdsourcing is an open call for contributions from workers of the crowd to carry out

human intelligence tasks (Kazai, 2011). In the era of big data, the popularity and development

of the internet have greatly increased the scope and participation of crowdsourcing.

AJIM

72,2

244

To continue reading

Request your trial

Subscribers can access the reported version of this case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the cited cases and legislation of a document.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the documents that have cited the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the revised versions of legislation with amendments.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see any amendments made to the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a visualisation of a case and its relationships to other cases. An alternative to lists of cases, the Precedent Map makes it easier to establish which ones may be of most relevance to your research and prioritise further reading. You also get a useful overview of how the case was received.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the list of results connected to your document through the topics and citations Vincent found.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

A cooperative crowdsourcing framework for knowledge extraction in digital humanities – cases on Tang poetry

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users