Task design and assignment of full-text generation on mass Chinese historical archives in digital humanities. A crowdsourcing approach

Pages262-286
Date25 March 2020
Published date25 March 2020
DOIhttps://doi.org/10.1108/AJIM-09-2019-0245
AuthorJihong Liang,Hao Wang,Xiaojing Li
Subject MatterLibrary & information science,Information behaviour & retrieval,Information & knowledge management,Information management & governance,Information management
Task design and assignment of
full-text generation on mass
Chinese historical archives in
digital humanities
A crowdsourcing approach
Jihong Liang, Hao Wang and Xiaojing Li
Renmin University of China, Beijing, China
Abstract
Purpose The purpose of this paper is to explore the task design and assignment of full-textgeneration on
mass Chinese historical archives (CHAs) by crowdsourcing, with special attention paid to how to best divide
full-text generation tasks into smaller ones assigned to crowdsourced volunteers and to improve the
digitization of mass CHAs and the data-oriented processing of the digital humanities.
Design/methodology/approach This paper starts from the complexities of character recognition of mass
CHAs, takes Sheng Xuanhuai archives crowdsourcing project of Shanghai Library as a case study, and makes
use of the theories of archival science, including diplomatics of Chinese archival documents, and the historical
approach of Chinese archival traditions as the theoretical basis and analysis methods. The results are
generated through the comprehensive research.
Findings This paper points out that volunteer tasks of full-text generation include transcription,
punctuation, proofreading, metadata description, segmentation, and attribute annotation in digital humanities
and provides a metadata element set for volunteers to use in creating or revising metadata descriptions and
also provides an attribute tag set. The two sets can be used across the humanities to construct overall
observations about texts and the archives of which they are a part. Along these lines, this paper presents
significant insights for application in outlining the principles, methods, activities, and procedures of
crowdsourced full-text generation for mass CHAs.
Originality/value This study is the first to explore and identify the effective design and allocation of tasks
for crowdsourced volunteers completing full-text generation on CHAs in digital humanities.
Keywords Full-text generation, Task design and assignment, Crowdsourcing, Transcription, Metadata
description, Text annotation
Paper type Research paper
Introduction
In the Chinese archival field, Chinese historical archives(hereafter called CHAs) refer to the
physical archives that have been maintained since they were created. Generally, few
historical archives exit for the periods before the Song and Yuan dynasties, while many
archives exist from the Ming and Qing dynasties and the Chinese Republic. Here, it is
important to note that the archival materials were the sequential accumulation of events,
activities, and functions of governments at all levels, organizations, and families in history; in
some sense, then, these archives remember Chinas entire history. Researchers usually spend
great amounts of time and energy going through mass archives, especially if they specialize
in the histories of the Ming and Qing dynasties or the Republicthese materials were often
AJIM
72,2
262
The authors would like to thank Cuijuan Xia and Meredith Doviak for providing information and data
on Sheng Xuanhuai Archives transcription project and Citizen Archivist project. The authors would also
like to thank the anonymous reviewers for their comments and questions that have helped to improve
the quality of our paper and the editor for patience with the paper. The paper is supported by National
Social Science Foundation of China (Grant No. 10&ZD132).
The current issue and full text archive of this journal is available on Emerald Insight at:
https://www.emerald.com/insight/2050-3806.htm
Received 13 September 2019
Revised 1 December 2019
8 February 2020
Accepted 11 February 2020
Aslib Journal of Information
Management
Vol. 72 No. 2, 2020
pp. 262-286
© Emerald Publishing Limited
2050-3806
DOI 10.1108/AJIM-09-2019-0245
written by hand, and a number of them are illegible and/or damaged. The incentive collective
collaboration between researchers is necessary, the same as the overall observation on mass
archives is. Despite the difficulties involved in parsing them, CHAs play an important role in
scholarship and are thus a great focus of current transcription work. Transcription and
collation make CHAs clear, easy to use, and thus, more broadly, helpful for historical research.
To quicken such research, it is necessary to decipher characters in mass CHAs and transcribe
them in a recognizable standard font that allows these texts to be easily compiled and
published; however, to date, this work is largely undertaken by professional human
transcribers and thus requires a great deal of time, expense, and effort.
In thepresent, the humanitiesare establishing new methodologiesfor the digitalage that use
information technologies, giving rise to the emerging interdisciplinary field of the digital
humanities.One area of researchin the digital humanitiesseeks to applydigital analysis toolsto
process the massresearch materials, and the overallobservation by machine is one of the core
philosophiesin the system design (Chen et al., 2011).The image digitizationof the paper-media
literature can supply the machine-reading images, and its finding aids by optical character
recognition (OCR) or manual input can be searched, but the contents of images cannot be
searched,analyzed quantitatively,and processed by machine(DFG, 2009). Full-textgeneration,
which willbe read by a machine, is the basicwork in the digital humanities,and technical tools
and public participation in the era of Web2.0 supply a new solution of full-text generation of
massCHAs. However, manualinput and OCR have limitations:manual input is time-consuming
and labor-intensive; meanwhile,OCR has a high error rate and requires particularlevels of text
clarity and picture quality. To addressthis asymmetry, the digital humanities are working to
optimize the useof full-text generation for mass archives.
Notably, the possibility of public participation in full-text generation in the era of Web
2.0or crowdsourcing”—may enable the use of full-text generation for mass historical
archives. The concept of crowdsourcing was first proposed by Jeff Howe in WIRED in 2006
(Howe, 2006). Based on Jeff Howes concept, Brabham defines crowdsourcing as an online,
distributed problem-solving, and production model (Brabham, 2008). For our purposes, it is
helpful to note that crowdsourcing is an Internet-based cooperative mode of gathering
knowledge and acquiring popular knowledge, typically with the objective of helping the
creator of the crowdsourced task achieve his/her goal.
In China, a large number of CHAs have been scanned since the 1990s and are now being
prepared for data-oriented processing; full-text generation is necessary and basic work.
Full-text generation is still a bottleneck in the digital humanities. Crowdsourcing tools can
combine manual input with OCR and generate a resultant force of interaction.
Task design and assignment of full-text generation by crowdsourcing are directly
linked to the difficulty, integrality, and accuracy of the recognition and the foundation of
digital analysis of the contents of mass CHAs. While China is now enjoying its first
crowdsourced full-text generation project, Sheng Xuanhuai Archives Transcription
Project at Shanghai Library on which this study focused, currently, almost no scholarly
work has been done on the design and assignment of tasks for a crowdsourced full-text
generation. For example, while some researchers separately discussed the project as a
case study, they both focused on the exploring the participantsmotivations in sustained
stages of citizen science projects and the influence factors of task performance (Zhang
et al., 2018;Han et al.,2019), not the design and assignment of tasks for mass CHAs
themselves. This study responded to this gap in the research by uncovering the tasks
involved in crowdsourced full-text generation for mass CHAs, how to break them up, and
how to assign them to crowdsourced volunteers. To be sure, the design and assignment of
the tasks involved in crowdsourced full-text generation are directly linked to the
difficulty, integrality, and accuracy of the digital recognition. This study designed a basic
workflow for crowdsourced full-text generation with a focus on the principles, methods,
Task allocation
for
crowdsourced
volunteers
263

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT