Toward the optimized crowdsourcing strategy for OCR post-correction

Pages179-197
Date09 December 2019
DOIhttps://doi.org/10.1108/AJIM-07-2019-0189
Published date09 December 2019
AuthorOmri Suissa,Avshalom Elmalech,Maayan Zhitomirsky-Geffet
Subject MatterLibrary & information science,Information behaviour & retrieval,Information & knowledge management,Information management & governance,Information management
Toward the optimized
crowdsourcing strategy for OCR
post-correction
Omri Suissa, Avshalom Elmalech and Maayan Zhitomirsky-Geffet
Bar-Ilan University, Ramat Gan, Israel
Abstract
Purpose Digitization of historical documents is a challenging task in many digital humanities projects. A
popular approach for digitization is to scan the documents into images, and then convert images into text
using optical character recognition (OCR) algorithms. However, the outcome of OCR processing of historical
documents is usually inaccurate and requires post-processing error correction. The purpose of this paper is to
investigate how crowdsourcing can be utilized to correct OCR errors in historical text collections, and which
crowdsourcing methodology is the most effective in different scenarios and for various research objectives.
Design/methodology/approach A series of experiments with different micro-tasks structures and text
lengths were conducted with 753 workers on the Amazons Mechanical Turk platform. The workers had to fix
OCR errors in a selected historical text. To analyze the results, new accuracy and efficiency measures
were devised.
Findings The analysis suggests that in terms of accuracy, the optimal text length is medium (paragraph-
size) and the optimal structure of the experiment is two phase with a scanned image. In terms of efficiency, the
best results were obtained when using longer text in the single-stage structure with no image.
Practical implications The study provides practical recommendations to researchers on how to build the
optimal crowdsourcing task for OCR post-correction. The developed methodology can also be utilized to
create golden standard historical texts for automatic OCR post-correction.
Originality/value This is the first attempt to systematically investigate the influence of various factors on
crowdsourcing-based OCR post-correction and propose an optimal strategy for this process.
Keywords Crowdsourcing, Digital humanities, Historical texts, OCR post-correction, Task decomposition,
Task optimization
Paper type Research paper
Introduction
In many digital humanities projects, there is a need to automatically analyze the content of
large collections of paper-based documents. Digitization of historical text collections is a
complex task essential both for research and preservation of cultural heritage. The first step
toward this goal is to scan historical documents (books, manuscripts or newspaper pages)
into high-resolution images and then to use an optical character recognition (OCR)
technology to convert the images into text. For instance, the Library of Congress has a vast
digital historical collection (https://chroniclingamerica.loc.gov/), which has been digitized
using OCR to preserve and make it publicly available. The British Newspaper Archive
(www.britishnewspaperarchive.co.uk/) maintains an extensive digitized collection with
advanced discovery tools (Lansdall-Welfare et al., 2017). Even commercial enterprises
initiated large-scale digitization projects based on OCR technology. For example, the Google
Books project has already scanned more than 25m books and continues to scan 6,000 pages
per hour[1]. The quality of the OCR technology is a critical aspect of this process.
Unfortunately, OCRed historical texts still contain a significant percentage of errors that
undermine further analysis, search and preservation.
OCR errors come in several forms: insertions, deletions, substitutions, transposition of
one or two characters, splitting and concatenation of words or combination of several error
types together in one word (Reynaert, 2008). Many OCR errors are different from human
spelling mistakes, and therefore, common spelling correction algorithms are not suitable for
Received 30 July 2019
Accepted 30 October 2019
The current issue and full text archive of this journal is available on Emerald Insight at:
www.emeraldinsight.com/2050-3806.htm
Strategy for
OCR post-
correction
AslibJournalof Information
Management
Vol.72 No. 2, 2020
pp.179-197
©EmeraldPublishingLimited
2050-3806
DOI10.1108/AJIM-07-2019-0189
179
the OCR correction task. Moreover, these errors often include punctuation marks, which lead
to wrong text segmentation.
In order to automatically correct OCR errors in large historical texts, supervised machine
learning algorithms have been applied (Tong and Evans, 1996; Borovikov et al., 2004; Bassil
and Alwani, 2012; Wick et al., 2007; Raaijmakers, 2013; Evershed and Fitch, 2014; Kissos
and Dershowitz, 2016). Using a corpus of labeled examples (pairs of wrong and right words
or sentences) these algorithms employ statistical methods to learn how to distinguish
between an error and the correct word or sentence. To create golden standard data sets for
training OCR post-correction algorithms for historical texts, crowdsourcing has been
employed in previous studies (Von Ahn et al., 2008; Bernstein et al., 2010; Chrons and
Sundell, 2011; Clematide et al., 2016).
In the past decade, numerous studies employed various crowdsourcing approaches with
different experimental settings for post-correction of the OCR output (Von Ahn et al., 2008;
Chrons and Sundell, 2011; Clematide et al., 2016). Yet, it is unclear how to choose the optimal
crowdsourcing strategy for this task.
Hence, the goalof this research is to investigateand optimize the crowdsourcing procedure
of OCR post-correction in terms of accuracy and efficiency. To this end, a systematic
examinationof different crowdsourcing approaches was conducted. The consideredvariables
included the task structure, the text length and supplementary information. The research
questions addressed in the study are:
RQ1. How does the structure of the task influence the accuracy and efficiency of
crowdsourcing-based OCR post-correction?
RQ2. How does the text length affect the accuracy and efficiency of crowdsourcing-based
OCR post-correction?
RQ3. Does supplementary information (such as the scanned image of the text) improve
the accuracy of crowdsourcing-based OCR post-correction, and how does it affect
efficiency?
RQ4. How are the different variables (e.g. text length, task structure and image
presentation) related to the various types of errors in crowdsourcing-based OCR
post-correction?
RQ5. Based on the above variables, what is the optimal strategy for different scenarios
of crowdsourcing-based OCR post-correction?
For the purposes of this study, a series of crowdsourcing experiments with different
task structures, text lengths and supplementary information was carried out with
753 crowd workers recruited using the Amazons Mechanical Turk (MTurk) platform
(www.mturk.com/). Overall, 3,796 texts of various lengths have been fixed. The developed
methodology ena bles effective cr eation of golden standard data sets for historical
corpora which are required for training machine learning algorithms. To the best of the
authors knowledge, this is the first attempt to systematically investigate the influence of
various factors on crowdsourcing-based OCR post-correction and propose an optimal
strategy for this process. This study has important practical implications for many
digital humanities projects which aim to analyze the content of OCRed historical
document collections.
Related work
Crowdsourcing platforms
Crowdsourcing is based on assigning various well-defined tasks to large groups of
non-expert low-paid workers or volunteers. An example of a crowdsourcing platform where
AJIM
180
72,2

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT