Toward the optimized
crowdsourcing strategy for OCR
Omri Suissa, Avshalom Elmalech and Maayan Zhitomirsky-Geffet
Bar-Ilan University, Ramat Gan, Israel
Purpose –Digitization of historical documents is a challenging task in many digital humanities projects. A
popular approach for digitization is to scan the documents into images, and then convert images into text
using optical character recognition (OCR) algorithms. However, the outcome of OCR processing of historical
documents is usually inaccurate and requires post-processing error correction. The purpose of this paper is to
investigate how crowdsourcing can be utilized to correct OCR errors in historical text collections, and which
crowdsourcing methodology is the most effective in different scenarios and for various research objectives.
Design/methodology/approach –A series of experiments with different micro-task’s structures and text
lengths were conducted with 753 workers on the Amazon’s Mechanical Turk platform. The workers had to fix
OCR errors in a selected historical text. To analyze the results, new accuracy and efficiency measures
Findings –The analysis suggests that in terms of accuracy, the optimal text length is medium (paragraph-
size) and the optimal structure of the experiment is two phase with a scanned image. In terms of efficiency, the
best results were obtained when using longer text in the single-stage structure with no image.
Practical implications –The study provides practical recommendations to researchers on how to build the
optimal crowdsourcing task for OCR post-correction. The developed methodology can also be utilized to
create golden standard historical texts for automatic OCR post-correction.
Originality/value –This is the first attempt to systematically investigate the influence of various factors on
crowdsourcing-based OCR post-correction and propose an optimal strategy for this process.
Keywords Crowdsourcing, Digital humanities, Historical texts, OCR post-correction, Task decomposition,
Paper type Research paper
In many digital humanities projects, there is a need to automatically analyze the content of
large collections of paper-based documents. Digitization of historical text collections is a
complex task essential both for research and preservation of cultural heritage. The first step
toward this goal is to scan historical documents (books, manuscripts or newspaper pages)
into high-resolution images and then to use an optical character recognition (OCR)
technology to convert the images into text. For instance, the Library of Congress has a vast
digital historical collection (https://chroniclingamerica.loc.gov/), which has been digitized
using OCR to preserve and make it publicly available. The British Newspaper Archive
(www.britishnewspaperarchive.co.uk/) maintains an extensive digitized collection with
advanced discovery tools (Lansdall-Welfare et al., 2017). Even commercial enterprises
initiated large-scale digitization projects based on OCR technology. For example, the Google
Books project has already scanned more than 25m books and continues to scan 6,000 pages
per hour. The quality of the OCR technology is a critical aspect of this process.
Unfortunately, OCRed historical texts still contain a significant percentage of errors that
undermine further analysis, search and preservation.
OCR errors come in several forms: insertions, deletions, substitutions, transposition of
one or two characters, splitting and concatenation of words or combination of several error
types together in one word (Reynaert, 2008). Many OCR errors are different from human
spelling mistakes, and therefore, common spelling correction algorithms are not suitable for
Received 30 July 2019
Accepted 30 October 2019
The current issue and full text archive of this journal is available on Emerald Insight at:
Vol.72 No. 2, 2020