Toward the optimized crowdsourcing strategy for OCR post-correction

Document

Cited in

Pages	179-197
Date	09 December 2019
DOI	https://doi.org/10.1108/AJIM-07-2019-0189
Published date	09 December 2019
Author	Omri Suissa,Avshalom Elmalech,Maayan Zhitomirsky-Geffet
Subject Matter	Library & information science,Information behaviour & retrieval,Information & knowledge management,Information management & governance,Information management

Toward the optimized

crowdsourcing strategy for OCR

post-correction

Omri Suissa, Avshalom Elmalech and Maayan Zhitomirsky-Geffet

Bar-Ilan University, Ramat Gan, Israel

Abstract

Purpose –Digitization of historical documents is a challenging task in many digital humanities projects. A

popular approach for digitization is to scan the documents into images, and then convert images into text

using optical character recognition (OCR) algorithms. However, the outcome of OCR processing of historical

documents is usually inaccurate and requires post-processing error correction. The purpose of this paper is to

investigate how crowdsourcing can be utilized to correct OCR errors in historical text collections, and which

crowdsourcing methodology is the most effective in different scenarios and for various research objectives.

Design/methodology/approach –A series of experiments with different micro-task’s structures and text

lengths were conducted with 753 workers on the Amazon’s Mechanical Turk platform. The workers had to fix

OCR errors in a selected historical text. To analyze the results, new accuracy and efficiency measures

were devised.

Findings –The analysis suggests that in terms of accuracy, the optimal text length is medium (paragraph-

size) and the optimal structure of the experiment is two phase with a scanned image. In terms of efficiency, the

best results were obtained when using longer text in the single-stage structure with no image.

Practical implications –The study provides practical recommendations to researchers on how to build the

optimal crowdsourcing task for OCR post-correction. The developed methodology can also be utilized to

create golden standard historical texts for automatic OCR post-correction.

Originality/value –This is the first attempt to systematically investigate the influence of various factors on

crowdsourcing-based OCR post-correction and propose an optimal strategy for this process.

Keywords Crowdsourcing, Digital humanities, Historical texts, OCR post-correction, Task decomposition,

Task optimization

Paper type Research paper

Introduction

In many digital humanities projects, there is a need to automatically analyze the content of

large collections of paper-based documents. Digitization of historical text collections is a

complex task essential both for research and preservation of cultural heritage. The first step

toward this goal is to scan historical documents (books, manuscripts or newspaper pages)

into high-resolution images and then to use an optical character recognition (OCR)

technology to convert the images into text. For instance, the Library of Congress has a vast

digital historical collection (https://chroniclingamerica.loc.gov/), which has been digitized

using OCR to preserve and make it publicly available. The British Newspaper Archive

(www.britishnewspaperarchive.co.uk/) maintains an extensive digitized collection with

advanced discovery tools (Lansdall-Welfare et al., 2017). Even commercial enterprises

initiated large-scale digitization projects based on OCR technology. For example, the Google

Books project has already scanned more than 25m books and continues to scan 6,000 pages

per hour[1]. The quality of the OCR technology is a critical aspect of this process.

Unfortunately, OCRed historical texts still contain a significant percentage of errors that

undermine further analysis, search and preservation.

OCR errors come in several forms: insertions, deletions, substitutions, transposition of

one or two characters, splitting and concatenation of words or combination of several error

types together in one word (Reynaert, 2008). Many OCR errors are different from human

spelling mistakes, and therefore, common spelling correction algorithms are not suitable for

Received 30 July 2019

Accepted 30 October 2019

The current issue and full text archive of this journal is available on Emerald Insight at:

www.emeraldinsight.com/2050-3806.htm

Strategy for

OCR post-

correction

AslibJournalof Information

Management

Vol.72 No. 2, 2020

pp.179-197

©EmeraldPublishingLimited

2050-3806

DOI10.1108/AJIM-07-2019-0189

179

the OCR correction task. Moreover, these errors often include punctuation marks, which lead

to wrong text segmentation.

In order to automatically correct OCR errors in large historical texts, supervised machine

learning algorithms have been applied (Tong and Evans, 1996; Borovikov et al., 2004; Bassil

and Alwani, 2012; Wick et al., 2007; Raaijmakers, 2013; Evershed and Fitch, 2014; Kissos

and Dershowitz, 2016). Using a corpus of labeled examples (pairs of wrong and right words

or sentences) these algorithms employ statistical methods to learn how to distinguish

between an error and the correct word or sentence. To create golden standard data sets for

training OCR post-correction algorithms for historical texts, crowdsourcing has been

employed in previous studies (Von Ahn et al., 2008; Bernstein et al., 2010; Chrons and

Sundell, 2011; Clematide et al., 2016).

In the past decade, numerous studies employed various crowdsourcing approaches with

different experimental settings for post-correction of the OCR output (Von Ahn et al., 2008;

Chrons and Sundell, 2011; Clematide et al., 2016). Yet, it is unclear how to choose the optimal

crowdsourcing strategy for this task.

Hence, the goalof this research is to investigateand optimize the crowdsourcing procedure

of OCR post-correction in terms of accuracy and efficiency. To this end, a systematic

examinationof different crowdsourcing approaches was conducted. The consideredvariables

included the task structure, the text length and supplementary information. The research

questions addressed in the study are:

RQ1. How does the structure of the task influence the accuracy and efficiency of

crowdsourcing-based OCR post-correction?

RQ2. How does the text length affect the accuracy and efficiency of crowdsourcing-based

OCR post-correction?

RQ3. Does supplementary information (such as the scanned image of the text) improve

the accuracy of crowdsourcing-based OCR post-correction, and how does it affect

efficiency?

RQ4. How are the different variables (e.g. text length, task structure and image

presentation) related to the various types of errors in crowdsourcing-based OCR

post-correction?

RQ5. Based on the above variables, what is the optimal strategy for different scenarios

of crowdsourcing-based OCR post-correction?

For the purposes of this study, a series of crowdsourcing experiments with different

task structures, text lengths and supplementary information was carried out with

753 crowd workers recruited using the Amazon’s Mechanical Turk (MTurk) platform

(www.mturk.com/). Overall, 3,796 texts of various lengths have been fixed. The developed

methodology ena bles effective cr eation of golden standard data sets for historical

corpora which are required for training machine learning algorithms. To the best of the

authors knowledge, this is the first attempt to systematically investigate the influence of

various factors on crowdsourcing-based OCR post-correction and propose an optimal

strategy for this process. This study has important practical implications for many

digital humanities projects which aim to analyze the content of OCRed historical

document collections.

Related work

Crowdsourcing platforms

Crowdsourcing is based on assigning various well-defined tasks to large groups of

non-expert low-paid workers or volunteers. An example of a crowdsourcing platform where

AJIM

180

72,2

To continue reading

Request your trial

Subscribers can access the reported version of this case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the cited cases and legislation of a document.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the documents that have cited the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the revised versions of legislation with amendments.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see any amendments made to the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a visualisation of a case and its relationships to other cases. An alternative to lists of cases, the Precedent Map makes it easier to establish which ones may be of most relevance to your research and prioritise further reading. You also get a useful overview of how the case was received.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the list of results connected to your document through the topics and citations Vincent found.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Toward the optimized crowdsourcing strategy for OCR post-correction

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users