Development and customization of in-house developed OCR and its evaluation

Document

Cited in

DOI	https://doi.org/10.1108/EL-01-2018-0011
Date	01 October 2018
Pages	766-781
Published date	01 October 2018
Author	Rajeswari S.,Sai Baba Magapu
Subject Matter	Information & knowledge management,Information & communications technology,Internet

Development and customization of

in-house developed OCR and

its evaluation

Rajeswari S.

Homi Bhabha National Institute,

Indira Gandhi Center for Atomic Research, Kalpakkam, India, and

Sai Baba Magapu

Department of Natural Science and Engineering,

National Institute of Advanced Studies, Karnataka, India

Abstract

Purpose –The purpose of this paper is to developa text extraction tool for scanneddocuments that would

extract text and build the keywords corpus and key phrases corpus for the document without manual

intervention.

Design/methodology/approach –For text extraction from scanned documents, a Web-based optical

character recognition(OCR) tool was developed. OCR is a well-establishedtechnology, so to develop the OCR,

Microsoft Ofﬁce document imaging toolswere used. To account for the commonly encountered problem of

skew being introduced, a method to detect and correct the skew introducedin the scanned documents was

developed and integrated with the tool. The OCR tool was customized to build keywords and key phrases

corpus for everydocument.

Findings –The developed toolwas evaluated using a 100 document corpus to test the various propertiesof

OCR. The tool had above 99 per cent wordread accuracy for text only image documents. The customization of

the OCR was tested with samples of Microﬁches, sample of Journal pagesfrom back volumes and samples

from newspaper clips and the resultsare discussed in the summary. The tool was found to be useful for text

extractionand processing.

Social implications –The scanned documents are converted to keywords and key phrases corpus.The

tool couldbe used to build metadata for scanned documents withoutmanual intervention.

Originality/value –The tool is used to convert unstructured data (in the form of image documents) to

structured data (the documentis converted into keywords, and key phrases database). In addition, theimage

documentis convertedto editable and searchable document.

Keywords Key phrases, Key words, Optical character recognition, Skew detection and correction,

Stemming, Stop words

Paper type Technical paper

1. Introduction

In the digital era, converting the informationavailable in the print form into electronic form

has become a necessity, to enable integration of all the resources available. With hardware

becoming available to scan the documents, more and more information in the print form is

becoming available in the digital form. Information resource centres and institutional

repositories are having large holdings of such scanned documents. To retrieve metadata

from these documents, text extraction tools are necessary. Optical character recognition

The authors thank Prof G. Sivakumar, IITB, for his suggestions and useful discussions.

36,5

766

Received14 January 2018

Revised16 May 2018

29May 2018

Accepted21 July 2018

TheElectronic Library

Vol.36 No. 5, 2018

pp. 766-781

0264-0473

DOI 10.1108/EL-01-2018-0011

The current issue and full text archive of this journal is available on Emerald Insight at:

www.emeraldinsight.com/0264-0473.htm

(OCR) tools are used to extract text from scanned documents. OCR is a technology that

converts different types of documents such as: scanned documents, PDF ﬁles or images

captured by a digital camera, into editable and searchable form. Early versions available

needed to be trained with images of each character and could work only with limited

number of fonts. These methods require a large collection of pixel library of different fonts

along with their attributes like italics, bold, capital, size variation, etc. Using an existing

pixel library reduces the effort and timeto be put in the development of an OCR tool and for

further customization of the tool. In the present study, open-source libraries were used for

the development of a Web-based OCR tool to extract text from the scanned documents.

Developing an in-house OCR tool would make it possible to store the extracted text in the

format that is required for further processing. When a document is scanned, one of the

problems commonly encountered is the introduction of skew in the scanned documents.

OCR tools available were not directlycorrecting the skew, warranting the application of pre-

processing. In the present development, the necessary skew correction features were added

to the Web-based OCR tool developed, thereby enhancing the efﬁciency of the text

extraction. The paper describes the development of the tool and the assessment of the text

extraction accuracyof the tool for a variety of documents.

Commercially available OCRs have the features like extraction of text from invoices,

images, multiple columns documents and storing in the ﬁles of desired choice and in the

desired directories. It is desirable that the OCR tool should enable customizing to build

databases of keywords and key phrases from the OCR output documents. In the keywords

builder, STOP words removal and stemming of words are necessary. Such features arenot

readily available in the commercial OCR tools. The tool developed as part of the current

work has all the above-mentioned features. The details are discussed in the subsequent

sections.

2. Optical character recognition –steps to extract text from image ﬁle

OCR extracts text from scanned ﬁles. Scanned ﬁles are ﬁles with extensions like PDF, Tiff,

JPG and PNG. The ﬁrst step implemented in most of the OCRs is binarization (Sezin and

Sankur, 2004). Binarization can be broadly categorized as global (Gonzalez and Woods,

2002) and local Gllavata et al. (2003). In this process, the entireﬁle is converted into 1s and

0s, using algorithms. Any colour or grey scale found in the ﬁle is removed. Converting the

entire ﬁle into 1 or 0 based on single threshold is called Otsu binarization (Otsu, 1979). The

alternate to Otsu is adaptive thresholding Sauvola et al. (1997). Here, the entire page is

segmented, and localised thresholding is applied to different segments (Kaur et al.,2013).

Then, the OCR segments convert the given ﬁle intopages.

Page analysis is the next major routine used in the character recognitionprocess. In this

process, page elements like blocks of text, tables, lines, single characters are searched

(Kahan et al., 1987). Line segmentation consists of slicing a page of text into its different

lines. This step also analyses interline spacing, line skew and separates touching lines. The

word segmentation isolates one word from another using the inter-word spacing in the

documents (Zuva et al., 2011). The charactersegmentation separates the various letters of a

word. If the characters have the same width (ﬁxed pitch), character segmentation is easy.

The problem is more complex when the width of the letters varies and is called as

proportional pitch (Microsoft Ofﬁce, 2016). The actual character recognition extracts

characteristicsout of each isolated shape and assigns a symbol.

The pixels of a scanned image need to be organizedinto characters. To do this, a library

of characters in pixel form is necessary, and the library chosen should support different

OCR and its

evaluation

767

To continue reading

Request your trial

Subscribers can access the reported version of this case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the cited cases and legislation of a document.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the documents that have cited the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the revised versions of legislation with amendments.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see any amendments made to the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a visualisation of a case and its relationships to other cases. An alternative to lists of cases, the Precedent Map makes it easier to establish which ones may be of most relevance to your research and prioritise further reading. You also get a useful overview of how the case was received.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the list of results connected to your document through the topics and citations Vincent found.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Development and customization of in-house developed OCR and its evaluation

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users