Development and customization of in-house developed OCR and its evaluation

DOIhttps://doi.org/10.1108/EL-01-2018-0011
Date01 October 2018
Pages766-781
Published date01 October 2018
AuthorRajeswari S.,Sai Baba Magapu
Subject MatterInformation & knowledge management,Information & communications technology,Internet
Development and customization of
in-house developed OCR and
its evaluation
Rajeswari S.
Homi Bhabha National Institute,
Indira Gandhi Center for Atomic Research, Kalpakkam, India, and
Sai Baba Magapu
Department of Natural Science and Engineering,
National Institute of Advanced Studies, Karnataka, India
Abstract
Purpose The purpose of this paper is to developa text extraction tool for scanneddocuments that would
extract text and build the keywords corpus and key phrases corpus for the document without manual
intervention.
Design/methodology/approach For text extraction from scanned documents, a Web-based optical
character recognition(OCR) tool was developed. OCR is a well-establishedtechnology, so to develop the OCR,
Microsoft Ofce document imaging toolswere used. To account for the commonly encountered problem of
skew being introduced, a method to detect and correct the skew introducedin the scanned documents was
developed and integrated with the tool. The OCR tool was customized to build keywords and key phrases
corpus for everydocument.
Findings The developed toolwas evaluated using a 100 document corpus to test the various propertiesof
OCR. The tool had above 99 per cent wordread accuracy for text only image documents. The customization of
the OCR was tested with samples of Microches, sample of Journal pagesfrom back volumes and samples
from newspaper clips and the resultsare discussed in the summary. The tool was found to be useful for text
extractionand processing.
Social implications The scanned documents are converted to keywords and key phrases corpus.The
tool couldbe used to build metadata for scanned documents withoutmanual intervention.
Originality/value The tool is used to convert unstructured data (in the form of image documents) to
structured data (the documentis converted into keywords, and key phrases database). In addition, theimage
documentis convertedto editable and searchable document.
Keywords Key phrases, Key words, Optical character recognition, Skew detection and correction,
Stemming, Stop words
Paper type Technical paper
1. Introduction
In the digital era, converting the informationavailable in the print form into electronic form
has become a necessity, to enable integration of all the resources available. With hardware
becoming available to scan the documents, more and more information in the print form is
becoming available in the digital form. Information resource centres and institutional
repositories are having large holdings of such scanned documents. To retrieve metadata
from these documents, text extraction tools are necessary. Optical character recognition
The authors thank Prof G. Sivakumar, IITB, for his suggestions and useful discussions.
EL
36,5
766
Received14 January 2018
Revised16 May 2018
29May 2018
Accepted21 July 2018
TheElectronic Library
Vol.36 No. 5, 2018
pp. 766-781
© Emerald Publishing Limited
0264-0473
DOI 10.1108/EL-01-2018-0011
The current issue and full text archive of this journal is available on Emerald Insight at:
www.emeraldinsight.com/0264-0473.htm
(OCR) tools are used to extract text from scanned documents. OCR is a technology that
converts different types of documents such as: scanned documents, PDF les or images
captured by a digital camera, into editable and searchable form. Early versions available
needed to be trained with images of each character and could work only with limited
number of fonts. These methods require a large collection of pixel library of different fonts
along with their attributes like italics, bold, capital, size variation, etc. Using an existing
pixel library reduces the effort and timeto be put in the development of an OCR tool and for
further customization of the tool. In the present study, open-source libraries were used for
the development of a Web-based OCR tool to extract text from the scanned documents.
Developing an in-house OCR tool would make it possible to store the extracted text in the
format that is required for further processing. When a document is scanned, one of the
problems commonly encountered is the introduction of skew in the scanned documents.
OCR tools available were not directlycorrecting the skew, warranting the application of pre-
processing. In the present development, the necessary skew correction features were added
to the Web-based OCR tool developed, thereby enhancing the efciency of the text
extraction. The paper describes the development of the tool and the assessment of the text
extraction accuracyof the tool for a variety of documents.
Commercially available OCRs have the features like extraction of text from invoices,
images, multiple columns documents and storing in the les of desired choice and in the
desired directories. It is desirable that the OCR tool should enable customizing to build
databases of keywords and key phrases from the OCR output documents. In the keywords
builder, STOP words removal and stemming of words are necessary. Such features arenot
readily available in the commercial OCR tools. The tool developed as part of the current
work has all the above-mentioned features. The details are discussed in the subsequent
sections.
2. Optical character recognition steps to extract text from image le
OCR extracts text from scanned les. Scanned les are les with extensions like PDF, Tiff,
JPG and PNG. The rst step implemented in most of the OCRs is binarization (Sezin and
Sankur, 2004). Binarization can be broadly categorized as global (Gonzalez and Woods,
2002) and local Gllavata et al. (2003). In this process, the entirele is converted into 1s and
0s, using algorithms. Any colour or grey scale found in the le is removed. Converting the
entire le into 1 or 0 based on single threshold is called Otsu binarization (Otsu, 1979). The
alternate to Otsu is adaptive thresholding Sauvola et al. (1997). Here, the entire page is
segmented, and localised thresholding is applied to different segments (Kaur et al.,2013).
Then, the OCR segments convert the given le intopages.
Page analysis is the next major routine used in the character recognitionprocess. In this
process, page elements like blocks of text, tables, lines, single characters are searched
(Kahan et al., 1987). Line segmentation consists of slicing a page of text into its different
lines. This step also analyses interline spacing, line skew and separates touching lines. The
word segmentation isolates one word from another using the inter-word spacing in the
documents (Zuva et al., 2011). The charactersegmentation separates the various letters of a
word. If the characters have the same width (xed pitch), character segmentation is easy.
The problem is more complex when the width of the letters varies and is called as
proportional pitch (Microsoft Ofce, 2016). The actual character recognition extracts
characteristicsout of each isolated shape and assigns a symbol.
The pixels of a scanned image need to be organizedinto characters. To do this, a library
of characters in pixel form is necessary, and the library chosen should support different
OCR and its
evaluation
767

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT