Development and customization of in-house developed OCR and its evaluation
DOI | https://doi.org/10.1108/EL-01-2018-0011 |
Date | 01 October 2018 |
Pages | 766-781 |
Published date | 01 October 2018 |
Author | Rajeswari S.,Sai Baba Magapu |
Subject Matter | Information & knowledge management,Information & communications technology,Internet |
Development and customization of
in-house developed OCR and
its evaluation
Rajeswari S.
Homi Bhabha National Institute,
Indira Gandhi Center for Atomic Research, Kalpakkam, India, and
Sai Baba Magapu
Department of Natural Science and Engineering,
National Institute of Advanced Studies, Karnataka, India
Abstract
Purpose –The purpose of this paper is to developa text extraction tool for scanneddocuments that would
extract text and build the keywords corpus and key phrases corpus for the document without manual
intervention.
Design/methodology/approach –For text extraction from scanned documents, a Web-based optical
character recognition(OCR) tool was developed. OCR is a well-establishedtechnology, so to develop the OCR,
Microsoft Office document imaging toolswere used. To account for the commonly encountered problem of
skew being introduced, a method to detect and correct the skew introducedin the scanned documents was
developed and integrated with the tool. The OCR tool was customized to build keywords and key phrases
corpus for everydocument.
Findings –The developed toolwas evaluated using a 100 document corpus to test the various propertiesof
OCR. The tool had above 99 per cent wordread accuracy for text only image documents. The customization of
the OCR was tested with samples of Microfiches, sample of Journal pagesfrom back volumes and samples
from newspaper clips and the resultsare discussed in the summary. The tool was found to be useful for text
extractionand processing.
Social implications –The scanned documents are converted to keywords and key phrases corpus.The
tool couldbe used to build metadata for scanned documents withoutmanual intervention.
Originality/value –The tool is used to convert unstructured data (in the form of image documents) to
structured data (the documentis converted into keywords, and key phrases database). In addition, theimage
documentis convertedto editable and searchable document.
Keywords Key phrases, Key words, Optical character recognition, Skew detection and correction,
Stemming, Stop words
Paper type Technical paper
1. Introduction
In the digital era, converting the informationavailable in the print form into electronic form
has become a necessity, to enable integration of all the resources available. With hardware
becoming available to scan the documents, more and more information in the print form is
becoming available in the digital form. Information resource centres and institutional
repositories are having large holdings of such scanned documents. To retrieve metadata
from these documents, text extraction tools are necessary. Optical character recognition
The authors thank Prof G. Sivakumar, IITB, for his suggestions and useful discussions.
EL
36,5
766
Received14 January 2018
Revised16 May 2018
29May 2018
Accepted21 July 2018
TheElectronic Library
Vol.36 No. 5, 2018
pp. 766-781
© Emerald Publishing Limited
0264-0473
DOI 10.1108/EL-01-2018-0011
The current issue and full text archive of this journal is available on Emerald Insight at:
www.emeraldinsight.com/0264-0473.htm
(OCR) tools are used to extract text from scanned documents. OCR is a technology that
converts different types of documents such as: scanned documents, PDF files or images
captured by a digital camera, into editable and searchable form. Early versions available
needed to be trained with images of each character and could work only with limited
number of fonts. These methods require a large collection of pixel library of different fonts
along with their attributes like italics, bold, capital, size variation, etc. Using an existing
pixel library reduces the effort and timeto be put in the development of an OCR tool and for
further customization of the tool. In the present study, open-source libraries were used for
the development of a Web-based OCR tool to extract text from the scanned documents.
Developing an in-house OCR tool would make it possible to store the extracted text in the
format that is required for further processing. When a document is scanned, one of the
problems commonly encountered is the introduction of skew in the scanned documents.
OCR tools available were not directlycorrecting the skew, warranting the application of pre-
processing. In the present development, the necessary skew correction features were added
to the Web-based OCR tool developed, thereby enhancing the efficiency of the text
extraction. The paper describes the development of the tool and the assessment of the text
extraction accuracyof the tool for a variety of documents.
Commercially available OCRs have the features like extraction of text from invoices,
images, multiple columns documents and storing in the files of desired choice and in the
desired directories. It is desirable that the OCR tool should enable customizing to build
databases of keywords and key phrases from the OCR output documents. In the keywords
builder, STOP words removal and stemming of words are necessary. Such features arenot
readily available in the commercial OCR tools. The tool developed as part of the current
work has all the above-mentioned features. The details are discussed in the subsequent
sections.
2. Optical character recognition –steps to extract text from image file
OCR extracts text from scanned files. Scanned files are files with extensions like PDF, Tiff,
JPG and PNG. The first step implemented in most of the OCRs is binarization (Sezin and
Sankur, 2004). Binarization can be broadly categorized as global (Gonzalez and Woods,
2002) and local Gllavata et al. (2003). In this process, the entirefile is converted into 1s and
0s, using algorithms. Any colour or grey scale found in the file is removed. Converting the
entire file into 1 or 0 based on single threshold is called Otsu binarization (Otsu, 1979). The
alternate to Otsu is adaptive thresholding Sauvola et al. (1997). Here, the entire page is
segmented, and localised thresholding is applied to different segments (Kaur et al.,2013).
Then, the OCR segments convert the given file intopages.
Page analysis is the next major routine used in the character recognitionprocess. In this
process, page elements like blocks of text, tables, lines, single characters are searched
(Kahan et al., 1987). Line segmentation consists of slicing a page of text into its different
lines. This step also analyses interline spacing, line skew and separates touching lines. The
word segmentation isolates one word from another using the inter-word spacing in the
documents (Zuva et al., 2011). The charactersegmentation separates the various letters of a
word. If the characters have the same width (fixed pitch), character segmentation is easy.
The problem is more complex when the width of the letters varies and is called as
proportional pitch (Microsoft Office, 2016). The actual character recognition extracts
characteristicsout of each isolated shape and assigns a symbol.
The pixels of a scanned image need to be organizedinto characters. To do this, a library
of characters in pixel form is necessary, and the library chosen should support different
OCR and its
evaluation
767
To continue reading
Request your trial