The integration of document image processing and text retrieval principles

Pages273-278
DOIhttps://doi.org/10.1108/eb045245
Published date01 April 1993
Date01 April 1993
AuthorNiël van der Merwe
Subject MatterInformation & knowledge management,Library & information science
Article
The integration of
document image processing
and text retrieval principles
Niël van der Merwe
Xcel,
PO
Box
20355,
Alkantrant
0005,
South Africa
Abstract: This paper will discuss the integration of
document image processing and text retrieval principles in
order to process and load existing paper documents
automatically in an electronic document database that
broadens the user's capability to retrieve relevant
information more accurately, without going through costly
processes to get paper documents into electronic
text.
The
principles of document image processing systems, as well as
the problems and shortcomings of most of today's document
image processing systems, will be
discussed.
Then concept
retrieval as the latest development in text retrieval will be
discussed,
with specific reference to the ability of the TOPIC
intelligent text retrieval system to allow users to build up a
knowledge base of search objects or concepts that can be
used at any point in time by all users for the system.
This paper will further specifically look at the automatic
processing of paper documents by converting the scanned
document image pages through to electronic
text.
The use of
optical character recognition technology, the indexing and
loading of the documents in a text database, the automatic
linking of the documents to the related document images
and
the
retrieval technology available in
TOPIC,
specifically the TYPO operator that was developed to
handle so-called dirty data such as the common
misspellings, character transpositions and 'dirty' text
received as output from the OCR process, will be
discussed.
A possible solution to load paper documents quickly and
cost-effectively into an electronic document database will
be discussed and demonstrated
in
detail.
The advantages
and disadvantages of this approach will be discussed with
specific reference to an electronic news clipping service
application.
1.
Introduction
Today only
10%
of the information used by an organisation is
in electronic form, compared to 90% of information that is
printed and not immediately machine-readable (Martin
1990).
Unfortunately, this is the critical mass of information,
namely incoming business correspondence, magazines,
newspapers, external R&D reports and reference works.
Therefore it is important to look at possibilities of processing
and loading existing paper documents automatically in an
electronic document database.
2.
Principles of a document image processing system
Traditionally a document image processing system is under-
stood as a system that makes it possible to capture images of
paper documents and
to
index, store and retrieve them
as
elec-
tronic images on a computer screen. Images of the paper
pages of a document are captured using a scanner. Normally
all the page images of a specific document are stored in a
multi-page image file. Under control of an application pro-
gram these documents are indexed by someone who first
reads through the document and then assigns certain key-
words. The multi-page document image files are then often
stored on optical discs. If
we
take newspaper clippings as an
example, the application program provides access to a data-
base of non-image field-oriented data such as the title and
author of an
article,
the name of the newspaper,
the
page num-
ber and the date that are all stored on magnetic disk. This
database also contains references
to
the document images that
are stored on optical disc and are related to the a specific
database record. This enables a user to retrieve information
from the database by querying the field-oriented information,
and when a specific record is retrieved the related image can
also be viewed on screen.
The main hardware components of a document image
processing system normally consist of a scanner, personal
computers with high resolution monitors, high resolution
printers (at least 300 dpi), optical disc storage devices (either
WORM or CDROM technology can be used), either image
processing hardware boards or image processing software
that can do the compression and decompression of
the
stored
images, and finally an application program that is connected
to a database (most of today's systems use an SQL-type of
relational database for this purposes such as Oracle, Sybase,
SQLBase, SQL Server, Informix, etc.). The system is nor-
mally run on a local area network so that users can have mul-
tiple access to the stored information.
Documents are captured by using scanners and created as
digital paper pages. It is important to note that a scanner
'senses' the whole page as a rectangular array of dots or pic-
The Electronic Library, Vol.
11,
No. 4/5, August/October 1993 273

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT