Heuristics for identification of bibliographic elements from title pages

Date01 December 2004
Published date01 December 2004
DOIhttps://doi.org/10.1108/07378830410570494
Pages389-396
AuthorDurga Sankar Rath,A.R.D. Prasad
Subject MatterInformation & knowledge management,Library & information science
Heuristics for
identification of
bibliographic elements
from title pages
Durga Sankar Rath and
A.R.D. Prasad
The authors
Durga Sankar Rath is a Lecturer in the Department of Library
and Information Science, Ravindra Bharati University,
Kolkata, India.
A.R.D. Prasad is an Associate Professor, Documentation
Research and Training Centre, Indian Statistical Institute,
Bangalore, Karnataka, India.
Keywords
Bibliographic systems, Data handling, Cataloguing,
Classification schemes, Information operations
Abstract
This paper presents a methodology for automatic identification
of bibliographic data elements from the title pages of books.
Also enumerates the various steps like scanning the title pages,
running optical character recognition (OCR) software, generating
HTML files out of title pages and applying heuristics to identify
the bibliographic data elements. Much of the paper deals with
the surveys undertaken to analyze the characteristics of various
bibliographic descriptive elements like title, author, publisher
and other elements. The first survey deals with the sequence of
the bibliographic data in the title pages. The second survey deals
with the font size, font type and the proximity of each
bibliographic element on the title pages. The survey results are
then used to develop heuristics, in order to develop a rule-based
expert system which can identify the bibliographic elements on
the title pages. The results of the system are presented, along
with problems encountered.
Electronic access
The Emerald Research Register for this journal is
available at
www.emeraldinsight.com/researchregister
The current issue and full text archive of this journal is
available at
www.emeraldinsight.com/0737-8831.htm
Introduction
One of the most time-consuming technical
operations in libraries is cataloguing. The
cataloguing process describes each item in a
collection, organizes the description into a
coherent structure of relationships, and provides a
tool in the form of a catalogue to access any
document in a library. Although the work involved
in cataloguing is very time consuming and not
easily automated, libraries have long tried to
reduce the amount of time and effort involved
(Akiyama, 1990). The process of determining
bibliographic data from title pages of the
documents is complex, yet systematic. An
investigation of the intellectual process involved
may yield a few heuristics to design an expert
system paradigm that can automatically identify
the bibliographic data elements from the title
pages.
The process of descriptive cataloguing begins
with the identification of bibliographic data about
an item. They include the following (Hagler and
Simmons, 1982):
.Title and Statement of Responsibility area
(i.e. the name of the item and names
designating its intellectual responsibility).
.Edition area.
.Publication and distribution area.
.Series area.
.Note area.
.Standard Number area.
Some bibliographic data can be easily found in the
item itself, others may come from other sources. In
order to ensure that all items are described in the
same way using at least the same starting point for
gathering data, the notion of “chief source of
information”, has been introduced by cataloguers.
The “chief source of information” is defined as
“the source of bibliographic data to be given first
preference as the source from which a
bibliographic description (or portion thereof) is
prepared” (Gorman and Winklet, 1988). For
monographs it is prescribed that “the page that
occurs very near the beginning of a book that
contains the most complete bibliographic
information about the book” (Gorman and
Winklet, 1978), called the title page, is to be the
first preference as a source of information for
descriptive cataloguing. “The title page serves a
purpose of information ... and as a means of
distinction and identification” (Wyner, 1980).
The purpose of this study is to investigate ways
in which artificial intelligence techniques can be
applied to cataloguing process. The basic problem
is to analyze, in terms of the conceptual level and
logical flows, the way a computer can be taught to
recognize bibliographic elements from the title
Library Hi Tech
Volume 22 · Number 4 · 2004 · pp.389-396
qEmerald Group Publishing Limited · ISSN 0737-8831
DOI 10.1108/07378830410570494
389

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT