Heuristics for identification of bibliographic elements from verso of title pages

Date01 December 2004
Published date01 December 2004
Pages397-403
DOIhttps://doi.org/10.1108/07378830410570502
AuthorA.R.D. Prasad,Durga Sankar Rath
Subject MatterInformation & knowledge management,Library & information science
Heuristics for
identification of
bibliographic elements
from verso of title pages
A.R.D. Prasad and
Durga Sankar Rath
The authors
A.R.D. Prasad is an Associate Professor in the Documentation
Research and Training Centre, Indian Statistical Institute,
Bangalore, Karnataka, India.
Durga Sankar Rath is a Lecturer in the Department of Library
and Information Science, Ravindra Bharati University,
Kolkata, India.
Keywords
Bibliographic systems, Information operations, Data handling,
Cataloguing, Classification schemes
Abstract
This paper presents a methodology to capture bibliographic data
from the verso of the title pages of documents. A survey has
been undertaken to identify the syntactic and semantic features
of bibliographic elements on the verso of title pages. These
features include the font size, line numbers and appearence of
certain string of characters. Emphasis is given to the study of
“cataloguing-in-publication” data. The results of the survey are
used to develop heuristics which can help in developing a
program to automatically identify the various bibliogaphic data
elements. The back of the title pages are scanned and stored as
HTML pages using optical recognition software. The heuristics
are then applied on the HTML pages. Few samples of input and
the output generated are presented. Finally,the problems related
to OCR and the heuristics are enumerated.
Electronic access
The Emerald Research Register for this journal is
available at
www.emeraldinsight.com/researchregister
The current issue and full text archive of this journal is
available at
www.emeraldinsight.com/0737-8831.htm
Introduction
The present paper deals with a part of our attempt
to develop a heuristics based expert system for
automatic identification of bibliographic
descriptive elements form the title and back of the
title pages of documents. As the title pages of
documents do not contain all the required
bibliographic descriptive elements (conspicuously
the date of publication, ISBN and other elements),
we have to resort to the back of the title page.
Additional advantage of the back of the title page is
that the data in cataloguing in publication (CIP)
can be used to counter check the data found in the
title page, especially the title, author, publisher.
However, this paper deals with the back of the title
pages only.
For the bibliographic description we should
resort to the various locations in the document
itself, as well as tap information found in some
outside sources. Though, Ranganathan
(Ranganathan, 1965) insisted more on using the
title page of the document as the chief source of
information, AA code and the Anglo-American
Cataloging Rules 1 (AACR I) laid emphasis on
various other pages and the document as a whole.
As defined by Anglo-American Cataloging
Rules 2 (AACR2), revised edition (1988), the
Verso is the left-hand page of a book, usually
bearing an even page number (Gorman and
Winkler (Eds), 1988, p. 624).
The back of the title page is customarily known
as the verso, or copyright page. This page should
carry: the address of the publisher; a copyright line
and associated information; the date of first
publication and those of reprints or new editions;
the country in which the book was printed and (in
some jurisdictions) the name and address of the
printer; CIP (Cataloguing in publication) data if
available; and the ISBN (International Standard
Book Number).
Rule 2.0B of AACR2R states:
Except Title and Statement of responsibility (that
also includes colophon, which again may appear at
the Verso of the title page), Edition, Publication,
Physical description, Series, Note, Standard
Number and terms of availability can be taken from
any of the preliminary pages, colophon, or from
any source. All these are described as the
prescribed sources of information for the respective
fields (Gorman and Winkler (Eds), 1988, p. 13).
Library Hi Tech
Volume 22 · Number 4 · 2004 · pp.397-403
qEmerald Group Publishing Limited · ISSN 0737-8831
DOI 10.1108/07378830410570502
Received: 20 June 2003
Revised: 30 November 2003
Accepted: 16 February 2004
397

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT