Duplicate detection algorithms of bibliographic descriptions

DOIhttps://doi.org/10.1108/07378830810880379
Date13 June 2008
Published date13 June 2008
Pages287-301
AuthorAnestis Sitas,Sarantos Kapidakis
Subject MatterInformation & knowledge management,Library & information science
Duplicate detection algorithms of
bibliographic descriptions
Anestis Sitas
School of Philosophy, Aristotle University of Thessaloniki, and
School of Library Science, Technological Institute of Thessaloniki,
Thessaloniki, Greece, and
Sarantos Kapidakis
Archive and Library Sciences Department, Ionian University, Paleo Anaktoro,
Greece
Abstract
Purpose – The purpose of this paper is to focus on duplicate record detection algorithms used for
detection in bibliographic databases.
Design/methodology/approach – Individual algorithms, their application process for duplicate
detection and their results are described based on available literature (published articles), information
found at various library web sites and follow-up e-mail communications.
Findings – Algorithms are categorized according to their application as a process of a single step or
two consecutive steps. The results of deletion, merging, and temporary and virtual consolidation of
duplicate records are studied.
Originality/value – The paper presents an overview of the duplication detection algorithms and an
up-to-date state of their application in different library systems.
Keywords Cataloguing,Algorithms, Bibliographic systems,Records management
Paper type Research paper
Introduction
The ideal setup for a library catalogue would be to register a unique bibliographic
record for each bibliographic entity. However, bibliographic databases inclu de several
types of duplicate records. Even if the search cues are clearly specified, locating the
correct entry is still an issue that requires further investigation as new materials are
added in a variety of media. Duplicate records slow down the indexing process and
significantly increase the cost for saving and managing data not to mention that their
retrieval is delayed. As a result, duplicate records constitute a system deficiency and
compromise quality control for all parties involved, namely users, catalogers, and
technical staff. Shared cataloging further aggravates the problem as, through the
automated systems, each library-member of one system can access the other members’
records. Administrators have to have to improve the bibliographic database quality
and keep the database functional and “clean”.
Duplicate records
In the environment of bibliographic databases, a duplicate record could be defined as
two or more records which stand for or describe the same document (defined as any
information resource). Duplicate records can cause problems to the following areas:
The current issue and full text archive of this journal is available at
www.emeraldinsight.com/0737-8831.htm
Duplicate
detection
algorithms
287
Received 11 October 2007
Revised 22 October 2007
Accepted 27 January 2008
Library Hi Tech
Vol. 26 No. 2, 2008
pp. 287-301
qEmerald Group Publishing Limited
0737-8831
DOI 10.1108/07378830810880379

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT