Classification of scientific publications according to library controlled vocabularies. A new concept matching-based approach

DOIhttps://doi.org/10.1108/LHT-03-2013-0030
Date18 November 2013
Pages725-747
Published date18 November 2013
AuthorArash Joorabchi,Abdulhussain E. Mahdi
Subject MatterLibrary & information science,Librarianship/library management,Library technology
REGULAR PAPER
ClassiïŹcation of scientiïŹc
publications according to library
controlled vocabularies
A new concept matching-based approach
Arash Joorabchi and Abdulhussain E. Mahdi
Electronic and Computer Engineering Department, University of Limerick,
Limerick, Ireland
Abstract
Purpose – This paper aims to report on the design and development of a new approach for automatic
classiïŹcation and subject indexing of research documents in scientiïŹc digital libraries and repositories
(DLR) according to library controlled vocabularies such as DDC and FAST.
Design/methodology/approach – The proposed concept matching-based approach (CMA) detects
key Wikipedia concepts occurring in a document and searches the OPACs of conventional libraries via
querying the WorldCat database to retrieve a set of MARC records which share one or more of the
detected key concepts. Then the semantic similarity of each retrieved MARC record to the document is
measured and, using an inference algorithm, the DDC classes and FAST subjects of those MARC
records which have the highest similarity to the document are assigned to it.
Findings – The performance of the proposed method in terms of the accuracy of the DDC classes and
FAST subjects automatically assigned to a set of research documents is evaluated using standard
information retrieval measures of precision, recall, and F1. The authors demonstrate the superiority of
the proposed approach in terms of accuracy performance in comparison to a similar system currently
deployed in a large scale scientiïŹc search engine.
Originality/value – The proposed approach enables the development of a new type of subject
classiïŹcation system for DLR, and addresses some of the problems similar systems suffer from, such
as the problem of imbalanced training data encountered by machine learning-based systems, and the
problem of word-sense ambiguity encountered by string matching-based systems.
Keywords Libraries,Information retrieval, Conceptmatching, Subject indexing,WorldCat, Wikipedia,
ScientiïŹc digitallibraries and repositories,Metadata generation, Subject metadata,
Dewey DecimalClassiïŹcation (DDC), FAST subject headings, Automatic classiïŹcation
Paper type Research paper
1. Introduction
The use of open access scientiïŹc digital libraries and repositories (DLR) is fast-growing
within research and academic communities. They provide open access platforms for
efïŹcient dissemination of research output by individuals or groups in research-oriented
organisations such as universities, research and development companies, national
The current issue and full text archive of this journal is available at
www.emeraldinsight.com/0737-8831.htm
This work is supported by the OCLC/ALISE Library & Information Science Research Grant
Program (LISRGP) 2012 and Irish Research Council New Foundations scheme 2012.
Both authors contributed equally to this work.
ClassiïŹcation of
scientiïŹc
publications
725
Received 5 March 2013
Revised 26 June 2013
8 September 2013
Accepted 9 September 2013
Library Hi Tech
Vol. 31 No. 4, 2013
pp. 725-747
qEmerald Group Publishing Limited
0737-8831
DOI 10.1108/LHT-03-2013-0030
research labs, centres, and institutes. The research output comprises scientiïŹc
publications including journal articles, conference papers, technical reports, theses and
dissertations, book chapters, and other materials about the theory, practice, and results of
scientiïŹc inquiry. The size of DLR collections vary from a few thousands, e.g. small
institutional repositories, to hundreds of thousands, e.g. arXiv (http://arxiv.org), and even
millions, e.g. PMC (www.ncbi.nlm.nih.gov/pmc) (Adamick and Reznik-Zellen, 2010). Also,
specialized search engines such as CiteSeerX (http://citeseerx.ist.psu.edu) and BASE
(www.base-search.net) harvest, aggregate and index up to tens of millions of academic
open access materials archived in institutional repositories, authors’ webpages, etc. As the
practice of open-access archiving grows due to the policy and enforcement initiatives
taken by many research funding agencies, and as DLR software systems mature, it is
expected that the size of DLR collections will grow exponentially. However, as these
collections grow in size, ïŹnding the most relevant and up-to-date archived materials
becomes challenging for the patrons. This is due to the fact that a great majority of
current DLR systems rely solely on traditional keyword-based search methods which are
prone to yield a large volume of indiscriminate search results irrespective of their content.
Therefore, in order to facilitate precision search and discovery of archived materials,
which enables patrons to focus their exploration efforts on the most relevant items of
interest and reduces the recall effort, i.e. the ratio of desired to examined, we need to go
beyond the traditional keyword-based search methods currently deployed.
ClassiïŹcation and subject indexing of archived materials according to library
controlled vocabularies can enhance the performance of DLR search and discovery
services. They also facilitate browsing the collections by category, e.g. Dewey Decimal
ClassiïŹcation (DDC) system or subject, e.g. Library of Congress Subject Headings
(LCSH). For example, the study of users navigation behaviours in a large-scale
European meta subject gateway, Renardus, via log analysis by Traugott et al. (2004)
showed that the directory-style of browsing in the DDC-based browsing structure was
clearly the dominant activity, constituting 60 per cent of all activities. However,
manual classiïŹcation and subject indexing of archived materials in DLR collections is a
resource-intensive task which requires expert cataloguers in each knowledge domain
represented in the collection and, therefore, deemed impractical in many cases due to
the sheer volume of new materials published on daily basis. For example, reportedly
the number of new publications in the ïŹeld of biomedical science alone exceeds 1,800 a
day (Hunter and Cohen, 2006). Methods and approaches reported in the library and
information science literature to address this problem by automating the classiïŹcation
and subject indexing process can be divided into two main categories:
(1) String matching-based systems. These systems rely on a method which
consists of string-to-string matching between words in a list of terms extracted
from library thesauri and classiïŹcation schemes, and words in the textual
content of the document to be classiïŹed. In this approach, an unlabelled
document can be thought of as a search query against the library classiïŹcation
schemes and thesauri, where the search results include the most probable
classes and subjects for the document. One of the well-known examples of such
systems is the Scorpion project by OCCL Research (Roger et al., 1997; Godby
and Smith, 2000– 2002). Scorpion builds a set of reference clusters for DDC
classes and deploys a term-frequency distance measure to ïŹnd the most
relevant cluster (and consequently DDC class) for the document to be classiïŹed.
LHT
31,4
726

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT