Classification of scientific publications according to library controlled vocabularies. A new concept matching-based approach

Document

Cited in

DOI	https://doi.org/10.1108/LHT-03-2013-0030
Date	18 November 2013
Pages	725-747
Published date	18 November 2013
Author	Arash Joorabchi,Abdulhussain E. Mahdi
Subject Matter	Library & information science,Librarianship/library management,Library technology

REGULAR PAPER

Classiﬁcation of scientiﬁc

publications according to library

controlled vocabularies

A new concept matching-based approach

Arash Joorabchi and Abdulhussain E. Mahdi

Electronic and Computer Engineering Department, University of Limerick,

Limerick, Ireland

Abstract

Purpose – This paper aims to report on the design and development of a new approach for automatic

classiﬁcation and subject indexing of research documents in scientiﬁc digital libraries and repositories

(DLR) according to library controlled vocabularies such as DDC and FAST.

Design/methodology/approach – The proposed concept matching-based approach (CMA) detects

key Wikipedia concepts occurring in a document and searches the OPACs of conventional libraries via

querying the WorldCat database to retrieve a set of MARC records which share one or more of the

detected key concepts. Then the semantic similarity of each retrieved MARC record to the document is

measured and, using an inference algorithm, the DDC classes and FAST subjects of those MARC

records which have the highest similarity to the document are assigned to it.

Findings – The performance of the proposed method in terms of the accuracy of the DDC classes and

FAST subjects automatically assigned to a set of research documents is evaluated using standard

information retrieval measures of precision, recall, and F1. The authors demonstrate the superiority of

the proposed approach in terms of accuracy performance in comparison to a similar system currently

deployed in a large scale scientiﬁc search engine.

Originality/value – The proposed approach enables the development of a new type of subject

classiﬁcation system for DLR, and addresses some of the problems similar systems suffer from, such

as the problem of imbalanced training data encountered by machine learning-based systems, and the

problem of word-sense ambiguity encountered by string matching-based systems.

Keywords Libraries,Information retrieval, Conceptmatching, Subject indexing,WorldCat, Wikipedia,

Scientiﬁc digitallibraries and repositories,Metadata generation, Subject metadata,

Dewey DecimalClassiﬁcation (DDC), FAST subject headings, Automatic classiﬁcation

Paper type Research paper

1. Introduction

The use of open access scientiﬁc digital libraries and repositories (DLR) is fast-growing

within research and academic communities. They provide open access platforms for

efﬁcient dissemination of research output by individuals or groups in research-oriented

organisations such as universities, research and development companies, national

The current issue and full text archive of this journal is available at

www.emeraldinsight.com/0737-8831.htm

This work is supported by the OCLC/ALISE Library & Information Science Research Grant

Program (LISRGP) 2012 and Irish Research Council New Foundations scheme 2012.

Both authors contributed equally to this work.

Classiﬁcation of

scientiﬁc

publications

725

Received 5 March 2013

Revised 26 June 2013

8 September 2013

Accepted 9 September 2013

Library Hi Tech

Vol. 31 No. 4, 2013

pp. 725-747

qEmerald Group Publishing Limited

0737-8831

DOI 10.1108/LHT-03-2013-0030

research labs, centres, and institutes. The research output comprises scientiﬁc

publications including journal articles, conference papers, technical reports, theses and

dissertations, book chapters, and other materials about the theory, practice, and results of

scientiﬁc inquiry. The size of DLR collections vary from a few thousands, e.g. small

institutional repositories, to hundreds of thousands, e.g. arXiv (http://arxiv.org), and even

millions, e.g. PMC (www.ncbi.nlm.nih.gov/pmc) (Adamick and Reznik-Zellen, 2010). Also,

specialized search engines such as CiteSeerX (http://citeseerx.ist.psu.edu) and BASE

(www.base-search.net) harvest, aggregate and index up to tens of millions of academic

open access materials archived in institutional repositories, authors’ webpages, etc. As the

practice of open-access archiving grows due to the policy and enforcement initiatives

taken by many research funding agencies, and as DLR software systems mature, it is

expected that the size of DLR collections will grow exponentially. However, as these

collections grow in size, ﬁnding the most relevant and up-to-date archived materials

becomes challenging for the patrons. This is due to the fact that a great majority of

current DLR systems rely solely on traditional keyword-based search methods which are

prone to yield a large volume of indiscriminate search results irrespective of their content.

Therefore, in order to facilitate precision search and discovery of archived materials,

which enables patrons to focus their exploration efforts on the most relevant items of

interest and reduces the recall effort, i.e. the ratio of desired to examined, we need to go

beyond the traditional keyword-based search methods currently deployed.

Classiﬁcation and subject indexing of archived materials according to library

controlled vocabularies can enhance the performance of DLR search and discovery

services. They also facilitate browsing the collections by category, e.g. Dewey Decimal

Classiﬁcation (DDC) system or subject, e.g. Library of Congress Subject Headings

(LCSH). For example, the study of users navigation behaviours in a large-scale

European meta subject gateway, Renardus, via log analysis by Traugott et al. (2004)

showed that the directory-style of browsing in the DDC-based browsing structure was

clearly the dominant activity, constituting 60 per cent of all activities. However,

manual classiﬁcation and subject indexing of archived materials in DLR collections is a

resource-intensive task which requires expert cataloguers in each knowledge domain

represented in the collection and, therefore, deemed impractical in many cases due to

the sheer volume of new materials published on daily basis. For example, reportedly

the number of new publications in the ﬁeld of biomedical science alone exceeds 1,800 a

day (Hunter and Cohen, 2006). Methods and approaches reported in the library and

information science literature to address this problem by automating the classiﬁcation

and subject indexing process can be divided into two main categories:

(1) String matching-based systems. These systems rely on a method which

consists of string-to-string matching between words in a list of terms extracted

from library thesauri and classiﬁcation schemes, and words in the textual

content of the document to be classiﬁed. In this approach, an unlabelled

document can be thought of as a search query against the library classiﬁcation

schemes and thesauri, where the search results include the most probable

classes and subjects for the document. One of the well-known examples of such

systems is the Scorpion project by OCCL Research (Roger et al., 1997; Godby

and Smith, 2000– 2002). Scorpion builds a set of reference clusters for DDC

classes and deploys a term-frequency distance measure to ﬁnd the most

relevant cluster (and consequently DDC class) for the document to be classiﬁed.

LHT

31,4

726

To continue reading

Request your trial

Subscribers can access the reported version of this case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the cited cases and legislation of a document.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the documents that have cited the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the revised versions of legislation with amendments.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see any amendments made to the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a visualisation of a case and its relationships to other cases. An alternative to lists of cases, the Precedent Map makes it easier to establish which ones may be of most relevance to your research and prioritise further reading. You also get a useful overview of how the case was received.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the list of results connected to your document through the topics and citations Vincent found.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Classification of scientific publications according to library controlled vocabularies. A new concept matching-based approach

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users