Enhancing document modeling by means of open topic models. Crossing the frontier of classification schemes in digital libraries by example of the DDC

Document

Cited in

Date	20 November 2009
Pages	520-539
Published date	20 November 2009
DOI	https://doi.org/10.1108/07378830911007646
Author	Alexander Mehler,Ulli Waltinger
Subject Matter	Information & knowledge management,Library & information science

Enhancing document modeling by

means of open topic models

Crossing the frontier of classiﬁcation schemes

in digital libraries by example of the DDC

Alexander Mehler and Ulli Waltinger

Faculty of Technology, Bielefeld University, Bielefeld, Germany

Abstract

Purpose – The purpose of this paperis to present a topic classiﬁcationmodel using the Dewey Decimal

Classiﬁcation(DDC) as the target scheme. This is to be doneby exploring metadata as provided by the

Open Archives Initiative (OAI) to derive documentsnippets as minimal document representations. The

reason is to reduce the effort of document processing in digital libraries. Further, the paper seeks to

perform feature selection and extension by means of social ontologies and related web-based lexical

resources. Thisis done to provide reliable topic-relatedclassiﬁcations while circumventing the problem

of data sparseness. Finally, the paper aims to evaluate the model by means of two language-speciﬁc

corpora.The paper bridges digital libraries,on the one hand, and computationallinguistics, on the other.

The aim is to make accessible computational linguistic methods to provide thematic classiﬁcations in

digital libraries based on closed topic modelssuch as the DDC.

Design/methodology/approach – The approach takes the form of text classiﬁcation,

text-technology, computational linguistics, computational semantics, and social semantics.

Findings – It is shown that SVM-based classiﬁers perform best by exploring certain selections of

OAI document metadata.

Research limitations/implications – The ﬁndings show that it is necessary to further develop

SVM-based DDC-classiﬁers by using larger training sets possibly for more than two languages in

order to get better F-measure values.

Originality/value – Algorithmic and formal-mathematical information is provided on how to build

DDC-classiﬁers for digital libraries.

Keywords Document management,Modelling, Digital libraries

Paper type Research paper

1. Introduction

It is beyond any doubt that automatic content classiﬁcation is of outmost interest in

digital libraries (Lossau, 2004). The idea is to provide content-related add-ons which

allow for improving retrieval and document processing. In this introduction, we give a

short overview of competing approaches in this ﬁeld of research which focus on

condensed document representations as provided, for example, by keyword lists or

summaries.

The current issue and full text archive of this journal is available at

www.emeraldinsight.com/0737-8831.htm

The authors gratefully acknowledge ﬁnancial support from the German Research Foundation

(DFG) through the EC 277 Cognitive Interaction Technology, the Research Group 437 Text

Technological Information Modeling and the DFG-LIS-Project P2P-Agents for Thematic

Structuring and Search Optimization in Digital Libraries at Bielefeld University. They also thank

Bielefeld University Library which kindly provided the test data used in this article.

LHT

27,4

520

Received 5 August 2009

Revised 10 September 2009

Accepted 14 September

2009

Library Hi Tech

Vol. 27 No. 4, 2009

pp. 520-539

qEmerald Group Publishing Limited

0737-8831

DOI 10.1108/07378830911007646

An early approach to clustering document summaries at different levels of thematic

granularity is the scatter-gather method (Cutting et al., 1992; Hearst and Pedersen,

1996). In recent years, variants of the Sufﬁx Tree Clustering (STC) algorithm (Meyer zu

Eißen, 2007; Stein and Meyer zu Eißen, 2003; Zamir and Etzioni, 1999; Stefanowski and

Weiss, 2003) also attracted attention in this ﬁeld of research. These variants explor e

common sub-phrases of documents which are judged to be similar because of their

common sufﬁx trees. An alternative approach with a focus on hierarchical document

classiﬁcation has been introduced by (Zhang and Dong, 2004) who explore search

query snippets instead of summaries as the main source of document representation.

These and related approaches form the core of search engines as, e.g. Vivı

´simo

(Valdes-Perez et al., 2000), Mapuccino (Maarek et al., 2000) and Carrot (Osinski and

Weiss, 2005), which perform post-retrieval document clustering. That is, they detect

topic labels of thematic clusters based on document snippets (e.g., titles ) as retrieved by

search queries (Kules et al., 2006). The idea behind this approach is to enhance the

identiﬁcation of relevant documents by eliminating the need to skim large numbers of

irrelevant texts.

This approach is easily transferred to the area of digital libraries where document

snippets are given by subject-related metadata. A metadata protocol which recently

became more and more prominent is the Open Archives Initiative-Protocol for

Metadata Harvesting (OAI-PMH). This protocol implements a standardized metadata

model for facilitating exchange between repositories. Approaches to document

clustering in digital libraries have focused, among other things, on extending sea rch

queries and metadata entries of documents (Hagedorn et al., 2007; Rosenberg and

Borgman, 1992). In this case, clustering is performed to detect the subject area of

documents based on a predeﬁned classiﬁcation scheme, that is, a closed topic model

(Newman et al., 2007).

In this article, we present a topic classiﬁcation model which uses the Dewey Decimal

Classiﬁcation (DDC) (OCLC, 2008) as the target scheme. Our approach is novel in two

senses. On the one hand, we use metadata as provided by the Open Archives Initiative

(OAI) to derive document snippets as minimized document representations. This is

done to reduce the time and space complexity of document processing. On the other

hand, we perform feature selection and feature extension by means of social ontologies

and related web-based lexical resources. This is done to provide reliable topic-related

classiﬁcations while circumventing the problem of data sparseness. In a nutshell, the

article provides a model of topic-related document classiﬁcations whose semantics is

explored by means of web-based resources of semantic relatedness and whose

document model is mainly based on OAI data.

The article is structured as follows: in Section 2, we describe several reference points

of document modeling in digital libraries. We do that to shed light on how to cross the

frontier of classiﬁcation schemes, i.e. moving from closed topic models toward open

topic models. Next, in Section 3, we describe our test corpora and the representation of

documents by means of OAI metadata. In Section 4, we introduce a search

engine-based classiﬁer for the DDC which integrates social semantic knowledge to

enhance document representation. Further, in Section 5, we present an experiment in

DDC classiﬁcation using two different corpora and ﬁve different DDC-related

classiﬁers. This experiment is discussed in detail in Section 6. Finally, Section 7

concludes and suggests prospects for future work.

Enhancing

document

modeling

521

To continue reading

Request your trial

Subscribers can access the reported version of this case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the cited cases and legislation of a document.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the documents that have cited the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the revised versions of legislation with amendments.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see any amendments made to the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a visualisation of a case and its relationships to other cases. An alternative to lists of cases, the Precedent Map makes it easier to establish which ones may be of most relevance to your research and prioritise further reading. You also get a useful overview of how the case was received.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the list of results connected to your document through the topics and citations Vincent found.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Enhancing document modeling by means of open topic models. Crossing the frontier of classification schemes in digital libraries by example of the DDC

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users