Enhancing document modeling by means of open topic models. Crossing the frontier of classification schemes in digital libraries by example of the DDC

Date20 November 2009
Pages520-539
Published date20 November 2009
DOIhttps://doi.org/10.1108/07378830911007646
AuthorAlexander Mehler,Ulli Waltinger
Subject MatterInformation & knowledge management,Library & information science
Enhancing document modeling by
means of open topic models
Crossing the frontier of classification schemes
in digital libraries by example of the DDC
Alexander Mehler and Ulli Waltinger
Faculty of Technology, Bielefeld University, Bielefeld, Germany
Abstract
Purpose – The purpose of this paperis to present a topic classificationmodel using the Dewey Decimal
Classification(DDC) as the target scheme. This is to be doneby exploring metadata as provided by the
Open Archives Initiative (OAI) to derive documentsnippets as minimal document representations. The
reason is to reduce the effort of document processing in digital libraries. Further, the paper seeks to
perform feature selection and extension by means of social ontologies and related web-based lexical
resources. Thisis done to provide reliable topic-relatedclassifications while circumventing the problem
of data sparseness. Finally, the paper aims to evaluate the model by means of two language-specific
corpora.The paper bridges digital libraries,on the one hand, and computationallinguistics, on the other.
The aim is to make accessible computational linguistic methods to provide thematic classifications in
digital libraries based on closed topic modelssuch as the DDC.
Design/methodology/approach The approach takes the form of text classification,
text-technology, computational linguistics, computational semantics, and social semantics.
Findings – It is shown that SVM-based classifiers perform best by exploring certain selections of
OAI document metadata.
Research limitations/implications – The findings show that it is necessary to further develop
SVM-based DDC-classifiers by using larger training sets possibly for more than two languages in
order to get better F-measure values.
Originality/value – Algorithmic and formal-mathematical information is provided on how to build
DDC-classifiers for digital libraries.
Keywords Document management,Modelling, Digital libraries
Paper type Research paper
1. Introduction
It is beyond any doubt that automatic content classification is of outmost interest in
digital libraries (Lossau, 2004). The idea is to provide content-related add-ons which
allow for improving retrieval and document processing. In this introduction, we give a
short overview of competing approaches in this field of research which focus on
condensed document representations as provided, for example, by keyword lists or
summaries.
The current issue and full text archive of this journal is available at
www.emeraldinsight.com/0737-8831.htm
The authors gratefully acknowledge financial support from the German Research Foundation
(DFG) through the EC 277 Cognitive Interaction Technology, the Research Group 437 Text
Technological Information Modeling and the DFG-LIS-Project P2P-Agents for Thematic
Structuring and Search Optimization in Digital Libraries at Bielefeld University. They also thank
Bielefeld University Library which kindly provided the test data used in this article.
LHT
27,4
520
Received 5 August 2009
Revised 10 September 2009
Accepted 14 September
2009
Library Hi Tech
Vol. 27 No. 4, 2009
pp. 520-539
qEmerald Group Publishing Limited
0737-8831
DOI 10.1108/07378830911007646
An early approach to clustering document summaries at different levels of thematic
granularity is the scatter-gather method (Cutting et al., 1992; Hearst and Pedersen,
1996). In recent years, variants of the Suffix Tree Clustering (STC) algorithm (Meyer zu
Eißen, 2007; Stein and Meyer zu Eißen, 2003; Zamir and Etzioni, 1999; Stefanowski and
Weiss, 2003) also attracted attention in this field of research. These variants explor e
common sub-phrases of documents which are judged to be similar because of their
common suffix trees. An alternative approach with a focus on hierarchical document
classification has been introduced by (Zhang and Dong, 2004) who explore search
query snippets instead of summaries as the main source of document representation.
These and related approaches form the core of search engines as, e.g. Vivı
´simo
(Valdes-Perez et al., 2000), Mapuccino (Maarek et al., 2000) and Carrot (Osinski and
Weiss, 2005), which perform post-retrieval document clustering. That is, they detect
topic labels of thematic clusters based on document snippets (e.g., titles ) as retrieved by
search queries (Kules et al., 2006). The idea behind this approach is to enhance the
identification of relevant documents by eliminating the need to skim large numbers of
irrelevant texts.
This approach is easily transferred to the area of digital libraries where document
snippets are given by subject-related metadata. A metadata protocol which recently
became more and more prominent is the Open Archives Initiative-Protocol for
Metadata Harvesting (OAI-PMH). This protocol implements a standardized metadata
model for facilitating exchange between repositories. Approaches to document
clustering in digital libraries have focused, among other things, on extending sea rch
queries and metadata entries of documents (Hagedorn et al., 2007; Rosenberg and
Borgman, 1992). In this case, clustering is performed to detect the subject area of
documents based on a predefined classification scheme, that is, a closed topic model
(Newman et al., 2007).
In this article, we present a topic classification model which uses the Dewey Decimal
Classification (DDC) (OCLC, 2008) as the target scheme. Our approach is novel in two
senses. On the one hand, we use metadata as provided by the Open Archives Initiative
(OAI) to derive document snippets as minimized document representations. This is
done to reduce the time and space complexity of document processing. On the other
hand, we perform feature selection and feature extension by means of social ontologies
and related web-based lexical resources. This is done to provide reliable topic-related
classifications while circumventing the problem of data sparseness. In a nutshell, the
article provides a model of topic-related document classifications whose semantics is
explored by means of web-based resources of semantic relatedness and whose
document model is mainly based on OAI data.
The article is structured as follows: in Section 2, we describe several reference points
of document modeling in digital libraries. We do that to shed light on how to cross the
frontier of classification schemes, i.e. moving from closed topic models toward open
topic models. Next, in Section 3, we describe our test corpora and the representation of
documents by means of OAI metadata. In Section 4, we introduce a search
engine-based classifier for the DDC which integrates social semantic knowledge to
enhance document representation. Further, in Section 5, we present an experiment in
DDC classification using two different corpora and five different DDC-related
classifiers. This experiment is discussed in detail in Section 6. Finally, Section 7
concludes and suggests prospects for future work.
Enhancing
document
modeling
521

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT