Automated subject classification of textual web documents

Date01 May 2006
Pages350-371
DOIhttps://doi.org/10.1108/00220410610666501
Published date01 May 2006
AuthorKoraljka Golub
Subject MatterInformation & knowledge management,Library & information science
Automated subject classification
of textual web documents
Koraljka Golub
Department of Information Technology, Lund University, Lund, Sweden
Abstract
Purpose – To provide an integrated perspective to similarities and differences between approaches
to automated classification in different research communities (machine learning, information retrieval
and library science), and point to problems with the approaches and automated classification as such.
Design/methodology/approach – A range of works dealing with automated classification of
full-text web documents are discussed. Explorations of individual approaches are given in the following
sections: special features (description, differences, evaluation), application and characteristics of
web pages.
Findings – Provides major similarities and differences between the three approaches: document
pre-processing and utilization of web-specific document characteristics is common to all the
approaches; major differences are in applied algorithms, employment or not of the vector space model
and of controlled vocabularies. Problems of automated classification are recognized.
Research limitations/implications – The paper does not attempt to provide an exhaustive
bibliography of related resources.
Practical implications As an integrated overview of approaches from different research
communities with application examples, it is very useful for students in library and information
science and computer science, as well as for practitioners. Researchers from one community have the
information on how similar tasks are conducted in different communities.
Originality/value – To the author’s knowledge, no review paper on automated text classification
attempted to discuss more than one community’s approach from an integrated perspective.
Keywords Automation, Classification, Internet,Document management, Controlledlanguages
Paper type Literature review
1. Introduction
Classification is, to the purpose of this paper, defined as:
... the multistage process of deciding on a property or characteristic of interest,
distinguishing things or objects that possess that property from those which lack it, and
grouping things or objects that have the property or characteristic in common into a class.
Other essential aspects of classification are establishing relationships among classes and
making distinctions within classes to arrive at subclasses and finer divisions (Chan, 1994,
p. 259).
Automated subject classification (in further text: automated classification) denotes
machine-based organization of related information objects into topically related
groups. In this process human intellectual processes are replaced by, for example,
statistical and computational linguistics techniques. In the literature on automated
The current issue and full text archive of this journal is available at
www.emeraldinsight.com/0022-0418.htm
Many thanks to Traugott Koch, Anders Ardo
¨Tatjana Aparac Jelus
ˇic
´Johan Eklund, Ingo
Frommholz, Repke de Vries and the Journal of Documentation reviewers for providing valuable
feedback on earlier versions of the paper.
JDOC
62,3
350
Received February 2005
Revised August 2005
Accepted September 2005
Journal of Documentation
Vol. 62 No. 3, 2006
pp. 350-371
qEmerald Group Publishing Limited
0022-0418
DOI 10.1108/00220410610666501
classification, the terms automatic and automated are both used. Here the term
automated is chosen because it more directly implies that the process is machine-based.
Automated classification has been a challenging research issue for several
decades now. Major motivation has been the high cost of manual classification.
Interest has grown rapidly since 1997, when search engines could not do with just
text retrieval techniques, because the number of available documents grew
exponentially. Owing to the ever-increasing number of documents, there is a
danger that recognized objectives of bibliographic systems would get left behind;
automated means could be a solution to preserve them (Svenonius, 2000, pp. 20-1,
30). Automated classification of text finds its use in a wide variety of applications,
such as: organizing documents into subject categories for topical browsing,
including grouping search results by subject; topical harvesting; personalized
routing of news articles; filtering of unwanted content for internet browsers; and
many others (Sebastiani, 2002; Jain et al., 1999).
In the narrower focus of this paper is automated classification of textual web
documents into subject categories for browsing. Web documents have specific
characteristics such as hyperlinks and anchors, metadata, and structural information,
all of which could serve as co mplementary features t o improve automated
classification. On the other hand, they are rather heterogeneous; many of them
contain little text, metadata provided are sparse and can be misused, structural tags
can also be misused, and titles can be general (“home page” “untitled document”).
Browsing in this paper refers to seeking for documents via a hierarchical structure of
subject classes into which the documents had been classified. Research has shown that
people find browsing useful in a number of information-seeking situations, such as:
when not looking for a specific item, when one is inexperienced in searchin g (Koch and
Zettergren, 1999), or unfamiliar with the subject in question and its terminology or
structure (Schwartz, 2001, p. 76).
In the literature, terms such as classification, categorization and clustering are used
to represent different approaches. In their broadest sense these terms could be
considered synonymous, which is probably one of the reasons why they are
interchangeably used in the literature, even within the same research com munities.
For example, Hartigan (1996, p. 2) says: “The term cluster analysis is use d most
commonly to describe the work in this book, but I much prefer the term classification...”
Or: “...classification or categorization is the task of assignin g objects from a universe
to two or more classes or categories” (Manning and Schu
¨tze, 1999, p. 575).
In this paper terms text categorization and document clustering are chosen because
they tend to be the prevalent terms in the literature of the corresponding communities.
Document classification and mixed approach are used in order to consistently
distinguish between the four approaches. Descriptions of the approaches are given
below:
(1) Text categorization. It is a machine-learning app roach, in which also
information retrieval methods are applied. It consists of three main parts:
categorizing a number of documents to pre-defined categories, learning the
characteristics of those documents, and categorizing new documents. In the
machine-learning terminology, text categorization is known as supervised
learning, since the process is “supervised” by learning categories’
characteristics from manually categorized documents.
Automated
subject
classification
351

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT