Automated subject classification of textual web documents

Document

Cited in

Date	01 May 2006
Pages	350-371
DOI	https://doi.org/10.1108/00220410610666501
Published date	01 May 2006
Author	Koraljka Golub
Subject Matter	Information & knowledge management,Library & information science

Automated subject classiﬁcation

of textual web documents

Koraljka Golub

Department of Information Technology, Lund University, Lund, Sweden

Abstract

Purpose – To provide an integrated perspective to similarities and differences between approaches

to automated classiﬁcation in different research communities (machine learning, information retrieval

and library science), and point to problems with the approaches and automated classiﬁcation as such.

Design/methodology/approach – A range of works dealing with automated classiﬁcation of

full-text web documents are discussed. Explorations of individual approaches are given in the following

sections: special features (description, differences, evaluation), application and characteristics of

web pages.

Findings – Provides major similarities and differences between the three approaches: document

pre-processing and utilization of web-speciﬁc document characteristics is common to all the

approaches; major differences are in applied algorithms, employment or not of the vector space model

and of controlled vocabularies. Problems of automated classiﬁcation are recognized.

Research limitations/implications – The paper does not attempt to provide an exhaustive

bibliography of related resources.

Practical implications – As an integrated overview of approaches from different research

communities with application examples, it is very useful for students in library and information

science and computer science, as well as for practitioners. Researchers from one community have the

information on how similar tasks are conducted in different communities.

Originality/value – To the author’s knowledge, no review paper on automated text classiﬁcation

attempted to discuss more than one community’s approach from an integrated perspective.

Keywords Automation, Classiﬁcation, Internet,Document management, Controlledlanguages

Paper type Literature review

1. Introduction

Classiﬁcation is, to the purpose of this paper, deﬁned as:

... the multistage process of deciding on a property or characteristic of interest,

distinguishing things or objects that possess that property from those which lack it, and

grouping things or objects that have the property or characteristic in common into a class.

Other essential aspects of classiﬁcation are establishing relationships among classes and

making distinctions within classes to arrive at subclasses and ﬁner divisions (Chan, 1994,

p. 259).

Automated subject classiﬁcation (in further text: automated classiﬁcation) denotes

machine-based organization of related information objects into topically related

groups. In this process human intellectual processes are replaced by, for example,

statistical and computational linguistics techniques. In the literature on automated

The current issue and full text archive of this journal is available at

www.emeraldinsight.com/0022-0418.htm

Many thanks to Traugott Koch, Anders Ardo

¨Tatjana Aparac Jelus

ˇic

´Johan Eklund, Ingo

Frommholz, Repke de Vries and the Journal of Documentation reviewers for providing valuable

feedback on earlier versions of the paper.

JDOC

62,3

350

Received February 2005

Revised August 2005

Accepted September 2005

Journal of Documentation

Vol. 62 No. 3, 2006

pp. 350-371

qEmerald Group Publishing Limited

0022-0418

DOI 10.1108/00220410610666501

classiﬁcation, the terms automatic and automated are both used. Here the term

automated is chosen because it more directly implies that the process is machine-based.

Automated classiﬁcation has been a challenging research issue for several

decades now. Major motivation has been the high cost of manual classiﬁcation.

Interest has grown rapidly since 1997, when search engines could not do with just

text retrieval techniques, because the number of available documents grew

exponentially. Owing to the ever-increasing number of documents, there is a

danger that recognized objectives of bibliographic systems would get left behind;

automated means could be a solution to preserve them (Svenonius, 2000, pp. 20-1,

30). Automated classiﬁcation of text ﬁnds its use in a wide variety of applications,

such as: organizing documents into subject categories for topical browsing,

including grouping search results by subject; topical harvesting; personalized

routing of news articles; ﬁltering of unwanted content for internet browsers; and

many others (Sebastiani, 2002; Jain et al., 1999).

In the narrower focus of this paper is automated classiﬁcation of textual web

documents into subject categories for browsing. Web documents have speciﬁc

characteristics such as hyperlinks and anchors, metadata, and structural information,

all of which could serve as co mplementary features t o improve automated

classiﬁcation. On the other hand, they are rather heterogeneous; many of them

contain little text, metadata provided are sparse and can be misused, structural tags

can also be misused, and titles can be general (“home page” “untitled document”).

Browsing in this paper refers to seeking for documents via a hierarchical structure of

subject classes into which the documents had been classiﬁed. Research has shown that

people ﬁnd browsing useful in a number of information-seeking situations, such as:

when not looking for a speciﬁc item, when one is inexperienced in searchin g (Koch and

Zettergren, 1999), or unfamiliar with the subject in question and its terminology or

structure (Schwartz, 2001, p. 76).

In the literature, terms such as classiﬁcation, categorization and clustering are used

to represent different approaches. In their broadest sense these terms could be

considered synonymous, which is probably one of the reasons why they are

interchangeably used in the literature, even within the same research com munities.

For example, Hartigan (1996, p. 2) says: “The term cluster analysis is use d most

commonly to describe the work in this book, but I much prefer the term classiﬁcation...”

Or: “...classiﬁcation or categorization is the task of assignin g objects from a universe

to two or more classes or categories” (Manning and Schu

¨tze, 1999, p. 575).

In this paper terms text categorization and document clustering are chosen because

they tend to be the prevalent terms in the literature of the corresponding communities.

Document classiﬁcation and mixed approach are used in order to consistently

distinguish between the four approaches. Descriptions of the approaches are given

below:

(1) Text categorization. It is a machine-learning app roach, in which also

information retrieval methods are applied. It consists of three main parts:

categorizing a number of documents to pre-deﬁned categories, learning the

characteristics of those documents, and categorizing new documents. In the

machine-learning terminology, text categorization is known as supervised

learning, since the process is “supervised” by learning categories’

characteristics from manually categorized documents.

Automated

subject

classiﬁcation

351

To continue reading

Request your trial

Subscribers can access the reported version of this case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the cited cases and legislation of a document.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the documents that have cited the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the revised versions of legislation with amendments.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see any amendments made to the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a visualisation of a case and its relationships to other cases. An alternative to lists of cases, the Precedent Map makes it easier to establish which ones may be of most relevance to your research and prioritise further reading. You also get a useful overview of how the case was received.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the list of results connected to your document through the topics and citations Vincent found.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Automated subject classification of textual web documents

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users