Binary k‐nearest neighbor for text categorization

DOIhttps://doi.org/10.1108/14684520510617839
Pages391-399
Published date01 August 2005
Date01 August 2005
AuthorSongbo Tan
Subject MatterInformation & knowledge management,Library & information science
Binary k-nearest neighbor for
text categorization
Songbo Tan
Software Department, Institute of Computing Technology,
Chinese Academy of Sciences, People’s Republic of China
Abstract
Purpose With the ever-increasing volume of text data via the internet, it is important that
documents are classified as manageable and easy to understand categories. This paper proposes the
use of binary k-nearest neighbour (BKNN) for text categorization.
Design/methodology/approach – The paper describes the traditional k-nearest neighbor (KNN)
classifier, introduces BKNN and outlines experiemental results.
Findings – The experimental results indicate that BKNN requires much less CPU time than KNN,
without loss of classification performance.
Originality/value – The paper demonstrates how BKNN can be an efficient and effective algorithm
for text categorization. Proposes the use of binary k-nearest neighbor (BKNN ) for text categorization.
Keywords Classification,Information retrieval, Data handling
Paper type General review
Introduction
With the ever-increasing volume of text data available via the internet, it is important
that documents are classified in manageable and easy-to-understand categories. Text
categorization aims to attach predefined labels to previously unseen documents
automatically. This is an active research area in information retrieval, machine
learning and natural language processing. A number of machine learning algorithms
have been introduced to deal with text classification: k-nearest neighbor (KNN) (Yang
and Liu, 1999), centroid-based classifier (Han and Karypis, 2000), Naive Bayes (Lewis,
1998), decision trees (Lewis and Ringuette, 1994), and support vector machines (SVM)
(Joachims, 1998).
Of the existing methods, KNN is a simple classification algorithm and very easy to
implement since it does not require a training phase. Furthermore, experimental
research shows that KNN offers good performance in most cases (Yang and Liu , 1999;
Yang, 1999). However, because it requires a large amount of CPU time to compute the
similarity between a test document and each training document, and to sort these
similarities, KNN is of low efficiency. Such a drawback makes it unsuitable for some
applications where classification efficiency is crucial, for example, online text
classification, in which the classifier has to respond to a lot of documents arriving
simultaneously in stream format.
In order to improve the efficiency of KNN, some researchers resort to pruning
training samples (Guan and Zhou, 2002; Zhang and Mani, 2003). The pruning strategy
may perform well in traditional machine learning problems (Mico
´et al., 1994;
Dasarathy et al., 2000), but it can damage the classification quality of KNN for text
categorization (Guan and Zhou, 2002). With text classification, the documents are
The Emerald Research Register for this journal is available at The current issue and full text archive of this journal is available at
www.emeraldinsight.com/researchregister www.emeraldinsight.com/1468-4527.htm
Text
categorization
391
Refereed article received
13 April 2005
Accepted 12 May 2005
Online Information Review
Vol. 29 No. 4, 2005
pp. 391-399
qEmerald Group Publishing Limited
1468-4527
DOI 10.1108/14684520510617839

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT