Artificial bee colony algorithm for feature selection and improved support vector machine for text classification

Date19 August 2019
Published date19 August 2019
Pages154-170
DOIhttps://doi.org/10.1108/IDD-09-2018-0045
AuthorJanani Balakumar,S. Vijayarani Mohan
Subject MatterLibrary & information science,Library & information services,Lending,Document delivery,Collection building & management,Stock revision,Consortia
Articial bee colony algorithm for feature
selection and improved support vector machine
for text classication
Janani Balakumar and S. Vijayarani Mohan
Department of Computer Science, Bharathiar University, Coimbatore, India
Abstract
Purpose Owing to the huge volume of documents available on the internet, text classication becomes a necessary task to handle these
documents. To achieve optimal text classication results, feature selection, an important stage, is use d to curtail the dimensionality of text
documents by choosing suitable features. The main purpose of this research work is to classify the personal computer documents based on their
content.
Design/methodology/approach This paper proposes a new algorithm for feature selection based on articial bee colony (ABCFS) to enhance the
text classication accuracy. The proposed algorithm (ABCFS) is scrutinized with the real and benchmark data sets, which is contrary to the other
existing feature selection approaches such as information gain and
x
2
statistic. To justify the efciency of the proposed algorithm, the support
vector machine (SVM) and improved SVM classier are used in this paper.
Findings The experiment was conducted on real and benchmark data sets. The real data set was collected in the form of documents that were
stored in the personal computer, and the benchmark data set was collected from Reuters and 20 Newsgroups corpus. The results prove the
performance of the proposed feature selection algorithm by enhancing the text document classication accuracy.
Originality/value This paper proposes a new ABCFS algorithm for feature selection, evaluates the efciency of the ABCFS algorithm and improves
the support vector machine. In this paper, the ABCFS algorithm is used to select the features from text (unstructured) documents. Although, there is
no text feature selection algorithm in the existing work, the ABCFS algorithm is used to select the data (structured) features. The proposed algorithm
will classify the documents automatically based on their content.
Keywords Information technology, Information science, Information retrieval, Information management, Information systems,
Document management, Text classication, Feature selection, Information gain,
x
2
statistic, Articial bee colony, Support vector machine,
Improved SVM
Paper type Research paper
1. Introduction
Text document classication is a frequently used technique in
the eld of text mining and machine learning. It is the process
of assigning a document into one or more predened
categories. Recently, text classication has received increasing
attention from the researchers in the eld of text mining,
information retrieval (IR), machine learning and articial
intelligence (Sebastiani, 2002). The main determination is to
learn the classier over the instances, hence they can perform
the category of assignment process automatically by using the
machine learning techniques (Leopold and Kindermann,
2002). The main problem of text document classication is
that the documents havea set of high-dimensional features that
may reduce the performance of the text classication system
(Apté et al.,1994).
Feature selection is the method of selectingthe features from
a set of high-dimensional features. The topmost subset of
features and the minimum amount of features is used to
improve the performance of text classicationsystem (Aghdam
et al.,2009). This algorithm picks an important set of features
and removes redundant,noisy and irrelevant data. This process
may be used for further classication task (Chen et al.,2009).
Commonly, the text can be denoted in two different behaviors,
such as a bag of words and strings.A document denoted as a set
of words with their associated frequency is called a bag of
words. The document that contains a sequence of words is
called as strings. From these types of document
representations, feature selection will select the optimal
number of features. This research work proposesa new feature
selection algorithm based on articial bee colony (ABCFS) for
text classication. The basic articial bee colony (ABC)
technique is used to resolve the optimization problem in the
text classicationtask, which simulates the foraging behavior of
The current issue and full text archive of this journal is available on
Emerald Insight at: www.emeraldinsight.com/2398-6247.htm
Information Discovery and Delivery
47/3 (2019) 154170
© Emerald Publishing Limited [ISSN 2398-6247]
[DOI 10.1108/IDD-09-2018-0045]
Received 26 September 2018
Revised 8 November 2018
10 December 2018
20 January 2019
11 February 2019
18 February 2019
27 February 2019
11 March 2019
Accepted 21 March 2019
154
a bee colony. This algorithm was proposed by Karaboga and
Ozturk (2010) for continuousoptimization function.
The rest of the paper is organized as follows. Section 2
introduces the work related to ABC, support vector machine
(SVM) and the feature selection methods. Section 3
describes the methods of feature selection, document
classication and the proposed algorithm (ABCFS) that is
combined with the improved SVM (ISVM) algorithm.
Section 4 provides the performance measures that were used
to validate the performance of text classication task.
Section 5 provides the observations of the experiments
carriedoutonrealandbenchmarkdatasetsaswellasa
comparative analysis for existing feature selection methods
and the proposed methods. Section 6 provides the results of
numerous experiments implemented to illustrate the
effectiveness of the proposed algorithm. Section 7 discusses
the conclusion of this research work.
2. Related works
In the area of text classication and feature selection with
optimization, some of the researchersexploit the advanced and
persistent techniques for text document classication and
feature selectionfor text classication.
Kannan et al. (2006) proposed a document frequency
threshold (DF) technique in thefeature selection phase. In the
classication phase, the k-nearest neighbor (kNN) algorithm
with the support vector machines (SVM) algorithm was used.
This study gave a precisionof 0.95 and presented that the kNN
algorithm is suitable for Arabic text classication. The
experimentations were performed on collected Arabic
newspaper articlesfrom varieties of newspaper websites that are
available online, including Al-Jazeera,Al-Nahar, Al-Hayat, Al-
Ahram and Al-Dostor. Al-Harbi et al. (2008) have used the
decision tree as a classier and chi-square statistics (
x
2
) for
feature selection. The result of this study is evaluated by
computing the accuracyby dividing the number of the correctly
classied document by the total number of documents in the
testing data set. The testing and the training dataare based on
Arabic Newswire and Arabic Gigaword corpus. The authors
reported an average accuracy of 0.68 with SVM and0.78 with
C5.0.
Subanya and Rajalaxmi (2014) investigated a new feature
selection based on ABCto recognizethe cardiovascular disease.
They used a benchmark data set taken from the UCI repository
with SVM classier to facilitate the proposed method. The
result shows that the accuracy is reported as 0.86, and it is
proved to be better than that produced by feature selection
methods based on reverseranking.
Schiezaro and Pedrini (2013) presented a feature selection
method based on the ABC algorithm. The results show that a
reduced feature can accomplishthe best classication accuracy.
For some data set, the accuracy has suggestively also the
number of features was reduced. The proposed algorithm
offers the better results when compared with other techniques.
In the future, they were suggested to develop the lter approach
by combining the ABC algorithm, entropy and mutual
information. Karaboga and Ozturk (2010) have proposed an
ABC algorithm, and it was tested on fuzzy clustering for
classifyingthe different data sets. Different benchmark data sets
were collected from the UCI repository. The result shows that
the performance of ABC optimizationalgorithm was successful
by attaining the lower classication error percentage of 16.32
per cent. The proposed ABC fuzzy clusteringachieves 8.09 per
cent less classicationerror when compared to fuzzy c-means.
An accelerated ABC (A-ABC) method was proposed with
two modications, and it was implemented on the ABC
algorithm to conrm the local search capability and
convergence speed. The modications were called
modication rate (MR) and step size (SS). The result shows
that the performance of A-ABC is good and convergencefaster
than the ABC algorithms standard version. This method was
compared with standard ABC along with the seven different
benchmark functions to validate the effects of using MR value
and SS modication (Ozkis and Babalik, 2014). OKeefe and
Koprinska (2009) proposed the feature selectors and feature
weights with Naive Bayes and SVM classiers.In this work, the
authors have used two newfeature selection methods and three
feature weighting methods. Sentiment analysis recognizes
whether opinion in a document is positive/negative based on a
topic. The experimental results show that it was possible to
maintain an 87.15 per cent, state-of-the-art classication
accuracy whenusing less than 36 per cent of features.
Ghany et al. (2015) proposed a binary algorithm for feature
selection. To check the effectiveness of this method, they used
several benchmarkdata sets and compared them with two well-
known bio-inspired methods [genetic algorithm (GA) and
particle swarm optimization (PSO)]. The results showed that
the proposed binary algorithm outperformed GA and PSO in
improving the classication performance and reducing the
feature set. They reported a classication error between 0.024
and 0.297. Younus et al. (2015) proposed a new PSO method
for feature selection to handle Arabic text summarization. The
proposed method was tested and compared with ve existing
works. They have describeda precision of 0.67 that was not the
best result compared to the results of the existing methods.
They recommended to improve the proposed PSO and to
investigate new swarm intelligent techniques such as
evolutionarystrategy.
Chandrashekar and Sahin (2014) presented different types
of feature selection methodsfor a high-dimensional data set. In
this paper, they provided an outline of some of the feature
selection techniques. The main objective of their research
paper is to deliver a standard introduction to variable
elimination that can be applied to a wide array of machine
learning problems. Especially, the authors were focused on
lter, wrapper and embedded methods. They also applied the
feature selection techniques on the standard data set to
establish the efciencyof feature selection techniques.
Liu and Yu (2005) presented the feature selection concepts
and algorithms., The study also reviews the existing feature
selection algorithms for classication and clustering, groups
and compares different algorithms with a categorizing
framework based on search strategies, evaluation criteria and
data mining tasks, reveals unattempt combinations and
provides guidelines in selecting featureselection algorithms. In
the categorization task, the authors were buildingan integrated
system for intelligent feature selection. A uniting platform was
proposed as an intermediate step. They used some of the real-
Articial bee colony algorithm
Janani Balakumar and S. Vijayarani Mohan
Information Discovery and Delivery
Volume 47 · Number 3 · 2019 · 154170
155

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT