A novel committee selection mechanism for combining classifiers to detect unsolicited emails
Published date | 14 November 2016 |
Pages | 524-548 |
DOI | https://doi.org/10.1108/VJIKMS-07-2015-0042 |
Date | 14 November 2016 |
Author | Shrawan Kumar Trivedi,Shubhamoy Dey |
Subject Matter | Information & knowledge management,Knowledge management,Knowledge management systems |
A novel committee selection
mechanism for combining
classiers to detect
unsolicited emails
Shrawan Kumar Trivedi
Department of Information Systems and Business Analytics,
School of Management, BML Munjal University, Gurgaon, India, and
Shubhamoy Dey
Department of Information Systems,
Indian Institute of Management Indore, Indore, India
Abstract
Purpose – The email is an important medium for sharing information rapidly. However, spam, being
a nuisance in such communication, motivates the building of a robust ltering system with high
classication accuracy and good sensitivity towards false positives. In that context, this paper aims to
present a combined classier technique using a committee selection mechanism where the main
objective is to identify a set of classiers so that their individual decisions can be combined by a
committee selection procedure for accurate detection of spam.
Design/methodology/approach – For training and testing of the relevant machine learning
classiers, text mining approaches are used in this research. Three data sets (Enron, SpamAssassin and
LingSpam) have been used to test the classiers. Initially, pre-processing is performed to extract the
features associated with the email les. In the next step, the extracted features are taken through a
dimensionality reduction method where non-informative features are removed. Subsequently, an
informative feature subset is selected using genetic feature search. Thereafter, the proposed classiers
are tested on those informative features and the results compared with those of other classiers.
Findings – For building the proposed combined classier, three different studies have been
performed. The rst study identies the effect of boosting algorithms on two probabilistic classiers:
Bayesian and Naïve Bayes. In that study, AdaBoost has been found to be the best algorithm for
performance boosting. The second study was on the effect of different Kernel functions on support
vector machine (SVM) classier, where SVM with normalized polynomial (NP) kernel was observed to
be the best. The last study was on combining classiers with committee selection where the committee
members were the best classiers identied by the rst study i.e. Bayesian and Naïve bays with
AdaBoost, and the committee president was selected from the second study i.e. SVM with NP kernel.
Results show that combining of the identied classiers to form a committee machine gives excellent
performance accuracy with a low false positive rate.
Research limitations/implications – This research is focused on the classication of email spams
written in English language. Only body (text) parts of the emails have been used. Image spam has not
been included in this work. We have restricted our work to only emails messages. None of the other
types of messages like short message service or multi-media messaging service were a part of this
study.
Practical implications – This research proposes a method of dealing with the issues and challenges
faced by internet service providers and organizations that use email. The proposed model provides not
only better classication accuracy but also a low false positive rate.
The current issue and full text archive of this journal is available on Emerald Insight at:
www.emeraldinsight.com/2059-5891.htm
VJIKMS
46,4
524
Received 1 August 2015
Revised 10 February 2016
Accepted 13 July 2016
VINEJournal of Information and
KnowledgeManagement Systems
Vol.46 No. 4, 2016
pp.524-548
©Emerald Group Publishing Limited
2059-5891
DOI 10.1108/VJIKMS-07-2015-0042
Originality/value – The proposed combined classier is a novel classier designed for accurate
classication of email spam.
Keywords SVM, Bayesian, Probabilistic classiers, Naïve Bayes, Function-based classiers,
Kernel functions, Combining classiers, Stacking, Committee machine
Paper type Technical paper
1. Introduction
In the present automated world, sharing information is important to be competitive and
sustainable in business. Email is a rapid and inexpensive medium of communication. It
is a popular medium of interaction between people and has become a part of life itself
(Whittaker and Moody, 2005). However, spam (unsolicited bulk email) has become a
nuisance in such communication. Recently, interest of researchers has increased in the
spam classication domain as its bulk is increasing day by day. A study observes that
66 per cent of all business emails are spam (Kaspersky Spam Statistics, 2014). This rapid
growth leads to serious problems such as unnecessary lling of users’ mailboxes,
engulng of important emails, consuming storage space and bandwidth and requiring
time to sort them.
Legal and other simplistic methods like blacklisting, keyword-based ltering, etc.,
have shown limited effect in countering such problems. However, content-based
ltering using machine learning methods is reported to be promising in literature.
Nowadays, spam classication is becoming a challenging area due to the complex
nature of the spam. Complexity is dened as the modications of content, such as
tokenization (modifying words such as “free” being written asfr33)andobfuscation
(which hides feature by adding HTML or some other codes such as “free” coded as
frexe or FR3E) (Heydari et al., 2015;Goodman et al., 2007), etc., to change the
information of features so as to create barriers in distinguishing spam from legitimate
emails. Many machine learning classiers have been tested in an attempt to tackle these
problems. Some of them, such as probabilistic classiers [Bayesian (Koller and Sahami,
1997a;Jatana and Sharma, 2014) and Naïve Bayes (NB) (Farid et al., 2014;Lewis and
Gale, 1994)] and support vector machine (SVM) (Drucker et al., 1999), have been found to
be good performers in literature. Signicantly good accuracy has been reported even in
the presence of the complexity discussed above. The Bayesian technique is well known,
as it has the interesting concept of nding the informative features/words with the help
of deviation from the mean.
The rst part of this research is on probabilistic classiers (Bayesian and NB) and the
concept of boosting [Bagging, Boosting (with re-sampling) and AdaBoost] to improve
data sampling for better learning during training. Boosting methods use voting
mechanism where a single classier is formulated as the linear combination of many
weak classiers.
For SVM classiers, a good choice of kernel function is important and at the same
time difcult. A good kernel function provides efcient learning to the SVM during
training. In the second part of our study, a number of different kernel functions have
been compared.
In the nal part of this research, a method for combining classiers with committee
selection has been proposed where the individual classication decisions of the good
classiers are combined. The committee members are chosen from the best performers
525
Novel
committee
selection
mechanism
To continue reading
Request your trial