A novel committee selection mechanism for combining classifiers to detect unsolicited emails

Published date14 November 2016
Pages524-548
DOIhttps://doi.org/10.1108/VJIKMS-07-2015-0042
Date14 November 2016
AuthorShrawan Kumar Trivedi,Shubhamoy Dey
Subject MatterInformation & knowledge management,Knowledge management,Knowledge management systems
A novel committee selection
mechanism for combining
classiers to detect
unsolicited emails
Shrawan Kumar Trivedi
Department of Information Systems and Business Analytics,
School of Management, BML Munjal University, Gurgaon, India, and
Shubhamoy Dey
Department of Information Systems,
Indian Institute of Management Indore, Indore, India
Abstract
Purpose – The email is an important medium for sharing information rapidly. However, spam, being
a nuisance in such communication, motivates the building of a robust ltering system with high
classication accuracy and good sensitivity towards false positives. In that context, this paper aims to
present a combined classier technique using a committee selection mechanism where the main
objective is to identify a set of classiers so that their individual decisions can be combined by a
committee selection procedure for accurate detection of spam.
Design/methodology/approach For training and testing of the relevant machine learning
classiers, text mining approaches are used in this research. Three data sets (Enron, SpamAssassin and
LingSpam) have been used to test the classiers. Initially, pre-processing is performed to extract the
features associated with the email les. In the next step, the extracted features are taken through a
dimensionality reduction method where non-informative features are removed. Subsequently, an
informative feature subset is selected using genetic feature search. Thereafter, the proposed classiers
are tested on those informative features and the results compared with those of other classiers.
Findings For building the proposed combined classier, three different studies have been
performed. The rst study identies the effect of boosting algorithms on two probabilistic classiers:
Bayesian and Naïve Bayes. In that study, AdaBoost has been found to be the best algorithm for
performance boosting. The second study was on the effect of different Kernel functions on support
vector machine (SVM) classier, where SVM with normalized polynomial (NP) kernel was observed to
be the best. The last study was on combining classiers with committee selection where the committee
members were the best classiers identied by the rst study i.e. Bayesian and Naïve bays with
AdaBoost, and the committee president was selected from the second study i.e. SVM with NP kernel.
Results show that combining of the identied classiers to form a committee machine gives excellent
performance accuracy with a low false positive rate.
Research limitations/implications – This research is focused on the classication of email spams
written in English language. Only body (text) parts of the emails have been used. Image spam has not
been included in this work. We have restricted our work to only emails messages. None of the other
types of messages like short message service or multi-media messaging service were a part of this
study.
Practical implications – This research proposes a method of dealing with the issues and challenges
faced by internet service providers and organizations that use email. The proposed model provides not
only better classication accuracy but also a low false positive rate.
The current issue and full text archive of this journal is available on Emerald Insight at:
www.emeraldinsight.com/2059-5891.htm
VJIKMS
46,4
524
Received 1 August 2015
Revised 10 February 2016
Accepted 13 July 2016
VINEJournal of Information and
KnowledgeManagement Systems
Vol.46 No. 4, 2016
pp.524-548
©Emerald Group Publishing Limited
2059-5891
DOI 10.1108/VJIKMS-07-2015-0042
Originality/value – The proposed combined classier is a novel classier designed for accurate
classication of email spam.
Keywords SVM, Bayesian, Probabilistic classiers, Naïve Bayes, Function-based classiers,
Kernel functions, Combining classiers, Stacking, Committee machine
Paper type Technical paper
1. Introduction
In the present automated world, sharing information is important to be competitive and
sustainable in business. Email is a rapid and inexpensive medium of communication. It
is a popular medium of interaction between people and has become a part of life itself
(Whittaker and Moody, 2005). However, spam (unsolicited bulk email) has become a
nuisance in such communication. Recently, interest of researchers has increased in the
spam classication domain as its bulk is increasing day by day. A study observes that
66 per cent of all business emails are spam (Kaspersky Spam Statistics, 2014). This rapid
growth leads to serious problems such as unnecessary lling of users’ mailboxes,
engulng of important emails, consuming storage space and bandwidth and requiring
time to sort them.
Legal and other simplistic methods like blacklisting, keyword-based ltering, etc.,
have shown limited effect in countering such problems. However, content-based
ltering using machine learning methods is reported to be promising in literature.
Nowadays, spam classication is becoming a challenging area due to the complex
nature of the spam. Complexity is dened as the modications of content, such as
tokenization (modifying words such as “free” being written asfr33)andobfuscation
(which hides feature by adding HTML or some other codes such as “free” coded as
frexe or FR3E) (Heydari et al., 2015;Goodman et al., 2007), etc., to change the
information of features so as to create barriers in distinguishing spam from legitimate
emails. Many machine learning classiers have been tested in an attempt to tackle these
problems. Some of them, such as probabilistic classiers [Bayesian (Koller and Sahami,
1997a;Jatana and Sharma, 2014) and Naïve Bayes (NB) (Farid et al., 2014;Lewis and
Gale, 1994)] and support vector machine (SVM) (Drucker et al., 1999), have been found to
be good performers in literature. Signicantly good accuracy has been reported even in
the presence of the complexity discussed above. The Bayesian technique is well known,
as it has the interesting concept of nding the informative features/words with the help
of deviation from the mean.
The rst part of this research is on probabilistic classiers (Bayesian and NB) and the
concept of boosting [Bagging, Boosting (with re-sampling) and AdaBoost] to improve
data sampling for better learning during training. Boosting methods use voting
mechanism where a single classier is formulated as the linear combination of many
weak classiers.
For SVM classiers, a good choice of kernel function is important and at the same
time difcult. A good kernel function provides efcient learning to the SVM during
training. In the second part of our study, a number of different kernel functions have
been compared.
In the nal part of this research, a method for combining classiers with committee
selection has been proposed where the individual classication decisions of the good
classiers are combined. The committee members are chosen from the best performers
525
Novel
committee
selection
mechanism

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT