Spam classification: a comparative analysis of different boosted decision tree approaches

Pages298-105
DOIhttps://doi.org/10.1108/JSIT-11-2017-0105
Published date13 August 2018
Date13 August 2018
AuthorShrawan Kumar Trivedi,Prabin Kumar Panigrahi
Subject MatterInformation & knowledge management,Information systems,Information & communications technology
Spam classication: a comparative
analysis of dierent boosted
decision tree approaches
Shrawan Kumar Trivedi
Indian Institute of Management Sirmaur, Sirmaur, India, and
Prabin Kumar Panigrahi
Indian Institute of Management Indore, Indore, India
Abstract
Purpose Email spam classication is now becoming a challenging area in the domain of text
classication. Precise and robust classiers are not only judged by classication accuracy but also by
sensitivity (correctly classied legitimate emails) and specicity (correctly classied unsolicited emails)
towards the accurateclassication, captured by both false positive and false negativerates. This paper aims
to present a comparativestudy between various decision tree classiers(such as AD tree, decision stump and
REP tree) with/withoutdifferent boosting algorithms (bagging, boostingwith re-sample and AdaBoost).
Design/methodology/approach Articial intelligence and text mining approaches have been
incorporated in this study. Eachdecision tree classier in this study is tested on informative words/features
selected from the two publicallyavailable data sets (SpamAssassin and LingSpam)using a greedy step-wise
featuresearch method.
Findings Outcomes of this study show that without boosting, the REP tree provideshigh performance
accuracy with the AD tree ranking as the second-best performer. Decision stump is found to be the under-
performing classier of this study. However, with boosting, the combination of REP tree and AdaBoost
compares favourably with other classication models. If the metrics false positive rate and performance
accuracy are taken together,AD tree and REP tree with AdaBoost were bothfound to carry out an effective
classication task. Greedy stepwise has proven its worth in this study by selecting a subset of valuable
featuresto identify the correct class of emails.
Research limitations/implications This research is focussed on the classication of those email
spams that are written in the English language only. The proposed models work with content (words/
features) of email datathat is mostly found in the body of the mail. Image spam has not been included in this
study. Other messagessuch as short message service or multi-media messaging servicewere not included in
this study.
Practical implications In this research, a boosted decisiontree approach has been proposed and used
to classify email spam and ham les; this is foundto be a highly effective approach in comparison withother
state-of-the-art modes usedin other studies. This classier may be tested for different applicationsand may
provide new insightsfor developers and researchers.
Originality/value A comparison of decision tree classierswith/without ensemble has been presented
for spam classication.
Keywords Adaboost, Bagging, Boosting, Decision tree classiers, Greedy stepwise feature search
Paper type Research paper
1. Introduction
The business environment in the modern day world is highly competitive with little scope
for complacency. In the context of an open system of communications, the exchange and
sharing of informationhas become indispensable for organisations. A seamlessexchange of
information amongst organisationsleads to better understanding and value creation. At the
JSIT
20,3
298
Received2 November 2017
Revised9 March 2018
15March 2018
21March 2018
24May 2018
Accepted26 July 2018
Journalof Systems and
InformationTechnology
Vol.20 No. 3, 2018
pp. 298-320
© Emerald Publishing Limited
1328-7265
DOI 10.1108/JSIT-11-2017-0105
The current issue and full text archive of this journal is available on Emerald Insight at:
www.emeraldinsight.com/1328-7265.htm
same time, all formsof communication should be cost-effective and affordable.The cheapest
method of exchanging electronically stored messages between people is electronic mail or
email. With email messages,it is possible to send text les as well as non-text les such as
audio and image les. Emails can be exchanged over the internetas well as over public and
private networks.
One of the problems with email usage is unsolicited messagessent over the internet; this
is known as spam. Typically, entitiesuse spam to send messages to a large number of email
users for the purposes of advertising, phishing or accessing information. By sponsoring
advertisements through e-mail spam, companies can earn huge money. As a consequence,
email users face differenttypes of problems when they receive unsolicitedmessages that can
be harmful and unwanted.
According to recent research, the proportion of spams and their occurrence on the
internet, especially in the form of emails, has risen to a staggering 70 per cent of the entire
worldwide ow of emails (AladinKnowledge System). These spams occupy a major portion
of the available mailbox space, resulting in signicant time being wasted to remove them
(Lai, 2007). The thin demarcation between a legitimate or unwanted email creates the
necessity for a precise and robustspam lter.
The expertise of spammers makes the issue more challenging. Spammers use various
methods to send spams that are a challenge to anti-spamsystems. One of the methods used
by spammers is tokenization,where the structure or the content of the email is fragmented
into multiple parts or divisionssuch as threewritten as thr33or adding an HTML link to
the email. The email itself is very different from any legitimateinformation sent to the user
(Trivedi and Dey, 2013a).
In this paper, a comparative study of different decision tree classiers that classify emails
into spam (unsolicited) or ham (legitimate) has been carried out. Using two data sets available
on the public domain, all the decision tree classiers are tested and compared with/without
boosting algorithm.' After comparison, a precise and robust decision tree model with boosting
has been successfully formed using the email spam classication applications.
In current literature, decision tree classiers have a prominent place. Sometimes, it is
found to be more user-friendly than other popular classiers such as neural network and
support vector machinesdue to its capability to tackle the presentation of data in an efcient
way. A number of approaches to develop a decision tree have been successfully proposed
(Quinlan, 1993). Another important featureof decision tree is its exibility in handling real-
value and categorical attributes as well as items with missing attributes. In addition,
decision trees have the capacityto classify more than two class problems efciently and can
be modied to deal with regressionproblems (Kingsford and Salzberg, 2008).
This study recommends a decision tree classier in association with boosting
algorithms; the overall delivery of the classier makes for better precision and efciency.
The respective classieruses the technique and voting mechanism to elevate performance.
The rest of the paper is presented as follows. Section 2 discussesthe related work. Section
3 covers the machine learning classier. Section 4 discusses experimental design. Section 5
discussed Results and ndingsfollowed by a conclusion in section 6.
2. Related work
Spam classication is a challenging area because of the smart activities of spammers; they
frequently alter the spamwords to introduce different forms of attack. A number of machine
learning based classiers have been highlighted in the literature to counter the above
attacks. This sectionshows some existing work related to this research.
Decision tree
approaches
299

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT