Spam classification: a comparative analysis of different boosted decision tree approaches
Pages | 298-105 |
DOI | https://doi.org/10.1108/JSIT-11-2017-0105 |
Published date | 13 August 2018 |
Date | 13 August 2018 |
Author | Shrawan Kumar Trivedi,Prabin Kumar Panigrahi |
Subject Matter | Information & knowledge management,Information systems,Information & communications technology |
Spam classification: a comparative
analysis of different boosted
decision tree approaches
Shrawan Kumar Trivedi
Indian Institute of Management Sirmaur, Sirmaur, India, and
Prabin Kumar Panigrahi
Indian Institute of Management Indore, Indore, India
Abstract
Purpose –Email spam classification is now becoming a challenging area in the domain of text
classification. Precise and robust classifiers are not only judged by classification accuracy but also by
sensitivity (correctly classified legitimate emails) and specificity (correctly classified unsolicited emails)
towards the accurateclassification, captured by both false positive and false negativerates. This paper aims
to present a comparativestudy between various decision tree classifiers(such as AD tree, decision stump and
REP tree) with/withoutdifferent boosting algorithms (bagging, boostingwith re-sample and AdaBoost).
Design/methodology/approach –Artificial intelligence and text mining approaches have been
incorporated in this study. Eachdecision tree classifier in this study is tested on informative words/features
selected from the two publicallyavailable data sets (SpamAssassin and LingSpam)using a greedy step-wise
featuresearch method.
Findings –Outcomes of this study show that without boosting, the REP tree provideshigh performance
accuracy with the AD tree ranking as the second-best performer. Decision stump is found to be the under-
performing classifier of this study. However, with boosting, the combination of REP tree and AdaBoost
compares favourably with other classification models. If the metrics false positive rate and performance
accuracy are taken together,AD tree and REP tree with AdaBoost were bothfound to carry out an effective
classification task. Greedy stepwise has proven its worth in this study by selecting a subset of valuable
featuresto identify the correct class of emails.
Research limitations/implications –This research is focussed on the classification of those email
spams that are written in the English language only. The proposed models work with content (words/
features) of email datathat is mostly found in the body of the mail. Image spam has not been included in this
study. Other messagessuch as short message service or multi-media messaging servicewere not included in
this study.
Practical implications –In this research, a boosted decisiontree approach has been proposed and used
to classify email spam and ham files; this is foundto be a highly effective approach in comparison withother
state-of-the-art modes usedin other studies. This classifier may be tested for different applicationsand may
provide new insightsfor developers and researchers.
Originality/value –A comparison of decision tree classifierswith/without ensemble has been presented
for spam classification.
Keywords Adaboost, Bagging, Boosting, Decision tree classifiers, Greedy stepwise feature search
Paper type Research paper
1. Introduction
The business environment in the modern day world is highly competitive with little scope
for complacency. In the context of an open system of communications, the exchange and
sharing of informationhas become indispensable for organisations. A seamlessexchange of
information amongst organisationsleads to better understanding and value creation. At the
JSIT
20,3
298
Received2 November 2017
Revised9 March 2018
15March 2018
21March 2018
24May 2018
Accepted26 July 2018
Journalof Systems and
InformationTechnology
Vol.20 No. 3, 2018
pp. 298-320
© Emerald Publishing Limited
1328-7265
DOI 10.1108/JSIT-11-2017-0105
The current issue and full text archive of this journal is available on Emerald Insight at:
www.emeraldinsight.com/1328-7265.htm
same time, all formsof communication should be cost-effective and affordable.The cheapest
method of exchanging electronically stored messages between people is electronic mail or
“email”. With email messages,it is possible to send text files as well as non-text files such as
audio and image files. Emails can be exchanged over the internetas well as over public and
private networks.
One of the problems with email usage is unsolicited messagessent over the internet; this
is known as spam. Typically, entitiesuse spam to send messages to a large number of email
users for the purposes of advertising, phishing or accessing information. By sponsoring
advertisements through e-mail spam, companies can earn huge money. As a consequence,
email users face differenttypes of problems when they receive unsolicitedmessages that can
be harmful and unwanted.
According to recent research, the proportion of spams and their occurrence on the
internet, especially in the form of emails, has risen to a staggering 70 per cent of the entire
worldwide flow of emails (AladinKnowledge System). These spams occupy a major portion
of the available mailbox space, resulting in significant time being wasted to remove them
(Lai, 2007). The thin demarcation between a legitimate or unwanted email creates the
necessity for a precise and robustspam filter.
The expertise of spammers makes the issue more challenging. Spammers use various
methods to send spams that are a challenge to anti-spamsystems. One of the methods used
by spammers is “tokenization”,where the structure or the content of the email is fragmented
into multiple parts or divisionssuch as “three”written as “thr33”or adding an HTML link to
the email. The email itself is very different from any legitimateinformation sent to the user
(Trivedi and Dey, 2013a).
In this paper, a comparative study of different decision tree classifiers that classify emails
into spam (unsolicited) or ham (legitimate) has been carried out. Using two data sets available
on the public domain, all the decision tree classifiers are tested and compared with/without
‘boosting algorithm.' After comparison, a precise and robust decision tree model with boosting
has been successfully formed using the email spam classification applications.
In current literature, decision tree classifiers have a prominent place. Sometimes, it is
found to be more user-friendly than other popular classifiers such as neural network and
support vector machinesdue to its capability to tackle the presentation of data in an efficient
way. A number of approaches to develop a decision tree have been successfully proposed
(Quinlan, 1993). Another important featureof decision tree is its flexibility in handling real-
value and categorical attributes as well as items with missing attributes. In addition,
decision trees have the capacityto classify more than two class problems efficiently and can
be modified to deal with regressionproblems (Kingsford and Salzberg, 2008).
This study recommends a decision tree classifier in association with boosting
algorithms; the overall delivery of the classifier makes for better precision and efficiency.
The respective classifieruses the technique and voting mechanism to elevate performance.
The rest of the paper is presented as follows. Section 2 discussesthe related work. Section
3 covers the machine learning classifier. Section 4 discusses experimental design. Section 5
discussed Results and findingsfollowed by a conclusion in section 6.
2. Related work
Spam classification is a challenging area because of the smart activities of spammers; they
frequently alter the spamwords to introduce different forms of attack. A number of machine
learning based classifiers have been highlighted in the literature to counter the above
attacks. This sectionshows some existing work related to this research.
Decision tree
approaches
299
To continue reading
Request your trial