Spam classification: a comparative analysis of different boosted decision tree approaches

Document

Cited in

Pages	298-105
DOI	https://doi.org/10.1108/JSIT-11-2017-0105
Published date	13 August 2018
Date	13 August 2018
Author	Shrawan Kumar Trivedi,Prabin Kumar Panigrahi
Subject Matter	Information & knowledge management,Information systems,Information & communications technology

Spam classiﬁcation: a comparative

analysis of diﬀerent boosted

decision tree approaches

Shrawan Kumar Trivedi

Indian Institute of Management Sirmaur, Sirmaur, India, and

Prabin Kumar Panigrahi

Indian Institute of Management Indore, Indore, India

Abstract

Purpose –Email spam classiﬁcation is now becoming a challenging area in the domain of text

classiﬁcation. Precise and robust classiﬁers are not only judged by classiﬁcation accuracy but also by

sensitivity (correctly classiﬁed legitimate emails) and speciﬁcity (correctly classiﬁed unsolicited emails)

towards the accurateclassiﬁcation, captured by both false positive and false negativerates. This paper aims

to present a comparativestudy between various decision tree classiﬁers(such as AD tree, decision stump and

REP tree) with/withoutdifferent boosting algorithms (bagging, boostingwith re-sample and AdaBoost).

Design/methodology/approach –Artiﬁcial intelligence and text mining approaches have been

incorporated in this study. Eachdecision tree classiﬁer in this study is tested on informative words/features

selected from the two publicallyavailable data sets (SpamAssassin and LingSpam)using a greedy step-wise

featuresearch method.

Findings –Outcomes of this study show that without boosting, the REP tree provideshigh performance

accuracy with the AD tree ranking as the second-best performer. Decision stump is found to be the under-

performing classiﬁer of this study. However, with boosting, the combination of REP tree and AdaBoost

compares favourably with other classiﬁcation models. If the metrics false positive rate and performance

accuracy are taken together,AD tree and REP tree with AdaBoost were bothfound to carry out an effective

classiﬁcation task. Greedy stepwise has proven its worth in this study by selecting a subset of valuable

featuresto identify the correct class of emails.

Research limitations/implications –This research is focussed on the classiﬁcation of those email

spams that are written in the English language only. The proposed models work with content (words/

features) of email datathat is mostly found in the body of the mail. Image spam has not been included in this

study. Other messagessuch as short message service or multi-media messaging servicewere not included in

this study.

Practical implications –In this research, a boosted decisiontree approach has been proposed and used

to classify email spam and ham ﬁles; this is foundto be a highly effective approach in comparison withother

state-of-the-art modes usedin other studies. This classiﬁer may be tested for different applicationsand may

provide new insightsfor developers and researchers.

Originality/value –A comparison of decision tree classiﬁerswith/without ensemble has been presented

for spam classiﬁcation.

Keywords Adaboost, Bagging, Boosting, Decision tree classiﬁers, Greedy stepwise feature search

Paper type Research paper

1. Introduction

The business environment in the modern day world is highly competitive with little scope

for complacency. In the context of an open system of communications, the exchange and

sharing of informationhas become indispensable for organisations. A seamlessexchange of

information amongst organisationsleads to better understanding and value creation. At the

JSIT

20,3

298

Received2 November 2017

Revised9 March 2018

15March 2018

21March 2018

24May 2018

Accepted26 July 2018

Journalof Systems and

InformationTechnology

Vol.20 No. 3, 2018

pp. 298-320

1328-7265

DOI 10.1108/JSIT-11-2017-0105

The current issue and full text archive of this journal is available on Emerald Insight at:

www.emeraldinsight.com/1328-7265.htm

same time, all formsof communication should be cost-effective and affordable.The cheapest

method of exchanging electronically stored messages between people is electronic mail or

“email”. With email messages,it is possible to send text ﬁles as well as non-text ﬁles such as

audio and image ﬁles. Emails can be exchanged over the internetas well as over public and

private networks.

One of the problems with email usage is unsolicited messagessent over the internet; this

is known as spam. Typically, entitiesuse spam to send messages to a large number of email

users for the purposes of advertising, phishing or accessing information. By sponsoring

advertisements through e-mail spam, companies can earn huge money. As a consequence,

email users face differenttypes of problems when they receive unsolicitedmessages that can

be harmful and unwanted.

According to recent research, the proportion of spams and their occurrence on the

internet, especially in the form of emails, has risen to a staggering 70 per cent of the entire

worldwide ﬂow of emails (AladinKnowledge System). These spams occupy a major portion

of the available mailbox space, resulting in signiﬁcant time being wasted to remove them

(Lai, 2007). The thin demarcation between a legitimate or unwanted email creates the

necessity for a precise and robustspam ﬁlter.

The expertise of spammers makes the issue more challenging. Spammers use various

methods to send spams that are a challenge to anti-spamsystems. One of the methods used

by spammers is “tokenization”,where the structure or the content of the email is fragmented

into multiple parts or divisionssuch as “three”written as “thr33”or adding an HTML link to

the email. The email itself is very different from any legitimateinformation sent to the user

(Trivedi and Dey, 2013a).

In this paper, a comparative study of different decision tree classiﬁers that classify emails

into spam (unsolicited) or ham (legitimate) has been carried out. Using two data sets available

on the public domain, all the decision tree classiﬁers are tested and compared with/without

‘boosting algorithm.' After comparison, a precise and robust decision tree model with boosting

has been successfully formed using the email spam classiﬁcation applications.

In current literature, decision tree classiﬁers have a prominent place. Sometimes, it is

found to be more user-friendly than other popular classiﬁers such as neural network and

support vector machinesdue to its capability to tackle the presentation of data in an efﬁcient

way. A number of approaches to develop a decision tree have been successfully proposed

(Quinlan, 1993). Another important featureof decision tree is its ﬂexibility in handling real-

value and categorical attributes as well as items with missing attributes. In addition,

decision trees have the capacityto classify more than two class problems efﬁciently and can

be modiﬁed to deal with regressionproblems (Kingsford and Salzberg, 2008).

This study recommends a decision tree classiﬁer in association with boosting

algorithms; the overall delivery of the classiﬁer makes for better precision and efﬁciency.

The respective classiﬁeruses the technique and voting mechanism to elevate performance.

The rest of the paper is presented as follows. Section 2 discussesthe related work. Section

3 covers the machine learning classiﬁer. Section 4 discusses experimental design. Section 5

discussed Results and ﬁndingsfollowed by a conclusion in section 6.

2. Related work

Spam classiﬁcation is a challenging area because of the smart activities of spammers; they

frequently alter the spamwords to introduce different forms of attack. A number of machine

learning based classiﬁers have been highlighted in the literature to counter the above

attacks. This sectionshows some existing work related to this research.

Decision tree

approaches

299

To continue reading

Request your trial

Subscribers can access the reported version of this case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the cited cases and legislation of a document.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the documents that have cited the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the revised versions of legislation with amendments.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see any amendments made to the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a visualisation of a case and its relationships to other cases. An alternative to lists of cases, the Precedent Map makes it easier to establish which ones may be of most relevance to your research and prioritise further reading. You also get a useful overview of how the case was received.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the list of results connected to your document through the topics and citations Vincent found.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Spam classification: a comparative analysis of different boosted decision tree approaches

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users