A novel committee selection mechanism for combining classifiers to detect unsolicited emails

Document

Cited in

Published date	14 November 2016
Pages	524-548
DOI	https://doi.org/10.1108/VJIKMS-07-2015-0042
Date	14 November 2016
Author	Shrawan Kumar Trivedi,Shubhamoy Dey
Subject Matter	Information & knowledge management,Knowledge management,Knowledge management systems

A novel committee selection

mechanism for combining

classiers to detect

unsolicited emails

Shrawan Kumar Trivedi

Department of Information Systems and Business Analytics,

School of Management, BML Munjal University, Gurgaon, India, and

Shubhamoy Dey

Department of Information Systems,

Indian Institute of Management Indore, Indore, India

Abstract

Purpose – The email is an important medium for sharing information rapidly. However, spam, being

a nuisance in such communication, motivates the building of a robust ltering system with high

classication accuracy and good sensitivity towards false positives. In that context, this paper aims to

present a combined classier technique using a committee selection mechanism where the main

objective is to identify a set of classiers so that their individual decisions can be combined by a

committee selection procedure for accurate detection of spam.

Design/methodology/approach – For training and testing of the relevant machine learning

classiers, text mining approaches are used in this research. Three data sets (Enron, SpamAssassin and

LingSpam) have been used to test the classiers. Initially, pre-processing is performed to extract the

features associated with the email les. In the next step, the extracted features are taken through a

dimensionality reduction method where non-informative features are removed. Subsequently, an

informative feature subset is selected using genetic feature search. Thereafter, the proposed classiers

are tested on those informative features and the results compared with those of other classiers.

Findings – For building the proposed combined classier, three different studies have been

performed. The rst study identies the effect of boosting algorithms on two probabilistic classiers:

Bayesian and Naïve Bayes. In that study, AdaBoost has been found to be the best algorithm for

performance boosting. The second study was on the effect of different Kernel functions on support

vector machine (SVM) classier, where SVM with normalized polynomial (NP) kernel was observed to

be the best. The last study was on combining classiers with committee selection where the committee

members were the best classiers identied by the rst study i.e. Bayesian and Naïve bays with

AdaBoost, and the committee president was selected from the second study i.e. SVM with NP kernel.

Results show that combining of the identied classiers to form a committee machine gives excellent

performance accuracy with a low false positive rate.

Research limitations/implications – This research is focused on the classication of email spams

written in English language. Only body (text) parts of the emails have been used. Image spam has not

been included in this work. We have restricted our work to only emails messages. None of the other

types of messages like short message service or multi-media messaging service were a part of this

study.

Practical implications – This research proposes a method of dealing with the issues and challenges

faced by internet service providers and organizations that use email. The proposed model provides not

only better classication accuracy but also a low false positive rate.

The current issue and full text archive of this journal is available on Emerald Insight at:

www.emeraldinsight.com/2059-5891.htm

VJIKMS

46,4

524

Received 1 August 2015

Revised 10 February 2016

Accepted 13 July 2016

VINEJournal of Information and

KnowledgeManagement Systems

Vol.46 No. 4, 2016

pp.524-548

©Emerald Group Publishing Limited

2059-5891

DOI 10.1108/VJIKMS-07-2015-0042

Originality/value – The proposed combined classier is a novel classier designed for accurate

classication of email spam.

Keywords SVM, Bayesian, Probabilistic classiers, Naïve Bayes, Function-based classiers,

Kernel functions, Combining classiers, Stacking, Committee machine

Paper type Technical paper

1. Introduction

In the present automated world, sharing information is important to be competitive and

sustainable in business. Email is a rapid and inexpensive medium of communication. It

is a popular medium of interaction between people and has become a part of life itself

(Whittaker and Moody, 2005). However, spam (unsolicited bulk email) has become a

nuisance in such communication. Recently, interest of researchers has increased in the

spam classication domain as its bulk is increasing day by day. A study observes that

66 per cent of all business emails are spam (Kaspersky Spam Statistics, 2014). This rapid

growth leads to serious problems such as unnecessary lling of users’ mailboxes,

engulng of important emails, consuming storage space and bandwidth and requiring

time to sort them.

Legal and other simplistic methods like blacklisting, keyword-based ltering, etc.,

have shown limited effect in countering such problems. However, content-based

ltering using machine learning methods is reported to be promising in literature.

Nowadays, spam classication is becoming a challenging area due to the complex

nature of the spam. Complexity is dened as the modications of content, such as

tokenization (modifying words such as “free” being written asfr33)andobfuscation

(which hides feature by adding HTML or some other codes such as “free” coded as

frexe or FR3E) (Heydari et al., 2015;Goodman et al., 2007), etc., to change the

information of features so as to create barriers in distinguishing spam from legitimate

emails. Many machine learning classiers have been tested in an attempt to tackle these

problems. Some of them, such as probabilistic classiers [Bayesian (Koller and Sahami,

1997a;Jatana and Sharma, 2014) and Naïve Bayes (NB) (Farid et al., 2014;Lewis and

Gale, 1994)] and support vector machine (SVM) (Drucker et al., 1999), have been found to

be good performers in literature. Signicantly good accuracy has been reported even in

the presence of the complexity discussed above. The Bayesian technique is well known,

as it has the interesting concept of nding the informative features/words with the help

of deviation from the mean.

The rst part of this research is on probabilistic classiers (Bayesian and NB) and the

concept of boosting [Bagging, Boosting (with re-sampling) and AdaBoost] to improve

data sampling for better learning during training. Boosting methods use voting

mechanism where a single classier is formulated as the linear combination of many

weak classiers.

For SVM classiers, a good choice of kernel function is important and at the same

time difcult. A good kernel function provides efcient learning to the SVM during

training. In the second part of our study, a number of different kernel functions have

been compared.

In the nal part of this research, a method for combining classiers with committee

selection has been proposed where the individual classication decisions of the good

classiers are combined. The committee members are chosen from the best performers

525

Novel

committee

selection

mechanism

To continue reading

Request your trial

Subscribers can access the reported version of this case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the cited cases and legislation of a document.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the documents that have cited the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the revised versions of legislation with amendments.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see any amendments made to the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a visualisation of a case and its relationships to other cases. An alternative to lists of cases, the Precedent Map makes it easier to establish which ones may be of most relevance to your research and prioritise further reading. You also get a useful overview of how the case was received.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the list of results connected to your document through the topics and citations Vincent found.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

A novel committee selection mechanism for combining classifiers to detect unsolicited emails

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users