Detection of phishing websites using a novel twofold ensemble model

DOIhttps://doi.org/10.1108/JSIT-09-2017-0074
Pages321-357
Published date13 August 2018
Date13 August 2018
AuthorKalyan Nagaraj,Biplab Bhattacharjee,Amulyashree Sridhar,Sharvani GS
Subject MatterInformation & knowledge management,Information systems,Information & communications technology
Detection of phishing websites
using a novel twofold
ensemble model
Kalyan Nagaraj
Department of Computer Science and Engineering,
RV College of Engineering, Bangalore, India
Biplab Bhattacharjee
School of Management Studies, National Institute of Technology, Calicut, India, and
Amulyashree Sridhar and Sharvani GS
Department of Computer Science and Engineering,
RV College of Engineering, Bangalore, India
Abstract
Purpose Phishing is one of the major threats affecting businesses worldwide in current times.
Organizationsand customers face the hazards arising out of phishingattacks because of anonymous access to
vulnerable details. Suchattacks often result in substantial nancial losses. Thus, there is a need for effective
intrusion detection techniques to identify and possibly nullify the effects of phishing. Classifying phishing
and non-phishing web content is a criticaltask in information security protocols,and full-proof mechanisms
have yet to be implemented in practice.The purpose of the current study is to present an ensemble machine
learningmodel for classifying phishing websites.
Design/methodology/approach A publicly available data set comprising 10,068 instances of
phishing andlegitimate websites was used to build the classiermodel. Feature extraction was performed by
deploying a group of methods, and relevant featuresextracted were used for building the model. A twofold
ensemble learner was developed by integrating results from random forest (RF) classier, fed into a
feedforward neural network (NN). Performance of the ensemble classier was validated using k-fold cross-
validation. The twofold ensemble learner was implemented as a user-friendly, interactive decision support
system for classifyingwebsites as phishing or legitimate ones.
Findings Experimental simulations were performed to access and compare the performance of the
ensemble classiers. The statistical tests estimated that RF_NN model gave superior performance with an
accuracyof 93.41 per cent and minimal mean squared error of 0.000026.
Research limitations/implications The research data set used in this study is publically
available and easy to analyze. Comparative analysis with other real-time data sets of recent origin
must be performed to ensure generalization of the model against various security breaches. Different
variants of phishing threats must be detected rather than focusing particularly toward phishing
website detection.
Originality/value The twofold ensemble model is not appliedfor classication of phishing websites in
any previousstudies as per the knowledge of authors.
Keywords Machine learning, Ensemble learner, Intelligent systems, Phishing website
Paper type Research paper
1. Introduction
Exponential expansion of datain digital media over the years has resulted in corresponding
growth of e-commerce transaction volumes. Internet has provided a digital platform to
Detection of
phishing
websites
321
Received7 September 2017
Revised16 January 2018
16May 2018
14June 2018
Accepted25 July 2018
Journalof Systems and
InformationTechnology
Vol.20 No. 3, 2018
pp. 321-357
© Emerald Publishing Limited
1328-7265
DOI 10.1108/JSIT-09-2017-0074
The current issue and full text archive of this journal is available on Emerald Insight at:
www.emeraldinsight.com/1328-7265.htm
encourage rapid communication between suppliers and end-users for information
dissemination of their products(Basu and Muylle, 2003). This rise of social media platforms
by consumers has led to usage of such mediums by marketersfor product promotions. Such
modernizations have gathered immense popularity over the last decade and continue to
expand in the coming years (Ho et al.,2007). Recent progress with smartphone devices,
tablets and notebooks have further lead to information eruption ranging from terabytes to
petabytes.
However, it is immensely important to observe the other side of this data expansion,
which is the inherent threat to sustainingrefuge of such content. This escalation of privacy
apprehensions is predominantlybecause of numerous threats which gain counterfeit access
resulting in loss of sensitive information. These security outbreaks can be commonly
classied as active and passive forms of attacks. In active attacks, the tenable systems are
circumvented to gain access to legitimate facts, whereas passive attacks set up sniffer
devices to detect secure information (Summerville et al.,2005). Phishing is an instance of
passive online fraud, dened as a deceitful deed of camouaging condence to acquire
protected credentials by retrieving emails, passwords, usernames and credit card
transactions (Caswell and Orebaugh, 2005). Different avors of phishing attacks are
witnessed, among which website attacks are the predominant ones. It is noteworthy to
pinpoint the method adopted by phishers for accessing sensitive information. Initially, an
attacker creates a visually identical replica of a legitimate website (this is referred to as
phishing website), and subsequently, such sites seek unscrupulous access to customers
private details by false recognition for the acquisition of monetary benets (Yu et al.,
2008).
At the moment, there is a prototype swing in the direction of social media
outbreaks, recurrently detected on Twitter and Facebook platforms (Chandrasekaran
et al., 2006;Grier et al.,2010). Such phishing threats are consistently observed in
several domains, as they intend to exploit the ignorance of humans leading to
sanctuary breaches. Some of the recent statistical reports have stressed upon the
increasing trends and the intensity of these security threats. According to the internet
security report released by Symantec, a leading cyber security organization, phishing
emails have largely contributed to business email compromise (BEC) threats, resulting
in a forfeiture of $3 bn (Internet Security Threat Report, 2017). Likewise, latest gures
from Anti Phishing Working Group (APWG) (Phishing Activity Trends Report 1
Quarter, 2010) released for the last quarter of the year 2016 has provided details of
about 20 million new malware. This report showed an increased volume of phishing
websites by a whopping 250 per cent as compared to the last quarter in 2015
(Phishing Activity Trends Report, 2017). Repercussions from these reports indicate a
likelihood of an exponential increase in phishing attacks in future, causing tremendous
obliteration to the security of mankind.
With such distressing upsurge of phishing attacks, an eternal necessity arises to
formulate enriched systematic explications for safeguarding consumers from diverse
malevolent asylum extortions. In retort to the propagation of phishing attacks, numerous
phishing detection practices are being implemented by different agencies to protect
information systemsused by consumers and establishments.Several authors have proposed
various methods of phishingdetection in past studies (Zhang et al., 2007;Garera et al.,2007;
Dong et al.,2008;Medvet et al.,2008;Yuet al.,2009;Afrozand Greenstadt, 2011;Singh et al.,
2011;Zhuang et al.,2012;Marchal et al.,2014;Rao and Ali, 2015;Ahmed and Abdullah,
2016; Y.A.Abutairand Belghith, 2017). However, the presence of abundant varieties of cyber
threats and immense competition among numerous security vendors may not mitigate all
JSIT
20,3
322
forms of attacks completely.In the context of phishing attacks,majority of the anti-phishing
tools fail to target specicavors of these outbreaks, as they tackle them as a massive bulk
without any knowledgebase of the previous and future explosionof attacks (Dhamija et al.,
2006). Alleviation of phishing attacks certainly requires effective analytical strategies and
techniques supported by adept manual intervention. Such premeditated prediction of
phishing threats enhances security and safeguards consumer details, which also helps in
minimizing hazards of online monetary transactions at domestic and global levels (Khonji
et al.,2013).
It is of ample importance to have a clear distinction between legitimate and phishing
websites so as to have strategies to mitigate security threats to both customers and
organizations. These threats are initiated from the time when a solitary ssure of
concealment is accompanied by numerous intimidations leading to unauthorized nancial
transactions and organizational losses. As a consequence, mathematical models must be
adopted by business website domains to identifyconsumers who are at the verge of privacy
threat (Ryan, 2001;Wieland et al.,2008;Liu and Terzi, 2009). Accordingly, in present times,
business analysts are tryingto adopt different strategies for comprehending the connotation
of phishing attacks on cyber securitymodules (Buczak and Guven, 2016).
From the users perspective, it is imperative for the webmasters to elucidate the
manifestation of phishing websites precisely within a pre-dened timeframe for curtailing
phish extortions. With the current explosion of information among web domains and
increasing demand for online businesses, business analytics and intelligence (BAI)
applications havepaved the way for value-added recognition and deterrenceof cyber threats
(Zuech et al.,2015). In this aspect, machine learning and data mining algorithms are viewed
as supporting tools for business analytics to uncover phishing outbreaks by studying the
historical behavioral patterns of the websites (Qabajeh and Thabtah, 2014;Smadi et al.,
2015;Jain and Gupta, 2017).
It is essential to understand the signicance of predictor variables which can feasibly
indicate the categorization of the websites into phishing and non-phishing types. Most
studies have considered predictor attributes based on objects from World Wide Web
(WWW), Uniform Resource Locator (URL) based features along with the third-party
features describing a webpage (Atighetchi and Pal, 2009;Wang et al., 2009;Alkhozae and
Batar,2011). Feature selection techniques are among the best-known machine learning
methodologies for extracting relevant attributes from data sets (Guyon and Elisseeff, 2003;
van der Maaten et al.,2009). Based on the features extracted, several mathematical models
have been developedto classify the phishing websites.
The foremost objective of detecting phishing attacks using business analytics
applications is to classifythe websites into two categories of phishing and non-phishing and
to provide better strategiesfor prevention of phishing attacks (Whittaker et al., 2010). In this
perspective, several data mining algorithms such as Logistic Regression (LR), Naïve Bayes
(NB), Support Vector Machines (SVM), DecisionTrees (DT), Association Rules mining (AR)
and Neural Networks (NN)are adopted for building predictive models intended fordetecting
phishing threats. Manystudies have presented the comparative analysis of these algorithms
on different sets of phishing data (Bratko et al., 2006;Basnet et al.,2008;Lakshmi and
Vijaya, 2012;Panda et al.,2012;Chu et al.,2013;James et al., 2013;Soska and Christin, 2014;
Singh et al.,2015;Jeeva and Rajsingh, 2016;Ramesh, Gupta and Gamya, 2017). Other
published works have focused on individual algorithms for identifying phishing threats
(Huang et al., 2012;Feroz and Mengel, 2014;Akinyelu and Adewumi, 2014;Abdelhamid,
2015;Li et al.,2016). Also, several researchers have focused on developing
ensemble models by combining discretealgorithms (Patil and Sherekar, 2013;Montazer and
Detection of
phishing
websites
323

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT