Phishing web site detection using diverse machine learning algorithms

Date02 January 2020
Published date02 January 2020
Pages65-80
DOIhttps://doi.org/10.1108/EL-05-2019-0118
AuthorAmmara Zamir,Hikmat Ullah Khan,Tassawar Iqbal,Nazish Yousaf,Farah Aslam,Almas Anjum,Maryam Hamdani
Subject MatterInformation & knowledge management,Information & communications technology,Internet
Phishing web site detection using
diverse machine learning
algorithms
Ammara Zamir
Department of Computer Science, University of Wah, Quaid Avenue,
Wah Cantt, Pakistan and Department of Computer Science,
COMSATS University Islamabad Wah Campus, Islamabad, Pakistan
Hikmat Ullah Khan and Tassawar Iqbal
Department of Computer Science, COMSATS University Islamabad,
Wah Campus, Islamabad, Pakistan
Nazish Yousaf
Department of Computer Science, University of Wah, Quaid Avenue,
Wah Cantt, Pakistan and Department of Computer and Software Engineering,
College of Electrical and Mechanical Engineering, Islamabad, Pakistan
Farah Aslam
Department of Computer Science, University of Wah, Quaid Avenue,
Wah Cantt, Pakistan
Almas Anjum
Department of Computer and Software Engineering,
College of Electrical and Mechanical Engineering, Islamabad, Pakistan, and
Maryam Hamdani
Department of Computer Science, University of Wah, Quaid Avenue,
Wah Cantt, Pakistan
Abstract
Purpose This paper aims to present a framework to detect phishing websites using stacking model.
Phishing is a type of fraud to access userscredentials. The attackers access userspersonal and sensitive
information for monetary purposes. Phishing affects diverse elds, such as e-commerce, online business,
banking and digitalmarketing, and is ordinarily carried out by sending spam emails and developingidentical
websites resembling the original websites. As people surf the targeted website, the phishers hijack their
personalinformation.
Design/methodology/approach Features of phishing data set are analysed by using feature
selection techniques including information gain, gain ratio, Relief-F and recursive feature elimination
(RFE) for feature selection. Two features are proposed combining the strong est and weakest attributes.
Principal component analysis with diverse machine learning algorithms including (random forest [RF],
neural network [NN], bagging, support vector machine, Naïve Bayes and k-nearest neighbour) is applied
on proposed and remaining features. Afterwards, two stacking models: Stacking1 (RF þNN þBagging)
and Stacking2 (kNN þRF þBagging) are applied by combining highest scoring classiers to improve the
classication accuracy.
Machine
learning
algorithms
65
Received14 May 2019
Revised8 September 2019
24October 2019
Accepted6 November 2019
TheElectronic Library
Vol.38 No. 1, 2020
pp. 65-80
© Emerald Publishing Limited
0264-0473
DOI 10.1108/EL-05-2019-0118
The current issue and full text archive of this journal is available on Emerald Insight at:
https://www.emerald.com/insight/0264-0473.htm
Findings The proposed featuresplayed an important role in improving the accuracyof all the classiers.
The results show that RFE playsan important role to remove the least important feature fromthe data set.
Furthermore, Stacking1 (RF þNN þBagging) outperformed all other classiers in terms of classication
accuracyto detect phishing website with 97.4% accuracy.
Originality/value This research is novel in this regardthat no previous research focusses on using feed
forward NN andensemble learners for detecting phishing websites.
Keywords Classication-based techniques, Ensemble learners, Feed forward neural network,
Phishing detection, Neural networks, Stacking models, Ensemble techniques, Feature selection,
Malicious URLs
Paper type Research paper
1. Introduction
In recent years, social networks havebecome a virtual meeting place for the general public.
Unfortunately, while connecting through social networks, people experience phishing
attacks. Phishing is a cybercrimewhich risks a users privacy, may execute malware attacks
and often steals their sensitive information. Phishing is carried out by using different
engineering techniques including: instant messages (Jakobsson, 2018); fraudulent emails or
mimicking an online bank, auctionor payment sites; and directing people to fake Web pages
(Rodríguez et al.,2019) that resemble a login page to a genuine site. Phishing attacks have
increased drasticallyin 2019, according to the Anti-Phishing WorkingGroup which detected
the total number of phishing websites in 2019 as 180,768 (Anti-Phishing Working Group,
APWG, 2019). Also, according to a Proofpoint survey, people who use social websites are
more exposed to potential phishing threats (www.proofpoint.com/us/security-awareness/
post/latest-phishing-rst-2019, accessed 8 May 2019).
A website phishing attackis carried out by spoong legal identities, such as a legitimate
website. A malicious website succeeds at obtaining some user information and can lead a
user to additional malicious links that consequently gain access to even more of the users
sensitive or personal information.To achieve this goal, identical websites are created which
so closely resemble the original website thatthe duplicitous duplication cannot be detected.
Phishing attacks cause great economical, intellectual property and national security
damages (Vayanskyand Kumar, 2018).
Phishing spoils industries including e-commerce and internet banking. Several techniques
exist to save users from phishing attacks, including the heuristic approach (Babagoli et al.,
2019), the rule-based approach (Adewole et al., 2019) and a supervised machine learning (ML)
approach (Sahingoz et al., 2019). Supervised ML algorithms are widely used for classication
(Alzubi et al., 2018; Hawashin et al.,2019) and are more popular among a ll the techniques used
to detect phishing websites. Kumar and Chaudhary (2017) introduced a framework based on
machine learning for e-commerce-based mobile applications to detect malwares. This approach
detects mobile phishing and protects information leakage (Kumar and Chaudhary, 2017).
Internet banking is also effected by phishing. The rule-based approach was introduced by
Moghimi and Varjani (2016) to detect phishing in internet banking using four sets of features.
Support vector machine (SVM) is applied to classify the Web pages, and the proposed
framework achieved 99 per cent accuracy in detecting Web pages for phishing in internet
banking (Moghimi and Varjani, 2016).
This research study focusses on a supervisedML approach to detect phishing websites.
The contributions of this researchstudy are as follows:
Application of feature selection algorithms including: gain ratio (GR), information
gain (IG), Relief-F and recursive feature elimination (RFE) to analyse the importance
EL
38,1
66

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT