Malicious web domain identification using online credibility and performance data by considering the class imbalance issue

Published date08 April 2019
DOIhttps://doi.org/10.1108/IMDS-02-2018-0072
Pages676-696
Date08 April 2019
AuthorZhongyi Hu,Raymond Chiong,Ilung Pranata,Yukun Bao,Yuqing Lin
Subject MatterInformation & knowledge management,Information systems,Data management systems,Knowledge management,Knowledge sharing,Management science & operations,Supply chain management,Supply chain information systems,Logistics,Quality management/systems
Malicious web domain
identification using online
credibility and performance data
by considering the class
imbalance issue
Zhongyi Hu
School of Information Management, Wuhan University, Wuhan, China
Raymond Chiong and Ilung Pranata
School of Electrical Engineering and Computing, The University of Newcastle,
Callaghan, Australia
Yukun Bao
School of Management, Huazhong University of Science and Technology,
Wuhan, China, and
Yuqing Lin
School of Electrical Engineering and Computing, The University of Newcastle,
Callaghan, Australia
Abstract
Purpose Malicious web domain identification is of significant importance to the security protection of
internet users. With online credibility and performance data, the purpose of this paper to investigate the use
of machine learning techniques for malicious web domain identification by considering the class imbalance
issue (i.e. there are more benign web domains than malicious ones).
Design/methodology/approach The authors propose an integrated resampling approach to handle class
imbalance by combining the synthetic minority oversampling technique (SMOTE) and particle swarm
optimisation (PSO), a population-based meta-heuristic algorithm. The authors use the SMOTE for
oversampling and PSO for undersampling.
Findings By applying eight well-known machine learning classifiers, the proposed integrated resampling
approach is comprehensively examined using several imbalanced web domain data sets with different
imbalance ratios. Compared to five other well-known resampling approaches, experimental results confirm
that the proposed approach is highly effective.
Practical implications This study not only inspires the practical use of online credibility and
performance data for identifying malicious web domains but also provides an effective resampling approach
for handling the class imbalance issue in the area of malicious web domain identification.
Originality/value Online credibility and performance data are applied to build malicious web domain
identification models using machine learning techniques. An integrated resampling approach is proposed to
address the class imbalance issue. The performance of the proposed approach is confirmed based on real-
world data sets with different imbalance ratios.
Keywords Particle swarm optimization, Imbalance class distribution, Malicious web domain,
Synthetic minority oversampling technique, Online data, Credibility and performance,
Information security, Internet users
Paper type Research paper
1. Introduction
Online malicious attacks represent a big threat to internet usersprivacy and security
(Dong-Her et al., 2004). Phishing attacks, which deceive users into sharing passwords and
their private information, and malware attacks, which secretly access and infect users
Industrial Management & Data
Systems
Vol. 119 No. 3, 2019
pp. 676-696
© Emerald PublishingLimited
0263-5577
DOI 10.1108/IMDS-02-2018-0072
Received 14 February 2018
Revised 14 May 2018
1 September 2018
Accepted 2 October 2018
The current issue and full text archive of this journal is available on Emerald Insight at:
www.emeraldinsight.com/0263-5577.htm
676
IMDS
119,3
computers by distributing viruses and malicious software, are two main types of malicious
attacks. These attacks not only result in immediate monetary loss but also shatter users
trust of engaging in future online activities (San-Martín and Jimenez, 2017).
In light of the above concerns, numerous studies have investigated different
approaches to identify malicious web domains. These approaches can be broadly
categorised into the following two types: blacklist-based and machine learning-based
approaches. The blacklist-based approach, which is a typical way to identify malicious
sites, detects malicious web domains by comparing the domains a user visits against a
user-verified blacklist (https://crypto.stanford.edu/SpoofGuard/). Although it is very
straightforward and has been widely used in different browser toolbars, some studies
have confirmed the ineffectiveness of such techniques (e.g. Tsai et al., 2011; Purkait, 2015).
The main reason being it is not only challenging but also almost impractical for any
blacklist to be up-to-date all the time (Ma et al., 2009a, b; Pranata et al., 2012). A more
effective way to tackle the problem is to identify malicious web domains using machine
learning techniques. Many popular machine learning techniques have been successfully
applied to identify malicious websites with lexical and host-based features extracted from
URLs (Ma et al., 2009a, b; Blum et al., 2010; Abutair and Belghith, 2017). Some studies also
built malicious web domain identification models using machine learning techniques with
features extrac ted from web page content (Zhang et al., 2011; Moghimi and Varjani, 2016;
Tan et al., 2016). Instead of extracting features from URLs or web page content, we
explored the use of online credibility and performance data to identify malicious web
domains with machine learning techniques (Hu et al., 2016). Our results confirmed that
with such kind of data, the examined machine learning approaches could accurately
identify malicious web domains.
However, our previous work did not consider the class imbalance issue, a commonly
known obstacle in building a machine learning classifier that can successfully distinguish
minority samples from majority ones ( Japkowicz and Stephen, 2002; Ko et al., 2017).
In practice, one would naturally expect the number of malicious web domains to be
much fewer than benign ones. Along this line of research, only a limited number of studies
(e.g. see Kegelmeyer et al., 2013; Ye et al., 2010) had paid attention to data imbalance when
developing machine learning models for malicious web domain identification. Generally,
resampling techniques such as oversampling and undersampling are useful for addressing
the data imbalance issue. However, even with advanced sampling techniques,
undersampling may discard some potentially useful data, and oversampling may
increase the likelihood of overfitting ( Japkowicz and Stephen, 2002). There is no universal
agreement about which kinds of methods are better (Zhu et al., 2018).
In this paper, with online credibility and performance data, we propose an integrated
resampling approach to address the class imbalance issue in identifying malicious web
domains. Specifically, we integrate two types of resampling strategies by starting with
oversampling the minority class moderately, followed by undersampling the majority class
to a similar size as the oversampled minority class. For oversampling, we use the synthetic
minority oversampling technique (SMOTE). To undersample the majority class, an
evolutionary undersampling approach based on the particle swarm optimisation (PSO)
algorithm is proposed. Four data sets with different imbalance ratios are used to verify the
performance of the proposed integrated resampling approach for malicious web domain
identification. Three metrics including F-measure, geometric mean (GMean) and the area
under the ROC curve (AUC) are considered in the evaluation process. Eight well-known
machine learning techniques are included as classifiers to build identification models
with the proposed resampling strategy as well as five other commonly used strategies.
We apply two-stage statistical tests to check the statistical differences between results
obtained by the tested approaches.
677
Malicious
web domain
identification

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT