Malicious web domain identification using online credibility and performance data by considering the class imbalance issue

Document

Cited in

Published date	08 April 2019
DOI	https://doi.org/10.1108/IMDS-02-2018-0072
Pages	676-696
Date	08 April 2019
Author	Zhongyi Hu,Raymond Chiong,Ilung Pranata,Yukun Bao,Yuqing Lin
Subject Matter	Information & knowledge management,Information systems,Data management systems,Knowledge management,Knowledge sharing,Management science & operations,Supply chain management,Supply chain information systems,Logistics,Quality management/systems

Malicious web domain

identification using online

credibility and performance data

by considering the class

imbalance issue

Zhongyi Hu

School of Information Management, Wuhan University, Wuhan, China

Raymond Chiong and Ilung Pranata

School of Electrical Engineering and Computing, The University of Newcastle,

Callaghan, Australia

Yukun Bao

School of Management, Huazhong University of Science and Technology,

Wuhan, China, and

Yuqing Lin

School of Electrical Engineering and Computing, The University of Newcastle,

Callaghan, Australia

Abstract

Purpose –Malicious web domain identification is of significant importance to the security protection of

internet users. With online credibility and performance data, the purpose of this paper to investigate the use

of machine learning techniques for malicious web domain identification by considering the class imbalance

issue (i.e. there are more benign web domains than malicious ones).

Design/methodology/approach –The authors propose an integrated resampling approach to handle class

imbalance by combining the synthetic minority oversampling technique (SMOTE) and particle swarm

optimisation (PSO), a population-based meta-heuristic algorithm. The authors use the SMOTE for

oversampling and PSO for undersampling.

Findings –By applying eight well-known machine learning classifiers, the proposed integrated resampling

approach is comprehensively examined using several imbalanced web domain data sets with different

imbalance ratios. Compared to five other well-known resampling approaches, experimental results confirm

that the proposed approach is highly effective.

Practical implications –This study not only inspires the practical use of online credibility and

performance data for identifying malicious web domains but also provides an effective resampling approach

for handling the class imbalance issue in the area of malicious web domain identification.

Originality/value –Online credibility and performance data are applied to build malicious web domain

identification models using machine learning techniques. An integrated resampling approach is proposed to

address the class imbalance issue. The performance of the proposed approach is confirmed based on real-

world data sets with different imbalance ratios.

Keywords Particle swarm optimization, Imbalance class distribution, Malicious web domain,

Synthetic minority oversampling technique, Online data, Credibility and performance,

Information security, Internet users

Paper type Research paper

1. Introduction

Online malicious attacks represent a big threat to internet users’privacy and security

(Dong-Her et al., 2004). Phishing attacks, which deceive users into sharing passwords and

their private information, and malware attacks, which secretly access and infect users’

Industrial Management & Data

Systems

Vol. 119 No. 3, 2019

pp. 676-696

0263-5577

DOI 10.1108/IMDS-02-2018-0072

Received 14 February 2018

Revised 14 May 2018

1 September 2018

Accepted 2 October 2018

The current issue and full text archive of this journal is available on Emerald Insight at:

www.emeraldinsight.com/0263-5577.htm

676

IMDS

119,3

computers by distributing viruses and malicious software, are two main types of malicious

attacks. These attacks not only result in immediate monetary loss but also shatter users’

trust of engaging in future online activities (San-Martín and Jimenez, 2017).

In light of the above concerns, numerous studies have investigated different

approaches to identify malicious web domains. These approaches can be broadly

categorised into the following two types: blacklist-based and machine learning-based

approaches. The blacklist-based approach, which is a typical way to identify malicious

sites, detects malicious web domains by comparing the domains a user visits against a

user-verified blacklist (https://crypto.stanford.edu/SpoofGuard/). Although it is very

straightforward and has been widely used in different browser toolbars, some studies

have confirmed the ineffectiveness of such techniques (e.g. Tsai et al., 2011; Purkait, 2015).

The main reason being it is not only challenging but also almost impractical for any

blacklist to be up-to-date all the time (Ma et al., 2009a, b; Pranata et al., 2012). A more

effective way to tackle the problem is to identify malicious web domains using machine

learning techniques. Many popular machine learning techniques have been successfully

applied to identify malicious websites with lexical and host-based features extracted from

URLs (Ma et al., 2009a, b; Blum et al., 2010; Abutair and Belghith, 2017). Some studies also

built malicious web domain identification models using machine learning techniques with

features extrac ted from web page content (Zhang et al., 2011; Moghimi and Varjani, 2016;

Tan et al., 2016). Instead of extracting features from URLs or web page content, we

explored the use of online credibility and performance data to identify malicious web

domains with machine learning techniques (Hu et al., 2016). Our results confirmed that

with such kind of data, the examined machine learning approaches could accurately

identify malicious web domains.

However, our previous work did not consider the class imbalance issue, a commonly

known obstacle in building a machine learning classifier that can successfully distinguish

minority samples from majority ones ( Japkowicz and Stephen, 2002; Ko et al., 2017).

In practice, one would naturally expect the number of malicious web domains to be

much fewer than benign ones. Along this line of research, only a limited number of studies

(e.g. see Kegelmeyer et al., 2013; Ye et al., 2010) had paid attention to data imbalance when

developing machine learning models for malicious web domain identification. Generally,

resampling techniques such as oversampling and undersampling are useful for addressing

the data imbalance issue. However, even with advanced sampling techniques,

undersampling may discard some potentially useful data, and oversampling may

increase the likelihood of overfitting ( Japkowicz and Stephen, 2002). There is no universal

agreement about which kinds of methods are better (Zhu et al., 2018).

In this paper, with online credibility and performance data, we propose an integrated

resampling approach to address the class imbalance issue in identifying malicious web

domains. Specifically, we integrate two types of resampling strategies by starting with

oversampling the minority class moderately, followed by undersampling the majority class

to a similar size as the oversampled minority class. For oversampling, we use the synthetic

minority oversampling technique (SMOTE). To undersample the majority class, an

evolutionary undersampling approach based on the particle swarm optimisation (PSO)

algorithm is proposed. Four data sets with different imbalance ratios are used to verify the

performance of the proposed integrated resampling approach for malicious web domain

identification. Three metrics including F-measure, geometric mean (GMean) and the area

under the ROC curve (AUC) are considered in the evaluation process. Eight well-known

machine learning techniques are included as classifiers to build identification models

with the proposed resampling strategy as well as five other commonly used strategies.

We apply two-stage statistical tests to check the statistical differences between results

obtained by the tested approaches.

677

Malicious

web domain

identification

To continue reading

Request your trial

Subscribers can access the reported version of this case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the cited cases and legislation of a document.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the documents that have cited the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the revised versions of legislation with amendments.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see any amendments made to the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a visualisation of a case and its relationships to other cases. An alternative to lists of cases, the Precedent Map makes it easier to establish which ones may be of most relevance to your research and prioritise further reading. You also get a useful overview of how the case was received.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the list of results connected to your document through the topics and citations Vincent found.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Malicious web domain identification using online credibility and performance data by considering the class imbalance issue

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users