Feature distillation and accumulated selection for automated fraudulent publisher classification from user click data of online advertising

DOIhttps://doi.org/10.1108/DTA-09-2021-0233
Published date06 January 2022
Date06 January 2022
Pages602-625
Subject MatterLibrary & information science,Librarianship/library management,Library technology,Information behaviour & retrieval,Metadata,Information & knowledge management,Information & communications technology,Internet
AuthorDeepti Sisodia,Dilip Singh Sisodia
Feature distillation and
accumulated selection for
automated fraudulent publisher
classification from user click data
of online advertising
Deepti Sisodia and Dilip Singh Sisodia
Computer Science and Engineering, National Institute of Technology Raipur,
Raipur, India
Abstract
Purpose The problem of choosing the utmost useful features from hundreds of features from time-series
user click data arises in online advertising toward fraudulent publishers classification. Selecting feature
subsets is a key issue in such classification tasks. Practically, the use of filter approaches is common; however,
they neglect the correlations amid features. Conversely, wrapper approaches could not be applied due to their
complexities. Moreover, in particular, existing feature selection methods could not handle such data, which is
one of the major causes of instability of feature selection.
Design/methodology/approach Toovercome such issues, a majorityvoting-based hybrid featureselection
method, namely feature distillation and accumulated selection (FDAS), is proposed to investigate the optimal
subset of relevant features for analyzing the publishers fraudulent conduct. FDAS works in two phases:
(1) featuredistillation, where significantfeatures from standardfilter and wrapper feature selectionmethods are
obtained usingmajority voting; (2) accumulatedselection, where we enumeratedan accumulated evaluation of
relevant featuresubset to search for an optimal feature subsetusing effective machine learning(ML) models.
Findings Empirical results prove enhanced classification performance with proposed features in average
precision, recall, f1-score and AUC in publisher identification and classification.
Originality/value The FDAS is evaluated on FDMA2012 user-click data and nine other benchmark
datasets to gauge its generalizing characteristics, first, considering original features, second, with relevant
feature subsets selected by feature selection (FS) methods, third, with optimal feature subset obtained by the
proposed approach. ANOVA significance test is conducted to demonstrate significant differences between
independent features.
Keywords Fraudulent publisher, FDAS, Feature selection, Feature distillation, Accumulated selection,
Majority voting
Paper type Research paper
1. Introduction
With the development of state-of-the-art techniques and global communication, fraud is
growing drastically (Sisodia et al., 2018). In the pay-per-click (PPC) online advertising
model, the advertising commissioner acted as a central coordinator among advertisers and
publishers. The advertiser provides the advertisements to the advertising commissioner
based on the planned budget and pays the commission for every generated click (Berrar,
2012,2016). The publisher communicates with the advertising commissioner to display
ads on their web pages to receive user clicks and commission proportionate to the
DTA
56,4
602
Ethical approval: This article does not contain any studies with human participants or animals
performed by any authors.
Compliance with ethical standards
Funding: No funding is provided for experimentation.
Conflict of interest: All authors declare that they have no conflict of interest.
The current issue and full text archive of this journal is available on Emerald Insight at:
https://www.emerald.com/insight/2514-9288.htm
Received 3 September 2021
Revised 18 November 2021
Accepted 17 December 2021
Data Technologies and
Applications
Vol. 56 No. 4, 2022
pp. 602-625
© Emerald Publishing Limited
2514-9288
DOI 10.1108/DTA-09-2021-0233
generated clicks (Xu et al., 2014). However, the clicks generated may be from genuine
publishers, deceptive software agents, or other illegitimate means. Publishers
monetization of illegitimate clicks is known as click fraud, and such publishers are
termed fraudulent publishers. Therefore, a click fraud detection (CFD) model is required to
prevent click fraud.
A CFD model is considered as a predictive model identifying the click-log behavior of the
publisher. The publishers behavior is assessed based on his generated clicks (Perera et al.,
2013). The model predicts the legitimacy of the current click-log behavior of the publisher
utilizing click-log generated by the publisher in the past. Feature engineering in detecting
click fraud is vital for constructing feature variables that appropriately summarize and
represent the publishers conduct from raw click-log records (Cadenas et al., 2013). However,
filtering a large number of features and transforming these features into significant ones is a
crucial task of the CFD model as irrelevant features in training data may generate erroneous
results (Li et al., 2017). The major task in this concern is finding a feature subset that can
identify the anomalous behavior termed feature selection(Hoque et al., 2014). Feature
selection is the process of electing relevant feature subsets to use in model construction (Liu
et al., 2017). An appropriate selection of features may improve generalization, alleviate
training time, enhance the models interpretability, enhance accuracy and alleviate the
prediction time.
Though several conventional FS methods exist, they result in high computational
complexity as the feature space is v ast and infinite for nnumber of features that make the
search difficult with an increase in the number of features (Kohavi and John, 1997).
Therefore, this work proposes a hybrid feature selection method, namely features
distillation and accumulated selection (FDAS), which differs from conventional FS toward
selecting the best subset of features, as shown in Figure 1. The traditional FS method
selects the best subset of features utilizing statistical methods while focusing on individual
features to identify their relative importance. However, a feature might not be significant on
its own but might be an effective influencer when accumulated with other features. In
comparison, the proposed FDAS feature selection approach finds the best subset of
features in two phases. The first phase of Feature Distillation identifies the highly
correlated features, thereby obtaining an optimal subset using Majority Voting, which
filters highly voted commonly ranked features. In the second phase of Accumulated
Selection, FDAS overcomes the limitation of conventional FS methods by an iterative
accumulation of features. The proposed hybridFSapproachevaluatesthesignificanceof
features by accumulating a single feature to the combination of features by training a model
over it. It elects the combination of features results in the model performing superior as per
the performance metric.
The most optimal features determined by FDAS were used to train the standard
classification models for fraudulent publisher classification. The proposed feature selection
aims threefold: (1) enhancing the predictive performance of models, (2) making the models
cost-effective and (3) giving a preferable understanding of the underlying procedure of
generating data. This study covers the major points that are summarized as follows:
(1) Proposed a majority voting-based hybrid feature selection method, namely feature
distillation and accumulated selection (FDAS), to find an optimal subset of features
for fraudulent publishers classification.
(2) Majority voting is used to obtain a highly voted commonly ranked relevant feature
subset utilizing eight Filter and Wrapper feature selection methods.
(3) The Feature Accumulation process assesses the significance of an optimal subset of
features toward designing a predictive model.
Feature
distillation and
accumulated
selection
603

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT