A novel semi-supervised self-training method based on resampling for Twitter fake account identification

DOIhttps://doi.org/10.1108/DTA-07-2021-0196
Published date29 November 2021
Date29 November 2021
Pages409-428
Subject MatterLibrary & information science,Librarianship/library management,Library technology,Information behaviour & retrieval,Metadata,Information & knowledge management,Information & communications technology,Internet
AuthorZiming Zeng,Tingting Li,Shouqiang Sun,Jingjing Sun,Jie Yin
A novel semi-supervised
self-training method based on
resampling for Twitter fake
account identification
Ziming Zeng and Tingting Li
School of Information Management, Wuhan University, Wuhan, China;
Center for Studies of Information Resources, Wuhan University, Wuhan, China and
Laboratory Center for Library and Information Science, Wuhan, China, and
Shouqiang Sun, Jingjing Sun and Jie Yin
School of Information Management, Wuhan University, Wuhan, China
Abstract
Purpose Twitter fake accounts refer to bot accounts created by third-party organizations to influence public
opinion, commercial propaganda or impersonate others. The effective identification of bot accounts is
conducive to accurately judge the disseminated information for the public. However, in actual fake account
identification, it is expensive and inefficient to manually label Twitter accounts, and the labeled data are
usually unbalanced in classes. To this end, the authors propose a novel framework to solve these problems.
Design/methodology/approach In the proposed framework, the authors introduce the concept of semi-
supervised self-training learning and apply it to the real Twitter account data set from Kaggle. Specifically, the
authors first train the classifier in the initial small amount of labeled account data, then use the trained classifier
to automatically label large-scale unlabeled account data. Next, iteratively select high confidence instances
from unlabeled data to expand the labeled data. Finally, an expanded Twitter account training set is obtained.
It is worth mentioning that the resampling technique is integrated into the self-training process, and the data
class is balanced at the initial stage of the self-training iteration.
Findings The proposed framework effectively improves labeling efficiency and reduces the influence of
class imbalance. It shows excellent identification results on 6 different base classifiers, especially for the initial
small-scale labeled Twitter accounts.
Originality/value This paper provides novel insights in identifying Twitter fake accounts. First, the
authors take the lead in introducing a self-training method to automatically label Twitter accounts from the
semi-supervised background. Second, the resampling technique is integrated into the self-training process to
effectively reduce the influence of class imbalance on the identification effect.
Keywords Bot accounts, Class imbalance data, Semi-supervised learning, Self-training method,
Resampling technique
Paper type Research paper
1. Introduction
Twitter is one of the most popular social media, and its rapid development has greatly
lowered the threshold of public information release. Diverse information will have different
effects while spreading rapidly. Many illegal technical organizations exploit these
characteristics to register a large size of bot accounts. These bot accounts indirectly
stimulate and induce radical people by hijacking hot topics, thus forming a huge fake public
opinion field (Al-Rawi et al., 2019;Zhang et al., 2019). Therefore, it becomes part of the
Internets collective agenda to seek out these unwelcome bots.
The challenge of bot account identification has been taken seriously by our research
community (Subrahmanian et al., 2016). Various approaches have been proposed to detect
social bots (Ferrara et al., 2016). The research on bot account identification mainly adopts
Twitter fake
account
identification
409
Funding: This paper is supported by the National Social Science Fund of China (No. 21BTQ046).
The current issue and full text archive of this journal is available on Emerald Insight at:
https://www.emerald.com/insight/2514-9288.htm
Received 27 July 2021
Revised 26 September 2021
Accepted 22 October 2021
Data Technologies and
Applications
Vol. 56 No. 3, 2022
pp. 409-428
© Emerald Publishing Limited
2514-9288
DOI 10.1108/DTA-07-2021-0196
supervised learning methods: examples of human accounts and bots, labeled as such, can be
fed to machine learning algorithms and then use trained models to classify unforeseen
accounts. Pozzana and Ferrara (2020) analyzed the existence of human short-term behavior
trends and found that this trend is related to cognitive origin, while bots have no such trends
intuitively due to the automatic nature of their activities. These findings were finally codified
to create and evaluate a machine-learning algorithm to detect the activities generated by bots
and humans. Varol et al. (2017) extracted more than a thousand features from the public data
and metadata of users (i.e. friends, tweet content and sentiment, network patterns, activity
time series, etc.). Then, the manually labeled Twitter accounts were used to test the
classification effect of traditional supervision algorithms and evaluate them. Perdana and
Muliawati (2015) manually reviewed and labeled the crawled Twitter accounts, and a new
method was proposed to distinguish spammers from legitimate accounts by using time
interval entropy and tweets similarity. Although these methods are quite successful, the
supervised learning methods of bot account identification are expensive in that they require
experts to scrutinize huge amounts of data as the input of the training model.
More conservative studies estimate that the number of bots on Twitter lies somewhere
between5% and 9% of the overall population(Morstatter and Nazer, 2016).This means that in
the actualidentification of fake accounts,the size of bot accountswill not be too large, resulting
in a seriousclass imbalance in the training set.We have observed that most of the trainingsets
used in previous researchon fake account identification have theproblem of class imbalance.
For example, Loyolaet al. (2019) used a pattern-based classification mechanism forsocial bot
detection. The data set used to verify the mechanism had followed the distribution of
imbalanced classes. Kudugunta and Ferrara (2018) proposeda deep neural network based on
contextuallong short-term memory architectureto detect bots at the tweet level.However, the
size of tweetsgenerated by legal accountswas about 2.4 times that of bot accounts,which led to
more bot accounts were ignored in the identification process. In the training data collected in
Morstatter et al. (2016), the fraction of users who were identified as bots only accounted for
about 7.5%. Ther efore, Morstatteret al. (2016) proposed a BoostORmodel to balance the recall
and precision of bot accounts, which reduced the negative influence of class imbalance on
identificationresults. Obviously, it is necessary for us to effectively deal with the unbalanced
data set, which can greatly improve the identification efficiency of bot accounts.
All the researches mentioned above rely on crawling and collecting the characteristic
information of multiple Twitter accounts and then manually labeling whether they are bot
accounts to obtain training data. However, the training data obtained in this way have two
limitations:First, the manuallabeling process is inefficientand costly, which leadsto the limited
size of effective training data. Second, in the actual research of fake account identification,
researchershave to suffer from the serious imbalancebetween the distributionof bot accounts
and human accounts. And this unbalanced instance distribution will make the identification
accuracy of bot accounts much lower than human accounts. Therefore, we take thelead from
the perspective of semi-supervised learning (SSL; almost no one has done so) and propose a
semi-supervised self-trainingframework based on the resamplingtechnique (SSSTR) to break
through the above limitations. The main contributions of this framework are as follows:
(1) A semi-supervised self-training method is introduced to automatically expand the
labeled Twitter accounts, which greatly improves the labeling efficiency and reduces
the labeling cost.
(2) The improved self-training method based on the resampling technique is used to
balance the class distribution of bot accounts and human accounts.
(3) The proposed framework is tested on a real Twitter account data set, and the
experimental results show that the framework achieves better performance.
DTA
56,3
410

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT