A novel semi-supervised self-training method based on resampling for Twitter fake account identification

Document

Cited in

DOI	https://doi.org/10.1108/DTA-07-2021-0196
Published date	29 November 2021
Date	29 November 2021
Pages	409-428
Subject Matter	Library & information science,Librarianship/library management,Library technology,Information behaviour & retrieval,Metadata,Information & knowledge management,Information & communications technology,Internet
Author	Ziming Zeng,Tingting Li,Shouqiang Sun,Jingjing Sun,Jie Yin

A novel semi-supervised

self-training method based on

resampling for Twitter fake

account identification

Ziming Zeng and Tingting Li

School of Information Management, Wuhan University, Wuhan, China;

Center for Studies of Information Resources, Wuhan University, Wuhan, China and

Laboratory Center for Library and Information Science, Wuhan, China, and

Shouqiang Sun, Jingjing Sun and Jie Yin

School of Information Management, Wuhan University, Wuhan, China

Abstract

Purpose –Twitter fake accounts refer to bot accounts created by third-party organizations to influence public

opinion, commercial propaganda or impersonate others. The effective identification of bot accounts is

conducive to accurately judge the disseminated information for the public. However, in actual fake account

identification, it is expensive and inefficient to manually label Twitter accounts, and the labeled data are

usually unbalanced in classes. To this end, the authors propose a novel framework to solve these problems.

Design/methodology/approach –In the proposed framework, the authors introduce the concept of semi-

supervised self-training learning and apply it to the real Twitter account data set from Kaggle. Specifically, the

authors first train the classifier in the initial small amount of labeled account data, then use the trained classifier

to automatically label large-scale unlabeled account data. Next, iteratively select high confidence instances

from unlabeled data to expand the labeled data. Finally, an expanded Twitter account training set is obtained.

It is worth mentioning that the resampling technique is integrated into the self-training process, and the data

class is balanced at the initial stage of the self-training iteration.

Findings –The proposed framework effectively improves labeling efficiency and reduces the influence of

class imbalance. It shows excellent identification results on 6 different base classifiers, especially for the initial

small-scale labeled Twitter accounts.

Originality/value –This paper provides novel insights in identifying Twitter fake accounts. First, the

authors take the lead in introducing a self-training method to automatically label Twitter accounts from the

semi-supervised background. Second, the resampling technique is integrated into the self-training process to

effectively reduce the influence of class imbalance on the identification effect.

Keywords Bot accounts, Class imbalance data, Semi-supervised learning, Self-training method,

Resampling technique

Paper type Research paper

1. Introduction

Twitter is one of the most popular social media, and its rapid development has greatly

lowered the threshold of public information release. Diverse information will have different

effects while spreading rapidly. Many illegal technical organizations exploit these

characteristics to register a large size of bot accounts. These bot accounts indirectly

stimulate and induce radical people by hijacking hot topics, thus forming a huge fake public

opinion field (Al-Rawi et al., 2019;Zhang et al., 2019). Therefore, it becomes part of the

Internet’s collective agenda to seek out these unwelcome bots.

The challenge of bot account identification has been taken seriously by our research

community (Subrahmanian et al., 2016). Various approaches have been proposed to detect

social bots (Ferrara et al., 2016). The research on bot account identification mainly adopts

Twitter fake

account

identification

409

Funding: This paper is supported by the National Social Science Fund of China (No. 21BTQ046).

The current issue and full text archive of this journal is available on Emerald Insight at:

https://www.emerald.com/insight/2514-9288.htm

Received 27 July 2021

Revised 26 September 2021

Accepted 22 October 2021

Data Technologies and

Applications

Vol. 56 No. 3, 2022

pp. 409-428

2514-9288

DOI 10.1108/DTA-07-2021-0196

supervised learning methods: examples of human accounts and bots, labeled as such, can be

fed to machine learning algorithms and then use trained models to classify unforeseen

accounts. Pozzana and Ferrara (2020) analyzed the existence of human short-term behavior

trends and found that this trend is related to cognitive origin, while bots have no such trends

intuitively due to the automatic nature of their activities. These findings were finally codified

to create and evaluate a machine-learning algorithm to detect the activities generated by bots

and humans. Varol et al. (2017) extracted more than a thousand features from the public data

and metadata of users (i.e. friends, tweet content and sentiment, network patterns, activity

time series, etc.). Then, the manually labeled Twitter accounts were used to test the

classification effect of traditional supervision algorithms and evaluate them. Perdana and

Muliawati (2015) manually reviewed and labeled the crawled Twitter accounts, and a new

method was proposed to distinguish spammers from legitimate accounts by using time

interval entropy and tweets similarity. Although these methods are quite successful, the

supervised learning methods of bot account identification are expensive in that they require

experts to scrutinize huge amounts of data as the input of the training model.

More conservative studies estimate that the number of bots on Twitter lies somewhere

between5% and 9% of the overall population(Morstatter and Nazer, 2016).This means that in

the actualidentification of fake accounts,the size of bot accountswill not be too large, resulting

in a seriousclass imbalance in the training set.We have observed that most of the trainingsets

used in previous researchon fake account identification have theproblem of class imbalance.

For example, Loyolaet al. (2019) used a pattern-based classification mechanism forsocial bot

detection. The data set used to verify the mechanism had followed the distribution of

imbalanced classes. Kudugunta and Ferrara (2018) proposeda deep neural network based on

contextuallong short-term memory architectureto detect bots at the tweet level.However, the

size of tweetsgenerated by legal accountswas about 2.4 times that of bot accounts,which led to

more bot accounts were ignored in the identification process. In the training data collected in

Morstatter et al. (2016), the fraction of users who were identified as bots only accounted for

about 7.5%. Ther efore, Morstatteret al. (2016) proposed a BoostORmodel to balance the recall

and precision of bot accounts, which reduced the negative influence of class imbalance on

identificationresults. Obviously, it is necessary for us to effectively deal with the unbalanced

data set, which can greatly improve the identification efficiency of bot accounts.

All the researches mentioned above rely on crawling and collecting the characteristic

information of multiple Twitter accounts and then manually labeling whether they are bot

accounts to obtain training data. However, the training data obtained in this way have two

limitations:First, the manuallabeling process is inefficientand costly, which leadsto the limited

size of effective training data. Second, in the actual research of fake account identification,

researchershave to suffer from the serious imbalancebetween the distributionof bot accounts

and human accounts. And this unbalanced instance distribution will make the identification

accuracy of bot accounts much lower than human accounts. Therefore, we take thelead from

the perspective of semi-supervised learning (SSL; almost no one has done so) and propose a

semi-supervised self-trainingframework based on the resamplingtechnique (SSSTR) to break

through the above limitations. The main contributions of this framework are as follows:

(1) A semi-supervised self-training method is introduced to automatically expand the

labeled Twitter accounts, which greatly improves the labeling efficiency and reduces

the labeling cost.

(2) The improved self-training method based on the resampling technique is used to

balance the class distribution of bot accounts and human accounts.

(3) The proposed framework is tested on a real Twitter account data set, and the

experimental results show that the framework achieves better performance.

DTA

56,3

410

To continue reading

Request your trial

Subscribers can access the reported version of this case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the cited cases and legislation of a document.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the documents that have cited the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the revised versions of legislation with amendments.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see any amendments made to the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a visualisation of a case and its relationships to other cases. An alternative to lists of cases, the Precedent Map makes it easier to establish which ones may be of most relevance to your research and prioritise further reading. You also get a useful overview of how the case was received.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the list of results connected to your document through the topics and citations Vincent found.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

A novel semi-supervised self-training method based on resampling for Twitter fake account identification

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users