“Less is more”. Mining useful features from Twitter user profiles for Twitter user classification in the public health domain

Published date17 December 2019
DOIhttps://doi.org/10.1108/OIR-05-2019-0143
Pages213-237
Date17 December 2019
AuthorZiqi Zhang,Georgica Bors
Subject MatterLibrary & information science,Information behaviour & retrieval,Collection building & management,Bibliometrics,Databases,Information & knowledge management,Information & communications technology,Internet,Records management & preservation,Document management
Less is more
Mining useful features from Twitter
user profiles for Twitter user classification
in the public health domain
Ziqi Zhang
Information School, University of Sheffield, Sheffield, UK, and
Georgica Bors
Department of Computer Science, University of Sheffield, Sheffield, UK
Abstract
Purpose This work studies automated user classification on Twitter in the public health domain, a task
that is essential to many public health-related research works on social media but has not been addressed.
The purpose of this paper is to obtain empirical knowledge on how to optimise the classifier performance on
this task.
Design/methodology/approach A sample of 3,100 Twitter users who tweeted about different health
conditions were manually coded into six most common stakeholders. The authors propose new, simple
features extracted from the short Twitter profiles of these users, and compare a large set of classification
models (including state-of-the-art) that use more complex features and with different algorithms on this
data set.
Findings The authors show that us er classification in t he public health domain is a very challenging
task, as the best result th e authors can obtain on this d ata set is only 59 per cent in ter ms of F1 score.
Compared to state-of-t he-art, the methods can obtain significa ntly better (10 percentage points in F1 on a
best-against-best basis) results when usin g only a small set of 40 featur es extracted from the sho rt
Twitter user profile text s.
Originality/value The work is the first to study the different types of users that engage in health-related
communication on social media, applicable to a broad range of health conditions rather than specific ones
studied in the previous work. The methods are implemented as open source tools, and together with data, are
the first of this kind. The authors believe these will encourage future research to further improve this
important task.
Keywords Social media, Machine learning, Twitter, Public health, Data science
Paper type Research paper
1. Introduction
In recent years, social media platforms such as discussion forums and social networks
have been growing rapidly as a channel for the communication and engagement of public
health-related matters. Among these, Twitter has become the most commonly used platform
for this purpose (Thackeray et al., 2012), due to its support for real-time dissemination of
information and personal opinions. Twitter is a social networking and microblogging
platform where users post and interact with messages, or tweets. It enables its users to
engage in effective and real-time information sharing and dialogic relationship building
with each other (Park et al., 2016). It offers interactive features such as the ability to follow
users to form networks, retweet (i.e. republish and reshare), quote, like and reply to tweets,
and to embed rich media including hyperlinks, multimedia, hashtags (a notion of topic)
as well as symbols within tweets.
Due to the potential of Twitter to provide insight into public views and opinions related
to health and the ability to retrieve data at little cost, it has become a valuable resource for
Online Information Review
Vol. 44 No. 1, 2020
pp. 213-237
© Emerald PublishingLimited
1468-4527
DOI 10.1108/OIR-05-2019-0143
Received 1 May 2019
Revised 21 September 2019
Accepted 19 November 2019
The current issue and full text archive of this journal is available on Emerald Insight at:
https://www.emerald.com/insight/1468-4527.htm
This paper does not contain any studies with human participants performed by any of the authors
(nevertheless data collection from Twitter was still subject to the authorsinstitutions internal
ethical approval).
213
Mining useful
features from
Twitter user
profiles
research (Moorhead et al., 2013). Currently, research based on Twitter in the health domain
can be generally divided into two types: one that studies health-related content shared on
Twitter, and the other studies users who engage with such content.
The majority of previous work belongs to the research of content analysis. This covers
work that apply data mining to discover novel patterns that predict future events such
as disease outbreak (Szomszor et al., 2012), or enhance our existing knowledge such as
pharmacovigilance (Ginn et al., 2014); studies that analyse the nature (e.g. content, quantity)
of information sharing concerning particular health conditions on Twitter (Thackeray et al.,
2012; Tsuya et al., 2014; Rosenkrantz et al., 2016); and research that aims to understand the
impact of such shared content in terms of engaging audience and growing communities
(Ferguson et al., 2014; Singh and John, 2015; Brady et al., 2017; Rabarison et al., 2017).
In contrast, work on user analysis in the health domain is very limited. This typically
involves user profiling based on demographic characteristics or interests. We argue that
this is an equally important area since the identification and characterisation of different
user types allow us to understand dominant or emerging topics, influential users, the
composition of a community and the information exchange patterns therein. Such
knowledge will allow us to better connect information seekers with providers, which will be
of key interest to public health stakeholders. For example, public health agencies and
healthcare providers can better target their audience for the promotion of information and
services; information seekers and service users can better find credible information to fulfil
their informational needs. While there exists a wealth of literature on social media user
profiling in general, these are limited to either non-health context (Tinati et al., 2012; Uddin
et al., 2014), or specific health-related issues such as smokers and drug users (Kim et al.,
2017; Kursuncu et al., 2018). Methods and findings from these studies are ad hoc and not
directly applicable to the general public health domain.
In this work, we study the empirical task of automatically classifying Twitter users that
engage in health-related information sharing, using natural language processing (NLP) and
machine learning techniques. We refer to the different types of users as stakeholders,
representing different interests and information needs. Our contributions are empirical and
include: the first study on the automatic user classification in the general public health
domain, while previous work only tackled single health conditions where the classification
schemes are non-applicable to other problems. We propose a generic classification scheme,
release both our code and data to foster further research in this area. Second, a comparative
analysis of the popular machine learning algorithms and features used for social media user
classification on this specific task. We show that empirically, this is a very challenging task,
as many well-established methods in other domains are shown to obtain only mediocre
results. Third, a new method to capture useful features based on the short Twitter profile
texts of different stakeholders. Compared to state-of-the-art, such features are easier to
extract, and shown to be significantly more effective on this specific task. As one of
our models using only 40 features has significantly outperformed the best performing
state-of-the-art (10 percentage points) that uses thousands of features extracted by complex
processes (e.g. topic modelling) from tweets, as well as additional corpora.
The rest of the paper is organised as follows. Section 2 presents a brief literature review.
Section 3 describes our methodology in detail. This is followed by Section 4 that presents
and discusses the results. Then, Section 5 discusses the limitations of this work, and
Section 6 concludes this paper with future research directions.
2. Background
We first discuss literature in the context of public health-related communication on Twitter.
This includes studies of both content analysis (Section 2.1) and user analysis (Section 2.2).
We then review related work from a methodological point of view, to cover automated user
214
OIR
44,1

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT