Optimized discovery of discourse topics in social media: science communication about COVID-19 in Brazil
| Date | 23 September 2024 |
| Pages | 180-198 |
| DOI | https://doi.org/10.1108/DTA-03-2024-0283 |
| Published date | 23 September 2024 |
| Author | Bernardo Cerqueira de Lima,Renata Maria Abrantes Baracho,Thomas Mandl,Patricia Baracho Porto |
Optimized discovery of discourse
topics in social media: science
communication about COVID-19
in Brazil
Bernardo Cerqueira de Lima and Renata Maria Abrantes Baracho
Federal University of Minas Gerais, Belo Horizonte, Brazil
Thomas Mandl
University of Hildesheim, Hildesheim, Germany, and
Patricia Baracho Porto
Pontifical Catholic University of Minas Gerais, Belo Horizonte, Brazil
Abstract
Purpose –Social media platforms that disseminate scientific information to the public during the COVID-19
pandemic highlighted the importance of the topic of scientific communication. Content creators in the field, as
well as researchers who study the impact of scientific information online, are interested in how people react to
these information resources and how they judge them. This study aims to devise a framework for extracting
large social media datasets and find specific feedback to content delivery, enabling scientific content creators
to gain insights into how the public perceives scientific information.
Design/methodology/approach –To collect public reactions to scientific information, the study focused on
Twitter users who are doctors, researchers, science communicators or representatives of research institutes,
and processed their replies for two years from the start of the pandemic. The study aimed in developing a
solution powered by topic modeling enhanced by manual validation and other machine learning techniques,
such as word embeddings, that is capable of filtering massive social media datasets in search of documents
related to reactions to scientific communication. The architecture developed in this paper can be replicated for
finding any documents related to niche topics in social media data. As a final step of our framework, we also
fine-tuned a large language model to be able to perform the classification task with even more accuracy,
forgoing the need of more human validation after the first step.
Findings –We provided a framework capable of receiving a large document dataset, and, with the help of
with a small degree of human validation at different stages, is able to filter out documents within the corpus
that are relevant to a very underrepresented niche theme inside the database, with much higher precision than
traditional state-of-the-art machine learning algorithms. Performance was improved even further by the fine-
tuning of a large language model based on BERT, which would allow for the use of such model to classify even
larger unseen datasets in search of reactions to scientific communication without the need for further manual
validation or topic modeling.
Research limitations/implications –The challenges of scientific communication are even higher with the
rampant increase of misinformation in social media, and the difficulty of competing in a saturated attention
economy of the social media landscape. Our study aimed at creating a solution that could be used by scientific
content creators to better locate and understand constructive feedback toward their content and how it is
received, which can be hidden as a minor subject between hundreds of thousands of comments. By leveraging
an ensemble of techniques ranging from heuristics to state-of-the-art machine learning algorithms, we created
a framework that is able to detect texts related to very niche subjects in very large datasets, with just a small
amount of examples of texts related to the subject being given as input.
DTA
59,1
180
This work was funded by the Volkswagen Foundation in Germany (Volkswagenstiftung) with the
grant A133902 (Project Information Behavior and Media Discourse during the Corona Crisis: An
interdisciplinary Analysis – InDisCo). Further financial support was provided by the Coordination for
the Improvement of Higher Education Personnel (CAPES) from Brazil.
These authors contributed equally to this work.
The current issue and full text archive of this journal is available on Emerald Insight at:
https://www.emerald.com/insight/2514-9288.htm
Received 4 March 2024
Revised 14 July 2024
Accepted 13 August 2024
Data Technologies and
Applications
Vol. 59 No. 1, 2025
pp. 180-198
© Emerald Publishing Limited
2514-9288
DOI 10.1108/DTA-03-2024-0283
Practical implications –With this tool, scientific content creators can sift through their social media
following and quickly understand how to adapt their content to their current user’s needs and standards of
content consumption.
Originality/value –This study aimed to find reactions to scientific communication in social media. We
applied three methods with human intervention and compared their performance. This study shows for the
first time, the topics of interest which were discussed in Brazil during the COVID-19 pandemic.
Keywords Pandemic, Topic modeling, Machine learning, Science communication
Paper type Research paper
1. Introduction
The COVID-19 pandemic allowed the observation of information-seeking behavior and the
effects of the availability of diverse information resources on the population and individuals.
The acceptance of information might have had even an impact on the individual health, as a
study on Brazil suggests (Burni et al., 2023). Social media platforms have emerged as crucial
channels for diverse scientific communication content, addressing the unique informational
demands that arise from times of crisis. Much misinformation was observed during the
COVID-19 crisis online (Langguth et al., 2023) and responses applying AI have been
developed (Nakov et al., 2022).
Recipients of medical and other crucial information face a considerable challenge in
discerning its reliability and trustworthiness (Barnwal et al., 2019). The state of knowledge in
a society depends on the available sources and during the COVID-19 crisis, there was much
distorted information available (Campolino et al., 2022). The design of scientific
communication artifacts is decisive for the human information-seeking behavior during
crises (Soroya et al., 2021). Media creators, in particular, must learn to discern the different
criteria by which their content and communication is consumed and judged by their
followers (Jaki, 2021). The design of multimodal scientific information can be manifold and
many options exist. Further exploration of optimal methods for disseminating scientific
information are still necessary (Rodr
�
ıguez Estrada and Davis, 2015).
This study zeroes in on dissecting online discussions surrounding the COVID-19 crisis,
with a specific emphasis on science communication. The primary objective is to create a
solution that aids creators of social media channels in navigating a large volume of feedback
to their content, helping them filter out constructive feedback in regard to their own method
of communication and the design of their products. Our focus is directed toward identifying
and analyzing a distinct subset of comments posted in science communication channels in
response to the presented content. Such an approach has not been carried out for science
communication to the best of our knowledge. For the purposes of this study, 1.12 million
tweets were collected, constituting comments from a network comprised of Brazilian
scientists, governmental bodies, doctors and scientific communicators. It is noteworthy, that
the majority of these comments are part of the broader discourse on the COVID-19 crisis,
incorporate political viewpoints and general crisis-related commentary. It means that they
are not relevant for our study. This is because of the discursive character of social media
platforms such as X (formerly Twitter).
Initiating efforts to filter this data involved keyword search, however, the diversity of this
content is too large for allowing successful search based on a few words. As a next step,
traditional topic modeling algorithms, such as latent Dirichlet allocation (LDA) (Blei et al.,
2003) were applied. However, these methods revealed inadequacies in handling niche topics
within short documents and large data collections (Cerqueira de Lima et al., 2023;Mandl et al.,
2023). Following manual validation of the topic model, an ensemble method for document
filtering was devised. This method involved constructing a word dictionary composed of the
most relevant words in topics related to scientific communication and their top-n closest
Data
Technologies and
Applications
181
Get this document and AI-powered insights with a free trial of vLex and Vincent AI
Get Started for FreeStart Your Free Trial of vLex and Vincent AI, Your Precision-Engineered Legal Assistant
-
Access comprehensive legal content with no limitations across vLex's unparalleled global legal database
-
Build stronger arguments with verified citations and CERT citator that tracks case history and precedential strength
-
Transform your legal research from hours to minutes with Vincent AI's intelligent search and analysis capabilities
-
Elevate your practice by focusing your expertise where it matters most while Vincent handles the heavy lifting
Start Your Free Trial of vLex and Vincent AI, Your Precision-Engineered Legal Assistant
-
Access comprehensive legal content with no limitations across vLex's unparalleled global legal database
-
Build stronger arguments with verified citations and CERT citator that tracks case history and precedential strength
-
Transform your legal research from hours to minutes with Vincent AI's intelligent search and analysis capabilities
-
Elevate your practice by focusing your expertise where it matters most while Vincent handles the heavy lifting
Start Your Free Trial of vLex and Vincent AI, Your Precision-Engineered Legal Assistant
-
Access comprehensive legal content with no limitations across vLex's unparalleled global legal database
-
Build stronger arguments with verified citations and CERT citator that tracks case history and precedential strength
-
Transform your legal research from hours to minutes with Vincent AI's intelligent search and analysis capabilities
-
Elevate your practice by focusing your expertise where it matters most while Vincent handles the heavy lifting
Start Your Free Trial of vLex and Vincent AI, Your Precision-Engineered Legal Assistant
-
Access comprehensive legal content with no limitations across vLex's unparalleled global legal database
-
Build stronger arguments with verified citations and CERT citator that tracks case history and precedential strength
-
Transform your legal research from hours to minutes with Vincent AI's intelligent search and analysis capabilities
-
Elevate your practice by focusing your expertise where it matters most while Vincent handles the heavy lifting
Start Your Free Trial of vLex and Vincent AI, Your Precision-Engineered Legal Assistant
-
Access comprehensive legal content with no limitations across vLex's unparalleled global legal database
-
Build stronger arguments with verified citations and CERT citator that tracks case history and precedential strength
-
Transform your legal research from hours to minutes with Vincent AI's intelligent search and analysis capabilities
-
Elevate your practice by focusing your expertise where it matters most while Vincent handles the heavy lifting