A corpus of debunked and verified user-generated videos

Publication Date11 February 2019
Date11 February 2019
AuthorOlga Papadopoulou,Markos Zampoglou,Symeon Papadopoulos,Ioannis Kompatsiaris
SubjectLibrary & information science,Information behaviour & retrieval,Collection building & management,Bibliometrics,Databases,Information & knowledge management,Information & communications technology,Internet,Records management & preservation,Document management
A corpus of debunked and
verified user-generated videos
Olga Papadopoulou, Markos Zampoglou, Symeon Papadopoulos and
Ioannis Kompatsiaris
Centre for Research and Technology Hellas, Information Technologies Institute,
Thermi, Greece
Purpose As user-generated content (UGC) is entering the news cycle alongside content captured by news
professionals, it is important to detect misleading content as early as possible and avoid disseminating it. The
purpose of this paper is to present an annotated dataset of 380 user-generated videos (UGVs), 200 debunked
and 180 verified, along with 5,195 near-duplicate reposted versions of them, and a set of automatic verification
experiments aimed to serve as a baseline for future comparisons.
Design/methodology/approach The dataset was formed using a systematic process combining text
search and near-duplicate video retrieval, followed by manual annotation using a set of journalism-inspired
guidelines. Following the formation of the dataset, the automatic verification step was carried out using
machine learning over a set of well-established features.
Findings Analysis of the dataset shows distinctive patterns in the spread of verified vs debunked videos,
and the application of state-of-the-art machine learning models shows that the dataset poses a particularly
challenging problem to automatic methods.
Research limitations/implications Practical limitations constrained the current collection to three
platforms: YouTube, Facebook and Twitter. Furthermore, there exists a wealth of information that can be
drawn from the dataset analysis, which goes beyond the constraints of a single paper. Extension to other
platforms and further analysis will be the object of subsequent research.
Practical implications The dataset analysis indicates directions for future automatic video verification
algorithms, and the dataset itself provides a challenging benchmark.
Social implications Having a carefully collected and labelled dataset of debunked and verified videos is
an important resource both for developing effective disinformation-countering tools and for supporting media
literacy activities.
Originality/value Besides its importance as a unique benchmark for research in automatic verification,
the analysis also allows a glimpse into the dissemination patterns of UGC, and possible telltale differences
between fake and real content.
Keywords Video verification, Fake news, Disinformation detection, User-generated content, Social media,
Paper type Research paper
1. Introduction
User-generated content (UGC), i.e. media content generated by non-professional bystanders
during unfoldin g newsworthy events, has b ecome an essential compon ent of evolving news
stories. The ubiquity of capturing devices means that it is verylikely that bystanders may be
capturing relevant content and sharing it through various web and social media platforms.
News professionals are pressed by competition to integrate such content in their stories, but
verifying it firstis essential to any news providers reputation (Hermida an d Thurman, 2008).
Automatic and semi-automatic tools have the potential of considerably easing and speeding
up the verification of UGC.
News content verification through automated means is a relatively young field,
comprising a set of distinct disciplines, including rumour analysis (Zubiaga et al., 2018),
multimedia forensics (Zampoglou et al., 2017), classification of social media content
Online Information Review
Vol. 43 No. 1, 2019
pp. 72-88
© Emerald PublishingLimited
DOI 10.1108/OIR-03-2018-0101
Received 20 March 2018
Revised 17 July 2018
2 October 2018
Accepted 4 October 2018
The current issue and full text archive of this journal is available on Emerald Insight at:
This work has been supported by the InVID project, partially funded by the European Commission
under Contract No. H2020-687786.
This paper forms part of a special section Social media mining for journalism.
(Castillo et al., 2011), web mining and multimedia retrieval (Xie et al., 2011). A recent survey
(Kumar and Shah, 2018) presented an analysis of known patterns of disinformation
dissemination and approaches on the automatic detection of false information.
Datasets are an important asset for understanding and addressing the problem of news
content verification, and range from collections of tampered multimedia content and social
media posts, to rumours, i.e. cascades of unverified information. Carefully designed
datasets may contribute both to better understanding the patterns of disinformation
dissemination and to training and evaluating automatic detection systems.
This paper deals with user-generated video (UGV ) verification, specifically with the
effort to discern whether a suspect video conveys factual information or disinformation -in
other words, for the sake of brevity, if the video is realor fake. The paper presents the
first large-scale video verification dataset, consisting of 380 videos and their 5,195 near-
duplicates collected from YouTube (YT), Facebook (FB) and Twitter (TW), including a
number of fake and real UGVs and numerous other versions of those videos that were
consecutively posted online. The dataset is supplemented with 77,258 tweets that contain
links to the datasets videos. The dataset, named Fake Video Corpus 2018 (FVC-2018), which
has been made publicly available[1], was gathered using a systematic process and can
provide insights into the nature of disinformation, and the types of fake and real
content circulating the web. It is also aimed to serve as a benchmark for automatic content
verification methods.
2. Related work
The area of multimedia verification consists of several fields of study, tackling various
aspects of the problem from different viewpoints.
2.1 Multimedia forensics
A large part of related research concerns tampering detection and image/video forensics
algorithms. Proposed algorithms attempt to detect and localise image modifications, either
actively by embedding watermarks in multimedia content and monitoring their integrity
(Dadkhah et al., 2014; Botta et al., 2015), or passively by searching for telltale self-repetitions
(Zandi et al., 2016; Ferreira et al., 2016) or inconsistencies in the image. Such inconsistencies
may appear in the pixel domain or the compressed domain depending on the specific
process of tampering. A recent survey and evaluation of such algorithms can be found in
Zampoglou et al. (2017). Generally, such content-based approaches suffer from a number of
issues that often render them inapplicable. One problem is their limited robustness with
respect to image transformations. When the images or videos are recompressed or rescaled,
as it is often the case with social media uploads, the traces of the tampering tend to
disappear (Zampoglou et al., 2017). Another limitation is that such approaches are only
relevant in specific cases of disinformation. There are cases where a multimedia item is used
to convey false information not by altering its content but by altering its context. One
typical such approach is to reuse content from a past event and present it as if it was
captured during a current one. Another is to misrepresent the content, e.g. the location
where it was taken or the identities of depicted people. In such cases, an approach must be
able to evaluate the context of the post (e.g. the profile of the uploader, the linguistic
characteristics of the accompanying post or the collective characteristics of all posts sharing
the same item) rather than its actual content.
2.2 Automated fact checking
In automated fact checking (Hassan et al., 2015), statements are isolated and their
veracity is evaluated using reliable databases providing structured knowledge such as
A corpus of
debunked and
verified UGVs

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT