A corpus of debunked and verified user-generated videos

Document

Cited in

Pages	72-88
DOI	https://doi.org/10.1108/OIR-03-2018-0101
Published date	11 February 2019
Date	11 February 2019
Author	Olga Papadopoulou,Markos Zampoglou,Symeon Papadopoulos,Ioannis Kompatsiaris
Subject Matter	Library & information science,Information behaviour & retrieval,Collection building & management,Bibliometrics,Databases,Information & knowledge management,Information & communications technology,Internet,Records management & preservation,Document management

A corpus of debunked and

verified user-generated videos

Olga Papadopoulou, Markos Zampoglou, Symeon Papadopoulos and

Ioannis Kompatsiaris

Centre for Research and Technology Hellas, Information Technologies Institute,

Thermi, Greece

Abstract

Purpose –As user-generated content (UGC) is entering the news cycle alongside content captured by news

professionals, it is important to detect misleading content as early as possible and avoid disseminating it. The

purpose of this paper is to present an annotated dataset of 380 user-generated videos (UGVs), 200 debunked

and 180 verified, along with 5,195 near-duplicate reposted versions of them, and a set of automatic verification

experiments aimed to serve as a baseline for future comparisons.

Design/methodology/approach –The dataset was formed using a systematic process combining text

search and near-duplicate video retrieval, followed by manual annotation using a set of journalism-inspired

guidelines. Following the formation of the dataset, the automatic verification step was carried out using

machine learning over a set of well-established features.

Findings –Analysis of the dataset shows distinctive patterns in the spread of verified vs debunked videos,

and the application of state-of-the-art machine learning models shows that the dataset poses a particularly

challenging problem to automatic methods.

Research limitations/implications –Practical limitations constrained the current collection to three

platforms: YouTube, Facebook and Twitter. Furthermore, there exists a wealth of information that can be

drawn from the dataset analysis, which goes beyond the constraints of a single paper. Extension to other

platforms and further analysis will be the object of subsequent research.

Practical implications –The dataset analysis indicates directions for future automatic video verification

algorithms, and the dataset itself provides a challenging benchmark.

Social implications –Having a carefully collected and labelled dataset of debunked and verified videos is

an important resource both for developing effective disinformation-countering tools and for supporting media

literacy activities.

Originality/value –Besides its importance as a unique benchmark for research in automatic verification,

the analysis also allows a glimpse into the dissemination patterns of UGC, and possible telltale differences

between fake and real content.

Keywords Video verification, Fake news, Disinformation detection, User-generated content, Social media,

Dataset

Paper type Research paper

1. Introduction

User-generated content (UGC), i.e. media content generated by non-professional bystanders

during unfoldin g newsworthy events, has b ecome an essential compon ent of evolving news

stories. The ubiquity of capturing devices means that it is verylikely that bystanders may be

capturing relevant content and sharing it through various web and social media platforms.

News professionals are pressed by competition to integrate such content in their stories, but

verifying it firstis essential to any news provider’s reputation (Hermida an d Thurman, 2008).

Automatic and semi-automatic tools have the potential of considerably easing and speeding

up the verification of UGC.

News content verification through automated means is a relatively young field,

comprising a set of distinct disciplines, including rumour analysis (Zubiaga et al., 2018),

multimedia forensics (Zampoglou et al., 2017), classification of social media content

Online Information Review

Vol. 43 No. 1, 2019

pp. 72-88

1468-4527

DOI 10.1108/OIR-03-2018-0101

Received 20 March 2018

Revised 17 July 2018

2 October 2018

Accepted 4 October 2018

The current issue and full text archive of this journal is available on Emerald Insight at:

www.emeraldinsight.com/1468-4527.htm

This work has been supported by the InVID project, partially funded by the European Commission

under Contract No. H2020-687786.

This paper forms part of a special section “Social media mining for journalism”.

OIR

43,1

(Castillo et al., 2011), web mining and multimedia retrieval (Xie et al., 2011). A recent survey

(Kumar and Shah, 2018) presented an analysis of known patterns of disinformation

dissemination and approaches on the automatic detection of false information.

Datasets are an important asset for understanding and addressing the problem of news

content verification, and range from collections of tampered multimedia content and social

media posts, to “rumours”, i.e. cascades of unverified information. Carefully designed

datasets may contribute both to better understanding the patterns of disinformation

dissemination and to training and evaluating automatic detection systems.

This paper deals with user-generated video (UGV ) verification, specifically with the

effort to discern whether a suspect video conveys factual information or disinformation -in

other words, for the sake of brevity, if the video is “real”or “fake”. The paper presents the

first large-scale video verification dataset, consisting of 380 videos and their 5,195 near-

duplicates collected from YouTube (YT), Facebook (FB) and Twitter (TW), including a

number of fake and real UGVs and numerous other versions of those videos that were

consecutively posted online. The dataset is supplemented with 77,258 tweets that contain

links to the dataset’s videos. The dataset, named Fake Video Corpus 2018 (FVC-2018), which

has been made publicly available[1], was gathered using a systematic process and can

provide insights into the nature of disinformation, and the types of fake and real

content circulating the web. It is also aimed to serve as a benchmark for automatic content

verification methods.

2. Related work

The area of multimedia verification consists of several fields of study, tackling various

aspects of the problem from different viewpoints.

2.1 Multimedia forensics

A large part of related research concerns tampering detection and image/video forensics

algorithms. Proposed algorithms attempt to detect and localise image modifications, either

actively by embedding watermarks in multimedia content and monitoring their integrity

(Dadkhah et al., 2014; Botta et al., 2015), or passively by searching for telltale self-repetitions

(Zandi et al., 2016; Ferreira et al., 2016) or inconsistencies in the image. Such inconsistencies

may appear in the pixel domain or the compressed domain depending on the specific

process of tampering. A recent survey and evaluation of such algorithms can be found in

Zampoglou et al. (2017). Generally, such content-based approaches suffer from a number of

issues that often render them inapplicable. One problem is their limited robustness with

respect to image transformations. When the images or videos are recompressed or rescaled,

as it is often the case with social media uploads, the traces of the tampering tend to

disappear (Zampoglou et al., 2017). Another limitation is that such approaches are only

relevant in specific cases of disinformation. There are cases where a multimedia item is used

to convey false information not by altering its content but by altering its context. One

typical such approach is to reuse content from a past event and present it as if it was

captured during a current one. Another is to misrepresent the content, e.g. the location

where it was taken or the identities of depicted people. In such cases, an approach must be

able to evaluate the context of the post (e.g. the profile of the uploader, the linguistic

characteristics of the accompanying post or the collective characteristics of all posts sharing

the same item) rather than its actual content.

2.2 Automated fact checking

In automated fact checking (Hassan et al., 2015), statements are isolated and their

veracity is evaluated using reliable databases providing structured knowledge such as

A corpus of

debunked and

verified UGVs

To continue reading

Request your trial

Subscribers can access the reported version of this case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the cited cases and legislation of a document.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the documents that have cited the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the revised versions of legislation with amendments.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see any amendments made to the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a visualisation of a case and its relationships to other cases. An alternative to lists of cases, the Precedent Map makes it easier to establish which ones may be of most relevance to your research and prioritise further reading. You also get a useful overview of how the case was received.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the list of results connected to your document through the topics and citations Vincent found.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

A corpus of debunked and verified user-generated videos

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users