Published July 8, 2019 | Version v1
Dataset Open

Proppy Corpus 1.0

  • 1. Università di Bologna
  • 2. University of Texas at Arlington
  • 3. Qatar Computing Research Institute, HBKU


# Proppy Corpus 1.0

Version 1.0: March 25, 2019

This file describes the corpus Proppy 1.0. This is the corpus used in the
paper "Proppy: Organizing the News Based on Their Propagandistic Content"

(see citation section).

The corpus contains 52k articles from 100+ news outlets. Each article is
labeled as either “propagandistic” (positive class) or “non-propagandistic”
(negative class). The labeling was done indirectly using a technique known as
distant supervision, i.e. an article is considered propagandistic if it comes
from a news outlet that has been labeled as propagandistic by human annotators.

## Data format

We provide the corpus in three tsv files, including training, development, and
testing partitions.

The data is tab-separated. Each line represents one article, with the following

1. article_text: the text of the article retrieved via newspaper3k package.
2. event_location: the geographical location - collected from GDELT.
3. average_tone: measures the impact of the event - collected from GDELT
4. article_date: article's publish date - collected from GDELT.
5. article_ID: GDELT ID , unique among the dataset's articles.
6. article_URL: the direct URL for the published article in its source website.
7. MBFC_factuality_label: factuality label for the source from MBFC
8. article_URL
9. MBFC_factuality_label   
10. URL_to_MBFC_page        
11. source_name     
12. MBFC_notes_about_source
13. MBFC_bias_label
14. source_URL
15. propaganda_label

## About

The corpus was downloaded using MBFC metadata to identify propagandistic vs
non-propagandistic sources. Specific URLs where then gathered with GDELT and
contents downloaded with newspaper3k

## Credit

Please cite the following paper when using this corpus:

A. Barrón-Cedeño, G. Da San Martino, I. Jaradat, and P. Nakov.
Proppy: Organizing news coverage on the basis of their propagandistic content.
Information Processing and Management 56(5), pp. 1849-1864. 2019

author = "Barr\'{o}n-Cede\~no, Alberto and
    Da San Martino, Giovanni and
    Jaradat, Israa and
    Nakov, Preslav",
title = "{Proppy: Organizing the news based on their propagandistic content}",
journal = "Information Processing & Management",
volume = "56",
number = "5",
pages = "1849 - 1864",
year = "2019",
issn = "0306-4573",
doi = "",
url = "",

## Authors

Alberto Barrón-Cedeno;
Israa Jaradat;
Giovani Da San Martino;
Preslav Nakov


Files (208.2 MB)

Name Size Download all
20.4 MB Download
41.0 MB Download
146.7 MB Download
2.6 kB Preview Download

Additional details

Related works

Is supplement to
10.1016/j.ipm.2019.03.005 (DOI)