Proppy Corpus 1.0
- 1. Università di Bologna
- 2. University of Texas at Arlington
- 3. Qatar Computing Research Institute, HBKU
Description
# Proppy Corpus 1.0
Version 1.0: March 25, 2019
---------------------------
This file describes the corpus Proppy 1.0. This is the corpus used in the
paper "Proppy: Organizing the News Based on Their Propagandistic Content"
(see citation section).
The corpus contains 52k articles from 100+ news outlets. Each article is
labeled as either “propagandistic” (positive class) or “non-propagandistic”
(negative class). The labeling was done indirectly using a technique known as
distant supervision, i.e. an article is considered propagandistic if it comes
from a news outlet that has been labeled as propagandistic by human annotators.
## Data format
We provide the corpus in three tsv files, including training, development, and
testing partitions.
The data is tab-separated. Each line represents one article, with the following
information:
1. article_text: the text of the article retrieved via newspaper3k package.
2. event_location: the geographical location - collected from GDELT.
3. average_tone: measures the impact of the event - collected from GDELT
4. article_date: article's publish date - collected from GDELT.
5. article_ID: GDELT ID , unique among the dataset's articles.
6. article_URL: the direct URL for the published article in its source website.
7. MBFC_factuality_label: factuality label for the source from MBFC
8. article_URL
9. MBFC_factuality_label
10. URL_to_MBFC_page
11. source_name
12. MBFC_notes_about_source
13. MBFC_bias_label
14. source_URL
15. propaganda_label
## About
The corpus was downloaded using MBFC metadata to identify propagandistic vs
non-propagandistic sources. Specific URLs where then gathered with GDELT and
contents downloaded with newspaper3k
## Credit
Please cite the following paper when using this corpus:
A. Barrón-Cedeño, G. Da San Martino, I. Jaradat, and P. Nakov.
Proppy: Organizing news coverage on the basis of their propagandistic content.
Information Processing and Management 56(5), pp. 1849-1864. 2019
@article{BARRONCEDENO20191849,
author = "Barr\'{o}n-Cede\~no, Alberto and
Da San Martino, Giovanni and
Jaradat, Israa and
Nakov, Preslav",
title = "{Proppy: Organizing the news based on their propagandistic content}",
journal = "Information Processing & Management",
volume = "56",
number = "5",
pages = "1849 - 1864",
year = "2019",
issn = "0306-4573",
doi = "https://doi.org/10.1016/j.ipm.2019.03.005",
url = "http://www.sciencedirect.com/science/article/pii/S0306457318306058",
}
## Authors
Alberto Barrón-Cedeno;
Israa Jaradat;
Giovani Da San Martino;
Preslav Nakov
Files
README.md
Additional details
Related works
- Is supplement to
- 10.1016/j.ipm.2019.03.005 (DOI)