This file describes the corpus Proppy 1.0. This is the corpus used in the paper "Proppy: Organizing the News Based on Their Propagandistic Content" (see citation section).
The corpus contains 52k articles from 100+ news outlets. Each article is labeled as either “propagandistic” (positive class) or “non-propagandistic” (negative class). The labeling was done indirectly using a technique known as distant supervision, i.e. an article is considered propagandistic if it comes from a news outlet that has been labeled as propagandistic by human annotators.
We provide the corpus in three tsv files, including training, development, and testing partitions.
The data is tab-separated. Each line represents one article, with the following information:
The corpus was downloaded using MBFC metadata to identify propagandistic vs non-propagandistic sources. Specific URLs where then gathered with GDELT and contents downloaded with newspaper3k
Please cite the following paper when using this corpus:
A. Barrón-Cedeño, G. Da San Martino, I. Jaradat, and P. Nakov. Proppy: Organizing news coverage on the basis of their propagandistic content. Information Processing and Management 56(5), pp. 1849-1864. 2019
@article{BARRONCEDENO20191849, author = "Barr'{o}n-Cede~no, Alberto and Da San Martino, Giovanni and Jaradat, Israa and Nakov, Preslav", title = "{Proppy: Organizing the news based on their propagandistic content}", journal = "Information Processing & Management", volume = "56", number = "5", pages = "1849 - 1864", year = "2019", issn = "0306-4573", doi = "https://doi.org/10.1016/j.ipm.2019.03.005", url = "http://www.sciencedirect.com/science/article/pii/S0306457318306058", }
Alberto Barrón-Cedeno; Israa Jaradat; Giovani Da San Martino; Preslav Nakov