# Proppy Corpus 1.0 Version 1.0: March 25, 2019 --------------------------- This file describes the corpus Proppy 1.0. This is the corpus used in the paper "Proppy: Organizing the News Based on Their Propagandistic Content" (see citation section). The corpus contains 52k articles from 100+ news outlets. Each article is labeled as either “propagandistic” (positive class) or “non-propagandistic” (negative class). The labeling was done indirectly using a technique known as distant supervision, i.e. an article is considered propagandistic if it comes from a news outlet that has been labeled as propagandistic by human annotators. ## Data format We provide the corpus in three tsv files, including training, development, and testing partitions. The data is tab-separated. Each line represents one article, with the following information: 1. article_text: the text of the article retrieved via newspaper3k package. 2. event_location: the geographical location - collected from GDELT. 3. average_tone: measures the impact of the event - collected from GDELT 4. article_date: article's publish date - collected from GDELT. 5. article_ID: GDELT ID , unique among the dataset's articles. 6. article_URL: the direct URL for the published article in its source website. 7. MBFC_factuality_label: factuality label for the source from MBFC 8. article_URL 9. MBFC_factuality_label 10. URL_to_MBFC_page 11. source_name 12. MBFC_notes_about_source 13. MBFC_bias_label 14. source_URL 15. propaganda_label ## About The corpus was downloaded using MBFC metadata to identify propagandistic vs non-propagandistic sources. Specific URLs where then gathered with GDELT and contents downloaded with newspaper3k ## Credit Please cite the following paper when using this corpus: A. Barrón-Cedeño, G. Da San Martino, I. Jaradat, and P. Nakov. Proppy: Organizing news coverage on the basis of their propagandistic content. Information Processing and Management 56(5), pp. 1849-1864. 2019 @article{BARRONCEDENO20191849, author = "Barr\'{o}n-Cede\~no, Alberto and Da San Martino, Giovanni and Jaradat, Israa and Nakov, Preslav", title = "{Proppy: Organizing the news based on their propagandistic content}", journal = "Information Processing & Management", volume = "56", number = "5", pages = "1849 - 1864", year = "2019", issn = "0306-4573", doi = "https://doi.org/10.1016/j.ipm.2019.03.005", url = "http://www.sciencedirect.com/science/article/pii/S0306457318306058", } ## Authors Alberto Barrón-Cedeno; Israa Jaradat; Giovani Da San Martino; Preslav Nakov