Data for PAN at SemEval 2019 Task 4: Hyperpartisan News Detection

Johannes Kiesel; Maria Mestre; Rishabh Shukla; Emmanuel Vincent; David Corney; Payam Adineh; Benno Stein; Martin Potthast

doi:10.5281/zenodo.1489920

Published November 22, 2018 | Version Training and validation v1

Dataset Open

Data for PAN at SemEval 2019 Task 4: Hyperpartisan News Detection

1. Bauhaus-Universität Weimar
2. Factmata Ltd.
3. Leipzig University

Training and validation data for the PAN @ SemEval 2019 Task 4: Hyperpartisan News Detection.

The data is split into multiple files. The articles are contained in the files with names starting with "articles-" (which validate against the XML schema article.xsd). The ground-truth information is contained in the files with names starting with "ground-truth-" (which validate against the XML schema ground-truth.xsd).

The first part of the data (filename contains "bypublisher") is labeled by the overall bias of the publisher as provided by BuzzFeed journalists or MediaBiasFactCheck.com. It contains a total of 750,000 articles, half of which (375,000) are hyperpartisan and half of which are not. Half of the articles that are hyperpartisan (187,500) are on the left side of the political spectrum, half are on the right side. This data is split into a training set (80%, 600,000 articles) and a validation set (20%, 150,000 articles), where no publisher that occurs in the training set also occurs in the validation set. Similarly, none of the publishers in those sets will occur in the test set.

The second part of the data (filename contains "byarticle") is labeled through crowdsourcing on an article basis. The data contains only articles for which a consensus among the crowdsourcing workers existed. It contains a total of 645 articles. Of these, 238 (37%) are hyperpartisan and 407 (63%) are not, We will use a similar (but balanced!) test set. Again, none of the publishers in this set will occur in the test set.

Note that article IDs are only unique within the parts.

The collection (including labels) are licensed under a Creative Commons Attribution 4.0 International License.

Acknowledgements: Thanks to Jonathan Miller for his assistance in cleaning the data!

Files

articles-training-byarticle-20181122.zip

Files (1.3 GB)

Name	Size
article.xsd md5:b1208c0e3043b680904f53e260b4ce16	2.1 kB	Download
articles-training-byarticle-20181122.zip md5:0f03ccac40669f0daf0f42abdda72af6	971.8 kB	Preview Download
articles-training-bypublisher-20181122.zip md5:0b236915cf270ab9d0c0d8dfc26c645b	980.8 MB	Preview Download
articles-validation-bypublisher-20181122.zip md5:5b6e3e7c06a232892ce4d2f28bf05c88	337.4 MB	Preview Download
ground-truth-training-byarticle-20181122.zip md5:5e1183e7f50bf502d5554c33c98bbd2c	28.5 kB	Preview Download
ground-truth-training-bypublisher-20181122.zip md5:415bb734755e59fab8b3a0ae6c44e862	22.4 MB	Preview Download
ground-truth-validation-bypublisher-20181122.zip md5:4207ba7cf6d919e72a67a28e2af19108	5.2 MB	Preview Download
ground-truth.xsd md5:81dd0e153d6f78ca10a5599da6aac66e	1.6 kB	Download

Additional details

Is referenced by: https://pan.webis.de/semeval19/semeval19-web/ (URL)

	All versions	This version
Views	23,075	9,978
Downloads	13,414	8,086
Data volume	12.0 TB	10.4 TB

Data for PAN at SemEval 2019 Task 4: Hyperpartisan News Detection

Authors/Creators

Description

Files

articles-training-byarticle-20181122.zip

Files (1.3 GB)

Additional details

Related works