Dataset Open Access

PAN Wikipedia Quality Flaw Corpus 2012 (PAN-WQF-12)

Anderka, Maik; Stein, Benno; Völske, Michael

The PAN Wikipedia Quality Flaw Corpus 2012, PAN-WQF-12, provides human-labeled English Wikipedia articles that contain specific quality flaws.

The corpus comprises 1,592,226 articles extracted from the English Wikipedia snapshot from January 4th, 2012. A subset of 208,228 articles is labled with ten specific quality flaws, which are listed in the following table. The labeling is based on human-defined cleanup tags. In addition, the corpus comprises 1,383,998 articles that have not been tagged with any cleanup tag.

Files (3.7 GB)
Name Size
pan-wikipedia-quality-flaw-corpus-2012.tar.gz
md5:3d3ec4d71c707def537e7225169201df
3.7 GB Download
  • Maik Anderka and Benno Stein. Overview of the 1st International Competition on Quality Flaw Prediction in Wikipedia. In Pamela Forner, Jussi Karlgren, and Christa Womser-Hacker, editors, Working Notes Papers of the CLEF 2012 Evaluation Labs, September 2012. ISBN 978-88-904810-3-1. ISSN 2038-4963.

40
12
views
downloads
All versions This version
Views 4040
Downloads 1212
Data volume 44.5 GB44.5 GB
Unique views 2525
Unique downloads 77

Share

Cite as