Dataset Open Access
The PAN Wikipedia Quality Flaw Corpus 2012, PAN-WQF-12, provides human-labeled English Wikipedia articles that contain specific quality flaws.
The corpus comprises 1,592,226 articles extracted from the English Wikipedia snapshot from January 4th, 2012. A subset of 208,228 articles is labled with ten specific quality flaws, which are listed in the following table. The labeling is based on human-defined cleanup tags. In addition, the corpus comprises 1,383,998 articles that have not been tagged with any cleanup tag.
Maik Anderka and Benno Stein. Overview of the 1st International Competition on Quality Flaw Prediction in Wikipedia. In Pamela Forner, Jussi Karlgren, and Christa Womser-Hacker, editors, Working Notes Papers of the CLEF 2012 Evaluation Labs, September 2012. ISBN 978-88-904810-3-1. ISSN 2038-4963.