Planned intervention: On Wednesday April 3rd 05:30 UTC Zenodo will be unavailable for up to 2-10 minutes to perform a storage cluster upgrade.
Published July 22, 2011 | Version v1
Dataset Open

PAN Wikipedia Vandalism Corpus 2011 (PAN-WVC-11)

  • 1. Bauhaus-Universität Weimar

Description

The PAN Wikipedia Vandalism Corpus 2011 (PAN-WVC-11) is a corpus for the evaluation of automatic vandalism detectors for Wikipedia. For research purposes the corpus can be used free of charge.

This corpus supplements the PAN-WVC-10, which features only English edits. Both corpora should be used to get more representative results.

The corpus compiles 29949 edits on 24351 Wikipedia articles, among which 2813 vandalism edits have been identified. The corpus features 9985 English edits, 9990 German edits, and 9974 Spanish edits. To annotate the corpus we have used Amazon's Mechanical Turk; each edit was presented to a number of annotators who were asked to decide whether it is vandalism or regular, and the agreement of the annotators was analyzed in order to label an edit.

Files

pan-wikipedia-vandalism-corpus-2011.zip

Files (388.8 MB)

Name Size Download all
md5:926089315581b133be244e3e6dcca28c
388.8 MB Preview Download

Additional details

References

  • Martin Potthast and Teresa Holfeld. Overview of the 2nd International Competition on Wikipedia Vandalism Detection. In Vivien Petras, Pamela Forner, and Paul D. Clough, editors, Notebook Papers of CLEF 2011 Labs and Workshops, September 2011. ISBN 978-88-904810-1-7. ISSN 2038-4963.
  • Benno Stein, Martin Potthast, Alberto Barrón-Cedeño, Paolo Rosso, Efstathios Stamatatos, and Moshe Koppel. 4th International Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 2010). SIGIR Forum, 45 (1) : 45-48, June 2011.