Dataset Open Access

PAN Plagiarism Corpus 2009 (PAN-PC-09)

Potthast, Martin; Stein, Benno; Eiselt, Andreas; Barrón-Cedeño, Alberto; Rosso, Paolo

This corpus is outdated. Please use its successor PAN-PC-11: https://doi.org/10.5281/zenodo.3250095

The PAN plagiarism corpus 2009 (PAN-PC-09) is a corpus for the evaluation of automatic plagiarism detection algorithms. For research purposes the corpus can be used free of charge.

The PAN-PC-09 contains documents in which artificial plagiarism has been inserted automatically. The plagiarism cases have been constructed using a so-called random plagiarist, a computer program which constructs plagiarism according to a number of random variables. The variables include the percentage of plagiarism in the whole corpus, the percentage of plagiarism per document, the length of a single plagiarized section, and the degree of obfuscation per plagiarized section.

Files (2.4 GB)
Name Size
pan-plagiarism-corpus-2009.part1.rar
md5:82093e3464b5bff97dd01d99bee5d095
1.0 GB Download
pan-plagiarism-corpus-2009.part2.rar
md5:ecc755dbdb9c7599f1c7d4f842e53ec2
1.0 GB Download
pan-plagiarism-corpus-2009.part3.rar
md5:016c64c3713bf65df03d9203bab2df93
393.6 MB Download
  • Martin Potthast, Benno Stein, Andreas Eiselt, Alberto Barrón-Cedeño, and Paolo Rosso. Overview of the 1st International Competition on Plagiarism Detection. In Benno Stein et al, editors, SEPLN 2009 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 2009), pages 1-9, September 2009. CEUR-WS.org. ISSN 1613-0073.

273
293
views
downloads
All versions This version
Views 273273
Downloads 293293
Data volume 265.7 GB265.7 GB
Unique views 244244
Unique downloads 7474

Share

Cite as