Dataset Open Access

PAN Plagiarism Corpus 2010 (PAN-PC-10)

Potthast, Martin; Stein, Benno; Eiselt, Andreas; Barrón-Cedeño, Alberto; Rosso, Paolo

This corpus is outdated. Please use its successor PAN-PC-11: https://doi.org/10.5281/zenodo.3250095

The PAN plagiarism corpus 2010 (PAN-PC-10) is a corpus for the evaluation of automatic plagiarism detection algorithms. For research purposes the corpus can be used free of charge.

The PAN-PC-10 contains documents in which artificial plagiarism has been inserted automatically as well as documents in which simulated plagiarism has been inserted manually. The former have been constructed using a so-called random plagiarist, a computer program which constructs plagiarism according to a number of parameters, while the latter have been obtained with crowdsourcing via Amazon's Mechanical Turk.

Files (1.8 GB)
Name Size
pan-plagiarism-corpus-2010.part1.rar
md5:66e4f2801f097da2c1537453d6edf4ee
1.1 GB Download
pan-plagiarism-corpus-2010.part2.rar
md5:629861d970aeda647ff7b7c4c1cc70f4
700.2 MB Download
  • Alberto Barrón-Cedeño, Martin Potthast, Paolo Rosso, Benno Stein, and Andreas Eiselt. Corpus and Evaluation Measures for Automatic Plagiarism Detection. In Nicoletta Calzolari et al, editors, 7th Conference on International Language Resources and Evaluation (LREC 10), May 2010. European Language Resources Association (ELRA). ISBN 2-9517408-6-7.

34
14
views
downloads
All versions This version
Views 3434
Downloads 1414
Data volume 13.2 GB13.2 GB
Unique views 1919
Unique downloads 66

Share

Cite as