PAN Arabic External Plagiarism Detection Shared Task Corpus

Bensalem, Imene; Boukhalfa, Imène; Rosso, Paolo; Chikhi, Salim

doi:10.5281/zenodo.6607799

Published August 10, 2015 | Version v1

Dataset Open

PAN Arabic External Plagiarism Detection Shared Task Corpus

1. Constantine 2 University
2. Universitat Politècnica de València

Evaluation Corpus for ARAbic EXternal plagiarism detection (ExAra Corpus)

This corpus has been used in AraPlagDet 2015 shared task

More details could be found in : https://araplagdet.misc-lab.org/ or https://pan.webis.de/fire15/pan15-web/index.html

If you publish a paper about your experimentations using ExAra corpus, please cite the following paper:

Bensalem, I., Boukhalfa, I., Rosso, P., Abouenour, L., Darwish, K., & Chikhi, S.: Overview of the AraPlagDet PAN@FIRE2015 Shared Task on Arabic Plagiarism Detection. In P. Majumder, M. Mitra, M. Agrawal, & P. Mehta (Eds.), Post Proceedings of the Workshops at the 7th Forum for Information Retrieval Evaluation (FIRE 2015), Gandhinagar, India, December 4-6, CEUR proceedings vol. 1587 (pp. 111–122). CEUR-WS.org (2015).

We encourage you to compare your method tested on ExAra with the methods of AraPlagDet competition described in the paper above.

I. SYNOPSIS

ExAra corpus comprises 2345 documents; almost half of them (suspecious doucuments) contain passages borrowed from the other half (source docucments) to simulate documents that contain plagiarized fragments. The corpus involves 2 parts: Training and test.

II. DESCRIPTION

Each part of the corpus (training and test) consists mainly of 3 datasets: 2 sets of textual files and 1 set of XML files. The 2 sets of the textual files are the suspicious documents (i.e. the documents that contain artificial plagiarism) and the source documents (i.e., the documents from which the suspicious passages have been plagiarised). The 3rd set of documents contains XML files, which are the plagiarism annotation, i.e., they provide for each plagiarized passage its starting offset and its length in both the suspicious and source documents (offset and length were both expressed in characters). A suspicious document file (.txt) and its plagiarism annotation file (.xml) share the same name.

III. PURPOSE

The purpose of ExAra corpus is to evaluate automatic plagiarism detection methods, notably methods of the External approach. This approach consists in uncovering the plagiarized passages on the basis of their similarity with passages in the source documents.

It should be noted that some suspicious documents in ExAra corpus contain religious quotations (e.g., Quran and Hadith) and common phrases. Some of them appear also in some source documents, and hence a simple plagiarism detection software can consider them as plagiarism. However, quotations and common phrases are legitimate text reuse cases and are not annotated in the XML files in ExAra. Therefore, it is an important feature for the plagiarism detection systems evaluated on ExAra to not consider religious quotations and common phrases as plagiarism cases unless they appear as part of a larger plagiarism case.

IV. BUILDING METHODS

The documents that compose ExAra corpus do not contain actual plagiarism cases, they are rather artificial suspicious documents in which plagiarism was created automatically by a software that takes fragments of text from one or more sources documents and inserts them in another one according to a set of parameters, namely the percentage of plagiarism and the lengths of the plagiarized passages. Some of the plagiarised fragments are obfuscated manually or automatically before inserting them in the suspicious documents.

This building method is the same used to construct PAN 2009-2011 corpora of plagiarism detection (see http://pan.webis.de for more information on PAN competition and its corpora).

V. LANGUAGE AND ENCODING

All the textual documents of this corpus are written in Arabic language and encoded in UTF-8 without BOM.

VI. HOW TO CITE THE CORPUS ?

If you publish a paper about your experimentations using ExAra corpus, please cite the following paper:

Bensalem, I., Boukhalfa, I., Rosso, P., Abouenour, L., Darwish, K., & Chikhi, S.: Overview of the AraPlagDet PAN@FIRE2015 Shared Task on Arabic Plagiarism Detection. In P. Majumder, M. Mitra, M. Agrawal, & P. Mehta (Eds.), Post Proceedings of the Workshops at the 7th Forum for Information Retrieval Evaluation (FIRE 2015), Gandhinagar, India, December 4-6, CEUR proceedings vol. 1587 (pp. 111–122). CEUR-WS.org (2015).

We encourage you to compare your method tested on ExAra with the methods of AraPlagDet competition described in the paper above.

VII. WARNING

It should be noted that the Arabic texts may contain quotations from the Quran and the Hadith; and due to the fact that text insertion is automatic and in random positions, it is possible that the plagiarized text is inserted unintentionally between Quranic verses or sentences of a Hadith cited in a document. Hence, the inserted passages may alter the meaning of the original text. For these reasons, this corpus must not be used outside the purpose for which it was built. Examples of the inappropriate use include using the corpus documents as a source of knowledge or distributing them without mentioning that they contain borrowed texts. If you are not interested in plagiarism detection, and you are retaining the corpus because it contains articles you want to read, then this corpus is not the right source. Please, you should refer to the sources mentioned in (Bensalem et al. 2015) (i.e.,the paper above) where you can find the original content of the articles you are looking for.

We emphasize that we are not responsible for the results of any use of this corpus other than the evaluation of the external plagiarism detection methods.

VIII. CONTACT US

We will be happy to hear from you about your experience in using ExAra corpus. Please do not hesitate to contact us with the following email address: bens.imene@gmail.com

Imene Bensalem¹, Imene Boukhalfa¹, Paolo Rosso², Salim Chikhi¹

¹MISC Lab. Constantine 2 university, Algeria

²PRHLT, Universitat Politècnica de València, Spain

Files

Files (16.0 MB)

Name	Size	Download all
ExAraCorpusPAN2015.rar md5:e2ad38494806e6010a640731de54734b	16.0 MB	Download

Additional details

Bensalem, I., Boukhalfa, I., Rosso, P., Abouenour, L., Darwish, K., & Chikhi, S.: Overview of the AraPlagDet PAN@FIRE2015 Shared Task on Arabic Plagiarism Detection. In P. Majumder, M. Mitra, M. Agrawal, & P. Mehta (Eds.), Post Proceedings of the Workshops at the 7th Forum for Information Retrieval Evaluation (FIRE 2015), Gandhinagar, India, December 4-6, CEUR proceedings vol. 1587 (pp. 111–122). CEUR-WS.org (2015).

	All versions	This version
Views	654	653
Downloads	116	116
Data volume	2.0 GB	2.0 GB

PAN Arabic External Plagiarism Detection Shared Task Corpus

Authors/Creators

Description

Files

Files (16.0 MB)

Additional details

References