Evaluation Corpus for ARAbic EXternal plagiarism detection (ExAra Corpus) 

This corpus has been used in AraPlagDet 2015 shared task 
More details could be found in : https://araplagdet.misc-lab.org/
							  or https://pan.webis.de/fire15/pan15-web/index.html

	I.    SYNOPSIS
	II.   DESCRIPTION
	III.  PURPOSE
	IV.   BUILDING METHODS 
	V.	  LANGUAGE AND ENCODING
	VI.   HOW TO CITE THE CORPUS ?
	VII.  WARNING
	VIII. CONTACT US

I. SYNOPSIS 

ExAra corpus comprises 2345 documents; almost half of them (suspecious doucuments)
contain passages borrowed from the other half (source docucments) to simulate
documents that contain plagiarized fragments. The corpus involves 2 parts: Training
and test.

II. DESCRIPTION 

Each part of the corpus (training and test) consists mainly of 3 datasets: 
2 sets of textual files and 1 set of XML files. The 2 sets of the textual files are
the suspicious documents (i.e. the documents that contain artificial plagiarism) and
the source documents (i.e., the documents from which the suspicious passages have
been plagiarised). The 3rd set of documents contains XML files, which are the plagiarism
annotation, i.e., they provide for each plagiarized passage its starting offset and its
length in both the suspicious and source documents (offset and length were both expressed
in characters). A suspicious document file (.txt) and its plagiarism annotation file (.xml)
share the same name.

III. PURPOSE 

The purpose of ExAra corpus is to evaluate automatic plagiarism detection methods, notably
methods of the External approach. This approach consists in uncovering the plagiarized
passages on the basis of their similarity with passages in the source documents.

It should be noted that some suspicious documents in ExAra corpus contain religious
quotations (e.g., Quran and Hadith) and common phrases. Some of them appear also in some
source documents, and hence a simple plagiarism detection software can consider them as
plagiarism. However, quotations and common phrases are legitimate text reuse cases and are
not annotated in the XML files in ExAra. Therefore, it is an important feature for the 
plagiarism detection systems evaluated on ExAra to not consider religious quotations and
common phrases as plagiarism cases unless they appear as part of a larger plagiarism case.

IV. BUILDING METHODS 

The documents that compose ExAra corpus do not contain actual plagiarism 
cases, they are rather artificial suspicious documents in which 
plagiarism was created automatically by a software that takes fragments 
of text from one or more sources documents and inserts them in another 
one according to a set of parameters, namely the percentage of plagiarism 
and the lengths of the plagiarized passages. Some of the plagiarised fragments are 
obfuscated manually or automatically before inserting them in the suspicious documents.
This building method is the same used to construct PAN 2009-2011 corpora of plagiarism
detection (see http://pan.webis.de for more information on PAN competition and its 
corpora). 

V. LANGUAGE AND ENCODING 

All the textual documents of this corpus are written in Arabic language 
and encoded in UTF-8 without BOM.
 

VI.	HOW TO CITE THE CORPUS ?

If you publish a paper about your experimentations using ExAra corpus, 
please cite the following paper:

Bensalem, I., Boukhalfa, I., Rosso, P., Abouenour, L., Darwish, K., & Chikhi, S.: 
Overview of the AraPlagDet PAN@FIRE2015 Shared Task on Arabic Plagiarism Detection. 
In P. Majumder, M. Mitra, M. Agrawal, & P. Mehta (Eds.), Post Proceedings of the Workshops
at the 7th Forum for Information Retrieval Evaluation (FIRE 2015), Gandhinagar, India, 
December 4-6, CEUR proceedings vol. 1587 (pp. 111–122). CEUR-WS.org (2015).

We encourage you to compare your method tested on ExAra with the methods of AraPlagDet 
competition described in the paper above.

VII. WARNING 

It should be noted that the Arabic texts may contain quotations from the 
Quran and the Hadith; and due to the fact that text insertion is 
automatic and in random positions, it is possible that the plagiarized 
text is inserted unintentionally between Quranic verses or sentences of 
a Hadith cited in a document. Hence, the inserted passages may alter 
the meaning of the original text. For these reasons, this corpus must 
not be used outside the purpose for which it was built. Examples of the 
inappropriate use include using the corpus documents as a source of 
knowledge or distributing them without mentioning that they contain 
borrowed texts. If you are not interested in plagiarism detection and 
you are retaining the corpus because it contains articles you want to read, 
then this corpus is not the right source. Please, you should refer to the 
sources mentioned in (Bensalem et al. 2015) (i.e.,the paper above) where 
you can find the original content of the articles you are looking for.
We emphasize that we are not responsible for the results of any use of this
corpus other than the evaluation of the external plagiarism detection methods. 

VIII.	CONTACT US
We will be happy to hear from you about your experience in using ExAra 
corpus. Please do not hesitate to contact us with the following email 
address: bens.imene@gmail.com

Imene Bensalem¹, Imene Boukhalfa¹, Paolo Rosso², Salim Chikhi¹
¹MISC Lab. Constantine 2 university, Algeria
²PRHLT, Universitat Politècnica de València, Spain 

