Published January 31, 2014
| Version v1
Dataset
Restricted
FIRE14 Detection of SOurce COde Re-use
- 1. Universitat Politècnica de València
- 2. Universidad Autonoma Metropolitana
Description
This data was used for the PAN shared task on source code re-use detection at FIRE2014.
Please find the task description at https://pan.webis.de/fire14/pan14-web/index.html.
THIS DATA
For the training phase we provide an annotated corpus including with the programming language extensions. It includes information about whether a text fragment has been re-used and, if it is the case, what its source is.
- The collection consists of source codes written in Java and C.
- Re-use is commited in both programming languages but ONLY at monolingual level.
- The Java collection contains 259 source codes from 000.java to 258.java.
- The C collection contains 79 source codes from 000.c to 078.c.
- Relevance Judgements represent re-use in both directions(a→b and b→a)
In the test phase the only annotation that will be provided in the corpus is the programming language extensions.
- It is divided by programming language (C/C++ and JAVA) so you do not need any pre-process to identify the programming language of the source codes.
- Each programming language folder contains 6 folders (A1, B1, B2, C1 and C2) that contains a specific scenario with monolingual re-use.
- There is not re-use between scenarios so you just need to look for re-used cases among the source code files inside each folder.
- The name of the files consists of the name of the task which they belong plus an identifier. For example, file "B10021" belongs to scenario B1 and its identifier number is 0021.
- It could not exist re-use between source codes that belong to different scenarios. For example, you do not have to submit a re-used case between files "B10021" and "B20013". The first one belongs to scenario B1 but the second one belongs to B2.