Published September 2, 2021 | Version 1.0.0
Dataset Open

Super Code Clone Detection - 88 (SCD-88)

Creators

  • 1. Anonymous

Description

SCD-88 is the Python-specific subset of the Cross-Language Clone Detection dataset which was originally extracted from AtCoder, a popular Online Judge. We reformulate this classification task as a retrieval one where given a code and a collection of candidates as the input, the task is to return top-k codes with the same semantic. Models can hence, be evaluated by the MAP@R score. MAP@R is defined as the mean of average precision scores, each of which is evaluated for retrieving R most similar samples given a query. For a code (query), R is the number of other codes in the same class, i.e. R=129 in this dataset. The newly sampled dataset amounts to a total of 11,440 examples where the splits are as follows: 7800 / 1040 / 2600 (Train / Valid / Test).

Files

label-map.txt

Files (2.5 MB)

Name Size Download all
md5:8128481913a0f5cace8998144f3f1a15
870 Bytes Preview Download
md5:bc2a63a97197c32d18b9abac6b20d1b3
513.5 kB Download
md5:8b6d5f947147aba6530376ed472da7bf
1.7 MB Download
md5:3ea5d45a8099d3cb62c55e29d6ff01c7
322.4 kB Download