Super Code Clone Detection - 88 (SCD-88)

Anonymous

doi:10.5281/zenodo.5388452

Published September 2, 2021 | Version 1.0.0

Dataset Open

Super Code Clone Detection - 88 (SCD-88)

Anonymous¹

1. Anonymous

SCD-88 is the Python-specific subset of the Cross-Language Clone Detection dataset which was originally extracted from AtCoder, a popular Online Judge. We reformulate this classification task as a retrieval one where given a code and a collection of candidates as the input, the task is to return top-k codes with the same semantic. Models can hence, be evaluated by the MAP@R score. MAP@R is defined as the mean of average precision scores, each of which is evaluated for retrieving R most similar samples given a query. For a code (query), R is the number of other codes in the same class, i.e. R=129 in this dataset. The newly sampled dataset amounts to a total of 11,440 examples where the splits are as follows: 7800 / 1040 / 2600 (Train / Valid / Test).

Files

label-map.txt

Files (2.5 MB)

Name	Size	Download all
label-map.txt md5:8128481913a0f5cace8998144f3f1a15	870 Bytes	Preview Download
test.jsonl md5:bc2a63a97197c32d18b9abac6b20d1b3	513.5 kB	Download
train.jsonl md5:8b6d5f947147aba6530376ed472da7bf	1.7 MB	Download
valid.jsonl md5:3ea5d45a8099d3cb62c55e29d6ff01c7	322.4 kB	Download

	All versions	This version
Views	596	592
Downloads	155	154
Data volume	124.8 MB	124.3 MB

Super Code Clone Detection - 88 (SCD-88)

Creators

Description

Files

label-map.txt

Files (2.5 MB)