Published January 27, 2020
| Version v1
Dataset
Open
Raw C Code Corpus
- 1. The University of Edinburgh
- 2. Free University of Bozen-Bolzano
- 3. Google Research
Description
A raw code corpus for the C programming language i.e., includes only the C source files of each repository without any preprocessing.
The corpus was used to generate the C training, validation, testing, and BPE encoding sets for the experiments performed in the paper: Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code.
Files
Files
(2.5 GB)
Name | Size | Download all |
---|---|---|
md5:cd5e06cec9a72518f37724987880e00d
|
2.5 GB | Download |
md5:c7f4a07083ff6adb2bbcc41aec763d11
|
49.4 kB | Download |
Additional details
Related works
- Is source of
- Dataset: 10.5281/zenodo.3628638 (DOI)
- Other: 10.5281/zenodo.3628628 (DOI)
Funding
- UK Research and Innovation
- EPSRC Centre for Doctoral Training in Data Science EP/L016427/1