Published January 27, 2020
| Version v1
Dataset
Open
Raw Python Code Corpus
- 1. The University of Edinburgh
- 2. Free University of Bozen-Bolzano
- 3. Google Research
Description
A raw code corpus for the Python programming language i.e., includes only the Python source files of each repository without any preprocessing.
The corpus was used to generate the Python training, validation, testing, and BPE encoding sets for the experiments performed in the paper: Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code.
Files
Files
(2.0 GB)
Name | Size | Download all |
---|---|---|
md5:e51aef174ae9225c666837d2d1f5baf5
|
2.0 GB | Download |
md5:8c5c43f5bee90f7012ca0211a9d381b9
|
480.3 kB | Download |
Additional details
Related works
- Is source of
- Dataset: 10.5281/zenodo.3628636 (DOI)
- Other: 10.5281/zenodo.3628628 (DOI)
Funding
- EPSRC Centre for Doctoral Training in Data Science EP/L016427/1
- UK Research and Innovation