Published January 27, 2020 | Version v1
Dataset Open

Raw Python Code Corpus

  • 1. The University of Edinburgh
  • 2. Free University of Bozen-Bolzano
  • 3. Google Research

Description

A raw code corpus for the Python programming language i.e., includes only the Python source files of each repository without any preprocessing.
The corpus was used to generate the Python training, validation, testing, and BPE encoding sets for the experiments performed in the paper: Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code.

Files

Files (2.0 GB)

Name Size Download all
md5:e51aef174ae9225c666837d2d1f5baf5
2.0 GB Download
md5:8c5c43f5bee90f7012ca0211a9d381b9
480.3 kB Download

Additional details

Related works

Is source of
Dataset: 10.5281/zenodo.3628636 (DOI)
Other: 10.5281/zenodo.3628628 (DOI)

Funding

UK Research and Innovation
EPSRC Centre for Doctoral Training in Data Science EP/L016427/1