Published January 27, 2020 | Version v1
Dataset Open

Raw C Code Corpus

  • 1. The University of Edinburgh
  • 2. Free University of Bozen-Bolzano
  • 3. Google Research

Description

A raw code corpus for the C programming language i.e., includes only the C source files of each repository without any preprocessing.
The corpus was used to generate the C training, validation, testing, and BPE encoding sets for the experiments performed in the paper: Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code.

Files

Files (2.5 GB)

Name Size Download all
md5:cd5e06cec9a72518f37724987880e00d
2.5 GB Download
md5:c7f4a07083ff6adb2bbcc41aec763d11
49.4 kB Download

Additional details

Related works

Is source of
Dataset: 10.5281/zenodo.3628638 (DOI)
Other: 10.5281/zenodo.3628628 (DOI)

Funding

UK Research and Innovation
EPSRC Centre for Doctoral Training in Data Science EP/L016427/1