Raw C Code Corpus

Rafael - Michael Karampatsis; Hlib Babii; Romain Robbes; Charles Sutton; Andrea Janes

doi:10.5281/zenodo.3628775

Published January 27, 2020 | Version v1

Dataset Open

Raw C Code Corpus

1. The University of Edinburgh
2. Free University of Bozen-Bolzano
3. Google Research

A raw code corpus for the C programming language i.e., includes only the C source files of each repository without any preprocessing.
The corpus was used to generate the C training, validation, testing, and BPE encoding sets for the experiments performed in the paper: Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code.

Files

Files (2.5 GB)

Name	Size	Download all
c-corpus.tar.gz md5:cd5e06cec9a72518f37724987880e00d	2.5 GB	Download
c_dataset_stats.tar.gz md5:c7f4a07083ff6adb2bbcc41aec763d11	49.4 kB	Download

Additional details

Is source of: Dataset: 10.5281/zenodo.3628638 (DOI); Other: 10.5281/zenodo.3628628 (DOI)

UK Research and Innovation
EPSRC Centre for Doctoral Training in Data Science EP/L016427/1

Views

407

Downloads

Show more details

	All versions	This version
Views	1,558	1,549
Downloads	407	405
Data volume	895.4 GB	890.5 GB

More info on how stats are collected....

DOI

Resource type

Dataset

Publisher

Zenodo

Conference

42nd International Conference on Software Engineering (ICSE 2020) , Seoul, South Korea, 23-29 May 2020

License: Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: January 27, 2020
Modified: January 29, 2020

Raw C Code Corpus

Files

Files (2.5 GB)

Additional details

Related works

Funding

Raw C Code Corpus

Creators

Description

Files

Files (2.5 GB)

Additional details

Related works

Funding