A Systematic Evaluation of Large Language Models of Code

doi:10.5281/zenodo.6363556

Published February 26, 2022 | Version v4

Dataset Open

A Systematic Evaluation of Large Language Models of Code

1. Carnegie Mellon University

These are datasets for the paper:

"A Systematic Evaluation of Large Language Models of Code"

https://arxiv.org/pdf/2202.13169.pdf

The code is available at: https://github.com/VHellendoorn/Code-LMs

The file "unseen_test_sets.tar.gz" contains test sets of ~100 files in each of 12 programming languages.

These files are not included in The Pile, and thus models such as GPT-Neo, GPT-J, GPT-NeoX were not trained on them.

In the paper, we use these test sets to compare a variety of language models of code including OpenAI's Codex, GPT-J, GPT-Neo, GPT-NeoX-20B, and CodeParrot and our PolyCoder model.

The file "index.zip" includes an index of the training set file paths and commit SHAs.

The other files, such as "2-7B-150K.tar", are trained model checkpoints, as explained at https://github.com/VHellendoorn/Code-LMs .

Notes

https://arxiv.org/abs/2202.13169

Files

index.zip

Files (16.3 GB)

Name	Size	Download all
0-4B-100K.tar md5:a4a367be4fbee217bca4e19ef203cf99	810.8 MB	Download
0-4B-150K.tar md5:530400d496940107b1745a0317812795	810.8 MB	Download
160M-150K.tar md5:ca615c1547288fe29b750992259dea8c	2.3 GB	Download
2-7B-100K.tar md5:dbdba8f93e511cf9da3407d04c6c3066	5.6 GB	Download
2-7B-150K.tar md5:9d503f66a809240b1d6d59b3b0569f40	5.6 GB	Download
index.zip md5:eaf63ec6e69478bde42b1b28a46a1881	1.3 GB	Preview Download
unseen_test_sets.tar.gz md5:6473b2cb1a475de40309832237922c2d	898.7 kB	Download

	All versions	This version
Views	3,781	2,841
Downloads	5,808	4,804
Data volume	57.5 TB	32.3 TB

A Systematic Evaluation of Large Language Models of Code

Creators

Description

Notes

Files

index.zip

Files (16.3 GB)