Preprocessed Java Code Corpus
- 1. The University of Edinburgh
- 2. Free University of Bozen-Bolzano
- 3. Google Research
Description
A preprocessed code corpus for the Java programming language.
The corpus was used for the experiments in the paper Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code.
It contains preprocessed-tokenized files for training, validation, testing, and BPE encoding learning.
The BPE segmented versions of the above files are also included for three different encoding sizes i,e., 2000, 5000, and 10000 BPE merge operations as well as the learned BPE encodings.
Similar versions are also contained for splitting compound identifiers on camelCase and snake_case as in (Allamanis et al., 2015) as well as the corresponding subtoken maps.
Files
Files
(44.3 GB)
Name | Size | Download all |
---|---|---|
md5:791b09f293b05c3634e9456529a7e2d0
|
16.0 MB | Download |
md5:f3ace37318b03f0998a07e77e9ad0bce
|
26.6 MB | Download |
md5:0530b1bf49c51fb572f01b8734dc37e8
|
23.3 MB | Download |
md5:c840b355323300ea55d5228844a30773
|
21.5 MB | Download |
md5:e883a9b348a256864e028d016fdc679c
|
19.6 MB | Download |
md5:8fc123022f5cbd4b94ea0aa1c31e216e
|
92.0 kB | Download |
md5:37565b2cc9f82241f5775a9c7bf59cd0
|
15.0 kB | Download |
md5:06008892a8548f4eb32172d25e1e08f5
|
41.9 kB | Download |
md5:f9a4cf623b390521300da0ca0aa32230
|
27.0 MB | Download |
md5:55fe9337a8aab06a208d472541fe7402
|
32.6 MB | Download |
md5:602e4f68cdc4f6a8e820e4d683567a85
|
37.6 MB | Download |
md5:ef8846b450222ae41cc2911c04a99fc8
|
34.3 MB | Download |
md5:72f869445a3d2fb5fbc936599bb63075
|
28.5 MB | Download |
md5:1dda29f651a0f81119d5fb8ef0ed33b0
|
7.2 GB | Download |
md5:c1e46043f0ed118742e665a9662568a5
|
8.9 GB | Download |
md5:8cae7fd663195448aca3eaff5ab3ebdb
|
10.3 GB | Download |
md5:f9eca73d2aa6a8acf91602712810618c
|
9.4 GB | Download |
md5:8be24a41fe933cca455b1585da2c9286
|
7.7 GB | Download |
md5:5b23ef86fbb5634d7297aa738b8a642b
|
81.1 MB | Download |
md5:aa35dc652563a03e44a69b2c40ffe337
|
100.3 MB | Download |
md5:9136a53cc1ddde287e4beee1d7c93a76
|
117.4 MB | Download |
md5:113cbd5ac18003fa4bc5fa79bf3b35a7
|
106.2 MB | Download |
md5:3877f3074d57de23002d0fef35d52b42
|
87.7 MB | Download |
md5:f244e89bac59b38411856a02767f0020
|
18.8 MB | Download |
md5:6ddc1adee8faff46303a62d7c919d4c1
|
27.8 MB | Download |
md5:39f658792ff91192e400e1a0a75ae4bb
|
20.1 MB | Download |
md5:b779fd896953efecd5b0ca9974f88bec
|
19.7 MB | Download |
md5:889d6129be6b5221398030ed85f7cb26
|
65.7 kB | Download |
Additional details
Related works
- Is referenced by
- Dataset: 10.5281/zenodo.3628636 (DOI)
- Dataset: 10.5281/zenodo.3628638 (DOI)
- Is source of
- Other: DOI 10.5281/zenodo.3628628 (Handle)
Funding
- EPSRC Centre for Doctoral Training in Data Science EP/L016427/1
- UK Research and Innovation