Preprocessed Java Code Corpus

doi:10.5281/zenodo.3628522

Published January 27, 2020 | Version 1.0

Dataset Open

Preprocessed Java Code Corpus

1. The University of Edinburgh
2. Free University of Bozen-Bolzano
3. Google Research

A preprocessed code corpus for the Java programming language.
The corpus was used for the experiments in the paper Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code.
It contains preprocessed-tokenized files for training, validation, testing, and BPE encoding learning.
The BPE segmented versions of the above files are also included for three different encoding sizes i,e., 2000, 5000, and 10000 BPE merge operations as well as the learned BPE encodings.
Similar versions are also contained for splitting compound identifiers on camelCase and snake_case as in (Allamanis et al., 2015) as well as the corresponding subtoken maps.

Files

Files (44.3 GB)

Name	Size	Download all
id_map_java_test_pre md5:791b09f293b05c3634e9456529a7e2d0	16.0 MB	Download
id_map_java_test_pre_bpe_2000 md5:f3ace37318b03f0998a07e77e9ad0bce	26.6 MB	Download
id_map_java_test_pre_bpe_5000 md5:0530b1bf49c51fb572f01b8734dc37e8	23.3 MB	Download
id_map_java_test_slp_bpe_10000 md5:c840b355323300ea55d5228844a30773	21.5 MB	Download
id_map_java_test_slp_pre_sub md5:e883a9b348a256864e028d016fdc679c	19.6 MB	Download
java_encoding_pre.enc_bpe_10000 md5:8fc123022f5cbd4b94ea0aa1c31e216e	92.0 kB	Download
java_encoding_pre.enc_bpe_2000 md5:37565b2cc9f82241f5775a9c7bf59cd0	15.0 kB	Download
java_encoding_pre.enc_bpe_5000 md5:06008892a8548f4eb32172d25e1e08f5	41.9 kB	Download
java_test_pre md5:f9a4cf623b390521300da0ca0aa32230	27.0 MB	Download
java_test_pre_enc_bpe_10000 md5:55fe9337a8aab06a208d472541fe7402	32.6 MB	Download
java_test_pre_enc_bpe_2000 md5:602e4f68cdc4f6a8e820e4d683567a85	37.6 MB	Download
java_test_pre_enc_bpe_5000 md5:ef8846b450222ae41cc2911c04a99fc8	34.3 MB	Download
java_test_pre_sub md5:72f869445a3d2fb5fbc936599bb63075	28.5 MB	Download
java_training_huge_pre md5:1dda29f651a0f81119d5fb8ef0ed33b0	7.2 GB	Download
java_training_huge_pre_enc_bpe_10000 md5:c1e46043f0ed118742e665a9662568a5	8.9 GB	Download
java_training_huge_pre_enc_bpe_2000 md5:8cae7fd663195448aca3eaff5ab3ebdb	10.3 GB	Download
java_training_huge_pre_enc_bpe_5000 md5:f9eca73d2aa6a8acf91602712810618c	9.4 GB	Download
java_training_huge_pre_sub md5:8be24a41fe933cca455b1585da2c9286	7.7 GB	Download
java_training_pre md5:5b23ef86fbb5634d7297aa738b8a642b	81.1 MB	Download
java_training_pre_enc_bpe_10000 md5:aa35dc652563a03e44a69b2c40ffe337	100.3 MB	Download
java_training_pre_enc_bpe_2000 md5:9136a53cc1ddde287e4beee1d7c93a76	117.4 MB	Download
java_training_pre_enc_bpe_5000 md5:113cbd5ac18003fa4bc5fa79bf3b35a7	106.2 MB	Download
java_training_pre_sub md5:3877f3074d57de23002d0fef35d52b42	87.7 MB	Download
java_validation_pre md5:f244e89bac59b38411856a02767f0020	18.8 MB	Download
java_validation_pre_enc_bpe_2000 md5:6ddc1adee8faff46303a62d7c919d4c1	27.8 MB	Download
java_validation_pre_sub md5:39f658792ff91192e400e1a0a75ae4bb	20.1 MB	Download
subtoken_map md5:b779fd896953efecd5b0ca9974f88bec	19.7 MB	Download
testProjects md5:889d6129be6b5221398030ed85f7cb26	65.7 kB	Download

Additional details

Is referenced by: Dataset: 10.5281/zenodo.3628636 (DOI); Dataset: 10.5281/zenodo.3628638 (DOI)
Is source of: Other: DOI 10.5281/zenodo.3628628 (Handle)

EPSRC Centre for Doctoral Training in Data Science EP/L016427/1: UK Research and Innovation

	All versions	This version
Views	836	189
Downloads	2,545	275
Data volume	2.8 TB	751.2 GB

Preprocessed Java Code Corpus

Files

Files (44.3 GB)

Additional details

Related works

Funding

Preprocessed Java Code Corpus

Creators

Description

Files

Files (44.3 GB)

Additional details

Related works

Funding