Published January 27, 2020 | Version 1.0
Dataset Open

Preprocessed Java Code Corpus

  • 1. The University of Edinburgh
  • 2. Free University of Bozen-Bolzano
  • 3. Google Research

Description

A preprocessed code corpus for the Java programming language.
The corpus was used for the experiments in the paper Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code.
It contains preprocessed-tokenized files for training, validation, testing, and BPE encoding learning.
The BPE segmented versions of the above files are also included for three different encoding sizes i,e., 2000, 5000, and 10000 BPE merge operations as well as the learned BPE encodings.
Similar versions are also contained for splitting compound identifiers on camelCase and snake_case as in (Allamanis et al., 2015) as well as the corresponding subtoken maps.

Files

Files (44.7 GB)

Name Size Download all
md5:791b09f293b05c3634e9456529a7e2d0
16.0 MB Download
md5:f3ace37318b03f0998a07e77e9ad0bce
26.6 MB Download
md5:0530b1bf49c51fb572f01b8734dc37e8
23.3 MB Download
md5:c840b355323300ea55d5228844a30773
21.5 MB Download
md5:e883a9b348a256864e028d016fdc679c
19.6 MB Download
md5:2ec8f60bb569a18758b9fbe7ca1f8659
318.7 MB Download
md5:8fc123022f5cbd4b94ea0aa1c31e216e
92.0 kB Download
md5:37565b2cc9f82241f5775a9c7bf59cd0
15.0 kB Download
md5:06008892a8548f4eb32172d25e1e08f5
41.9 kB Download
md5:f9a4cf623b390521300da0ca0aa32230
27.0 MB Download
md5:55fe9337a8aab06a208d472541fe7402
32.6 MB Download
md5:602e4f68cdc4f6a8e820e4d683567a85
37.6 MB Download
md5:ef8846b450222ae41cc2911c04a99fc8
34.3 MB Download
md5:72f869445a3d2fb5fbc936599bb63075
28.5 MB Download
md5:1dda29f651a0f81119d5fb8ef0ed33b0
7.2 GB Download
md5:c1e46043f0ed118742e665a9662568a5
8.9 GB Download
md5:8cae7fd663195448aca3eaff5ab3ebdb
10.3 GB Download
md5:f9eca73d2aa6a8acf91602712810618c
9.4 GB Download
md5:8be24a41fe933cca455b1585da2c9286
7.7 GB Download
md5:5b23ef86fbb5634d7297aa738b8a642b
81.1 MB Download
md5:aa35dc652563a03e44a69b2c40ffe337
100.3 MB Download
md5:9136a53cc1ddde287e4beee1d7c93a76
117.4 MB Download
md5:113cbd5ac18003fa4bc5fa79bf3b35a7
106.2 MB Download
md5:3877f3074d57de23002d0fef35d52b42
87.7 MB Download
md5:f244e89bac59b38411856a02767f0020
18.8 MB Download
md5:6ddc1adee8faff46303a62d7c919d4c1
27.8 MB Download
md5:39f658792ff91192e400e1a0a75ae4bb
20.1 MB Download
md5:b779fd896953efecd5b0ca9974f88bec
19.7 MB Download
md5:889d6129be6b5221398030ed85f7cb26
65.7 kB Download

Additional details

Related works

Is new version of
Dataset: 10.5281/zenodo.3628522 (DOI)
Is referenced by
Dataset: 10.5281/zenodo.3628636 (DOI)
Dataset: 10.5281/zenodo.3628638 (DOI)
Is source of
Other: DOI 10.5281/zenodo.3628628 (Handle)

Funding

EPSRC Centre for Doctoral Training in Data Science EP/L016427/1
UK Research and Innovation