Published February 7, 2019 | Version v1
Dataset Open

Source Code Embeddings

  • 1. Athens University of Economics and Business

Description

A set of six pretrained fastText models for semantic representations of source code. 

Each of the models has been trained on high-quality GitHub repositories where the primary language is one of Java, Python, C++, C#, C, PHP. For collecting training data 13.144 repositories were cloned, 2.402.790.348 lines of code were read out of 944,467,560 files and preprocessed, to finally produce a total of 944.467.560 tokens of clean training data. 

For further details refer to the following paper: 

Efstathiou, V.,  Spinellis, D., 2019. "Semantic Source Code Models Using Identifier Embeddings". In 16th International Conference on Mining Software Repositories: Data Showcase Track. MSR'19. 

Files

Files (13.2 GB)

Name Size Download all
md5:0a1797b09aa8020deaea4096e2dad518
3.1 GB Download
md5:0331aa4fad384854552f79b9f7d382dc
2.6 GB Download
md5:68de3f02881ff244033ee7a4fc4a7135
1.6 GB Download
md5:f6701447ee02802c8dcc35a76c40d661
2.8 GB Download
md5:2e691933bd4b7a5114cf09a06c91c1ed
1.4 GB Download
md5:85127ad0f34bfeb1a17edf2cead912e8
1.6 GB Download

Additional details

Funding

CROSSMINER – Developer-Centric Knowledge Mining from Large Open-Source Software Repositories 732223
European Commission