Dataset Open Access

Source Code Embeddings

Efstathiou Vasiliki; Spinellis Diomidis

A set of six pretrained fastText models for semantic representations of source code. 

Each of the models has been trained on high-quality GitHub repositories where the primary language is one of Java, Python, C++, C#, C, PHP. For collecting training data 13.144 repositories were cloned, 2.402.790.348 lines of code were read out of 944,467,560 files and preprocessed, to finally produce a total of 944.467.560 tokens of clean training data. 

For further details refer to the following paper: 

Efstathiou, V.,  Spinellis, D., 2019. "Semantic Source Code Models Using Identifier Embeddings". In 16th International Conference on Mining Software Repositories: Data Showcase Track. MSR'19. 

Files (13.2 GB)
Name Size
c-ftskip-dim100-ws5.bin
md5:0a1797b09aa8020deaea4096e2dad518
3.1 GB Download
cpp-ftskip-dim100-ws5.bin
md5:0331aa4fad384854552f79b9f7d382dc
2.6 GB Download
csharp-ftskip-dim100-ws5.bin
md5:68de3f02881ff244033ee7a4fc4a7135
1.6 GB Download
java-ftskip-dim100-ws5.bin
md5:f6701447ee02802c8dcc35a76c40d661
2.8 GB Download
php-ftskip-dim100-ws5.bin
md5:2e691933bd4b7a5114cf09a06c91c1ed
1.4 GB Download
python-ftskip-dim100-ws4.bin
md5:85127ad0f34bfeb1a17edf2cead912e8
1.6 GB Download
125
38
views
downloads
All versions This version
Views 125125
Downloads 3838
Data volume 95.0 GB95.0 GB
Unique views 9999
Unique downloads 2323

Share

Cite as