Code4ML: a Large-scale Dataset of annotated Machine Learning Code
Authors/Creators
- 1. NRU HSE
Description
We present Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle. The corpus consists of ≈ 2.5 million snippets of ML code collected from ≈ 100 thousand Jupyter notebooks. A representative fraction of the snippets is annotated by human assessors through a user-friendly interface specially designed for that purpose.
The data is organized as a set of tables in CSV format. It includes several central entities: raw code blocks collected from Kaggle (code_blocks.csv), kernels (kernels_meta.csv) and competitions meta information (competitions_meta.csv). Manually annotated code blocks are presented as a separate table (murkup_data.csv). As this table contains the numeric id of the code block semantic type, we also provide a mapping from the id to semantic class and subclass (vertices.csv).
Snippets information (code_blocks.csv) can be mapped with kernels meta-data via kernel_id. Kernels metadata is linked to Kaggle competitions information through comp_name. To ensure the quality of the data kernels_meta.csv includes only notebooks with an available Kaggle score.
The corpus can help solve various problems, including code synthesis from a prompt in natural language, code autocompletion, and semantic code classification.
Files
code_blocks.csv
Files
(991.6 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:522702708d3fb2e291150c2db39e72bb
|
986.3 MB | Preview Download |
|
md5:f95c7ae7ca75769a2581d97ca8dc7fd2
|
584.0 kB | Preview Download |
|
md5:7e79487b2299a9124434359e774776ae
|
2.4 MB | Preview Download |
|
md5:b619b6448c824614d6938536d19f8fef
|
2.3 MB | Preview Download |
|
md5:0afd0b11950f36b3df30007f026dd212
|
407 Bytes | Preview Download |
|
md5:7676fda7db7d014247226227c8dda180
|
2.5 kB | Preview Download |