There is a newer version of the record available.

Published June 2, 2022 | Version 1.0.0
Dataset Open

Code4ML: a Large-scale Dataset of annotated Machine Learning Code

Description

We present Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle. The corpus consists of ≈ 2.5 million snippets of ML code collected from ≈ 100 thousand Jupyter notebooks. A representative fraction of the snippets is annotated by human assessors through a user-friendly interface specially designed for that purpose. 

The data is organized as a set of tables in CSV format. It includes several central entities: raw code blocks collected from Kaggle (code_blocks.csv), kernels (kernels_meta.csv) and competitions meta information (competitions_meta.csv). Manually annotated code blocks are presented as a separate table (murkup_data.csv). As this table contains the numeric id of the code block semantic type, we also provide a mapping from the id to semantic class and subclass (vertices.csv).

Snippets information (code_blocks.csv) can be mapped with kernels meta-data via kernel_id. Kernels metadata is linked to Kaggle competitions information through comp_name. To ensure the quality of the data kernels_meta.csv includes only notebooks with an available Kaggle score.

The corpus can help solve various problems, including code synthesis from a prompt in natural language, code autocompletion, and semantic code classification.

Files

code_blocks.csv

Files (991.6 MB)

Name Size Download all
md5:522702708d3fb2e291150c2db39e72bb
986.3 MB Preview Download
md5:f95c7ae7ca75769a2581d97ca8dc7fd2
584.0 kB Preview Download
md5:7e79487b2299a9124434359e774776ae
2.4 MB Preview Download
md5:b619b6448c824614d6938536d19f8fef
2.3 MB Preview Download
md5:0afd0b11950f36b3df30007f026dd212
407 Bytes Preview Download
md5:7676fda7db7d014247226227c8dda180
2.5 kB Preview Download