There is a newer version of the record available.

Published June 2, 2022 | Version 1.0.1
Dataset Open

Code4ML: a Large-scale Dataset of annotated Machine Learning Code

Description

We present Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle. The corpus consists of ≈ 2.5 million snippets of ML code collected from ≈ 100 thousand Jupyter notebooks. A representative fraction of the snippets is annotated by human assessors through a user-friendly interface specially designed for that purpose. 

The data is organized as a set of tables in CSV format. It includes several central entities: raw code blocks collected from Kaggle (code_blocks.csv), kernels (kernels_meta.csv) and competitions meta information (competitions_meta.csv). Manually annotated code blocks are presented as a separate table (murkup_data.csv). As this table contains the numeric id of the code block semantic type, we also provide a mapping from the id to semantic class and subclass (vertices.csv).

Snippets information (code_blocks.csv) can be mapped with kernels meta-data via kernel_id. Kernels metadata is linked to Kaggle competitions information through comp_name. To ensure the quality of the data kernels_meta.csv includes only notebooks with an available Kaggle score.

Automatic classification of code_blocks are stored in data_with_preds.csv. The mapping of this table with code_blocks.csv can be doe through code_blocks_index column, which corresponds to code_blocks indices.

The corpus can help solve various problems, including code synthesis from a prompt in natural language, code autocompletion, and semantic code classification.

Files

code_blocks.csv

Files (1.0 GB)

Name Size Download all
md5:81227d0d1fed6b74d6544b25fbce4f1f
986.6 MB Preview Download
md5:7e7f8e07ee3e2937fe19e119d30a85c7
582.3 kB Preview Download
md5:44eafce3cfb073e0a2c93adffa939c87
54.9 MB Preview Download
md5:738958a342398cdbb82eb5473405c4c6
2.2 MB Preview Download
md5:2d1d10dcfc66a49d6520c2ab70db8cfc
2.3 MB Preview Download
md5:0afd0b11950f36b3df30007f026dd212
407 Bytes Preview Download
md5:9a71351669f045e04fd702134f7cde96
2.4 kB Preview Download