Code4ML: a Large-scale Dataset of annotated Machine Learning Code

Anastasia Drozdova; Polina Guseva; Ekaterina Trofimova; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy

doi:10.5281/zenodo.7312803

Published June 2, 2022 | Version 1.0.0

Dataset Open

Code4ML: a Large-scale Dataset of annotated Machine Learning Code

1. NRU HSE

We present Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle. The corpus consists of ≈ 2.5 million snippets of ML code collected from ≈ 100 thousand Jupyter notebooks. A representative fraction of the snippets is annotated by human assessors through a user-friendly interface specially designed for that purpose.

The data is organized as a set of tables in CSV format. It includes several central entities: raw code blocks collected from Kaggle (code_blocks.csv), kernels (kernels_meta.csv) and competitions meta information (competitions_meta.csv). Manually annotated code blocks are presented as a separate table (murkup_data.csv). As this table contains the numeric id of the code block semantic type, we also provide a mapping from the id to semantic class and subclass (vertices.csv).

Snippets information (code_blocks.csv) can be mapped with kernels meta-data via kernel_id. Kernels metadata is linked to Kaggle competitions information through comp_name. To ensure the quality of the data kernels_meta.csv includes only notebooks with an available Kaggle score.

The corpus can help solve various problems, including code synthesis from a prompt in natural language, code autocompletion, and semantic code classification.

Files

code_blocks.csv

Files (991.6 MB)

Name	Size	Download all
code_blocks.csv md5:522702708d3fb2e291150c2db39e72bb	986.3 MB	Preview Download
competitions_meta.csv md5:f95c7ae7ca75769a2581d97ca8dc7fd2	584.0 kB	Preview Download
kernels_meta.csv md5:7e79487b2299a9124434359e774776ae	2.4 MB	Preview Download
markup_data.csv md5:b619b6448c824614d6938536d19f8fef	2.3 MB	Preview Download
NOTICE.txt md5:0afd0b11950f36b3df30007f026dd212	407 Bytes	Preview Download
vertices.csv md5:7676fda7db7d014247226227c8dda180	2.5 kB	Preview Download

	All versions	This version
Views	3,971	444
Downloads	3,813	531
Data volume	1.8 TB	235.1 GB

Code4ML: a Large-scale Dataset of annotated Machine Learning Code

Authors/Creators

Description

Files

code_blocks.csv

Files (991.6 MB)