There is a newer version of the record available.

Published June 2, 2022 | Version 0.0.0
Dataset Open

Code4ML: a Large-scale Dataset of annotated Machine Learning Code

Authors/Creators

Description

We present Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle.

The data is organized in a table structure. Code4ML includes several main objects: competitions information, raw code blocks collected form Kaggle and manually marked up snippets. Each table has a .csv format.

Each competition has the text description and metadata, reflecting competition and used dataset characteristics as well as evaluation metrics (competitions.csv). The corresponding datasets can be loaded using Kaggle API and data sources.

The code blocks themselves and their metadata are collected to the data frames concerning the publishing year of the initial kernels. The current version of the corpus includes two code blocks files: snippets from kernels up to the 2020 year (сode_blocks_upto_20.csv) and those from the 2021 year (сode_blocks_21.csv) with corresponding metadata. The corpus consists of 2 743 615 ML code blocks collected from 107 524 Jupyter notebooks.  

Marked up code blocks have the following metadata: anonymized id, the format of the used data (for example, table or audio), the id of the semantic type, a flag for the code errors, the estimated relevance to the semantic class (from 1 to 5), the id of the parent notebook, and the name of the competition. The current version of the corpus has ~12 000 labeled snippets (markup_data_20220415.csv).

As marked up code blocks data contains the numeric id of the code block semantic type, we also provide a mapping from this number to semantic type and subclass (actual_graph_2022-06-01.csv).

The dataset can help solve various problems, including code synthesis from a prompt in natural language, code autocompletion, and semantic code classification.

Files

actual_graph_2022-06-01.csv

Files (1.4 GB)

Name Size Download all
md5:d4711ce0f96fc732bf5f61e851e68fa7
2.5 kB Preview Download
md5:b77029473c4b5155b34f38c6b036aed2
1.3 MB Preview Download
md5:d7df0f9bac959b6da727b155c1ef68c3
3.3 MB Preview Download
md5:1375c277f484835af7ae84085819223f
101.1 MB Preview Download
md5:4e262822d76982b55f9b363395c2a616
1.3 GB Preview Download