Dataset Open Access

A Code Token Type Taxonomy-enhanced dataset with pre-computed token types for Python150k

Le, Kim Tuyen; Rashidi, Gabriel; Andrzejak, Artur

Code Token Type Taxonomy (CT3) is a methodology for refined evaluation of ML-based code completion approaches.

We published the CT3-enhanced dataset with pre-computed token types for each token in the Python150k dataset.

The dataset was obtained from an empirical study of the below paper:

Kim Tuyen Le, Gabriel Rashidi, and Artur Andrzejak. A Methodology for Refined Evaluation of ML-based Code Completion Approaches. In Special Issue on Programming Language Processing, Data Mining and Knowledge Discovery.

Please read the README.txt file for detailed information of structuring the enhanced dataset.

Files (1.1 GB)
Name Size
1.1 GB Download
All versions This version
Views 4828
Downloads 22
Data volume 2.1 GB2.1 GB
Unique views 3521
Unique downloads 22


Cite as