There is a newer version of this record available.

Dataset Open Access

A Code Token Type Taxonomy-enhanced dataset with pre-computed token types for Python150k

Le, Kim Tuyen; Rashidi, Gabriel; Andrzejak, Artur

Code Token Type Taxonomy (CT3) is a methodology for refined evaluation of ML-based code completion approaches.

We published the CT3-enhanced dataset with pre-computed token types for each token in the Python150k dataset.

The dataset was obtained from an empirical study of the below paper:

Kim Tuyen Le, Gabriel Rashidi, and Artur Andrzejak. A Methodology for Refined Evaluation of ML-based Code Completion Approaches. In KDD Workshop on Programming Language Processing (PLP), August 14-18, 2021 (Virtual).

Please read the README.txt file for detailed information of structuring the enhanced dataset.

Files (1.1 GB)
Name Size
1.1 GB Download
All versions This version
Views 4820
Downloads 20
Data volume 2.1 GB0 Bytes
Unique views 3516
Unique downloads 20


Cite as