Published November 28, 2021 | Version v2
Dataset Open

A Code Token Type Taxonomy-enhanced dataset with pre-computed token types for Python150k

  • 1. Heidelberg University

Description

Code Token Type Taxonomy (CT3) is a methodology for refined evaluation of ML-based code completion approaches.

We published the CT3-enhanced dataset with pre-computed token types for each token in the Python150k dataset.

The dataset was obtained from an empirical study of the below paper:

Kim Tuyen Le, Gabriel Rashidi, and Artur Andrzejak. A Methodology for Refined Evaluation of ML-based Code Completion Approaches. In Special Issue on Programming Language Processing, Data Mining and Knowledge Discovery.

Please read the README.txt file for detailed information of structuring the enhanced dataset.

Files

CT3-dataset-journal-20211128.zip

Files (1.1 GB)

Name Size Download all
md5:52f69f659e875a8c4f937552bb96830a
1.1 GB Preview Download