UPDATE: Zenodo migration postponed to Oct 13 from 06:00-08:00 UTC. Read the announcement.

Dataset Open Access

ManyTypes4TypeScript: A Comprehensive TypeScript Dataset for Sequence-Based Type Inference

Kevin Jesse; Premkumar T. Devanbu

In this paper, we present ManyTypes4TypeScript, a very large corpus for training and evaluating machine-learning models for sequence-based type inference in TypeScript. The dataset includes over 9 million type annotations, across 13,953 projects and 539,571 files. The dataset is approximately 10x larger than analogous type inference datasets for Python, and is the largest available for TypeScript. We also provide API access to the dataset, which can be integrated into any tokenizer and used with any state-of-the-art sequence-based model. Finally, we provide analysis and performance results for state-of-the-art code-specific models, for baselining. ManyTypes4TypeScript is available on Huggingface and Zenodo.

This dataset was collected on January 22, 2022 and deduplicated with Allamanis code deduplication tool.

Files (2.2 GB)
Name Size
ManyTypes4TypeScript.tar.gz
md5:923726dcfeb40e1f27fb25856718143a
2.2 GB Download
334
291
views
downloads
All versions This version
Views 334254
Downloads 291274
Data volume 639.7 GB602.4 GB
Unique views 274218
Unique downloads 195181

Share

Cite as