ManyTypes4TypeScript: A Comprehensive TypeScript Dataset for Sequence-Based Type Inference

Kevin Jesse; Premkumar T. Devanbu

doi:10.5281/zenodo.6387001

Published March 7, 2022 | Version v2

Dataset Open

ManyTypes4TypeScript: A Comprehensive TypeScript Dataset for Sequence-Based Type Inference

1. University of California, Davis

In this paper, we present ManyTypes4TypeScript, a very large corpus for training and evaluating machine-learning models for sequence-based type inference in TypeScript. The dataset includes over 9 million type annotations, across 13,953 projects and 539,571 files. The dataset is approximately 10x larger than analogous type inference datasets for Python, and is the largest available for TypeScript. We also provide API access to the dataset, which can be integrated into any tokenizer and used with any state-of-the-art sequence-based model. Finally, we provide analysis and performance results for state-of-the-art code-specific models, for baselining. ManyTypes4TypeScript is available on Huggingface and Zenodo.

This dataset was collected on January 22, 2022 and deduplicated with Allamanis code deduplication tool.

Files

Files (2.2 GB)

Name	Size
ManyTypes4TypeScript.tar.gz md5:923726dcfeb40e1f27fb25856718143a	2.2 GB	Download

Additional details

U.S. National Science Foundation
SHF: Large: Collaborative Research: Exploiting the Naturalness of Software 1414172

Views

414

Downloads

Show more details

	All versions	This version
Views	1,635	1,254
Downloads	414	376
Data volume	1.2 TB	1.1 TB

More info on how stats are collected....

DOI

Resource type

Dataset

Publisher

Zenodo

Conference

Mining Software Repositories Conference (MSR) , Pittsburgh, Pennsylvania, United States, 23-24 May 2022

License: Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: March 27, 2022
Modified: March 27, 2022

ManyTypes4TypeScript: A Comprehensive TypeScript Dataset for Sequence-Based Type Inference

Authors/Creators

Description

Files

Files (2.2 GB)

Additional details

Funding