There is a newer version of the record available.

Published March 1, 2021 | Version v0.4
Dataset Open

ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference

  • 1. Delft University of Technology

Description

  • The dataset is gathered on Sep. 17th 2020. It has more than 5.4K Python repositories that are hosted on GitHub. Check out the file ManyTypes4PyDataset.spec for repositories URL and their commit SHA.
  • The dataset is also de-duplicated using the CD4Py tool. The list of duplicate files is provided in duplicate_files.txt file.
  • All of its Python projects are processed in JSON-formatted files. They contain a seq2seq representation of each file, type-related hints, and information for machine learning models. The structure of JSON-formatted files is described in JSONOutput.md file.
  • The dataset is split into train, validation and test sets by source code files. The list of files and their corresponding set is provided in dataset_split.csv file.
  • Notable changes to each version of the dataset are documented in CHANGELOG.md.

Files

ManyTypes4PyDataset-v0.4.zip

Files (395.5 MB)

Name Size Download all
md5:ea8f7416609812142eafee4355c93fee
395.5 MB Preview Download

Additional details

Funding

FASTEN – Fine-Grained Analysis of Software Ecosystems as Networks 825328
European Commission