There is a newer version of the record available.

Published March 12, 2021 | Version v0.5
Dataset Open

ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference

  • 1. Delft University of Technology

Description

  • The dataset is gathered on Sep. 17th 2020. It has more than 5.4K Python repositories that are hosted on GitHub. Check out the file ManyTypes4PyDataset.spec for repositories URL and their commit SHA.
  • The dataset is also de-duplicated using the CD4Py tool. The list of duplicate files is provided in duplicate_files.txt file.
  • All of its Python projects are processed in JSON-formatted files. They contain a seq2seq representation of each file, type-related hints, and information for machine learning models. The structure of JSON-formatted files is described in JSONOutput.md file.
  • The dataset is split into train, validation and test sets by source code files. The list of files and their corresponding set is provided in dataset_split.csv file.
  • Name-based visible type hints for processed projects are stored in the extracted_visible_types folder.
  • Notable changes to each version of the dataset are documented in CHANGELOG.md.

Files

Files (393.8 MB)

Name Size Download all
md5:ef5e1c1ece891eaf08171fcf2f36d234
393.8 MB Download

Additional details

Funding

European Commission
FASTEN - Fine-Grained Analysis of Software Ecosystems as Networks 825328