There is a newer version of the record available.

Published January 14, 2021 | Version v0.2
Dataset Open

ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference

  • 1. Delft University of Technology

Description

  • The dataset is gathered on Sep. 17th 2020. It has more than 5.4K Python repositories that are hosted on GitHub. Check out the file ManyTypes4PyDataset.spec for repositories URL and their commit SHA.
  • The dataset is also de-duplicated using the CD4Py tool. The list of duplicate files is provided in duplicate_files.txt file.
  • All of its Python projects are processed in JSON-formatted files. They contain a seq2seq representation of each file, type-related hints, and information for machine learning models. The structure of JSON-formatted files is described in JSONOutput.md file.
  • The dataset is split into train, validation and test sets by source code files. The list of files and their corresponding set is provided in dataset_split.csv file.

Files

Files (385.4 MB)

Name Size Download all
md5:13aa9d13a694fd9cb5615ac8b9dba920
385.4 MB Download

Additional details

Funding

FASTEN – Fine-Grained Analysis of Software Ecosystems as Networks 825328
European Commission