Published January 29, 2021
| Version v0.3
Dataset
Open
ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference
Description
- The dataset is gathered on Sep. 17th 2020. It has more than 5.4K Python repositories that are hosted on GitHub. Check out the file ManyTypes4PyDataset.spec for repositories URL and their commit SHA.
- The dataset is also de-duplicated using the CD4Py tool. The list of duplicate files is provided in duplicate_files.txt file.
- All of its Python projects are processed in JSON-formatted files. They contain a seq2seq representation of each file, type-related hints, and information for machine learning models. The structure of JSON-formatted files is described in JSONOutput.md file.
- The dataset is split into train, validation and test sets by source code files. The list of files and their corresponding set is provided in dataset_split.csv file.
Files
Files
(385.3 MB)
Name | Size | Download all |
---|---|---|
md5:d5f2caf6c31c7beb9ff6f024a884f8ec
|
385.3 MB | Download |