Published August 24, 2021
| Version v0.7
Dataset
Open
ManyTypes4Py: A Benchmark Python Dataset for Machine Learning-Based Type Inference
Description
- The dataset is gathered on Sep. 17th 2020 from GitHub.
- It has clean and complete versions (from v0.7):
- The clean version has 5.1K type-checked Python repositories and 1.2M type annotations.
- The complete version has 5.2K Python repositories and 3.3M type annotations.
- The dataset's source files are type-checked using mypy (clean version).
- The dataset is also de-duplicated using the CD4Py tool.
- Check out the README.MD file for the description of the dataset.
- Notable changes to each version of the dataset are documented in CHANGELOG.md.
- The dataset's scripts and utilities are available on its GitHub repository.
Files
Files
(1.1 GB)
Name | Size | Download all |
---|---|---|
md5:cd66425dca48cd59423c8097a47cd6f1
|
1.1 GB | Download |
Additional details
Funding
References
- A. Mir, E. Latoskinas and G. Gousios, "ManyTypes4Py: A Benchmark Python Dataset for Machine Learning-Based Type Inference," in 2021 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR), 2021 pp. 585-589. doi: 10.1109/MSR52588.2021.00079