There is a newer version of this record available.

Dataset Open Access

ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference

Mir, Amir M.; Latoskinas, Evaldas; Gousios, Georgios

  • The dataset is gathered on Sep. 17th 2020 from GitHub.
  • It has more than 5.2K Python repositories and 4.2M type annotations.
  • The dataset is also de-duplicated using the CD4Py tool.
  • Check out the README.MD file for the description of the dataset.
  • Notable changes to each version of the dataset are documented in
  • The dataset's scripts and utilities are available on its GitHub repository.
Files (564.6 MB)
Name Size
564.6 MB Download
  • A. Mir, E. Latoskinas and G. Gousios, "ManyTypes4Py: A Benchmark Python Dataset for Machine Learning-Based Type Inference," in 2021 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR), 2021 pp. 585-589. doi: 10.1109/MSR52588.2021.00079

All versions This version
Views 1,460584
Downloads 579244
Data volume 399.2 GB137.8 GB
Unique views 1,035430
Unique downloads 408154


Cite as