Published April 27, 2025 | Version 1.3
Dataset Open

ProteinGym

Description

Predicting the effects of mutations in proteins is critical to many applications, from understanding genetic disease to designing novel proteins to address our most pressing challenges in climate, agriculture and healthcare. Despite an increase in machine learning-based protein modeling methods, assessing their effectiveness is problematic due to the use of distinct, often contrived, experimental datasets and variable performance across different protein families. Addressing these challenges requires scale. To that end we introduce ProteinGym v1.0, a large-scale and holistic set of benchmarks specifically designed for protein fitness prediction and design. It encompasses both a broad collection of over 250 standardized deep mutational scanning assays, spanning millions of mutated sequences, as well as curated clinical datasets providing high-quality expert annotations about mutation effects. We devise a robust evaluation framework that combines metrics for both fitness prediction and design, factors in known limitations of the underlying experimental methods, and covers both zero-shot and supervised settings. We report the performance of a diverse set of over 40 high-performing models from various subfields (eg., mutation effects, inverse folding) into a unified benchmark. We open source the corresponding codebase, datasets, MSAs, structures, predictions and develop a user-friendly website that facilitates comparisons across all settings.

Files

clinical_indels.csv

Files (10.0 GB)

Name Size Download all
md5:7ec4c434a9dfe841bdc0bf73384f9100
1.5 MB Preview Download
md5:5799650aad6b67cd2e088a72be52d2e8
5.9 GB Preview Download
md5:d4c6c52b04cfe4ec84eba1aef870ffd6
31.3 MB Preview Download
md5:bdb2d2e63f2225521e9b41de2697f738
2.1 MB Preview Download
md5:78fa03e133b073f80725d99fc166e53b
3.4 MB Preview Download
md5:c08aa57780a9d0d345fe48ed364d166f
2.2 MB Preview Download
md5:7f7b2d94bbb7339b3191fe975d4afd36
8.8 MB Preview Download
md5:ed0ed89828ab196b55fad4d19a0787b2
34.0 MB Preview Download
md5:53b3c9be428aac3724703a877c014f9c
13.4 MB Preview Download
md5:a4694317e8b24bdba1bd6eee1e38f613
44.5 kB Preview Download
md5:d3e6fe9f5555d4cb35a9d3872bd1882d
1.5 GB Preview Download
md5:509b0d1106583e71a202b04bfb713f06
45.2 MB Preview Download
md5:7054632956c1501e6020016af6a71c4c
6.7 MB Preview Download
md5:ca1a4d46941ef33cc972245347118c7d
43.0 MB Preview Download
md5:c434631737013fceb56efc98056151e0
208.7 kB Preview Download
md5:f5b156ac00a0dbe5f6e61af5d36bc35d
19.2 MB Preview Download
md5:d9039ee270eb446ef14d42cf4baf258b
269.8 MB Preview Download
md5:4a8af28095059afe7c05cb7fec373c2f
2.2 MB Preview Download
md5:bc8228fe86225726032950cb779dbb87
15.9 MB Preview Download
md5:442881616899de02cd1c1b0a57badd37
19.2 MB Preview Download
md5:d2f8a5dcb7d4ebcb7370803533a16e51
10.7 MB Preview Download
md5:64d731a47759514a19032de9e446c630
112.2 MB Preview Download
md5:3f011b8582e4b52e83ca2d73415a06a3
2.8 MB Preview Download
md5:f6700254c1452041e3e4e452656c04ab
13.2 MB Preview Download
md5:b7db8253045ea9ea041c74623651a6f6
64.8 MB Preview Download
md5:7912b6a1f6c80b0b78c12a15c84d96fc
1.9 GB Preview Download

Additional details

Dates

Updated
2025-04-27

References

  • Notin, P., Kollasch, A.W., Ritter, D., Niekerk, L.V., Paul, S., Spinner, H., Rollins, N.J., Shaw, A., Orenbuch, R., Weitzman, R., Frazer, J., Dias, M., Franceschi, D., Gal, Y., & Marks, D.S. (2023). ProteinGym: Large-Scale Benchmarks for Protein Fitness Prediction and Design. Neural Information Processing Systems.