Protein Language Models: Is Scaling Necessary?
Creators
Description
Public protein sequence databases contain samples from the fitness landscape explored by nature. Protein language models (pLMs) pre-trained on these sequences aim to capture this landscape for tasks like property prediction and protein design. Following the same trend as in natural language processing, pLMs have continuously been scaled up. However, the premise that scale leads to better performance assumes that source databases provide accurate representation of the underlying fitness landscape, which is likely false. By developing an efficient codebase, designing a modern architecture, and addressing data quality concerns such as sample bias, we introduce AMPLIFY, a best-in-class pLM that is orders of magnitude less expensive to train and deploy than previous models. Furthermore, to support the scientific community and democratize the training of pLMs, we have open-sourced AMPLIFY's pre-training codebase, data, and model checkpoints.
Files
AMPLIFY_120M.zip
Files
(114.9 GB)
Name | Size | Download all |
---|---|---|
md5:62f234197d08315b115adc17469bb33b
|
439.2 MB | Preview Download |
md5:2ffe857e4a6c8a357b7e3394c6b73a71
|
439.2 MB | Preview Download |
md5:c112c9fd162327b6d22a2204e3797bc9
|
1.3 GB | Preview Download |
md5:8537d438b48da6ae1f064e292d4d3496
|
1.3 GB | Preview Download |
md5:55a59fc01cf6610b45f04dd5e6777205
|
53.4 MB | Preview Download |
md5:b85b0bef0b4d6bbc660a33f681dfbf29
|
357.9 MB | Preview Download |
md5:c69f687d33ebf7fb60bd29cf27a91f70
|
265.9 MB | Preview Download |
md5:4271432906f9dc4251419b643d002d77
|
829.0 kB | Preview Download |
md5:a1b94b582262cecb9bbad919d4d3600d
|
152.9 MB | Preview Download |
md5:dfd27017a75b23c362ed22f3b808c0a9
|
1.2 MB | Preview Download |
md5:cd11e638e292aea4483a93c1d9a19d51
|
3.2 MB | Preview Download |
md5:68c238f7215541f2914d853d6b28e65a
|
3.3 MB | Preview Download |
md5:b74aeae04270a9e7daf4abf1b4e15a8d
|
97.7 GB | Preview Download |
md5:64e7a46a946d5d1bd1a4f5c83ef6b268
|
12.9 GB | Preview Download |
Additional details
Additional titles
- Subtitle (English)
- AMPLIFY checkpoints and datasets
Related works
- Is supplement to
- Preprint: 10.1101/2024.09.23.614603 (DOI)
Software
- Repository URL
- https://github.com/chandar-lab/AMPLIFY
- Programming language
- Python
- Development Status
- Active