Published September 23, 2024 | Version 1.0.0
Model Open

Protein Language Models: Is Scaling Necessary?

  • 1. ROR icon Mila - Quebec Artificial Intelligence Institute
  • 2. ROR icon Amgen (United States)
  • 3. ROR icon Polytechnique Montréal
  • 4. ROR icon Canadian Institute for Advanced Research

Description

Public protein sequence databases contain samples from the fitness landscape explored by nature. Protein language models (pLMs) pre-trained on these sequences aim to capture this landscape for tasks like property prediction and protein design. Following the same trend as in natural language processing, pLMs have continuously been scaled up. However, the premise that scale leads to better performance assumes that source databases provide accurate representation of the underlying fitness landscape, which is likely false. By developing an efficient codebase, designing a modern architecture, and addressing data quality concerns such as sample bias, we introduce AMPLIFY, a best-in-class pLM that is orders of magnitude less expensive to train and deploy than previous models. Furthermore, to support the scientific community and democratize the training of pLMs, we have open-sourced AMPLIFY's pre-training codebase, data, and model checkpoints.

Files

AMPLIFY_120M.zip

Files (114.9 GB)

Name Size Download all
md5:62f234197d08315b115adc17469bb33b
439.2 MB Preview Download
md5:2ffe857e4a6c8a357b7e3394c6b73a71
439.2 MB Preview Download
md5:c112c9fd162327b6d22a2204e3797bc9
1.3 GB Preview Download
md5:8537d438b48da6ae1f064e292d4d3496
1.3 GB Preview Download
md5:55a59fc01cf6610b45f04dd5e6777205
53.4 MB Preview Download
md5:b85b0bef0b4d6bbc660a33f681dfbf29
357.9 MB Preview Download
md5:c69f687d33ebf7fb60bd29cf27a91f70
265.9 MB Preview Download
md5:4271432906f9dc4251419b643d002d77
829.0 kB Preview Download
md5:a1b94b582262cecb9bbad919d4d3600d
152.9 MB Preview Download
md5:dfd27017a75b23c362ed22f3b808c0a9
1.2 MB Preview Download
md5:cd11e638e292aea4483a93c1d9a19d51
3.2 MB Preview Download
md5:68c238f7215541f2914d853d6b28e65a
3.3 MB Preview Download
md5:b74aeae04270a9e7daf4abf1b4e15a8d
97.7 GB Preview Download
md5:64e7a46a946d5d1bd1a4f5c83ef6b268
12.9 GB Preview Download

Additional details

Additional titles

Subtitle (English)
AMPLIFY checkpoints and datasets

Related works

Is supplement to
Preprint: 10.1101/2024.09.23.614603 (DOI)

Software

Repository URL
https://github.com/chandar-lab/AMPLIFY
Programming language
Python
Development Status
Active