Published October 30, 2025 | Version v1
Model Open

MotifAE: Unsupervised Discovery of Functional Motifs from Protein Language Model

Authors/Creators

  • 1. ROR icon Columbia University

Description

representative_2.3M_seq.csv contains representative proteins from structure-based clustering of Alphafold structure database. The ESM2-650M last layer embeddings of these proteins were used to train SAE and MotifAE.

SAE_step_80000.pt and MotifAE_step_80000.pt are checkpoints at 80,000 steps of both models. SAE was trained with reconstruction loss and L1 norm, MotifAE was trained with an additional local similarity loss.

412pros_ddG_ML.csv contains the deep mutational scanning data of protein folding stability, which is use to train MotifAE-G. 1404_stability_associated_features.pt were selected features using MotifAE-G.

Files

412pros_ddG_ML.csv

Files (1.7 GB)

Name Size Download all
md5:5149210e9d5c7b45eebd51a4008c7f40
165.0 kB Download
md5:9702c3d046b0c9c8fd78666a9751f7f7
52.3 MB Preview Download
md5:cabb07008f26705f07e18344607a2d51
419.6 MB Download
md5:2faacfd5c802d9f13c0f455582269e52
765.1 MB Preview Download
md5:9e424e4f50e0e30ddd98818430d1c775
419.6 MB Download

Additional details

Dates

Available
2025-11-04

Software

Repository URL
https://github.com/CHAOHOU-97/MotifAE
Programming language
Python