Affinity Map: Few-Shot Protein Family Classification via Prototypical Networks: Benchmarking Sequence Encoders and Episodic ESM-2 Fine-Tuning
Description
Protein family annotation is a cornerstone of computational biology, yet the acquisition of large, curated per-family corpora is laborious and often infeasible for rare families. We present Affinity Map, a meta-learning framework that frames protein family classification as a few-shot learning problem: given only K labelled examples from a previously unseen family, the model must correctly assign new sequences to that family. We systematically benchmark encoder quality under this episodic framework, ranging from a lightweight 1D-CNN trained from scratch through compositional k-mer baselines to a frozen ESM-2 protein language model and episodic LoRA fine-tuning, all evaluated under Prototypical Networks with N-way K-shot tasks sampled from the Pfam database.
Evaluating on 24 held-out test families reveals: (1) CNN ProtoNet trained from scratch reaches 71.0% at K=5; (2) 3-mer frequency k-mer ProtoNet reaches 86.2%; (3) a frozen ESM-2 encoder reaches 88.7% at K=5; and (4) episodic LoRA fine-tuning of ESM-2 reveals a K-dependent interaction: LoRA gains +2.5 pp over frozen ESM-2 at K=1 (p < 0.001), but underperforms frozen ESM-2 at K >= 2, indicating that episodic adaptation improves single-shot retrieval at the cost of multi-shot prototype quality. All pairwise CNN vs. baseline differences are statistically significant (paired Wilcoxon, p < 0.001). Real per-epoch learning curves, a named confusion matrix, PCA/UMAP embedding visualisations, and comprehensive baseline comparisons provide biologically interpretable diagnostics throughout.
Files
affinity_map_paper.pdf
Files
(2.2 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:e62b0e1c1aad4cfd2c23110f3ba2cfae
|
2.2 MB | Preview Download |
Additional details
Software
- Repository URL
- https://github.com/MDerazNasr/Protein-fewshot
- Programming language
- Python , NumPy , Scilab
- Development Status
- Abandoned