There is a newer version of the record available.

Published February 2, 2025 | Version v2
Dataset Open

Learning the Rules of Peptide Self-assembly through Data Mining with Large Language Models

  • 1. ROR icon Massachusetts Institute of Technology
  • 2. ROR icon University of Cambridge

Description

Peptides are biologically ubiquitous and important molecules that self-assemble into diverse structures. While extensive research has explored the effects of chemical composition and environmental conditions on self-assembly, a systematic study consolidating this data to uncover global rules is lacking. In this work, we curate a peptide assembly database through a combination of manual processing by human experts and literature mining with a large language model. As a result, we collect more than 1,000 experimental data entries with information about peptide sequence, experimental conditions and corresponding self-assembly phases. Utilizing the data, machine learning models are trained and evaluated, demonstrating excellent accuracy (> 80%) and efficiency in assembly phase classification. Moreover, we fine-tune our GPT model for peptide literature mining with the developed dataset, which exhibits markedly superior performance in extracting information from academic publications relative to the pre-trained model. This workflow can improve efficiency when exploring potential self-assembling peptide candidates, through guiding experimental work, while also deepening our understanding of the mechanisms governing peptide self-assembly.

 

--- phase_data_clean.csv stores 1000+ peptide self-assembly data under different experimental conditions.

--- trainset.jsonl and testset.jsonl are data we used for fine-tuning the LLM. 

--- fine-tuning.ipynb: code used to fine-tune ChatGPT model. 

--- pretrain.ipynb: code used to test the pretrained ChatGPT model.

--- train_and_inference.ipynb: code to use mined data to train and test a ML predictor for phase classification. 

Files

fine-tuning.ipynb

Files (3.7 MB)

Name Size Download all
md5:0316fcb938cdb7b08bebc5b920c304ac
234.3 kB Preview Download
md5:3c02db78d27eb64843d78f0246b59244
306.8 kB Preview Download
md5:fd5003eba9aa7626684259601c4eb71a
512.6 kB Preview Download
md5:fc856e34bf9b289de05505ffd309abaf
323.8 kB Download
md5:f3e0beef6b70eb436d5cbcf075ed86bd
1.4 MB Preview Download
md5:8f8369d808476a1f7e78a4754c60e5ac
875.5 kB Download