Learning the Rules of Peptide Self-assembly through Data Mining with Large Language Models
Description
Peptides are biologically ubiquitous and important molecules that self-assemble into diverse structures. While extensive research has explored the effects of chemical composition and environmental conditions on self-assembly, a systematic study consolidating this data to uncover global rules is lacking. In this work, we curate a peptide assembly database through a combination of manual processing by human experts and literature mining with a large language model. As a result, we collect more than 1,000 experimental data entries with information about peptide sequence, experimental conditions and corresponding self-assembly phases. Utilizing the data, machine learning models are trained and evaluated, demonstrating excellent accuracy (> 80%) and efficiency in assembly phase classification. Moreover, we fine-tune our GPT model for peptide literature mining with the developed dataset, which exhibits markedly superior performance in extracting information from academic publications relative to the pre-trained model. This workflow can improve efficiency when exploring potential self-assembling peptide candidates, through guiding experimental work, while also deepening our understanding of the mechanisms governing peptide self-assembly.
--- phase_data_clean.csv stores 1000+ peptide self-assembly data under different experimental conditions.
--- trainset.jsonl and testset.jsonl are data we used for fine-tuning the LLM.
--- fine-tuning.ipynb: code used to fine-tune ChatGPT model.
--- pretrain.ipynb: code used to test the pretrained ChatGPT model.
--- train_and_inference.ipynb: code to use mined data to train and test a ML predictor for phase classification.
Files
fine-tuning.ipynb
Files
(3.7 MB)
Name | Size | Download all |
---|---|---|
md5:0316fcb938cdb7b08bebc5b920c304ac
|
234.3 kB | Preview Download |
md5:3c02db78d27eb64843d78f0246b59244
|
306.8 kB | Preview Download |
md5:fd5003eba9aa7626684259601c4eb71a
|
512.6 kB | Preview Download |
md5:fc856e34bf9b289de05505ffd309abaf
|
323.8 kB | Download |
md5:f3e0beef6b70eb436d5cbcf075ed86bd
|
1.4 MB | Preview Download |
md5:8f8369d808476a1f7e78a4754c60e5ac
|
875.5 kB | Download |