Discovery of Electron Hole-hopping Redox Mutations in Myoglobin by Deep Mutational Learning
Description
Data and scripts of the manuscript "Language Model-guided Discovery of Hole-hopping Redox Mutations in Myoglobin" by Christoph Küng, Alperen Dalkıran, Rosario Vanella, Diego A. Oyarzún and Michael A. Nash
Experimental File Descriptions
Details for reproducing results can be found in the Supplementary information of the manuscript.
PacBio lookup table generation
The scripts needed to process the raw long read sequencing files PB4_hifi_1800.fastq
can be found in C_Scripts.zip.
Generate_LUT.ipynb
: Creates the look up table connecting barcodes to variants.
Notebook to reproduce DMS scores
The raw illumina reads are stored in Illumina_raw.zip. Processing with scripts stored in C_Scripts.zip will lead to files stored in Illumina_processed bins.zip.
ill_tag1_bins.ipynb
Merges all outputs from RIB script into one file 230807_PB4_ML_2R.tsv
.
Activity_Script.ipynb
: Creates the actual fitess scores for all variants depending on the distribution amongst bins.
ML File Descriptions
Scripts for Reproducing Original Results
predict_fitness_with_mlp_embeddings.py
: Trains and evaluates a Multi-Layer Perceptron (MLP) model to predict fitness using pre-computed protein embeddings.
predict_fitness_with_cnn_onehot.py
: Trains and evaluates a Convolutional Neural Network (CNN) to predict fitness using one-hot encoded protein sequences.
Scripts for Predicting and Analyzing Fitness of New Sequences
create_embeddings.py
: Generates protein embeddings for a list of sequences from a CSV file. It supports both ESM-3 (ESM3_OPEN_SMALL) and ProtTrans (Prot_T5_XL) models.
get_predictions_for_embeddings.py
: Loads the 5 pre-trained MLP models (one for each cross-validation fold) and predicts fitness scores for new sequences using their pre-computed embeddings.
analyze_prediction_consensus.py
: Analyzes the prediction results from the 5 folds. It identifies the top-performing sequences that are common across all folds, calculates their average fitness score and standard deviation, and identifies the specific mutations relative to the wild-type sequence.
Data and Model Files
filtered_fitness_data.csv
: The fitness data used for model training and evaluation. fitness_rep1 and fitness_rep2 are replicate measurements from the wet lab experiments.
protein_embeddings_esm3.csv / protein_embeddings_prottrans.csv
: Pre-computed protein embeddings for the sequences in filtered_fitness_data.csv.
double_mutations_subset_1000.csv
: An example input file with 1,000 sequences to demonstrate the prediction and analysis pipeline.
combined_predictions.csv
: This large file contains the final prediction results for ~4.2 million double mutant variants. The scores in this file are an aggregation of predictions from 20 (4 seeds × 5 folds) different models.
models/
directory: This directory contains the pre-trained model weights (.pt files) for each fold of both the ESM and ProtTrans-based MLP models (e.g., ProtTrans_model_fold1.pt
, ESM_model_fold1.pt
, etc.).
Dependencies and Environment
Installation
It is recommended to use the specified versions of the following Python libraries. You can install them using pip:
For the main training/evaluation and analysis scripts
pip install torch==2.4.1 scikit-learn==1.5.1 pandas==2.2.2 numpy==1.26.4 verstack==4.1.4
For the embedding generation scripts
pip install transformers sentencepiece accelerate "huggingface_hub[cli]" esm
Hardware Requirements
A GPU is recommended for all operations, especially for generating embeddings and training models. However, all scripts can be run on a CPU, though it will be significantly slower.
How to Use This Repository
There are two main use cases for this repository: reproducing our original performance results and generating and analyzing fitness predictions for new myoglobin variants.
1. Reproducing the Original Performance Results
To reproduce the R² values in Figure 3C, use the predict_fitness_with_mlp_embeddings.py
and predict_fitness_with_cnn_onehot.py
scripts. See the inline comments in the scripts for configuration.
python predict_fitness_with_mlp_embeddings.py
python predict_fitness_with_cnn_onehot.py
2. Generating and Analyzing Predictions for New Sequences
This complete workflow allows you to predict the fitness of new myoglobin sequences and identify the most promising candidates.
Step 1: Prepare Input Sequences Create a CSV file containing a single column named sequence with your protein sequences. An example is provided in
double_mutations_subset_1000.csv.
Step 2: Generate Protein Embeddings Use create_embeddings.py
to generate embeddings for your sequences. Configure the embedding model type and input_csv file path inside the script.
python create_embeddings.py
Step 3: Get Fitness Predictions Use get_predictions_for_embeddings.py
to predict fitness scores from the generated embeddings. Ensure your pre-trained models are in the models/
directory. Configure the embedding type inside the script.
python get_predictions_for_embeddings.py
This will create 5 prediction files (one for each model fold) in a Predictions/
directory.
Step 4: Analyze Predictions and Identify Consensus Hits Use the analyze_prediction_consensus.py
script to find the most robust candidates from the predictions generated in the previous step. This is crucial for identifying sequences that are consistently ranked highly by all models.
Configure the script: Open analyze_prediction_consensus.py
and set the embedding type to match the previous steps. You can also adjust the top_n parameter to control how many of the best sequences from each fold are considered for the consensus analysis.
Run the analysis:
python analyze_prediction_consensus.py
This will produce a final summary file (e.g., Predictions/common_top_100_sequences_across_folds_with_ESM_mutations.csv
) containing only the sequences found in the top N of all five folds, along with their average predicted score, standard deviation, and specific mutations.
Citation
If you use the data or code from this repository in your work, please cite:
[XXX]
Files
Activity_Script.ipynb
Files
(13.9 GB)
Name | Size | Download all |
---|---|---|
md5:83f09a14e7ef92e564ade94297ceb6e2
|
13.5 MB | Download |
md5:fa8524f1f8f49b216c35b77dd08813af
|
608.6 kB | Preview Download |
md5:6171494cc9eb084c11690d37d185608b
|
2.7 kB | Download |
md5:584c1ce528c598aa5cb425f4296aad11
|
11.7 kB | Preview Download |
md5:39d26897dac9a017766a6f916b25add5
|
2.4 GB | Preview Download |
md5:9b668d2c39903ea550255439b6d88c2f
|
4.4 kB | Download |
md5:cef035c102dec0643dedcee0bae52c41
|
155.0 kB | Preview Download |
md5:c6e8b7067a17f313c9c0583295aa73b9
|
3.8 MB | Download |
md5:e12132a9c4d1a1ddef173afd4212dc49
|
3.8 MB | Download |
md5:7a59783e0e0b889ea0855587649e9b65
|
3.8 MB | Download |
md5:150bfbea00114161e7cc094f1b90cb5b
|
3.8 MB | Download |
md5:68d0d30fc4700ecf85adfe2031955a5f
|
3.8 MB | Download |
md5:21d40d98a0707a44c91b7dd177d1fe62
|
490.7 kB | Preview Download |
md5:af1bf1173456b9dd2c369abbfb29af40
|
1.2 MB | Preview Download |
md5:4846a702f4040d290c84a344edfc8b0b
|
17.5 kB | Preview Download |
md5:7b9274f20fdddb969c1ac223859b72f5
|
3.9 kB | Download |
md5:f41aa9da1bc8c804dac1dc0b91ec07ef
|
12.4 MB | Download |
md5:eb90cd3f9ab672591d421a95f78d5073
|
128 Bytes | Download |
md5:7fa6afff4d6c39ed51ba274a92cd5430
|
25.8 kB | Preview Download |
md5:09f8fea2c561a407ab5172433bbfc814
|
5.0 MB | Preview Download |
md5:8a8779a0adfd97f92b95e7a268ccb84e
|
4.8 GB | Preview Download |
md5:e89f56e79f362a430e2d97b511e2de99
|
3.5 MB | Download |
md5:27325fee47733d3f3bc436d7530e00c7
|
13.0 kB | Download |
md5:991b4ef83d846acb7c57c0a4b588bdf5
|
6.4 GB | Download |
md5:4e3e6708f49343a243398e80946d72f8
|
7.4 kB | Download |
md5:9ca7b45a3d7813008bef92f3fcbe18c8
|
8.3 kB | Download |
md5:d7a4ca8bd96f1085a74e05a3f6e0e239
|
15.7 MB | Preview Download |
md5:ad9de8ac293c0124cd4125f9632cfd5e
|
12.7 MB | Preview Download |
md5:0504a71262a1f442392d8e82e4987185
|
95.8 MB | Preview Download |
md5:8e6ccea5e90f67a14cd3c2754f6d73f8
|
77.6 MB | Preview Download |
md5:c9f547124040cc8f5b3d79ba632340e8
|
2.8 MB | Download |
md5:75f98fe86b1f23af4c788b843f32e10c
|
2.8 MB | Download |
md5:d7062d35bccfc9ef84a071faffba28e5
|
2.8 MB | Download |
md5:5c68e03162ab223aa917432463458e54
|
2.8 MB | Download |
md5:1bd7fdf178f6bc30d8294a18e2b09142
|
2.8 MB | Download |
md5:228ed6dbed81b716f54bab39c3e02bab
|
7.9 MB | Download |
md5:139ec0be11140bc1fd10e0dc79f45d6e
|
5.4 kB | Preview Download |
md5:f060f3573e4736292ca81d6478c0016f
|
1.6 kB | Download |
md5:69d2a50f150fa836eb8e8ae53fe72025
|
80.1 kB | Download |