Discovery of Electron Hole-hopping Redox Mutations in Myoglobin by Deep Mutational Learning

Küng, Christoph; Dalkıran, Alperen; Vanella, Rosario; Oyarzún, Diego; Nash, Michael

doi:10.5281/zenodo.16781055

Published August 27, 2025 | Version v1

Publication Open

Discovery of Electron Hole-hopping Redox Mutations in Myoglobin by Deep Mutational Learning

1. University of Basel
2. Middle East Technical University
3. The University of Edinburgh

Data and scripts of the manuscript "Language Model-guided Discovery of Hole-hopping Redox Mutations in Myoglobin" by Christoph Küng, Alperen Dalkıran, Rosario Vanella, Diego A. Oyarzún and Michael A. Nash

Experimental File Descriptions

Details for reproducing results can be found in the Supplementary information of the manuscript.

PacBio lookup table generation

The scripts needed to process the raw long read sequencing files PB4_hifi_1800.fastq can be found in C_Scripts.zip.

Generate_LUT.ipynb: Creates the look up table connecting barcodes to variants.

Notebook to reproduce DMS scores

The raw illumina reads are stored in Illumina_raw.zip. Processing with scripts stored in C_Scripts.zip will lead to files stored in Illumina_processed bins.zip.

ill_tag1_bins.ipynb Merges all outputs from RIB script into one file 230807_PB4_ML_2R.tsv.

Activity_Script.ipynb: Creates the actual fitess scores for all variants depending on the distribution amongst bins.

ML File Descriptions

Scripts for Reproducing Original Results

predict_fitness_with_mlp_embeddings.py: Trains and evaluates a Multi-Layer Perceptron (MLP) model to predict fitness using pre-computed protein embeddings.

predict_fitness_with_cnn_onehot.py: Trains and evaluates a Convolutional Neural Network (CNN) to predict fitness using one-hot encoded protein sequences.

Scripts for Predicting and Analyzing Fitness of New Sequences

create_embeddings.py: Generates protein embeddings for a list of sequences from a CSV file. It supports both ESM-3 (ESM3_OPEN_SMALL) and ProtTrans (Prot_T5_XL) models.

get_predictions_for_embeddings.py: Loads the 5 pre-trained MLP models (one for each cross-validation fold) and predicts fitness scores for new sequences using their pre-computed embeddings.

analyze_prediction_consensus.py: Analyzes the prediction results from the 5 folds. It identifies the top-performing sequences that are common across all folds, calculates their average fitness score and standard deviation, and identifies the specific mutations relative to the wild-type sequence.

Data and Model Files

filtered_fitness_data.csv: The fitness data used for model training and evaluation. fitness_rep1 and fitness_rep2 are replicate measurements from the wet lab experiments.

protein_embeddings_esm3.csv / protein_embeddings_prottrans.csv: Pre-computed protein embeddings for the sequences in filtered_fitness_data.csv.

double_mutations_subset_1000.csv: An example input file with 1,000 sequences to demonstrate the prediction and analysis pipeline.

combined_predictions.csv: This large file contains the final prediction results for ~4.2 million double mutant variants. The scores in this file are an aggregation of predictions from 20 (4 seeds × 5 folds) different models.

models/ directory: This directory contains the pre-trained model weights (.pt files) for each fold of both the ESM and ProtTrans-based MLP models (e.g., ProtTrans_model_fold1.pt, ESM_model_fold1.pt, etc.).

Dependencies and Environment

Installation

It is recommended to use the specified versions of the following Python libraries. You can install them using pip:

For the main training/evaluation and analysis scripts

pip install torch==2.4.1 scikit-learn==1.5.1 pandas==2.2.2 numpy==1.26.4 verstack==4.1.4

For the embedding generation scripts

pip install transformers sentencepiece accelerate "huggingface_hub[cli]" esm

Hardware Requirements

A GPU is recommended for all operations, especially for generating embeddings and training models. However, all scripts can be run on a CPU, though it will be significantly slower.

How to Use This Repository

There are two main use cases for this repository: reproducing our original performance results and generating and analyzing fitness predictions for new myoglobin variants.

1. Reproducing the Original Performance Results

To reproduce the R² values in Figure 3C, use the predict_fitness_with_mlp_embeddings.py and predict_fitness_with_cnn_onehot.py scripts. See the inline comments in the scripts for configuration.

python predict_fitness_with_mlp_embeddings.py
python predict_fitness_with_cnn_onehot.py

2. Generating and Analyzing Predictions for New Sequences

This complete workflow allows you to predict the fitness of new myoglobin sequences and identify the most promising candidates.

Step 1: Prepare Input Sequences Create a CSV file containing a single column named sequence with your protein sequences. An example is provided in

double_mutations_subset_1000.csv.

Step 2: Generate Protein Embeddings Use create_embeddings.py to generate embeddings for your sequences. Configure the embedding model type and input_csv file path inside the script.

python create_embeddings.py

Step 3: Get Fitness Predictions Use get_predictions_for_embeddings.py to predict fitness scores from the generated embeddings. Ensure your pre-trained models are in the models/ directory. Configure the embedding type inside the script.

python get_predictions_for_embeddings.py

This will create 5 prediction files (one for each model fold) in a Predictions/ directory.

Step 4: Analyze Predictions and Identify Consensus Hits Use the analyze_prediction_consensus.py script to find the most robust candidates from the predictions generated in the previous step. This is crucial for identifying sequences that are consistently ranked highly by all models.

Configure the script: Open analyze_prediction_consensus.py and set the embedding type to match the previous steps. You can also adjust the top_n parameter to control how many of the best sequences from each fold are considered for the consensus analysis.

Run the analysis:

python analyze_prediction_consensus.py

This will produce a final summary file (e.g., Predictions/common_top_100_sequences_across_folds_with_ESM_mutations.csv) containing only the sequences found in the top N of all five folds, along with their average predicted score, standard deviation, and specific mutations.

Citation

If you use the data or code from this repository in your work, please cite:

[XXX]

Files

Activity_Script.ipynb

Files (13.9 GB)

Name	Size	Download all
230807_PB4_ML_2R.tsv md5:83f09a14e7ef92e564ade94297ceb6e2	13.5 MB	Download
Activity_Script.ipynb md5:fa8524f1f8f49b216c35b77dd08813af	608.6 kB	Preview Download
analyze_prediction_consensus.py md5:6171494cc9eb084c11690d37d185608b	2.7 kB	Download
C_Scripts.zip md5:584c1ce528c598aa5cb425f4296aad11	11.7 kB	Preview Download
combined_predictions.csv md5:39d26897dac9a017766a6f916b25add5	2.4 GB	Preview Download
create_embeddings.py md5:9b668d2c39903ea550255439b6d88c2f	4.4 kB	Download
double_mutations_subset_1000.csv md5:cef035c102dec0643dedcee0bae52c41	155.0 kB	Preview Download
ESM_model_fold1.pt md5:c6e8b7067a17f313c9c0583295aa73b9	3.8 MB	Download
ESM_model_fold2.pt md5:e12132a9c4d1a1ddef173afd4212dc49	3.8 MB	Download
ESM_model_fold3.pt md5:7a59783e0e0b889ea0855587649e9b65	3.8 MB	Download
ESM_model_fold4.pt md5:150bfbea00114161e7cc094f1b90cb5b	3.8 MB	Download
ESM_model_fold5.pt md5:68d0d30fc4700ecf85adfe2031955a5f	3.8 MB	Download
Figure_raw_data.zip md5:21d40d98a0707a44c91b7dd177d1fe62	490.7 kB	Preview Download
filtered_fitness_data.csv md5:af1bf1173456b9dd2c369abbfb29af40	1.2 MB	Preview Download
Generate_LUT.ipynb md5:4846a702f4040d290c84a344edfc8b0b	17.5 kB	Preview Download
get_predictions_for_embeddings.py md5:7b9274f20fdddb969c1ac223859b72f5	3.9 kB	Download
HIFI_barcoded_WT_Myoglobin.dna md5:f41aa9da1bc8c804dac1dc0b91ec07ef	12.4 MB	Download
ill_ref.fa md5:eb90cd3f9ab672591d421a95f78d5073	128 Bytes	Download
ill_tag1_bins.ipynb md5:7fa6afff4d6c39ed51ba274a92cd5430	25.8 kB	Preview Download
Illumina_processed bins.zip md5:09f8fea2c561a407ab5172433bbfc814	5.0 MB	Preview Download
Illumina_raw.zip md5:8a8779a0adfd97f92b95e7a268ccb84e	4.8 GB	Preview Download
LUT.tsv md5:e89f56e79f362a430e2d97b511e2de99	3.5 MB	Download
NNK_Primers_Myoglobin.xlsx md5:27325fee47733d3f3bc436d7530e00c7	13.0 kB	Download
PB4_hifi_1800.fastq md5:991b4ef83d846acb7c57c0a4b588bdf5	6.4 GB	Download
predict_fitness_with_cnn_onehot.py md5:4e3e6708f49343a243398e80946d72f8	7.4 kB	Download
predict_fitness_with_mlp_embeddings.py md5:9ca7b45a3d7813008bef92f3fcbe18c8	8.3 kB	Download
protein_embeddings_double_mutations_subset_1000_ESM.csv md5:d7a4ca8bd96f1085a74e05a3f6e0e239	15.7 MB	Preview Download
protein_embeddings_double_mutations_subset_1000_ProtTrans.csv md5:ad9de8ac293c0124cd4125f9632cfd5e	12.7 MB	Preview Download
protein_embeddings_esm3.csv md5:0504a71262a1f442392d8e82e4987185	95.8 MB	Preview Download
protein_embeddings_prottrans.csv md5:8e6ccea5e90f67a14cd3c2754f6d73f8	77.6 MB	Preview Download
ProtTrans_model_fold1.pt md5:c9f547124040cc8f5b3d79ba632340e8	2.8 MB	Download
ProtTrans_model_fold2.pt md5:75f98fe86b1f23af4c788b843f32e10c	2.8 MB	Download
ProtTrans_model_fold3.pt md5:d7062d35bccfc9ef84a071faffba28e5	2.8 MB	Download
ProtTrans_model_fold4.pt md5:5c68e03162ab223aa917432463458e54	2.8 MB	Download
ProtTrans_model_fold5.pt md5:1bd7fdf178f6bc30d8294a18e2b09142	2.8 MB	Download
pUC19_Myoglobin.dna md5:228ed6dbed81b716f54bab39c3e02bab	7.9 MB	Download
README.md md5:139ec0be11140bc1fd10e0dc79f45d6e	5.4 kB	Preview Download
ref1.fasta md5:f060f3573e4736292ca81d6478c0016f	1.6 kB	Download
Stability_Scores.xlsx md5:69d2a50f150fa836eb8e8ae53fe72025	80.1 kB	Download

	All versions	This version
Views	6	6
Downloads	141	141
Data volume	44.9 GB	44.9 GB

Discovery of Electron Hole-hopping Redox Mutations in Myoglobin by Deep Mutational Learning

Creators

Description

Experimental File Descriptions

PacBio lookup table generation

Notebook to reproduce DMS scores

ML File Descriptions

Scripts for Reproducing Original Results

Scripts for Predicting and Analyzing Fitness of New Sequences

Data and Model Files

Dependencies and Environment

Installation

Hardware Requirements

How to Use This Repository

1. Reproducing the Original Performance Results

2. Generating and Analyzing Predictions for New Sequences

Citation

Files

Activity_Script.ipynb

Files (13.9 GB)