Published August 27, 2025 | Version v1
Publication Open

Discovery of Electron Hole-hopping Redox Mutations in Myoglobin by Deep Mutational Learning

  • 1. ROR icon University of Basel
  • 2. ROR icon Middle East Technical University
  • 3. The University of Edinburgh

Description

Data and scripts of the manuscript "Language Model-guided Discovery of Hole-hopping Redox Mutations in Myoglobin" by Christoph Küng, Alperen Dalkıran, Rosario Vanella, Diego A. Oyarzún and Michael A. Nash

 

Experimental File Descriptions

Details for reproducing results can be found in the Supplementary information of the manuscript.

PacBio lookup table generation

The scripts needed to process the raw long read sequencing files PB4_hifi_1800.fastq can be found in C_Scripts.zip.

Generate_LUT.ipynb:  Creates the look up table connecting barcodes to variants.

Notebook to reproduce DMS scores

The raw illumina reads are stored in Illumina_raw.zip. Processing with scripts stored in C_Scripts.zip will lead to files stored in Illumina_processed bins.zip.

ill_tag1_bins.ipynb Merges all outputs from RIB script into one file 230807_PB4_ML_2R.tsv.

Activity_Script.ipynb: Creates the actual fitess scores for all variants depending on the distribution amongst bins.

ML File Descriptions

Scripts for Reproducing Original Results

predict_fitness_with_mlp_embeddings.py: Trains and evaluates a Multi-Layer Perceptron (MLP) model to predict fitness using pre-computed protein embeddings.

predict_fitness_with_cnn_onehot.py: Trains and evaluates a Convolutional Neural Network (CNN) to predict fitness using one-hot encoded protein sequences.

Scripts for Predicting and Analyzing Fitness of New Sequences

create_embeddings.py: Generates protein embeddings for a list of sequences from a CSV file. It supports both ESM-3 (ESM3_OPEN_SMALL) and ProtTrans (Prot_T5_XL) models.

get_predictions_for_embeddings.py: Loads the 5 pre-trained MLP models (one for each cross-validation fold) and predicts fitness scores for new sequences using their pre-computed embeddings.

analyze_prediction_consensus.py: Analyzes the prediction results from the 5 folds. It identifies the top-performing sequences that are common across all folds, calculates their average fitness score and standard deviation, and identifies the specific mutations relative to the wild-type sequence.

Data and Model Files

filtered_fitness_data.csv: The fitness data used for model training and evaluation. fitness_rep1 and fitness_rep2 are replicate measurements from the wet lab experiments.

protein_embeddings_esm3.csv / protein_embeddings_prottrans.csv: Pre-computed protein embeddings for the sequences in filtered_fitness_data.csv.

double_mutations_subset_1000.csv: An example input file with 1,000 sequences to demonstrate the prediction and analysis pipeline.

combined_predictions.csv: This large file contains the final prediction results for ~4.2 million double mutant variants. The scores in this file are an aggregation of predictions from 20 (4 seeds × 5 folds) different models.

models/ directory: This directory contains the pre-trained model weights (.pt files) for each fold of both the ESM and ProtTrans-based MLP models (e.g., ProtTrans_model_fold1.ptESM_model_fold1.pt, etc.).

Dependencies and Environment

Installation

It is recommended to use the specified versions of the following Python libraries. You can install them using pip:

For the main training/evaluation and analysis scripts

pip install torch==2.4.1 scikit-learn==1.5.1 pandas==2.2.2 numpy==1.26.4 verstack==4.1.4

For the embedding generation scripts

pip install transformers sentencepiece accelerate "huggingface_hub[cli]" esm

Hardware Requirements

A GPU is recommended for all operations, especially for generating embeddings and training models. However, all scripts can be run on a CPU, though it will be significantly slower.

How to Use This Repository

There are two main use cases for this repository: reproducing our original performance results and generating and analyzing fitness predictions for new myoglobin variants.

1. Reproducing the Original Performance Results

To reproduce the R² values in Figure 3C, use the predict_fitness_with_mlp_embeddings.py and predict_fitness_with_cnn_onehot.py scripts. See the inline comments in the scripts for configuration.

python predict_fitness_with_mlp_embeddings.py
python predict_fitness_with_cnn_onehot.py

2. Generating and Analyzing Predictions for New Sequences

This complete workflow allows you to predict the fitness of new myoglobin sequences and identify the most promising candidates.

Step 1: Prepare Input Sequences Create a CSV file containing a single column named sequence with your protein sequences. An example is provided in

double_mutations_subset_1000.csv.

Step 2: Generate Protein Embeddings Use create_embeddings.py to generate embeddings for your sequences. Configure the embedding model type and input_csv file path inside the script.

python create_embeddings.py

Step 3: Get Fitness Predictions Use get_predictions_for_embeddings.py to predict fitness scores from the generated embeddings. Ensure your pre-trained models are in the models/ directory. Configure the embedding type inside the script.

python get_predictions_for_embeddings.py

This will create 5 prediction files (one for each model fold) in a Predictions/ directory.

Step 4: Analyze Predictions and Identify Consensus Hits Use the analyze_prediction_consensus.py script to find the most robust candidates from the predictions generated in the previous step. This is crucial for identifying sequences that are consistently ranked highly by all models.

Configure the script: Open analyze_prediction_consensus.py and set the embedding type to match the previous steps. You can also adjust the top_n parameter to control how many of the best sequences from each fold are considered for the consensus analysis.

Run the analysis:

python analyze_prediction_consensus.py

This will produce a final summary file (e.g., Predictions/common_top_100_sequences_across_folds_with_ESM_mutations.csv) containing only the sequences found in the top N of all five folds, along with their average predicted score, standard deviation, and specific mutations.

Citation

If you use the data or code from this repository in your work, please cite:

[XXX]

Files

Activity_Script.ipynb

Files (13.9 GB)

Name Size Download all
md5:83f09a14e7ef92e564ade94297ceb6e2
13.5 MB Download
md5:fa8524f1f8f49b216c35b77dd08813af
608.6 kB Preview Download
md5:6171494cc9eb084c11690d37d185608b
2.7 kB Download
md5:584c1ce528c598aa5cb425f4296aad11
11.7 kB Preview Download
md5:39d26897dac9a017766a6f916b25add5
2.4 GB Preview Download
md5:9b668d2c39903ea550255439b6d88c2f
4.4 kB Download
md5:cef035c102dec0643dedcee0bae52c41
155.0 kB Preview Download
md5:c6e8b7067a17f313c9c0583295aa73b9
3.8 MB Download
md5:e12132a9c4d1a1ddef173afd4212dc49
3.8 MB Download
md5:7a59783e0e0b889ea0855587649e9b65
3.8 MB Download
md5:150bfbea00114161e7cc094f1b90cb5b
3.8 MB Download
md5:68d0d30fc4700ecf85adfe2031955a5f
3.8 MB Download
md5:21d40d98a0707a44c91b7dd177d1fe62
490.7 kB Preview Download
md5:af1bf1173456b9dd2c369abbfb29af40
1.2 MB Preview Download
md5:4846a702f4040d290c84a344edfc8b0b
17.5 kB Preview Download
md5:7b9274f20fdddb969c1ac223859b72f5
3.9 kB Download
md5:f41aa9da1bc8c804dac1dc0b91ec07ef
12.4 MB Download
md5:eb90cd3f9ab672591d421a95f78d5073
128 Bytes Download
md5:7fa6afff4d6c39ed51ba274a92cd5430
25.8 kB Preview Download
md5:09f8fea2c561a407ab5172433bbfc814
5.0 MB Preview Download
md5:8a8779a0adfd97f92b95e7a268ccb84e
4.8 GB Preview Download
md5:e89f56e79f362a430e2d97b511e2de99
3.5 MB Download
md5:27325fee47733d3f3bc436d7530e00c7
13.0 kB Download
md5:991b4ef83d846acb7c57c0a4b588bdf5
6.4 GB Download
md5:4e3e6708f49343a243398e80946d72f8
7.4 kB Download
md5:9ca7b45a3d7813008bef92f3fcbe18c8
8.3 kB Download
md5:d7a4ca8bd96f1085a74e05a3f6e0e239
15.7 MB Preview Download
md5:ad9de8ac293c0124cd4125f9632cfd5e
12.7 MB Preview Download
md5:0504a71262a1f442392d8e82e4987185
95.8 MB Preview Download
md5:8e6ccea5e90f67a14cd3c2754f6d73f8
77.6 MB Preview Download
md5:c9f547124040cc8f5b3d79ba632340e8
2.8 MB Download
md5:75f98fe86b1f23af4c788b843f32e10c
2.8 MB Download
md5:d7062d35bccfc9ef84a071faffba28e5
2.8 MB Download
md5:5c68e03162ab223aa917432463458e54
2.8 MB Download
md5:1bd7fdf178f6bc30d8294a18e2b09142
2.8 MB Download
md5:228ed6dbed81b716f54bab39c3e02bab
7.9 MB Download
md5:139ec0be11140bc1fd10e0dc79f45d6e
5.4 kB Preview Download
md5:f060f3573e4736292ca81d6478c0016f
1.6 kB Download
md5:69d2a50f150fa836eb8e8ae53fe72025
80.1 kB Download