Published September 24, 2025 | Version v1

Code and data for "Optimization of regulatory DNA with active learning"

  • 1. ROR icon University of Edinburgh
  • 2. The University of Edinburgh

Contributors

Contact person:

  • 1. The University of Edinburgh

Description

Code and data for paper “Optimization of regulatory DNA with active learning” by Shen, Kudla and Oyarzún.

data.zip - includes all NK landscapes in csv format.
 
code.zip - includes Python code for reproducing the results of the paper.

 

1. Code overview.
It contains two subfolders on NK landscape and promoter landscape respectively, and one environment file.
- `AL.yml`: the environment for all the code

 

AL on NK landscape
2. NK genotype-phenotype landscapes (Figure 1)
- `nk_landscape.ipynb`: Generate the NK0-NK3 landscapes and save them in csv files as ground truth landscapes. The NK model is derived from a previous NK simulation in paper [1] from https://github.com/acmater/NK_Benchmarking/blob/master/utils/nk_utils/NK_landscape.py.
- `nk_local_landscape.ipynb`: Generate the NK1-NK3 local landscapes.
- `nk_tsne.ipynb`: Plot the 2D t-SNE embedding plots of the genotype space, and label the seqeunces according to their phenotype (Figure 1C).
- `nk_mlp.ipynb`: Train MLP models on four NK landscapes (Figure 1D).

 

3. AL on NK genotype-phenotype landscapes (Figure 2)
- `AL_NK_pipeline.ipynb`: The active learning pipeline on NK landscape. Different conditions like AL with random sampling and ALDE can be set inthe pipeline.
- `NK_benchmarking_ho.ipynb`: One-shot model performance on the NK landscapes with hyperparameter optimization to compare with AL performance. Three optimization methods on one-shot modelling are implemented: random screening (RS), strong-selection weak-mutation (SSWM) and gradient descent (GD).

 

AL on Promoter landscape
4. AL on NK genotype-phenotype landscapes (Figure 3)
- `Glu_model.py`, `Ura_model.py`: The code to use the pre-trained promoter landscape. The promoter landscape is derived from the trained transformer structure with a large-scale characterization of promoter expression in paper [2] from https://github.com/1edv/evolution/.
- `AL_loop.py`: The main script for active learning pipeline on promoter landscape.
- `AL_sampling_methods.py`: The selection methods for the active learning pipeline on promoter landscape.
- `AL_selection.py`: The UCB function for the active learning pipeline on promoter landscape, adapted from the paper [3].
- `promoter_benchmarking_ho.ipynb`: One-shot model performance on promoter landscape with hyperparameter optimization to compare with AL performance. Three optimization methods on one-shot modelling are implemented: random screening (RS), strong-selection weak-mutation (SSWM) and gradient descent (GD).
 
5. Biological sampling and motif information (Figure 4)
- `motif_analysis.ipynb`: Conduct motif analysis for the batches sampled by AL. (Figure 4C)
- `AL_PFM.py`: Combine the motif information calculation into the UCB function.

 

References
[1] Sandhu et al, "Investigating the determinants of performance in machine learning for protein fitness prediction," Protein Science (2025).
[2] Vaishnav et al. "The evolution, evolvability and engineering of gene regulatory DNA." Nature (2022).

Files

code.zip

Files (80.4 MB)

Name Size Download all
md5:d710a247904e4d22e400e9e0ea23c7c1
12.0 MB Preview Download
md5:fd3343e51a2007e8aaa511fe9abdab77
68.3 MB Preview Download