Code and data for "Optimization of regulatory DNA with active learning"

Shen, Yuxin; Kudla, Grzegorz; Oyarzún, Diego

doi:10.5281/zenodo.17190718

Published September 24, 2025 | Version v1

Software Open

Code and data for "Optimization of regulatory DNA with active learning"

1. University of Edinburgh
2. The University of Edinburgh

Contributors

Contact person:

Oyarzún, Diego¹

1. The University of Edinburgh

Code and data for paper “Optimization of regulatory DNA with active learning” by Shen, Kudla and Oyarzún.

data.zip - includes all NK landscapes in csv format.

code.zip - includes Python code for reproducing the results of the paper.

1. Code overview.

It contains two subfolders on NK landscape and promoter landscape respectively, and one environment file.

- `AL.yml`: the environment for all the code

AL on NK landscape

2. NK genotype-phenotype landscapes (Figure 1)

- `nk_landscape.ipynb`: Generate the NK0-NK3 landscapes and save them in csv files as ground truth landscapes. The NK model is derived from a previous NK simulation in paper [1] from https://github.com/acmater/NK_Benchmarking/blob/master/utils/nk_utils/NK_landscape.py.

- `nk_local_landscape.ipynb`: Generate the NK1-NK3 local landscapes.

- `nk_tsne.ipynb`: Plot the 2D t-SNE embedding plots of the genotype space, and label the seqeunces according to their phenotype (Figure 1C).

- `nk_mlp.ipynb`: Train MLP models on four NK landscapes (Figure 1D).

3. AL on NK genotype-phenotype landscapes (Figure 2)

- `AL_NK_pipeline.ipynb`: The active learning pipeline on NK landscape. Different conditions like AL with random sampling and ALDE can be set inthe pipeline.

- `NK_benchmarking_ho.ipynb`: One-shot model performance on the NK landscapes with hyperparameter optimization to compare with AL performance. Three optimization methods on one-shot modelling are implemented: random screening (RS), strong-selection weak-mutation (SSWM) and gradient descent (GD).

AL on Promoter landscape

4. AL on NK genotype-phenotype landscapes (Figure 3)

- `Glu_model.py`, `Ura_model.py`: The code to use the pre-trained promoter landscape. The promoter landscape is derived from the trained transformer structure with a large-scale characterization of promoter expression in paper [2] from https://github.com/1edv/evolution/.

- `AL_loop.py`: The main script for active learning pipeline on promoter landscape.

- `AL_sampling_methods.py`: The selection methods for the active learning pipeline on promoter landscape.

- `AL_selection.py`: The UCB function for the active learning pipeline on promoter landscape, adapted from the paper [3].

- `promoter_benchmarking_ho.ipynb`: One-shot model performance on promoter landscape with hyperparameter optimization to compare with AL performance. Three optimization methods on one-shot modelling are implemented: random screening (RS), strong-selection weak-mutation (SSWM) and gradient descent (GD).

5. Biological sampling and motif information (Figure 4)

- `motif_analysis.ipynb`: Conduct motif analysis for the batches sampled by AL. (Figure 4C)

- `AL_PFM.py`: Combine the motif information calculation into the UCB function.

References

[1] Sandhu et al, "Investigating the determinants of performance in machine learning for protein fitness prediction," Protein Science (2025).

[2] Vaishnav et al. "The evolution, evolvability and engineering of gene regulatory DNA." Nature (2022).

Files

code.zip

Files (80.4 MB)

Name	Size	Download all
code.zip md5:d710a247904e4d22e400e9e0ea23c7c1	12.0 MB	Preview Download
data.zip md5:fd3343e51a2007e8aaa511fe9abdab77	68.3 MB	Preview Download

	All versions	This version
Views	74	74
Downloads	32	32
Data volume	1.3 GB	1.3 GB

Code and data for "Optimization of regulatory DNA with active learning"

Authors/Creators

Contributors

Contact person:

Description

Files

code.zip

Files (80.4 MB)