Published July 5, 2024 | Version v2
Journal article Open

Designing realistic regulatory DNA with autoregressive language models

Description

This repository contains code used to perform all experiments reported in the paper "Designing realistic regulatory DNA with autoregressive language models" by Avantika Lal, David Garfield, Tommaso Biancalani, and Gokcen Eraslan, along with trained model weights and synthetic regulatory elements designed by various methods.

The folder structure is:

    - yeast_promoters: Notebooks and models related to the experiments on yeast promoter sequence generation.

    - human_enhancers: Notebooks and models related to the experiments on human enhancer sequence generation.

    - other_human_models: Notebooks related to the additional models used to validate synthetic human enhancers.

    - scripts: Python scripts and functions used in both experiments.

The trained regLM models are:

  • yeast_promoters/04_reglm/yeast_reglm.ckpt : regLM model trained on yeast promoter sequences
  • human enhancers/04_reglm/human_reglm.ckpt : regLM model trained on human enhancer sequences

The trained regression models are found in the following folders:

  • yeast_promoters/02_regression_paired/ : sequence-to-expression regression models for yeast promoters trained on the same data as the regLM model
  • yeast_promoters/03_regression_separate/: sequence-to-expression regression models for yeast promoters trained on the separate data from the regLM model
  • human enhancers/02_regression_paired/ : sequence-to-expression regression models for human enhancers trained on the same data as the regLM model
  • human enhancers/03_regression_separate/: sequence-to-expression regression models for human enhancers trained on the separate data from the regLM model

Code to train, load and test these models is available in the experimental folders.

Files

Files (15.6 GB)

Name Size Download all
md5:8ded4a12cc6e99df7057db4245e1450a
15.4 GB Download
md5:fb54644e7fa0b5ca6ff157bc72cd223e
10.9 kB Download
md5:04c18bf45a8ef0a72d3404fc45428847
7.0 kB Download
md5:527f7fdc351ffca94c2a718b1a05add5
270.5 MB Download