Published September 11, 2025 | Version v1
Dataset Open

Microviridae SFT training datasets for Evo 1 and Evo 2

  • 1. EDMO icon Stanford University
  • 2. ROR icon Arc Research Institute

Description

  • microviridae_sft_training_data_raw: All Microviridae data collected from NCBI Datasets, PhageScope, and OpenGenome
  • microviridae_sft_training_data_processed: Processed Microviridae data
    • Removed sequences >10 kb
    • Removed sequences with non-nucleotide characters
    • Clustered at 99% sequence identity
    • Prepended with soft prompting tokens
      • "+" indicates Microviridae
      • "+~" indicates 95–100% sequence identity to ΦX174
      • "+^" indicates 70–80% sequence identity to ΦX174
      • "+#" indicates 50–70% sequence identity to ΦX174
      • "+$" indicates <50% seuqence identity to ΦX174

 

If you find this dataset useful, please cite:

@article {king2025,
author = {King, Samuel H and Driscoll, Claudia L and Li, David B and Guo, Daniel and Merchant, Aditi T and Brixi, Garyk and Wilkinson, Max E and Hie, Brian L},
title = {Generative design of novel bacteriophages with genome language models},
year = {2025},
doi = {10.1101/2025.09.12.675911},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/10.1101/2025.09.12.675911v1},
journal = {bioRxiv}
}

Files

Files (150.3 MB)

Name Size Download all
md5:9cc0906f28fa0b5f0b9aff18adc30126
72.5 MB Download
md5:28dfc7fa5930313977663c722de3653a
77.8 MB Download