Published September 11, 2025
| Version v1
Dataset
Open
Microviridae SFT training datasets for Evo 1 and Evo 2
Description
- microviridae_sft_training_data_raw: All Microviridae data collected from NCBI Datasets, PhageScope, and OpenGenome
- microviridae_sft_training_data_processed: Processed Microviridae data
- Removed sequences >10 kb
- Removed sequences with non-nucleotide characters
- Clustered at 99% sequence identity
- Prepended with soft prompting tokens
- "+" indicates Microviridae
- "+~" indicates 95–100% sequence identity to ΦX174
- "+^" indicates 70–80% sequence identity to ΦX174
- "+#" indicates 50–70% sequence identity to ΦX174
- "+$" indicates <50% seuqence identity to ΦX174
If you find this dataset useful, please cite:
@article {king2025,
author = {King, Samuel H and Driscoll, Claudia L and Li, David B and Guo, Daniel and Merchant, Aditi T and Brixi, Garyk and Wilkinson, Max E and Hie, Brian L},
title = {Generative design of novel bacteriophages with genome language models},
year = {2025},
doi = {10.1101/2025.09.12.675911},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/10.1101/2025.09.12.675911v1},
journal = {bioRxiv}
}
Files
Files
(150.3 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:9cc0906f28fa0b5f0b9aff18adc30126
|
72.5 MB | Download |
|
md5:28dfc7fa5930313977663c722de3653a
|
77.8 MB | Download |