Published March 5, 2026 | Version v1.0
Dataset Open

viraldb-sh curated plant virus and viroid database (v1.0)

  • 1. ROR icon Queensland University of Technology

Description

This dataset contains curated viral sequence databases generated using viraldb-sh, a lightweight pipeline for building harmonized viral databases from NCBI Virus and ViroidDB. The pipeline integrates sequence retrieval, taxonomy enrichment, sequence filtering, clustering using CD-HIT-EST, and policy-based representative sequence selection.

The database includes representative sequences clustered at three identity thresholds (100%, 99.5%, and 99%) together with summary tables describing cluster composition and associated metadata. These resources provide a non-redundant and traceable viral reference database suitable for downstream analyses such as viral detection, metagenomic classification, and phylogenetic studies.

All sequences retain associated taxonomic lineage information and clustering summaries to ensure transparency and reproducibility.

The database was generated using viraldb-sh v0.1 on 2026-03-05.

Contents

Representative sequence databases:

viraldb-sh_v1.0_representatives_c1.000.fasta.gz

viraldb-sh_v1.0_representatives_c0.995.fasta.gz

viraldb-sh_v1.0_representatives_c0.990.fasta.gz

Cluster summary tables:

viraldb-sh_v1.0_summary_c1.000.tsv

viraldb-sh_v1.0_summary_c0.995.tsv

viraldb-sh_v1.0_summary_c0.990.tsv

Additional documentation:

README_dataset.txt

Files

Files (182.8 MB)

Name Size Download all
md5:4b1dccca0e25a686a19a59b60a7c63b2
182.8 MB Download

Additional details

Dates

Created
2026-03-05
viraldb-sh curated plant virus and viroid database

Software

Repository URL
https://github.com/robertobarrero/viraldb-sh
Programming language
Python , Shell
Development Status
Active

References

  • Barrero R.A. viraldb-sh: A lightweight pipeline for building curated viral databases from NCBI Virus and ViroidDB.