Scorpio Gene-Taxa Benchmark Dataset

doi:10.5281/zenodo.12964684

Published July 30, 2024 | Version v2

Dataset Open

Scorpio Gene-Taxa Benchmark Dataset

Refahi, Mohammad Saleh (Researcher)¹

1. Drexel University

Contributors

Researchers:

Supervisor:

Rosen, Gail¹

1. Drexel University

We used the Woltka pipeline to compile the complete Basic genome dataset, consisting of 4634 genomes, with each genus represented by a single genome. After downloading all coding sequences (CDS) from the NCBI database, we extracted 8 million distinct CDS, focusing on bacteria and archaea and excluding viruses and fungi due to inadequate gene information.

To maintain accuracy, we excluded hypothetical proteins, uncharacterized proteins, and sequences without gene labels. We addressed issues with gene name inconsistencies in NCBI by keeping only genes with more than 1000 samples and ensuring each phylum had at least 350 sequences. This resulted in a curated dataset of 800,318 gene sequences from 497 genes across 2046 genera.

We created four datasets to evaluate our model: a training set (Train_set), a test set (Test_set) with different samples but the same genus and gene as the training set, a Taxa_out_set excluding 18 phyla present in the training set but from different phyla, and a Gene_out_set excluding 60 genes from the training set but from the same phyla. We ensured each CDS had only one representation per genome, removing genes with multiple representations within the same species.

Technical info

test.fasta : Contains sequences for model testing.
gene_out.fasta: Includes sequences excluded based on gene criteria for model evaluation.
taxa_out.fasta :Includes sequences excluded based on taxonomic criteria for model evaluation.
val.fasta: Contains sequences for model validation .
train.fasta: Contains sequences for model training.
metadata.csv: Contains metadata information for sequences in the FASTA files.
hierarchical-level.txt : Determines hierarchical levels for triplet training and hierarchical sampling required for Scorpio training.

General Description of FASTA Files:

FASTA files contain sequence data where each sequence entry begins with a header line, starting with ">" followed by a sequence identifier (seqid).

Retrieving Information from Metadata:

To retrieve metadata information based on the sequence ID (seqid) from the FASTA files:

Extract the seqid from the header of the FASTA file
Use the extracted seqid to look up the corresponding row in the metadata.csv file.

@article{refahi2024scorpio,
  title={Scorpio: Enhancing Embeddings to Improve Downstream Analysis of DNA sequences},
  author={Refahi, Mohammadsaleh and Sokhansanj, Bahrad A and Mell, Joshua Chang and Brown, James and Yoo, Hyunwoo and Hearne, Gavin and Rosen, Gail},
  journal={bioRxiv},
  pages={2024--07},
  year={2024},
  publisher={Cold Spring Harbor Laboratory}
}

Files

hierarchical-level.txt

Files (991.9 MB)

Name	Size	Download all
gene_out.fasta md5:2af8334b4e2fff8d993bc715f8844b81	49.7 MB	Download
hierarchical-level.txt md5:40cc96c488644292c0ede67912bbdc3c	29.0 MB	Preview Download
metadata.csv md5:261c7c2895d76599b5ff175efa364d25	138.2 MB	Preview Download
taxa_out.fasta md5:14fbfbb4b1f4a612e84d85639e16cf68	12.9 MB	Download
test.fasta md5:f175b2d3e988ac71251655575568c1a2	155.5 MB	Download
train.fasta md5:635fb71791f13382a28e475f58da482a	606.5 MB	Download
val.fasta md5:82e129333a8e58f790dbaacdb84ab7e6	36.2 kB	Download

Additional details

Repository URL: https://github.com/MsAlEhR/Scorpio

	All versions	This version
Views	138	87
Downloads	179	139
Data volume	37.8 GB	24.5 GB

Scorpio Gene-Taxa Benchmark Dataset

Creators

Contributors

Researchers:

Supervisor:

Description

Technical info

General Description of FASTA Files:

Retrieving Information from Metadata:

Files

hierarchical-level.txt

Files (991.9 MB)

Additional details

Software