Scorpio Gene-Taxa Benchmark Dataset
Description
We used the Woltka pipeline to compile the complete Basic genome dataset, consisting of 4634 genomes, with each genus represented by a single genome. After downloading all coding sequences (CDS) from the NCBI database, we extracted 8 million distinct CDS, focusing on bacteria and archaea and excluding viruses and fungi due to inadequate gene information.
To maintain accuracy, we excluded hypothetical proteins, uncharacterized proteins, and sequences without gene labels. We addressed issues with gene name inconsistencies in NCBI by keeping only genes with more than 1000 samples and ensuring each phylum had at least 350 sequences. This resulted in a curated dataset of 800,318 gene sequences from 497 genes across 2046 genera.
We created four datasets to evaluate our model: a training set (Train_set), a test set (Test_set) with different samples but the same genus and gene as the training set, a Taxa_out_set excluding 18 phyla present in the training set but from different phyla, and a Gene_out_set excluding 60 genes from the training set but from the same phyla. We ensured each CDS had only one representation per genome, removing genes with multiple representations within the same species.
Technical info
-
test.fasta : Contains sequences for model testing.
-
gene_out.fasta: Includes sequences excluded based on gene criteria for model evaluation.
-
taxa_out.fasta :Includes sequences excluded based on taxonomic criteria for model evaluation.
- val.fasta: Contains sequences for model validation .
-
train.fasta: Contains sequences for model training.
-
metadata.csv: Contains metadata information for sequences in the FASTA files.
-
hierarchical-level.txt : Determines hierarchical levels for triplet training and hierarchical sampling required for Scorpio training.
General Description of FASTA Files:
FASTA files contain sequence data where each sequence entry begins with a header line, starting with ">" followed by a sequence identifier (seqid).
Retrieving Information from Metadata:
To retrieve metadata information based on the sequence ID (seqid) from the FASTA files:
- Extract the seqid from the header of the FASTA file
- Use the extracted seqid to look up the corresponding row in the metadata.csv file.
@article{refahi2024scorpio, title={Scorpio: Enhancing Embeddings to Improve Downstream Analysis of DNA sequences}, author={Refahi, Mohammadsaleh and Sokhansanj, Bahrad A and Mell, Joshua Chang and Brown, James and Yoo, Hyunwoo and Hearne, Gavin and Rosen, Gail}, journal={bioRxiv}, pages={2024--07}, year={2024}, publisher={Cold Spring Harbor Laboratory} }
Files
hierarchical-level.txt
Files
(991.9 MB)
Name | Size | Download all |
---|---|---|
md5:2af8334b4e2fff8d993bc715f8844b81
|
49.7 MB | Download |
md5:40cc96c488644292c0ede67912bbdc3c
|
29.0 MB | Preview Download |
md5:261c7c2895d76599b5ff175efa364d25
|
138.2 MB | Preview Download |
md5:14fbfbb4b1f4a612e84d85639e16cf68
|
12.9 MB | Download |
md5:f175b2d3e988ac71251655575568c1a2
|
155.5 MB | Download |
md5:635fb71791f13382a28e475f58da482a
|
606.5 MB | Download |
md5:82e129333a8e58f790dbaacdb84ab7e6
|
36.2 kB | Download |
Additional details
Software
- Repository URL
- https://github.com/MsAlEhR/Scorpio