Published July 30, 2024 | Version v2
Dataset Open

Scorpio Gene-Taxa Benchmark Dataset

  • 1. ROR icon Drexel University

Contributors

Supervisor:

  • 1. ROR icon Drexel University

Description

 

We used the Woltka pipeline to compile the complete Basic genome dataset, consisting of 4634 genomes, with each genus represented by a single genome. After downloading all coding sequences (CDS) from the NCBI database, we extracted 8 million distinct CDS, focusing on bacteria and archaea and excluding viruses and fungi due to inadequate gene information.

To maintain accuracy, we excluded hypothetical proteins, uncharacterized proteins, and sequences without gene labels. We addressed issues with gene name inconsistencies in NCBI by keeping only genes with more than 1000 samples and ensuring each phylum had at least 350 sequences. This resulted in a curated dataset of 800,318 gene sequences from 497 genes across 2046 genera.

We created four datasets to evaluate our model: a training set (Train_set), a test set (Test_set) with different samples but the same genus and gene as the training set, a Taxa_out_set excluding 18 phyla present in the training set but from different phyla, and a Gene_out_set excluding 60 genes from the training set but from the same phyla. We ensured each CDS had only one representation per genome, removing genes with multiple representations within the same species.

Technical info

  1. test.fasta :  Contains sequences for model testing.

  2. gene_out.fasta: Includes sequences excluded based on gene criteria for model evaluation.

  3. taxa_out.fasta :Includes sequences excluded based on taxonomic criteria for model evaluation.

  4. val.fasta: Contains sequences for model validation .
  5. train.fasta: Contains sequences for model training.

  6. metadata.csv: Contains metadata information for sequences in the FASTA files.

  7. hierarchical-level.txt : Determines hierarchical levels for triplet training and hierarchical sampling required for Scorpio training.

General Description of FASTA Files:

FASTA files contain sequence data where each sequence entry begins with a header line, starting with ">" followed by a sequence identifier (seqid). 

Retrieving Information from Metadata:

To retrieve metadata information based on the sequence ID (seqid) from the FASTA files:

  1. Extract the seqid from the header of the FASTA file
  2. Use the extracted seqid to look up the corresponding row in the metadata.csv file.

 

 

@article{refahi2024scorpio,
  title={Scorpio: Enhancing Embeddings to Improve Downstream Analysis of DNA sequences},
  author={Refahi, Mohammadsaleh and Sokhansanj, Bahrad A and Mell, Joshua Chang and Brown, James and Yoo, Hyunwoo and Hearne, Gavin and Rosen, Gail},
  journal={bioRxiv},
  pages={2024--07},
  year={2024},
  publisher={Cold Spring Harbor Laboratory}
}

Files

hierarchical-level.txt

Files (991.9 MB)

Name Size Download all
md5:2af8334b4e2fff8d993bc715f8844b81
49.7 MB Download
md5:40cc96c488644292c0ede67912bbdc3c
29.0 MB Preview Download
md5:261c7c2895d76599b5ff175efa364d25
138.2 MB Preview Download
md5:14fbfbb4b1f4a612e84d85639e16cf68
12.9 MB Download
md5:f175b2d3e988ac71251655575568c1a2
155.5 MB Download
md5:635fb71791f13382a28e475f58da482a
606.5 MB Download
md5:82e129333a8e58f790dbaacdb84ab7e6
36.2 kB Download

Additional details

Software