Published March 15, 2023 | Version Themisto v3
Dataset Open

Species-colored Themisto v3 index with 640k bacterial genomes

  • 1. University of Helsinki

Description

This is a Themisto v3 [1] index containing the 639,981 high-quality genomes from 661k bacterial genomes dataset of Blackwell et al. [2]. The index contains all distinct 31-mers of the dataset (both strands). There are 71 billion distinct 31-mers in the data (35.5 billion reverse complement pairs). Each k-mer is annotated with the set of species identifiers that contain that 31-mer. The species identifiers are called colors. There are 2340 distinct colors in the dataset, so the color identifiers range from 0 to 2339.

To pseudoalign reads.fastq against the index using 16 threads, install Themisto v3, and use the following command:

themisto pseudoalign -q reads.fastq -i themisto_640k/index -t 16 --temp-dir .

This will output one line of space-separated integers per read in the input. The first integer on a line is the zero-based rank of the read in the fastq file, and the rest of the integers are the identifiers of colors that are compatible with the read. The file color_names.csv lists the species name and the taxid for each color.

The pseudoalignment counts should not be directly used as abundance estimates because they only describe which reads are compatible with which species, and a single read may be compatible with many. We recommend using mSWEEP to estimate abundances based on the pseudoalignment data: https://github.com/PROBIC/mSWEEP.

--

The index was constructed with Themisto v3.0.0 using the following command line parameters:

themisto build -i input_file_list.txt --file-colors --reverse-complements -o 640k_bacteria -m 512000 -t 48 -k 31 --temp-dir temp --verbose -d 20

The file source_accessions.txt lists the accession numbers of assemblies included in the database.

[1] Alanko, J. N., Vuohtoniemi, J., Maklin, T., & Puglisi, S. J. (2023). Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes. bioRxiv, 2023-02.

[2] Blackwell, G. A., Hunt, M., Malone, K. M., Lima, L., Horesh, G., Alako, B. T., ... & Iqbal, Z. (2021). Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences. PLoS biology, 19(11), e3001421.

 

Files

Files (72.0 GB)

Name Size Download all
md5:5eda574a7d5a923c1b25ed79f33eb2b1
72.0 GB Download

Additional details

Funding

Dynamic Succinct Data Structures 339070
Academy of Finland
Massively Parallel Algorithms and Analysis for Metagenomics and Pangenomics (MAPAMEPA) / Consortium: MAPAMEPA 351150
Academy of Finland
Design and Verification Methods for Massively Parallel Distributed Systems (DeVeMaPa) 336092
Academy of Finland
Massively Parallel Algorithms and Analysis for Metagenomics and Pangenomics (MAPAMEPA) / Consortium: MAPAMEPA 351145
Academy of Finland