Species-colored Themisto v3 index with 640k bacterial genomes

doi:10.5281/zenodo.7736981

Published March 15, 2023 | Version Themisto v3

Dataset Open

Species-colored Themisto v3 index with 640k bacterial genomes

Jarno N. Alanko¹

1. University of Helsinki

This is a Themisto v3 [1] index containing the 639,981 high-quality genomes from 661k bacterial genomes dataset of Blackwell et al. [2]. The index contains all distinct 31-mers of the dataset (both strands). There are 71 billion distinct 31-mers in the data (35.5 billion reverse complement pairs). Each k-mer is annotated with the set of species identifiers that contain that 31-mer. The species identifiers are called colors. There are 2340 distinct colors in the dataset, so the color identifiers range from 0 to 2339.

To pseudoalign reads.fastq against the index using 16 threads, install Themisto v3, and use the following command:

themisto pseudoalign -q reads.fastq -i themisto_640k/index -t 16 --temp-dir .

This will output one line of space-separated integers per read in the input. The first integer on a line is the zero-based rank of the read in the fastq file, and the rest of the integers are the identifiers of colors that are compatible with the read. The file color_names.csv lists the species name and the taxid for each color.

The pseudoalignment counts should not be directly used as abundance estimates because they only describe which reads are compatible with which species, and a single read may be compatible with many. We recommend using mSWEEP to estimate abundances based on the pseudoalignment data: https://github.com/PROBIC/mSWEEP.

--

The index was constructed with Themisto v3.0.0 using the following command line parameters:

themisto build -i input_file_list.txt --file-colors --reverse-complements -o 640k_bacteria -m 512000 -t 48 -k 31 --temp-dir temp --verbose -d 20

The file source_accessions.txt lists the accession numbers of assemblies included in the database.

[1] Alanko, J. N., Vuohtoniemi, J., Maklin, T., & Puglisi, S. J. (2023). Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes. bioRxiv, 2023-02.

[2] Blackwell, G. A., Hunt, M., Malone, K. M., Lima, L., Horesh, G., Alako, B. T., ... & Iqbal, Z. (2021). Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences. PLoS biology, 19(11), e3001421.

Files

Files (72.0 GB)

Name	Size	Download all
themisto_640k.tar md5:5eda574a7d5a923c1b25ed79f33eb2b1	72.0 GB	Download

Additional details

Dynamic Succinct Data Structures 339070: Academy of Finland
Massively Parallel Algorithms and Analysis for Metagenomics and Pangenomics (MAPAMEPA) / Consortium: MAPAMEPA 351150: Academy of Finland
Design and Verification Methods for Massively Parallel Distributed Systems (DeVeMaPa) 336092: Academy of Finland
Massively Parallel Algorithms and Analysis for Metagenomics and Pangenomics (MAPAMEPA) / Consortium: MAPAMEPA 351145: Academy of Finland

	All versions	This version
Views	497	497
Downloads	40	40
Data volume	3.2 TB	3.2 TB

Species-colored Themisto v3 index with 640k bacterial genomes

Creators

Description

Files

Files (72.0 GB)

Additional details

Funding