Kraken2 index of Homo sapiens genome assembly T2T-CHM13v2.0

Mathur, Arpit; Terse, Vishram; Gawde, Vaibhav; Gokarn, Anant; Patkar, Nikhil

doi:10.5281/zenodo.16459107

Published July 26, 2025 | Version v1

Dataset Open

Kraken2 index of Homo sapiens genome assembly T2T-CHM13v2.0

1. Advanced Centre for Treatment, Research and Education in Cancer
2. Indraprastha Institute of Information Technology Delhi

This dataset contains the Kraken2 index for GCF_009914755.1, the Homo sapiens genome assembly T2T-CHM13v2.0.

With the advent of new human reference genomes, the field has progressed beyond the traditional hg19 and hg38 assemblies. The Telomere-to-Telomere (T2T) CHM13v2.0 assembly represents the first complete human genome, providing a gapless and highly accurate reference that spans all chromosomes end-to-end. This new reference overcomes limitations of previous versions by resolving centromeric regions, telomeres, segmental duplications, and other repetitive regions that were previously unresolved or misassembled. As a result, CHM13v2.0 offers improved read alignment, variant calling, and annotation accuracy, especially in previously inaccessible genomic regions—making it an invaluable resource for high-resolution genomic studies.

K-mer–based classification tools, which are significantly faster than traditional read aligners such as BWA, Bowtie2, Minimap2, or BWA-MEM2, often face a major limitation: the difficulty of building a customized index from user-defined FASTA files. These tools rely on additional taxonomy files such as nodes.dmp and names.dmp, which store hierarchical relationships and taxonomic names.

However, for custom references—such as specific human assemblies (e.g., CHM13v2.0) or genomes of unknown or artificial evolutionary origin—such taxonomic information is often unavailable or not applicable, making it impossible to construct a proper index for k-mer–based classification.

To overcome this challenge, we developed a tool named FastaKrakenizer (https://github.com/arpit20328/FastaKrakenizer), a simple Bash script that enables the creation of functional Kraken2-compatible k-mer indices from arbitrary FASTA files without requiring external taxonomy files. This broadens the applicability of k-mer–based classification to a wider range of genomic references.

The tool allows rapid identification and quantification of sequencing reads mapped to a custom FASTA reference, facilitating downstream analysis for researchers in the fields of genomics and computational biology.

As an example, the Kraken2 index for the T2T-CHM13v2.0 human genome assembly (GCF_009914755.1) is provided here as a compressed .tar.gz archive.

For questions, suggestions, or issues, please visit our GitHub repository and open an issue in the Issues section: https://github.com/arpit20328/FastaKrakenizer

Files

Files (4.1 GB)

Name	Size	Download all
GCF_009914755.1_kraken2_index.tar.gz md5:2212b9291cdeb0d64f2b8ffc6c4f7468	4.1 GB	Download

Additional details

Programming language: Python , Linux Kernel Module

	All versions	This version
Views	190	190
Downloads	24	24
Data volume	110.3 GB	110.3 GB

Kraken2 index of Homo sapiens genome assembly T2T-CHM13v2.0

Authors/Creators

Description

Files

Files (4.1 GB)

Additional details

Software