Published July 26, 2025 | Version v1
Dataset Open

Kraken2 index of Homo sapiens genome assembly T2T-CHM13v2.0

  • 1. ROR icon Advanced Centre for Treatment, Research and Education in Cancer
  • 2. ROR icon Indraprastha Institute of Information Technology Delhi

Description

This dataset contains the Kraken2 index for GCF_009914755.1, the Homo sapiens genome assembly T2T-CHM13v2.0.

With the advent of new human reference genomes, the field has progressed beyond the traditional hg19 and hg38 assemblies. The Telomere-to-Telomere (T2T) CHM13v2.0 assembly represents the first complete human genome, providing a gapless and highly accurate reference that spans all chromosomes end-to-end. This new reference overcomes limitations of previous versions by resolving centromeric regions, telomeres, segmental duplications, and other repetitive regions that were previously unresolved or misassembled. As a result, CHM13v2.0 offers improved read alignment, variant calling, and annotation accuracy, especially in previously inaccessible genomic regions—making it an invaluable resource for high-resolution genomic studies.

K-mer–based classification tools, which are significantly faster than traditional read aligners such as BWA, Bowtie2, Minimap2, or BWA-MEM2, often face a major limitation: the difficulty of building a customized index from user-defined FASTA files. These tools rely on additional taxonomy files such as nodes.dmp and names.dmp, which store hierarchical relationships and taxonomic names.

However, for custom references—such as specific human assemblies (e.g., CHM13v2.0) or genomes of unknown or artificial evolutionary origin—such taxonomic information is often unavailable or not applicable, making it impossible to construct a proper index for k-mer–based classification.

To overcome this challenge, we developed a tool named FastaKrakenizer (https://github.com/arpit20328/FastaKrakenizer), a simple Bash script that enables the creation of functional Kraken2-compatible k-mer indices from arbitrary FASTA files without requiring external taxonomy files. This broadens the applicability of k-mer–based classification to a wider range of genomic references.

The tool allows rapid identification and quantification of sequencing reads mapped to a custom FASTA reference, facilitating downstream analysis for researchers in the fields of genomics and computational biology.

As an example, the Kraken2 index for the T2T-CHM13v2.0 human genome assembly (GCF_009914755.1) is provided here as a compressed .tar.gz archive.

For questions, suggestions, or issues, please visit our GitHub repository and open an issue in the Issues section: https://github.com/arpit20328/FastaKrakenizer

 

 

Files

Files (4.1 GB)

Name Size Download all
md5:2212b9291cdeb0d64f2b8ffc6c4f7468
4.1 GB Download

Additional details

Software

Programming language
Python , Linux Kernel Module