Kraken2 index of Homo sapiens genome assembly T2T-CHM13v2.0
Authors/Creators
Description
This dataset contains the Kraken2 index for GCF_009914755.1, the Homo sapiens genome assembly T2T-CHM13v2.0.
With the advent of new human reference genomes, the field has progressed beyond the traditional hg19 and hg38 assemblies. The Telomere-to-Telomere (T2T) CHM13v2.0 assembly represents the first complete human genome, providing a gapless and highly accurate reference that spans all chromosomes end-to-end. This new reference overcomes limitations of previous versions by resolving centromeric regions, telomeres, segmental duplications, and other repetitive regions that were previously unresolved or misassembled. As a result, CHM13v2.0 offers improved read alignment, variant calling, and annotation accuracy, especially in previously inaccessible genomic regions—making it an invaluable resource for high-resolution genomic studies.
K-mer–based classification tools, which are significantly faster than traditional read aligners such as BWA, Bowtie2, Minimap2, or BWA-MEM2, often face a major limitation: the difficulty of building a customized index from user-defined FASTA files. These tools rely on additional taxonomy files such as nodes.dmp and names.dmp, which store hierarchical relationships and taxonomic names.
However, for custom references—such as specific human assemblies (e.g., CHM13v2.0) or genomes of unknown or artificial evolutionary origin—such taxonomic information is often unavailable or not applicable, making it impossible to construct a proper index for k-mer–based classification.
To overcome this challenge, we developed a tool named FastaKrakenizer (https://github.com/arpit20328/FastaKrakenizer), a simple Bash script that enables the creation of functional Kraken2-compatible k-mer indices from arbitrary FASTA files without requiring external taxonomy files. This broadens the applicability of k-mer–based classification to a wider range of genomic references.
The tool allows rapid identification and quantification of sequencing reads mapped to a custom FASTA reference, facilitating downstream analysis for researchers in the fields of genomics and computational biology.
As an example, the Kraken2 index for the T2T-CHM13v2.0 human genome assembly (GCF_009914755.1) is provided here as a compressed .tar.gz archive.
For questions, suggestions, or issues, please visit our GitHub repository and open an issue in the Issues section: https://github.com/arpit20328/FastaKrakenizer
Files
Files
(4.1 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:2212b9291cdeb0d64f2b8ffc6c4f7468
|
4.1 GB | Download |
Additional details
Software
- Programming language
- Python , Linux Kernel Module