Data from: Memory-bound k-mer selection for large and evolutionary diverse reference libraries
Description
Using k-mers to find sequence matches is increasingly used in many bioinformatic applications, including metagenomic sequence classification. The accuracy of these down-stream applications relies on the density of the reference databases, which, luckily, are rapidly growing. While the increased density provides hope for dramatic improvements in accuracy, scalability is a concern. Reference k-mers are kept in the memory during the query time, and saving all k-mers of these ever-expanding databases is fast becoming impractical. Several strategies for subsampling have been proposed, including minimizers and finding taxon-specific k-mers. However, we contend that these strategies are inadequate, especially when reference sets are taxonomically imbalanced, as are most microbial libraries. In this paper, we explore approaches for selecting a fixed-size subset of k-mers present in an ultra-large dataset to include in a library such that the classification of reads suffers the least. Our experiments demonstrate the limitations of existing approaches, especially for novel and poorly sampled groups. We propose a library construction algorithm called KRANK (K-mer RANKer) that combines several components, including a hierarchical selection strategy with adaptive size restrictions and an equitable coverage strategy. We implement KRANK in highly optimized code and combine it with the locality-sensitive-hashing classifier CONSULT-II to build a taxonomic classification and profiling method. On several benchmarks, KRANK k-mer selection dramatically reduces memory consumption with minimal loss in classification accuracy. We show in extensive analyses based on CAMI benchmarks that KRANK outperforms k-mer-based alternatives in terms of taxonomic profiling and comes close to the best marker-based methods in terms of accuracy.
Notes
Files
Files
(216.8 kB)
Name | Size | Download all |
---|---|---|
md5:2e8f9d97dd7c154a00ed731ab15aee8a
|
216.8 kB | Download |
Additional details
Related works
- Is cited by
- 10.1101/2024.02.12.580015 (DOI)
- 10.1007/978-1-0716-3989-4_26 (DOI)
- Is source of
- 10.5061/dryad.0000000c2 (DOI)