Published January 23, 2025 | Version v2
Dataset Open

Genomic datasets used for evaluation of k-mer representations and indexes

  • 1. ROR icon Charles University
  • 2. ROR icon Inria Rennes - Bretagne Atlantique Research Centre
  • 3. Eidgenössische Technische Hochschule Zürich

Description

This record contains genomic datasets, including subsampled k-mer sets for some datasets (files with names containing _subsampled_). Namely, it provides the following datasets:

  • Two E. coli pan-genomes, obtained as the union of the E. coli genomes from the 661k collection. One contains all genomes (without quality filtering) and for the other (HQ) we applied high-quality filtering.
  • S. pneumoniae pan-genome: 616 genomes, as provided in RASE DB S. pneumoniae https://github.com/c2-d2/rase-db-spneumoniae-sparc/
  • SARS-CoV-2 pan-genome, downloaded from GISAID https://gisaid.org/ (access upon registration) on May 20, 2024 (16,729,549 genomes).
  • Metagenomic sample SRS063932 (Illumina raw reads) of human microbiome with accession SRX023459, download from https://www.hmpdacc.org/hmp/HMASM/. The fastq files were converted to FASTA files using `seqtk seq -A -C`.
  • Human RNA-seq Illumina raw reads with accession SRX348811, downloaded using the prefetch tool from the SRA toolkit and then converted into the FASTA format by
    `fastq-dump --split-3 --fasta`.
  • Human genome Illumina raw reads with accession SRX016231, downloaded using the prefetch tool from the SRA toolkit and then converted into the FASTA format by
    `fastq-dump --split-3 --fasta`.
  • Human genome assembly chm13v2.0 (T2T), downloaded from https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/analysis_set/chm13v2.0.fa.gz.
  • Two MiniKraken datasets (4GB and 8GB), downloaded from https://ccb.jhu.edu/software/kraken/, with the 31-mers dumped using Jellyfish 1.1.12.

The resulting FASTA files (apart from the human genome assembly chm13v2.0 and MiniKraken datasets) were converted to unitigs by GGCAT v1.1.0
by `ggcat build -k {kmer-size} -m 200 -j 5 -s {min-freq} -o {preprocessed_unitigs} {input_FASTA}`, where we used $k=128$ and  `{min-freq}`=1 for pan-genomes and $k=32$ and `{min-freq}`=2 for dataset from raw reads.

Finally, the subsampled files `{dataset}_subsampled_k{$k$}_r0.1.fa.xz` contain 10% randomly chosen distinct canonical $k$-mers from the whole $k$-mer set of the given dataset. The FASTA file contains one subsampled k-mer per sequence.

Files

Files (22.0 GB)

Name Size Download all
md5:d52598dabc822c070c3bbfb9879fa9e2
1.0 GB Download
md5:5e75bf536d1582b2a6384485a4d4d0a0
184.9 MB Download
md5:0df24fcce483a54ae769bb3f9894f5a7
898.1 MB Download
md5:7eb3216c668fbc94f421dccfddd19fc4
1.3 GB Download
md5:869f308dc8b07085837f86ccfd8f2369
528.7 MB Download
md5:25923d36e426ee468b8b681dc48778e8
106.5 MB Download
md5:5fdab0d7f0d4ea6dd97f39a3210bbd2d
342.4 MB Download
md5:e7ca8921517dc9ba636cbd729b836671
533.7 MB Download
md5:ef6f5ec79bf3e9d2aa0f3862da124333
745.7 MB Download
md5:d92b5ffdd9cc740943599ce3225d4a9e
1.8 GB Download
md5:760dc63fb8897e6362824386abb3af3c
2.9 GB Download
md5:9f9ea406301a577354b66e5523bc585f
211.3 MB Download
md5:75ac6933600155e6e2b5ccbd7beb6625
102.8 MB Download
md5:f94eba47b5e0624ba5e703a29abd1f9d
268.8 MB Download
md5:3816c57188bfcfad9bf7e93f865cee94
351.1 MB Download
md5:77095d7dce0d654d471d99c5f69e33c3
903.9 MB Download
md5:4630ca81f1f2fc1eb632e4005f37fbfe
115.5 MB Download
md5:1f8cbe13ba6fc351f8fe5636effac2d8
438.3 MB Download
md5:6421840630a6896ed68c0f1a31f54498
622.4 MB Download
md5:445eef4f7ccb208a5a7ec8ff287cbc0d
2.8 GB Download
md5:482939ebd947c1a7a3d52155a733457b
5.7 GB Download
md5:fdf50773755a0010454fdbb234181e2e
1.5 MB Download
md5:adb5253940d42ef856f37969da40dff7
2.5 MB Download
md5:6da8d1306929dd8c210150f6633a51d5
3.2 MB Download
md5:cefeff230e728c17574449bba02c6c0c
23.4 MB Download
md5:8b5637e44c7e5567c2369d9c667320e8
2.5 MB Download
md5:f01b036a235765707e1c6e884f117175
5.3 MB Download
md5:4ce52b2cd21ac5192c0e7ed791bf6fc9
8.4 MB Download
md5:d2db15aca5d8c3af9609316d4c6d894a
6.4 MB Download
md5:363f023ada8e45318751516750e747ab
3.1 MB Download
md5:5682762947217ec65f894bebbb236459
5.4 MB Download
md5:6b405d2c9b32c59057941d18ec48598b
7.3 MB Download