Genomic datasets used for evaluation of k-mer representations and indexes
Authors/Creators
Description
This record contains genomic datasets, including subsampled k-mer sets for some datasets (files with names containing _subsampled_). Namely, it provides the following datasets:
- Two E. coli pan-genomes, obtained as the union of the E. coli genomes from the 661k collection. One contains all genomes (without quality filtering) and for the other (HQ) we applied high-quality filtering.
- S. pneumoniae pan-genome: 616 genomes, as provided in RASE DB S. pneumoniae https://github.com/c2-d2/rase-db-spneumoniae-sparc/
- SARS-CoV-2 pan-genome, downloaded from GISAID https://gisaid.org/ (access upon registration) on May 20, 2024 (16,729,549 genomes).
- Metagenomic sample SRS063932 (Illumina raw reads) of human microbiome with accession SRX023459, download from https://www.hmpdacc.org/hmp/HMASM/. The fastq files were converted to FASTA files using `seqtk seq -A -C`.
- Human RNA-seq Illumina raw reads with accession SRX348811, downloaded using the prefetch tool from the SRA toolkit and then converted into the FASTA format by
`fastq-dump --split-3 --fasta`. - Human genome Illumina raw reads with accession SRX016231, downloaded using the prefetch tool from the SRA toolkit and then converted into the FASTA format by
`fastq-dump --split-3 --fasta`. - Human genome assembly chm13v2.0 (T2T), downloaded from https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/analysis_set/chm13v2.0.fa.gz.
- Two MiniKraken datasets (4GB and 8GB), downloaded from https://ccb.jhu.edu/software/kraken/, with the 31-mers dumped using Jellyfish 1.1.12.
The resulting FASTA files (apart from the human genome assembly chm13v2.0 and MiniKraken datasets) were converted to unitigs by GGCAT v1.1.0
by `ggcat build -k {kmer-size} -m 200 -j 5 -s {min-freq} -o {preprocessed_unitigs} {input_FASTA}`, where we used $k=128$ and `{min-freq}`=1 for pan-genomes and $k=32$ and `{min-freq}`=2 for dataset from raw reads.
Finally, the subsampled files `{dataset}_subsampled_k{$k$}_r0.1.fa.xz` contain 10% randomly chosen distinct canonical $k$-mers from the whole $k$-mer set of the given dataset. The FASTA file contains one subsampled k-mer per sequence.
Files
Files
(22.0 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:d52598dabc822c070c3bbfb9879fa9e2
|
1.0 GB | Download |
|
md5:5e75bf536d1582b2a6384485a4d4d0a0
|
184.9 MB | Download |
|
md5:0df24fcce483a54ae769bb3f9894f5a7
|
898.1 MB | Download |
|
md5:7eb3216c668fbc94f421dccfddd19fc4
|
1.3 GB | Download |
|
md5:869f308dc8b07085837f86ccfd8f2369
|
528.7 MB | Download |
|
md5:25923d36e426ee468b8b681dc48778e8
|
106.5 MB | Download |
|
md5:5fdab0d7f0d4ea6dd97f39a3210bbd2d
|
342.4 MB | Download |
|
md5:e7ca8921517dc9ba636cbd729b836671
|
533.7 MB | Download |
|
md5:ef6f5ec79bf3e9d2aa0f3862da124333
|
745.7 MB | Download |
|
md5:d92b5ffdd9cc740943599ce3225d4a9e
|
1.8 GB | Download |
|
md5:760dc63fb8897e6362824386abb3af3c
|
2.9 GB | Download |
|
md5:9f9ea406301a577354b66e5523bc585f
|
211.3 MB | Download |
|
md5:75ac6933600155e6e2b5ccbd7beb6625
|
102.8 MB | Download |
|
md5:f94eba47b5e0624ba5e703a29abd1f9d
|
268.8 MB | Download |
|
md5:3816c57188bfcfad9bf7e93f865cee94
|
351.1 MB | Download |
|
md5:77095d7dce0d654d471d99c5f69e33c3
|
903.9 MB | Download |
|
md5:4630ca81f1f2fc1eb632e4005f37fbfe
|
115.5 MB | Download |
|
md5:1f8cbe13ba6fc351f8fe5636effac2d8
|
438.3 MB | Download |
|
md5:6421840630a6896ed68c0f1a31f54498
|
622.4 MB | Download |
|
md5:445eef4f7ccb208a5a7ec8ff287cbc0d
|
2.8 GB | Download |
|
md5:482939ebd947c1a7a3d52155a733457b
|
5.7 GB | Download |
|
md5:fdf50773755a0010454fdbb234181e2e
|
1.5 MB | Download |
|
md5:adb5253940d42ef856f37969da40dff7
|
2.5 MB | Download |
|
md5:6da8d1306929dd8c210150f6633a51d5
|
3.2 MB | Download |
|
md5:cefeff230e728c17574449bba02c6c0c
|
23.4 MB | Download |
|
md5:8b5637e44c7e5567c2369d9c667320e8
|
2.5 MB | Download |
|
md5:f01b036a235765707e1c6e884f117175
|
5.3 MB | Download |
|
md5:4ce52b2cd21ac5192c0e7ed791bf6fc9
|
8.4 MB | Download |
|
md5:d2db15aca5d8c3af9609316d4c6d894a
|
6.4 MB | Download |
|
md5:363f023ada8e45318751516750e747ab
|
3.1 MB | Download |
|
md5:5682762947217ec65f894bebbb236459
|
5.4 MB | Download |
|
md5:6b405d2c9b32c59057941d18ec48598b
|
7.3 MB | Download |