Published October 21, 2023
| Version v1
Dataset
Open
SequenceLab Datasets
Description
These are the datasets included in the SequenceLab evaluation framework. They were generated based on read sets of the human genome with accession numbers SRR10035390 (Illumina), SRR12519035 (PacBio HiFi), SRR12564436 (Oxford Nanopore Technologies).
The files are in TSS (Tab Separated Sequences) format. Each line contains a pair of nucleotide sequences, separated by tabs. This simplified format enables evaluating genomic tools with little overhead on real datasets.
TSS Specification
- Each line consists of a pair of nucleotide sequences, separated by a tab character.
- Each line is terminated by a single newline character, i.e. in UNIX style. Windows style linebreaks (carriage return + newline) are not permitted.
- Basepair sequences may consist of uppercase and lowercase nucleic or amino acid codes, as allowed in the FASTA format.
- If the dataset is for a readmapping usecase, the first sequence is the read or query, the second is the reference or target.
Methodology
The datasets were generated in three steps:
- Each read set was mapped to the T2T CHM13 reference genome using minimap2 once with alignment disabled, resulting in the *_chained and *_mapped datasets, respectively.
- The candidate pairs reported in the resulting .paf files were extracted from the reads and reference, respectively, and written to a .tss file.
- For each .tss file, the shortest 90% and longest 10% of candidate locations were split into separate .tss files named _bottom and _top, respectively.
Files
Files
(9.3 GB)
Name | Size | Download all |
---|---|---|
md5:af04ea844155bd3e4cb1fc85a17c352c
|
445.6 MB | Download |
md5:aa33128aaaaa2791d77d40a1c6434b7c
|
6.8 GB | Download |
md5:d5868244543db51f0884370cd9535ad3
|
2.1 GB | Download |