SequenceLab Datasets

Rumpf, Maximilian-David; Lindegger, Joel

doi:10.5281/zenodo.10028978

Published October 21, 2023 | Version v1

Dataset Open

SequenceLab Datasets

1. ETH Zurich

These are the datasets included in the SequenceLab evaluation framework. They were generated based on read sets of the human genome with accession numbers SRR10035390 (Illumina), SRR12519035 (PacBio HiFi), SRR12564436 (Oxford Nanopore Technologies).

The files are in TSS (Tab Separated Sequences) format. Each line contains a pair of nucleotide sequences, separated by tabs. This simplified format enables evaluating genomic tools with little overhead on real datasets.

TSS Specification

Each line consists of a pair of nucleotide sequences, separated by a tab character.
Each line is terminated by a single newline character, i.e. in UNIX style. Windows style linebreaks (carriage return + newline) are not permitted.
Basepair sequences may consist of uppercase and lowercase nucleic or amino acid codes, as allowed in the FASTA format.
If the dataset is for a readmapping usecase, the first sequence is the read or query, the second is the reference or target.

Methodology

The datasets were generated in three steps:

Each read set was mapped to the T2T CHM13 reference genome using minimap2 once with alignment disabled, resulting in the *_chained and *_mapped datasets, respectively.
The candidate pairs reported in the resulting .paf files were extracted from the reads and reference, respectively, and written to a .tss file.
For each .tss file, the shortest 90% and longest 10% of candidate locations were split into separate .tss files named _bottom and _top, respectively.

Files

Files (9.3 GB)

Name	Size	Download all
hifi.tar.gz md5:af04ea844155bd3e4cb1fc85a17c352c	445.6 MB	Download
illumina.tar.gz md5:aa33128aaaaa2791d77d40a1c6434b7c	6.8 GB	Download
ont.tar.gz md5:d5868244543db51f0884370cd9535ad3	2.1 GB	Download

	All versions	This version
Views	134	134
Downloads	71	71
Data volume	230.2 GB	230.2 GB

SequenceLab Datasets

Creators

Description

TSS Specification

Methodology

Files

Files (9.3 GB)