DNAformer Datasets

Sabary, Omer

doi:10.5281/zenodo.13896773

Published October 7, 2024 | Version v1

Dataset Open

DNAformer Datasets

Sabary, Omer

Bar-Lev, D., Orr, I., Sabary, O., Etzion T., & Yakkobi, E. Scalable and robust DNA-based storage via coding theory and deep learning. 2024.

Datasets description

This document provides an overview of the 5 datasets introduced in this work. For each dataset we provide both the raw .fastq files with the sequenced reads, as well as a file that includes the processed binned reads that were obtained by the binning step described in the paper.

The dataset is provided under similar license as the code repository, with scripts for loading and processing the data at: https://github.com/itaiorr/Deep-DNA-based-storage.git

The datasets

The data was synthesized using Twist Bioscience and are differentiated by the sequencing technology used. Two Illumina datasets, both generated by Illumina miSeq. The reads in these two datasets were sequenced with paired-end sequencing, while the merging (stitching) was done with PEAR software. We include both raw reads and stitched reads in our repository under the names:

Pilot Illumina dataset
1. BinnedPilotIllumina.txt - include the pilot dataset in binned format.
2. P-Pilot_S2_L001_R1_001.fastq.gz - include the pilot reads pre-stitching as obtained from Illumina miSeq.
3. P-Pilot_S2_L001_R2_001.fastq.gz - include the pilot reads pre-stitching as obtained from Illumina miSeq.
4. Pilot_Illumina_raw_reads.fastq - includes the reads post-stitching.
Test Illumina dataset
1. BinnedTestIllumina.txt - Includes the test dataset in binned format.
2. F1-Full-Pool_S1_L001_R1_001.fastq.gz - includes the Illumina reads pre-stitching as obtained from Illumina miSeq.
3. F1-Full-Pool_S1_L001_R2_001.fastq.gz- includes the Illumina reads pre-stitching as obtained from Illumina miSeq.
4. test_illumina_raw_reads.fastq - Includes the reads post-stitching.

Three Nanopore datasets, all generated by Oxford Nanopore Technologies MinION under the names:

Pilot Nanopore dataset
1. BinnedPilotNanopore.txt - reads in binned format.
2. raw_reads_pilot_nanopore.zip - original basecalled reads as obtained from ONT MinION.
3. Pilot_RawSignals_1_5.zip , Pilot_RawSignals_6_10.zip , Pilot_RawSignals_11_13.zip - raw nanopore signals as obtained from ONT MinION.
Test Nanopore first flowcell dataset (termed in the paper as “Nanopore single flowcell”).
1. BinnedNanoporeFirstFlowcell.txt - reads in binned format.
2. test_pool_nanopore_single.zip - original basecalled reads as obtained from ONT MinION.
3. NanoporeFirstFlowcellRawSignals.zip - raw nanopore signals as obtained from ONT MinION.
Test Nanopore second flowcells dataset
1. BinnedNanoporeSecondFlowcell.txt - reads in binned format.
2. test_nanopore_second_flowcell_part001.zip , test_nanopore_second_flowcell_part002.zip- original basecalled reads as obtained from ONT MinIN.
3. NanoproeSecondFlowcellRawSignals_1_5.zip , NanoproeSecondFlowcellRawSignals_6_10.zip , NanoproeSecondFlowcellRawSignals_11_15.zip - raw nanopore signals as obtained from ONT MinION.

Additionally, for completeness, we also included a file with the processed and binned reads of the test Nanopore dataset of the combined two flowcells dataset (termed in the paper as “Nanopore two flowcells”). This can be found in the file BinnedNanoporeTwoFlowcells.txt.

Detailed description

The binned format was created using the binning step described in the paper. Each cluster of reads appears in the file with a header followed by the reads. More specifically:

The header consists of 2 lines; the first corresponds to the encoded sequence of the clusters, and the second is a line of 18x“*” that should be ignored
The reads in the clusters are provided after the header, where each read is given in a separate line
Each cluster ends with two empty lines

Data processing

To ease the processing of our datasets, we also provide the following Python scripts (see https://github.com/itaiorr/Deep-DNA-based-storage)

Preprocessor.py - includes our preprocessing procedure of the raw reads. The procedure detects and truncates the primers
Parser.py - parses the file of the binned reads and creates two Python dictionaries. In the first dictionary each key is an encoded sequence, and the value is a list of the reads in the cluster. In the second dictionary the keys are the index and the value is a list of the reads in the cluster.

Files

BinnedNanoporeFirstFlowcell.txt

Files (38.5 GB)

Name	Size	Download all
BinnedNanoporeFirstFlowcell.txt md5:c95a5adffd1cedb04b55303d51e14706	196.8 MB	Preview Download
BinnedNanoporeSecondFlowcell.txt md5:6c9a930cbc9d2c18aa2c59e39d377979	146.6 MB	Preview Download
BinnedNanoporeTwoFlowcells.txt md5:9cf53ddcd5679ede42299c948690d84f	349.9 MB	Preview Download
BinnedPilotIllumina.txt md5:02b16f63adf5ac85e302c151eaa8ad48	74.2 MB	Preview Download
BinnedPilotNanopore.txt md5:893e1db9d80b210d3368f96d61f0ef78	106.0 MB	Preview Download
BinnedTestIllumina.txt md5:d90d4bdb52f340e74a1ac3d0e07bc4f0	467.9 MB	Preview Download
F1-Full-Pool_S1_L001_R1_001.fastq.gz md5:ec04f17756381a88dabf8dff5e66f044	357.1 MB	Download
F1-Full-Pool_S1_L001_R2_001.fastq.gz md5:46b65266332b7b60e54c864535247c6c	387.3 MB	Download
NanoporeFirstFlowcellRawSignals.zip md5:0970a0487d15216d7376b458fdc7f948	7.2 GB	Preview Download
NanoproeSecondFlowcellRawSignals_11_15.zip md5:31567755b644a3e3770fe6ec46871512	6.2 GB	Preview Download
NanoproeSecondFlowcellRawSignals_1_5.zip md5:31c3cd76800966c29105e38a16947f67	3.3 GB	Preview Download
NanoproeSecondFlowcellRawSignals_6_10.zip md5:1b91dfd8649a5a8d3470c9303d0539b5	4.2 GB	Preview Download
P-Pilot_S2_L001_R1_001.fastq.gz md5:fed474c46fdb06a1f809703d2974753a	56.5 MB	Download
P-Pilot_S2_L001_R2_001.fastq.gz md5:342cfde132cc9de492a3ab6399ae2226	60.2 MB	Download
Pilot_Illumina_raw_reads.fastq md5:a46920f4d9c1bd82bb1a8fd53dae158d	190.6 MB	Download
Pilot_RawSignals_11_13.zip md5:11ff24d068ee21ae26bebed29adf266d	3.4 GB	Preview Download
Pilot_RawSignals_1_5.zip md5:b90d8c1959d5ae2cccee9ddddfde4c83	3.4 GB	Preview Download
Pilot_RawSignals_6_10.zip md5:0da64d53d6b6cf194b0cdb71f3b4c813	4.1 GB	Preview Download
raw_reads_pilot_nanopore.zip md5:6aa769792b14fd667eba5642a1b5ea2c	883.2 MB	Preview Download
test_illumina_raw_reads.fastq md5:fecf0255fe06ebcf3624b97388b560a7	1.2 GB	Download
test_nanopore_second_flowcell_part001.zip md5:054d9255cd664cdfeb24ef36449039b1	674.5 MB	Preview Download
test_nanopore_second_flowcell_part002.zip md5:866759daf3ecf8b51310bd8e39d02c49	462.0 MB	Preview Download
test_pool_nanopore_single.zip md5:10b97ebaff4de1f222e5c9ced06a39e4	1.1 GB	Preview Download

	All versions	This version
Views	419	419
Downloads	1,515	1,515
Data volume	2.4 TB	2.4 TB

DNAformer Datasets

Creators

Description

Files

BinnedNanoporeFirstFlowcell.txt

Files (38.5 GB)