DNAformer Datasets
Creators
Description
Bar-Lev, D., Orr, I., Sabary, O., Etzion T., & Yakkobi, E. Scalable and robust DNA-based storage via coding theory and deep learning. 2024.
Datasets description
This document provides an overview of the 5 datasets introduced in this work. For each dataset we provide both the raw .fastq files with the sequenced reads, as well as a file that includes the processed binned reads that were obtained by the binning step described in the paper.
The dataset is provided under similar license as the code repository, with scripts for loading and processing the data at: https://github.com/itaiorr/Deep-DNA-based-storage.git
The datasets
The data was synthesized using Twist Bioscience and are differentiated by the sequencing technology used. Two Illumina datasets, both generated by Illumina miSeq. The reads in these two datasets were sequenced with paired-end sequencing, while the merging (stitching) was done with PEAR software. We include both raw reads and stitched reads in our repository under the names:
-
Pilot Illumina dataset
- BinnedPilotIllumina.txt - include the pilot dataset in binned format.
- P-Pilot_S2_L001_R1_001.fastq.gz - include the pilot reads pre-stitching as obtained from Illumina miSeq.
- P-Pilot_S2_L001_R2_001.fastq.gz - include the pilot reads pre-stitching as obtained from Illumina miSeq.
- Pilot_Illumina_raw_reads.fastq - includes the reads post-stitching.
-
Test Illumina dataset
- BinnedTestIllumina.txt - Includes the test dataset in binned format.
- F1-Full-Pool_S1_L001_R1_001.fastq.gz - includes the Illumina reads pre-stitching as obtained from Illumina miSeq.
- F1-Full-Pool_S1_L001_R2_001.fastq.gz- includes the Illumina reads pre-stitching as obtained from Illumina miSeq.
- test_illumina_raw_reads.fastq - Includes the reads post-stitching.
Three Nanopore datasets, all generated by Oxford Nanopore Technologies MinION under the names:
-
Pilot Nanopore dataset
- BinnedPilotNanopore.txt - reads in binned format.
- raw_reads_pilot_nanopore.zip - original basecalled reads as obtained from ONT MinION.
- Pilot_RawSignals_1_5.zip , Pilot_RawSignals_6_10.zip , Pilot_RawSignals_11_13.zip - raw nanopore signals as obtained from ONT MinION.
-
Test Nanopore first flowcell dataset (termed in the paper as “Nanopore single flowcell”).
- BinnedNanoporeFirstFlowcell.txt - reads in binned format.
- test_pool_nanopore_single.zip - original basecalled reads as obtained from ONT MinION.
- NanoporeFirstFlowcellRawSignals.zip - raw nanopore signals as obtained from ONT MinION.
-
Test Nanopore second flowcells dataset
- BinnedNanoporeSecondFlowcell.txt - reads in binned format.
- test_nanopore_second_flowcell_part001.zip , test_nanopore_second_flowcell_part002.zip- original basecalled reads as obtained from ONT MinIN.
- NanoproeSecondFlowcellRawSignals_1_5.zip , NanoproeSecondFlowcellRawSignals_6_10.zip , NanoproeSecondFlowcellRawSignals_11_15.zip - raw nanopore signals as obtained from ONT MinION.
Additionally, for completeness, we also included a file with the processed and binned reads of the test Nanopore dataset of the combined two flowcells dataset (termed in the paper as “Nanopore two flowcells”). This can be found in the file BinnedNanoporeTwoFlowcells.txt.
Detailed description
The binned format was created using the binning step described in the paper. Each cluster of reads appears in the file with a header followed by the reads. More specifically:
-
The header consists of 2 lines; the first corresponds to the encoded sequence of the clusters, and the second is a line of 18x“*” that should be ignored
-
The reads in the clusters are provided after the header, where each read is given in a separate line
-
Each cluster ends with two empty lines
Data processing
To ease the processing of our datasets, we also provide the following Python scripts (see https://github.com/itaiorr/Deep-DNA-based-storage)
-
Preprocessor.py - includes our preprocessing procedure of the raw reads. The procedure detects and truncates the primers
-
Parser.py - parses the file of the binned reads and creates two Python dictionaries. In the first dictionary each key is an encoded sequence, and the value is a list of the reads in the cluster. In the second dictionary the keys are the index and the value is a list of the reads in the cluster.
Files
BinnedNanoporeFirstFlowcell.txt
Files
(38.5 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:c95a5adffd1cedb04b55303d51e14706
|
196.8 MB | Preview Download |
|
md5:6c9a930cbc9d2c18aa2c59e39d377979
|
146.6 MB | Preview Download |
|
md5:9cf53ddcd5679ede42299c948690d84f
|
349.9 MB | Preview Download |
|
md5:02b16f63adf5ac85e302c151eaa8ad48
|
74.2 MB | Preview Download |
|
md5:893e1db9d80b210d3368f96d61f0ef78
|
106.0 MB | Preview Download |
|
md5:d90d4bdb52f340e74a1ac3d0e07bc4f0
|
467.9 MB | Preview Download |
|
md5:ec04f17756381a88dabf8dff5e66f044
|
357.1 MB | Download |
|
md5:46b65266332b7b60e54c864535247c6c
|
387.3 MB | Download |
|
md5:0970a0487d15216d7376b458fdc7f948
|
7.2 GB | Preview Download |
|
md5:31567755b644a3e3770fe6ec46871512
|
6.2 GB | Preview Download |
|
md5:31c3cd76800966c29105e38a16947f67
|
3.3 GB | Preview Download |
|
md5:1b91dfd8649a5a8d3470c9303d0539b5
|
4.2 GB | Preview Download |
|
md5:fed474c46fdb06a1f809703d2974753a
|
56.5 MB | Download |
|
md5:342cfde132cc9de492a3ab6399ae2226
|
60.2 MB | Download |
|
md5:a46920f4d9c1bd82bb1a8fd53dae158d
|
190.6 MB | Download |
|
md5:11ff24d068ee21ae26bebed29adf266d
|
3.4 GB | Preview Download |
|
md5:b90d8c1959d5ae2cccee9ddddfde4c83
|
3.4 GB | Preview Download |
|
md5:0da64d53d6b6cf194b0cdb71f3b4c813
|
4.1 GB | Preview Download |
|
md5:6aa769792b14fd667eba5642a1b5ea2c
|
883.2 MB | Preview Download |
|
md5:fecf0255fe06ebcf3624b97388b560a7
|
1.2 GB | Download |
|
md5:054d9255cd664cdfeb24ef36449039b1
|
674.5 MB | Preview Download |
|
md5:866759daf3ecf8b51310bd8e39d02c49
|
462.0 MB | Preview Download |
|
md5:10b97ebaff4de1f222e5c9ced06a39e4
|
1.1 GB | Preview Download |