Published October 7, 2024 | Version v1
Dataset Open

DNAformer Datasets

Creators

Description

Bar-Lev, D., Orr, I., Sabary, O., Etzion T., & Yakkobi, E.   Scalable and robust DNA-based storage via coding theory and deep learning. 2024.

 

Datasets description

 

This document provides an overview of the 5 datasets introduced in this work. For each dataset we provide both the raw .fastq files with the sequenced reads, as well as a file that includes the processed binned reads that were obtained by the binning step described in the paper. 

 

The dataset is provided under similar license as the code repository, with scripts for loading and processing the data at: https://github.com/itaiorr/Deep-DNA-based-storage.git



The datasets

The data was synthesized using Twist Bioscience and are differentiated by the sequencing technology used. Two Illumina datasets, both generated by Illumina miSeq. The reads in these two datasets were sequenced with paired-end sequencing, while the merging (stitching) was done with PEAR software. We include both raw reads and stitched reads in our repository under the names:

  1. Pilot Illumina dataset 

    1. BinnedPilotIllumina.txt  - include the pilot dataset in binned format. 
    2. P-Pilot_S2_L001_R1_001.fastq.gz - include the pilot reads pre-stitching as obtained from Illumina miSeq. 
    3. P-Pilot_S2_L001_R2_001.fastq.gz - include the pilot reads pre-stitching as obtained from Illumina miSeq.
    4. Pilot_Illumina_raw_reads.fastq - includes the reads post-stitching. 
       

     

  2. Test Illumina dataset

    1. BinnedTestIllumina.txt - Includes the test dataset in binned format. 
    2. F1-Full-Pool_S1_L001_R1_001.fastq.gz - includes the Illumina reads pre-stitching as obtained from Illumina miSeq. 
    3. F1-Full-Pool_S1_L001_R2_001.fastq.gz-  includes the Illumina reads pre-stitching as obtained from Illumina miSeq. 
    4. test_illumina_raw_reads.fastq - Includes the reads post-stitching. 

 

Three Nanopore datasets, all generated by Oxford Nanopore Technologies MinION under the names:

  1. Pilot Nanopore dataset

    1. BinnedPilotNanopore.txt  - reads in binned format. 
    2. raw_reads_pilot_nanopore.zip - original basecalled reads as obtained from ONT MinION.
    3. Pilot_RawSignals_1_5.zip , Pilot_RawSignals_6_10.zip , Pilot_RawSignals_11_13.zip - raw nanopore signals as obtained from ONT MinION. 
  2. Test Nanopore first flowcell dataset (termed in the paper as “Nanopore single flowcell”). 

    1. BinnedNanoporeFirstFlowcell.txt - reads in binned format. 
    2. test_pool_nanopore_single.zip - original basecalled reads as obtained from ONT MinION. 
    3. NanoporeFirstFlowcellRawSignals.zip - raw nanopore signals as obtained from ONT MinION.
  3. Test Nanopore second flowcells dataset

    1. BinnedNanoporeSecondFlowcell.txt  - reads in binned format. 
    2. test_nanopore_second_flowcell_part001.zip , test_nanopore_second_flowcell_part002.zip- original basecalled reads as obtained from ONT MinIN.
    3. NanoproeSecondFlowcellRawSignals_1_5.zip , NanoproeSecondFlowcellRawSignals_6_10.zip , NanoproeSecondFlowcellRawSignals_11_15.zip - raw nanopore signals as obtained from ONT MinION.

 

Additionally, for completeness, we also included a file with the processed and binned reads of the test Nanopore dataset of the combined two flowcells dataset (termed in the paper as “Nanopore two flowcells”). This can be found in the file BinnedNanoporeTwoFlowcells.txt



Detailed description

The binned format was created using the binning step described in the paper. Each cluster of reads appears in the file with a header followed by the reads. More specifically:

  1. The header consists of 2 lines; the first corresponds to the encoded sequence of the clusters, and the second is a line of 18x“*” that should be ignored

  2. The reads in the clusters are provided after the header, where each read is given in a separate line

  3. Each cluster ends with two empty lines





Data processing

To ease the processing of our datasets, we also provide the following Python scripts (see https://github.com/itaiorr/Deep-DNA-based-storage)

  1. Preprocessor.py - includes our preprocessing procedure of the raw reads. The procedure detects  and truncates the primers 

  2. Parser.py - parses the file of the binned reads and creates two Python dictionaries. In the first dictionary each key is an encoded sequence,  and the value is a  list of the reads in the cluster. In the second dictionary the keys are the index and the value is a list of the reads in the cluster. 

  

 

 

 

 

 

 

Files

BinnedNanoporeFirstFlowcell.txt

Files (38.5 GB)

Name Size Download all
md5:c95a5adffd1cedb04b55303d51e14706
196.8 MB Preview Download
md5:6c9a930cbc9d2c18aa2c59e39d377979
146.6 MB Preview Download
md5:9cf53ddcd5679ede42299c948690d84f
349.9 MB Preview Download
md5:02b16f63adf5ac85e302c151eaa8ad48
74.2 MB Preview Download
md5:893e1db9d80b210d3368f96d61f0ef78
106.0 MB Preview Download
md5:d90d4bdb52f340e74a1ac3d0e07bc4f0
467.9 MB Preview Download
md5:ec04f17756381a88dabf8dff5e66f044
357.1 MB Download
md5:46b65266332b7b60e54c864535247c6c
387.3 MB Download
md5:0970a0487d15216d7376b458fdc7f948
7.2 GB Preview Download
md5:31567755b644a3e3770fe6ec46871512
6.2 GB Preview Download
md5:31c3cd76800966c29105e38a16947f67
3.3 GB Preview Download
md5:1b91dfd8649a5a8d3470c9303d0539b5
4.2 GB Preview Download
md5:fed474c46fdb06a1f809703d2974753a
56.5 MB Download
md5:342cfde132cc9de492a3ab6399ae2226
60.2 MB Download
md5:a46920f4d9c1bd82bb1a8fd53dae158d
190.6 MB Download
md5:11ff24d068ee21ae26bebed29adf266d
3.4 GB Preview Download
md5:b90d8c1959d5ae2cccee9ddddfde4c83
3.4 GB Preview Download
md5:0da64d53d6b6cf194b0cdb71f3b4c813
4.1 GB Preview Download
md5:6aa769792b14fd667eba5642a1b5ea2c
883.2 MB Preview Download
md5:fecf0255fe06ebcf3624b97388b560a7
1.2 GB Download
md5:054d9255cd664cdfeb24ef36449039b1
674.5 MB Preview Download
md5:866759daf3ecf8b51310bd8e39d02c49
462.0 MB Preview Download
md5:10b97ebaff4de1f222e5c9ced06a39e4
1.1 GB Preview Download