Published August 26, 2021 | Version v1
Software Open

PacBio sequencing output increased through uniform and directional 5-fold concatenation

  • 1. University of California, Los Angeles
  • 2. University of Minnesota

Description

Advances in sequencing technology have allowed researchers to sequence DNA with greater ease and at decreasing costs. Main developments have focused on either sequencing many short sequences or fewer large sequences. Methods for sequencing mid-sized sequences of 600-5,000 bp are currently less efficient. For example, the PacBio Sequel I system yields ~100,000-300,000 reads with an accuracy per base pair of 90-99%. We sought to sequence several DNA populations of ~870 bp in length with a sequencing accuracy of 99% and to the greatest depth possible. We optimised a simple, robust method to concatenate genes of ~870 bp five times and then sequenced the resulting DNA of ~5,000 bp by PacBioSMRT long-read sequencing. Our method improved upon previously published concatenation attempts, leading to a greater sequencing depth, high-quality reads and limited sample preparation at little expense. We applied this efficient concatenation protocol to sequence nine DNA populations from a protein engineering study. The improved method is accompanied by a simple and user-friendly analysis pipeline, DeCatCounter, to sequence medium-length sequences efficiently at one-fifth of the cost.

Notes

Sequencing data: Dataset with 12,505 amplicons, corresponding to a subset of the raw dataset analyzed in the publication (124,715 amplicons): dataset.fasta

Barcodes: Barcode sequences used for demultiplexing: barcodes.txt

Constant regions: Constant region sequences used for deconcatenating: constant.txt

DeCatCounter: Python script to demultiplex the amplicons, deconcatenatenate into sequences and count their count reads: DeCatCounter.py

Count files: Text files with amino acid sequences read counts for the nine rounds of selection. Each file contains round information (number of unique sequences and total number of molecules) in its header, followed by sequences and the number of times they appeared in that round. Sequences have been labelled using the code "sequence_X", where X corresponds to the order in which the sequence appears in the files: count_files.zip

Funding provided by: National Institutes of Health
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100000002
Award Number: 5R01GM108703-04

Funding provided by: National Institutes of Health
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100000002
Award Number: 7DP2GM123457-02

Funding provided by: Simons Foundation
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100000893
Award Number: 340762

Funding provided by: Simons Foundation
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100000893
Award Number: 290356

Funding provided by: Minnesota Medical Foundation*
Crossref Funder Registry ID:
Award Number: 4036–9663-10

Funding provided by: National Aeronautics and Space Administration
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100000104
Award Number: 17-EXO17_2-0044

Files

Files (18.1 kB)

Name Size Download all
md5:5aef76c4fd1a3557d4b9d492482fbf87
18.1 kB Download

Additional details

Related works