Published September 29, 2021 | Version v1
Software Open

Deep sequencing datasets from: Witnessing the structural evolution of an RNA enzyme

  • 1. Yale University
  • 2. Salk Institute for Biological Studies

Description

An RNA polymerase ribozyme that has been the subject of extensive directed evolution efforts has attained the ability to synthesize complex functional RNAs, including a full-length copy of its own evolutionary ancestor. During the course of evolution, the catalytic core of the ribozyme has undergone a major structural rearrangement, resulting in a novel tertiary structural element that lies in close proximity to the active site. Through a combination of site-directed mutagenesis, structural probing, and deep sequencing analysis, the trajectory of evolution was seen to involve the progressive stabilization of the new structure, which provides the basis for improved catalytic activity of the ribozyme. Multiple paths to the new structure were explored by the evolving population, converging upon a common solution. Tertiary structural remodeling of RNA is known to occur in nature, as evidenced by the phylogenetic analysis of extant organisms, but this type of structural innovation had not previously been observed in an experimental setting. Despite prior speculation that the catalytic core of the ribozyme had become trapped in a narrow local fitness optimum, the evolving population has broken through to a new fitness locale, raising the possibility that further improvement of polymerase activity may be achievable.

 

Notes

Directory organization of files

Each selection round has its own directory (ex. eLife21_R#) containing five files corresponding to the aligned reads (eLife_R#_aligned), two cluster files (eLife21_R#_Clusters_a/b) and two files corresponding to the raw Illumina reads (eLife21_R#_R1/2.fastq.gz). A source data spreadsheet (eLife '21 suppfile2 source data.xlsx) and the python script used to enumerate the sequence reads (eLife '21 suppfileA.py) are provided as separate files. Fidelity data is provided as separate files consisting of raw Illumina reads of polymerase product RNA, (eLife21_###_###_R1/2.fastq.gz), aligned bam file (eLife21_c1l_200_align_sort.bam), and the final source data spreadsheet (eLife21_52-2_fidelity_table.xlsx) for polymerase fidelity calculations.
 

Deep Sequencing of 19 rounds of evolution

Classification of P8 stem variants - Microsoft Excel table

Excel spreadsheet parses cd-hit cluster output (eLife21_R#_Clusters_a/b) to determine the number of total reads associated with each cluster, and identifies clusters with >1% representation in the round (sheets R6-52). Clusters are manually classified and cross-checked against the aligned full sequences (eLife21_R#_aligned) to generate the table for Supplementary file 2 (sheet "prevalent clusters"). This information is used to generate the cluster representation-by-round heatmap (sheet "summary") that was used to generate Figure 5B.

"eLife '21 suppfile2 source data.xlsx"
 

P8 region clusters - text files

Output of cd-hit clustering of aligned regions in nucleotides 9–17 and 83–95 (as referenced to the 52-2 polymerase sequence). Used to generate high-representation cluster tables in "eLife '21 suppfile2 source data.xlsx".

"eLife21_R#_Clusters_a.txt";  "eLife21_R#_Clusters_b.txt"
 

Aligned polymerase sequences by round - fastq file

Gapped alignment of all sequences, by round, generated by MUSCLE alignment. These files were trimmed in AliView to the region encompassing the P7 and P8 stems (nucleotides 9–17 and 83–95) and clustered using cd-hit-est. 

"eLife21_R#_aligned.fastq"
 

Enumeration of individual sequences - script

Python script to enumerate individual sequences in a fastq file generates a new fastq file of unique sequences, named by number of reads of that specific sequence in the input file. Used on merged, filtered, and trimmed reads, with output aligned by MUSCLE to generate "eLife21_R#_aligned.fastq"

"eLife '21 suppfileA.py"
 

Raw Fastq files from individual rounds of directed in vitro evolution

Fastq files pertaining to sequences present in selected rounds of polymerase evolution. Select rounds (24, 38, 43, and 49) were sequenced as multiple sample sets, R24 as 3 sample sets and R38/43/49 as 2 sample sets. Paired-end reads for each round were merged, quality filtered  (phred >33), and trimmed (>150 nucleotides) using PEAR (v 0.9.11), followed by enumeration of individual sequences using the "eLife '21 suppfileA.py" script.

"eLife21_R#_raw_R1.fastq.gz"; "eLife21_R#_raw_R2.fastq.gz"
 

Determination of polymerase fidelity by deep sequencing.

Raw fastq files from ribozyme product libraries 

Fastq files were generated by Illumina MiniSeq runs from dsDNA libraries of ribozyme products synthesized by the 52-2 polymerase, as described in the methods section. 75-cycle paired-end runs were used for hammerhead products, and 150-cycle paired-end run for the ligase products. Files are labeled by ribozyme (hhead or c1l) and Mg++ concentration (200 or 50). Sequences of both sets were trimmed using cutadapt, merged using FLASH, and filtered for high quality reads using FASTX Toolkit, with parameters described in the methods section. These trimmed, merged, and filtered reads were then aligned to the full length sequence (hammerhead and separately class I ligase) using bowtie2. The sam files generated by bowtie2 were subsequently converted into a bam file and sorted using SAMtools. 

"eLife21_c1l_200_R1/2.fastq.gz"

"eLife21_hhead_200_R1/2.fastq.gz"

"eLife21_hhead_50_R1/2.fastq.gz"
 

Compressed binary sequence alignment/map file of class I ligase products (*.bam)

Edit distances to the reference sequence are listed by bowtie2 in field 17 (NM:i:<N>), and were extracted using SAMtools view and the following unix command: | awk '{print $17}' | sort | uniq -c | awk '{print $2"\t"$1}'. A gapped alignment file was generated from the sorted bam file using breseq (v0.35.5) bamtoaln. Finally, the resulting gapped alignment table was processed using a custom java script (Tjhung et al., 2020) to tabulate the number of substitution, deletion, and insertion mutations for each aligned read at each position.

"eLife21_c1l_200_align_sort.bam"
 

Polymerase fidelity tables and distribution of errors between products – Microsoft Excel table

Tabulated mutations by position for hammerhead and ligase products were added as separate tabs to an excel file, which were then used to calculate frequency of polymerase errors by base identity and average fidelity as the geometric mean of fidelity for the four bases. The distribution of edit distances determined by the bowtie2 alignment of ligase product reads were used to determine the distribution of errors between ligase products, and are shown below the ligase fidelity table. Observed errors are compared to expected errors for an average fidelity of 84.1%, using the binomial distribution. Results are presented in two charts to the right of the data.

"eLife21_52-2_fidelity_table.xlsx"

Funding provided by: National Science Foundation
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100000001
Award Number: DGE1752134

Funding provided by: National Aeronautics and Space Administration
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100000104
Award Number: NSSC19K0481

Funding provided by: Simons Foundation
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100000893
Award Number: 287624

Funding provided by: National Institutes of Health
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100000002
Award Number: P01GM022778

Files

Files (1.8 kB)

Name Size Download all
md5:852d12f2f9c1338e99b5bb746bb1c31c
1.8 kB Download

Additional details

Related works