Load libraries
library(ggplot2)
library(dplyr)
library(tidyr)
Open hyperlink in Table_S2_tRNApredictions.xlsx > click Genome Info > click Download Assembly
GtRNAdb homepage lists 217 Archaea
Downloaded a list from GtRNAdb with Archaea as the only search term > output is in gtrnadb-search15280.out
Add genome ID and accession numbers manually
In GtRNAdb, click genome info and copy assembly name, genbank accession, refseq accession
cat gtrnadb-search15280.out | grep -v "GtRNAdbID" | grep -v "gtrnadb_id" | awk -F"\t" '{print $1"\t"$2}' | sort | uniq > Archaea_genome_list.txt
List genbank accessions for the assemblies used in GtRNAdb:
cat Archaea_genome_list.txt | cut -f 4 > genbank_accessions
The ones that are the latest version can be downloaded using NCBI’s datasets tool.
# datasets code in for loop. something like:
# for line in genbank_accessions; do
# datasets --no-parent -r -A (I think) --filename $line.tar.gz $line
# : done
The 8 assemblies below are used by GtRNAdb but are not to the current version and can therefore not within reach of the datasets tool.
Downloaded these manually from NCBI:
| Genbank_accession | Strain | Anomaly |
|---|---|---|
| GCA_000789255.1 | Geoglobus acetivorans SBH | unverified source organism |
| GCA_000016605.1 | Metallosphaera sedula DSM 5348 | unverified source organism |
| GCA_000953115.1 | Methanobacterium formicicum | contaminated |
| GCA_000306725.1 | Methanolobus psychrophilus R15 | contaminated |
| GCA_000968395.1 | Sulfolobus solfataricus 98/2 SULC | Assembly replaced |
| GCA_000968435.1 | Sulfolobus solfataricus SULA | Assembly replaced |
| GCA_000968355.1 | Sulfolobus solfataricus SULB | Assembly replaced |
| GCA_000955905.2 | Thaumarchaeota archaeon SAT1 | Assembly replaced |
The 7 assemblies below are used by GtRNAdb but no longer available from NCBI:
| Genbank_accession | Strain | Anomaly |
|---|---|---|
| # GCA_000327505.1 | Aciduliprofundum sp. MAR08-339 | Assembly removed |
| # GCA_000018485.1 | Methanococcus maripaludis C6 | Assembly removed |
| # GCA_000017225.1 | Methanococcus maripaludis C7 | Assembly removed |
| # GCA_000006175.2 | Methanococcus voltae A3 | Assembly removed |
| # GCA_000235685.3 | Methanolinea tarda NOBI-1 | Assembly removed |
| # GCA_000195895.1 | Methanosarcina barkeri str. Fusaro | Assembly removed |
| # GCA_000024745.1 | Sulfolobus solfataricus 98/2 | Assembly removed |
For each of the 20 genomes, I downloaded the tRNAscan-SE predictions from GtRNAdb
Click link in Table_S2_tRNApredictions.xlsx > click Download tRNAscan-SE results
Combine the tRNA predictions from GtRNAdb into one file
for i in *tRNAs; do awk '{print FILENAME"\t"$0}' $i/$i.out | cut -d '/' -f 2 | sed 's/-tRNAs.out//g' | grep -v "Sequence" | grep -v "Name" | grep -v "\--------"; done > thermo_trnas_GtRNAdb.txt
The columns of the resulting tab-delimited file (thermo_trnas.txt) are:
Added column names manually.
Import file in R
thermo<-read.csv("~/Documents/tRNA/thermococcus/thermo_trnas_GtRNAdb.txt", sep="")
Add categories
thermo$cat <- ifelse(thermo$Anticodon=="TCA","seleno", ifelse(thermo$Anticodon=="CTA" | thermo$Anticodon=="TTA" | thermo$Anticodon=="TCA", "supp", ifelse(thermo$tRNA_type=="Undet","undet", ifelse(grepl("pseudo", thermo$Note, fixed=TRUE), "pseudo", "standard"))))
Extract the .fna files and run tRNAscan-SE
for i in *fna; do \
tRNAscan-SE \
-H \
--detail \
-A \
-o ../tRNA_pred_from_genome/$i.trnascan.out \
-f ../tRNA_pred_from_genome/$i.trnascan.struct \
-m ../tRNA_pred_from_genome/$i.trnascan.summ \
-s ../tRNA_pred_from_genome/$i.trnascan.isotype \
$i; done
Combine the tRNA predictions from genomes into one file
Add column names manually
for i in *out; do awk '{print FILENAME"\t"$0}' $i | sed 's/_genomic.fna.trnascan.out//g' | grep -v "Sequence" | grep -v "Name" | grep -v "\--------"; done > ../thermo_tRNAs_genome.txt
Import file in R
thermo2<-read.csv("~/Documents/tRNA/thermococcus/thermo_tRNAs_genome.txt", sep="")
Add categories
thermo2$cat <- ifelse(thermo2$Anticodon=="TCA","seleno", ifelse(thermo2$Anticodon=="CTA" | thermo2$Anticodon=="TTA" | thermo2$Anticodon=="TCA", "supp", ifelse(thermo2$tRNA_type=="Undet","undet", ifelse(grepl("pseudo", thermo2$Note, fixed=TRUE), "pseudo", "standard"))))
thermo2$cat <- factor(thermo2$cat, levels=c("standard","pseudo","undet"))
gtrnadb-search15280.out has all the tRNAs, but not all the model scores.
It only reports “score”, which I think is the Inf score. **confirm this**
The other scores are available from GtRNAdb at the strain level, but these cannot be downloaded in batch.
Downloaded these manually
Combine the tRNA predictions from GtRNAdb into one file
for i in *out; do awk '{print FILENAME"\t"$0}' $i | sed 's/-tRNAs.out//g' | grep -v "Sequence" | grep -v "Name" | grep -v "\--------"; done > Archaea_210_GtRNAdb_tRNAs.txt
The columns of the resulting tab-delimited file (Archaea_210_GtRNAdb_tRNAs.txt) are:
Added column names manually.
Import file in R
arch210GtRNAdb <- read.csv("~/Documents/tRNA/thermococcus/Archaea_210_GtRNAdb_tRNAs.txt", sep="")
Add categories
arch210GtRNAdb$cat <- ifelse(arch210GtRNAdb$Anticodon=="TCA","seleno", ifelse(arch210GtRNAdb$Anticodon=="CTA" | arch210GtRNAdb$Anticodon=="TTA" | arch210GtRNAdb$Anticodon=="TCA", "supp", ifelse(arch210GtRNAdb$tRNA_type=="Undet","undet", ifelse(grepl("pseudo", arch210GtRNAdb$Note, fixed=TRUE), "pseudo", "standard"))))
arch210GtRNAdb$cat <- factor(arch210GtRNAdb$cat, levels=c("standard","seleno","pseudo","undet"))
NOTE that pseudogene checking was disabled for (some?) genomes
tRNAscan-SE run options use dby GtRNAdb for Thermoproteus uzoniensis 768-20 (as a random example):
------------------------------------------------------------
Search Mode: Archaeal
Searching with: Infernal First Pass->Infernal
Isotype-specific model scan: Yes
Scan for noncanonical introns
Covariance model: TRNAinf-arch.cm
TRNAinf-arch-SeC.cm
Infernal first pass cutoff score: 10
Pseudogene checking disabled
Reporting HMM/2' structure score breakdown
------------------------------------------------------------
for i in *.fna; do \
tRNAscan-SE \
-H \
--detail \
-A \
-o ../tRNA_pred_from_genomes/$i.trnascan.out \
-f ../tRNA_pred_from_genomes/$i.trnascan.struct \
-m ../tRNA_pred_from_genomes/$i.trnascan.summ \
-s ../tRNA_pred_from_genomes/$i.trnascan.isotype \
$i; done
Four output files are generated for each genome:
The lists of predicted tRNAs for each genome (.out files) needed to be concatenated into one file. Added the genome accession as the first column.
for i in *.out; do awk '{print FILENAME"\t"$0}' $i | tail -n+4 | sed 's/_genomic.fna.trnascan.out//g'; done >> ../../Archaea_210genomes_tRNAs.txt
The columns of the resulting tab-delimited file (Archaea_210genomes_tRNAs.txt) are:
Added column names manually.
Import into R.
arch210<-read.csv("~/Documents/tRNA/thermococcus/Archaea_210genomes_tRNAs.txt", sep="")
Add categories and reorder to make them look better in plot.
arch210$cat <- ifelse(arch210$Anticodon=="TCA","seleno", ifelse(arch210$Anticodon=="CTA" | arch210$Anticodon=="TTA" | arch210$Anticodon=="TCA", "supp", ifelse(arch210$tRNA_type=="Undet","undet", ifelse(grepl("pseudo", arch210$Note, fixed=TRUE), "pseudo", "standard"))))
arch210$cat <- factor(arch210$cat, levels=c("standard","seleno","pseudo","undet"))
Histogram of Isotype scores for Thermococcus genomes FIG. A
ggplot(thermo, aes(Isotype_score, fill=cat)) +
geom_histogram(binwidth=1) +
theme_classic(base_size = 18) +
coord_cartesian(xlim = c(-20,160)) +
scale_fill_manual(values=c("#ced4da","#d90429")) +
theme(legend.title = element_blank()) +
theme(legend.position = c(0.2,0.6)) +
theme(axis.text.x = element_text(colour = "black")) +
theme(axis.text.y = element_text(colour = "black")) +
labs(y="Count", x="Isotype score")
Same plot, but zoomed in to isotype score < 90 FIG. B
ggplot(thermo, aes(Isotype_score, fill=cat)) +
geom_histogram(binwidth=1) +
theme_classic(base_size = 18) +
coord_cartesian(xlim = c(0,90), ylim=c(0,2)) +
scale_fill_manual(values=c("#ced4da","#d90429")) +
scale_y_continuous(breaks=c(0,1,2)) +
theme(legend.title = element_blank()) +
theme(legend.position = c(0.2,0.8)) +
theme(axis.text.x = element_text(colour = "black")) +
theme(axis.text.y = element_text(colour = "black")) +
labs(y="Count", x="Isotype score") +
annotate("text", x=65, y=1.5, label="Val-CAC", size=6) +
annotate("text", x=75, y=2, label="Arg-GCG", size=6) +
annotate("segment", x=70, y=1.4, xend=82, yend=1.05, arrow=arrow(length = unit(0.2, "cm"))) +
annotate("segment", x=80, y=1.9, xend=88, yend=1.05, arrow=arrow(length = unit(0.2, "cm")))
Histogram of Isotype scores for Thermococcus genomes FIG. C
ggplot(thermo2, aes(Isotype_score, fill=cat)) +
geom_histogram(binwidth=1) +
theme_classic(base_size = 18) +
coord_cartesian(xlim = c(-20,160)) +
scale_fill_manual(values=c("#ced4da","#2b2d42","#d90429")) +
theme(legend.title = element_blank()) +
theme(legend.position = c(0.2,0.6)) +
theme(axis.text.x = element_text(colour = "black")) +
theme(axis.text.y = element_text(colour = "black")) +
labs(y="Count", x="Isotype score")
Same plot, but zoomed in to isotype score < 90 FIG. D
ggplot(thermo2, aes(Isotype_score, fill=cat)) +
geom_histogram(binwidth=1) +
theme_classic(base_size = 18) +
coord_cartesian(xlim = c(0,90), ylim=c(0,3)) +
scale_fill_manual(values=c("#ced4da","#2b2d42","#d90429")) +
theme(legend.title = element_blank()) +
theme(legend.position = c(0.2,0.9)) +
theme(axis.text.x = element_text(colour = "black")) +
theme(axis.text.y = element_text(colour = "black")) +
labs(y="Count", x="Isotype score")
Histogram of Isotype scores. FIG. E
ggplot(arch210GtRNAdb, aes(Isotype_score, fill=cat)) +
geom_histogram(binwidth=1) +
coord_cartesian(xlim = c(-20,160), ylim=c(0,600)) +
scale_fill_manual(values=c("#ced4da","#03a1fc","#d90429")) +
theme_classic(base_size = 18) +
theme(legend.title = element_blank()) +
theme(legend.position = c(0.2,0.6)) +
theme(axis.text.x = element_text(colour = "black")) +
theme(axis.text.y = element_text(colour = "black")) +
labs(y="Count", x="Isotype score") +
geom_vline(xintercept=59.5, linetype="dashed") +
geom_vline(xintercept=92, linetype="dashed") +
annotate("text", x=20, y=600, label="Non-canonical tRNA", size=5) +
annotate("text", x=75.5, y=600, label="Uncertain", size=5) +
annotate("text", x=120, y=600, label="Canonical tRNA", size=5)
Same figure, but zoomed in to region isotype score < 90. FIG. F
ggplot(arch210GtRNAdb, aes(Isotype_score, fill=cat)) +
geom_histogram(binwidth=1) +
coord_cartesian(xlim = c(-20,90), ylim=c(0,50)) +
scale_fill_manual(values=c("#ced4da","#03a1fc","#d90429")) +
theme_classic(base_size = 18) +
theme(legend.title = element_blank()) +
theme(legend.position = c(0.2,0.6)) +
theme(axis.text.x = element_text(colour = "black")) +
theme(axis.text.y = element_text(colour = "black")) +
labs(y="Count", x="Isotype score") +
geom_vline(xintercept=60, linetype="dashed") +
geom_vline(xintercept=85, linetype="dashed") +
annotate("text", x=25, y=50, label="Non-canonical tRNA", size=5) +
annotate("text", x=73, y=50, label="Uncertain", size=5) +
annotate("text", x=110, y=50, label="Canonical tRNA", size=5)
Histogram of Isotype scores. FIG. G
ggplot(arch210, aes(Isotype_score, fill=cat)) +
geom_histogram(binwidth=1) +
coord_cartesian(xlim = c(-20,160), ylim=c(0,600)) +
scale_fill_manual(values=c("#ced4da","#03a1fc","#2b2d42","#d90429")) +
theme_classic(base_size = 18) +
theme(legend.title = element_blank()) +
theme(legend.position = c(0.2,0.6)) +
theme(axis.text.x = element_text(colour = "black")) +
theme(axis.text.y = element_text(colour = "black")) +
labs(y="Count", x="Isotype score") +
geom_vline(xintercept=59.5, linetype="dashed") +
geom_vline(xintercept=92, linetype="dashed") +
annotate("text", x=20, y=600, label="Non-canonical tRNA", size=5) +
annotate("text", x=75.5, y=600, label="Uncertain", size=5) +
annotate("text", x=120, y=600, label="Canonical tRNA", size=5)
Same figure, but zoomed in to region isotype score < 90. FIG. H
ggplot(arch210, aes(Isotype_score, fill=cat)) +
geom_histogram(binwidth=1) +
coord_cartesian(xlim = c(-20,90), ylim=c(0,50)) +
scale_fill_manual(values=c("#ced4da","#03a1fc","#2b2d42","#d90429")) +
theme_classic(base_size = 18) +
theme(legend.title = element_blank()) +
theme(legend.position = c(0.2,0.6)) +
theme(axis.text.x = element_text(colour = "black")) +
theme(axis.text.y = element_text(colour = "black")) +
labs(y="Count", x="Isotype score") +
geom_vline(xintercept=60, linetype="dashed") +
geom_vline(xintercept=85, linetype="dashed") +
annotate("text", x=25, y=50, label="Non-canonical tRNA", size=5) +
annotate("text", x=73, y=50, label="Uncertain", size=5) +
annotate("text", x=110, y=50, label="Canonical tRNA", size=5)
Histogram of Inf_score
ggplot(arch210, aes(Inf_score, fill=cat)) +
geom_histogram(binwidth=1) +
theme_classic(base_size = 18) +
coord_cartesian(xlim = c(0,160)) +
scale_fill_manual(values=c("#ced4da","#03a1fc","#2b2d42","#d90429")) +
theme(legend.title = element_blank()) +
theme(legend.position = c(0.15,0.6)) +
theme(axis.text.x = element_text(colour = "black")) +
theme(axis.text.y = element_text(colour = "black")) +
labs(y="Count", x="Inf score")
Histogram of HMM_score
ggplot(arch210, aes(HMM_score, fill=cat)) +
geom_histogram(binwidth=1) +
theme_classic(base_size = 18) +
coord_cartesian(xlim = c(0,100)) +
scale_fill_manual(values=c("#ced4da","#03a1fc","#2b2d42","#d90429")) +
theme(legend.title = element_blank()) +
theme(legend.position = c(0.15,0.6)) +
theme(axis.text.x = element_text(colour = "black")) +
theme(axis.text.y = element_text(colour = "black")) +
labs(y="Count", x="HMM score")
Histogram of 2’str_score
ggplot(arch210, aes(X2.str_score, fill=cat)) +
geom_histogram(binwidth=1) +
theme_classic(base_size = 18) +
coord_cartesian(xlim = c(-20,50)) +
scale_fill_manual(values=c("#ced4da","#03a1fc","#2b2d42","#d90429")) +
theme(legend.title = element_blank()) +
theme(legend.position = c(0.15,0.6)) +
theme(axis.text.x = element_text(colour = "black")) +
theme(axis.text.y = element_text(colour = "black")) +
labs(y="Count", x="2'str score")