Load libraries

library(ggplot2)
library(dplyr)
library(tidyr)

1. Input genomes

1a. 20 Thermococcus genomes

Open hyperlink in Table_S2_tRNApredictions.xlsx > click Genome Info > click Download Assembly

1b. All Archaeal genomes used in GtRNAdb

GtRNAdb homepage lists 217 Archaea
Downloaded a list from GtRNAdb with Archaea as the only search term > output is in gtrnadb-search15280.out
Add genome ID and accession numbers manually
In GtRNAdb, click genome info and copy assembly name, genbank accession, refseq accession

cat gtrnadb-search15280.out | grep -v "GtRNAdbID" | grep -v "gtrnadb_id" | awk -F"\t" '{print $1"\t"$2}' | sort | uniq > Archaea_genome_list.txt

List genbank accessions for the assemblies used in GtRNAdb:

cat Archaea_genome_list.txt | cut -f 4 > genbank_accessions

The ones that are the latest version can be downloaded using NCBI’s datasets tool.

# datasets code in for loop. something like:
# for line in genbank_accessions; do
# datasets --no-parent -r -A (I think) --filename $line.tar.gz $line
# : done

The 8 assemblies below are used by GtRNAdb but are not to the current version and can therefore not within reach of the datasets tool.
Downloaded these manually from NCBI:

Genbank_accession Strain Anomaly
GCA_000789255.1 Geoglobus acetivorans SBH unverified source organism
GCA_000016605.1 Metallosphaera sedula DSM 5348 unverified source organism
GCA_000953115.1 Methanobacterium formicicum contaminated
GCA_000306725.1 Methanolobus psychrophilus R15 contaminated
GCA_000968395.1 Sulfolobus solfataricus 98/2 SULC Assembly replaced
GCA_000968435.1 Sulfolobus solfataricus SULA Assembly replaced
GCA_000968355.1 Sulfolobus solfataricus SULB Assembly replaced
GCA_000955905.2 Thaumarchaeota archaeon SAT1 Assembly replaced

The 7 assemblies below are used by GtRNAdb but no longer available from NCBI:

Genbank_accession Strain Anomaly
# GCA_000327505.1 Aciduliprofundum sp. MAR08-339 Assembly removed
# GCA_000018485.1 Methanococcus maripaludis C6 Assembly removed
# GCA_000017225.1 Methanococcus maripaludis C7 Assembly removed
# GCA_000006175.2 Methanococcus voltae A3 Assembly removed
# GCA_000235685.3 Methanolinea tarda NOBI-1 Assembly removed
# GCA_000195895.1 Methanosarcina barkeri str. Fusaro Assembly removed
# GCA_000024745.1 Sulfolobus solfataricus 98/2 Assembly removed

2. tRNA predictions

2a. GtRNAdb predictions for 20 Thermococcus genomes

For each of the 20 genomes, I downloaded the tRNAscan-SE predictions from GtRNAdb
Click link in Table_S2_tRNApredictions.xlsx > click Download tRNAscan-SE results

Combine the tRNA predictions from GtRNAdb into one file

for i in *tRNAs; do awk '{print FILENAME"\t"$0}' $i/$i.out | cut -d '/' -f 2 | sed 's/-tRNAs.out//g' | grep -v "Sequence" | grep -v "Name" | grep -v "\--------"; done > thermo_trnas_GtRNAdb.txt

The columns of the resulting tab-delimited file (thermo_trnas.txt) are:

  1. Assembly accession
  2. Sequence name
  3. tRNA #
  4. tRNA bounds begin
  5. tRNA bounds end
  6. tRNA type
  7. Anticodon
  8. Intron bounds begin
  9. Intron bounds end
  10. Inf score
  11. HMM score
  12. 2’str score
  13. Isotype CM
  14. Isotype score
  15. Note

Added column names manually.
Import file in R

thermo<-read.csv("~/Documents/tRNA/thermococcus/thermo_trnas_GtRNAdb.txt", sep="")

Add categories

thermo$cat <- ifelse(thermo$Anticodon=="TCA","seleno", ifelse(thermo$Anticodon=="CTA" | thermo$Anticodon=="TTA" | thermo$Anticodon=="TCA", "supp", ifelse(thermo$tRNA_type=="Undet","undet", ifelse(grepl("pseudo", thermo$Note, fixed=TRUE), "pseudo", "standard"))))

2b. Local tRNAscan-SE predictions for 20 Theromococcus genomes

Extract the .fna files and run tRNAscan-SE

for i in *fna; do \
tRNAscan-SE \
-H \
--detail \
-A \
-o ../tRNA_pred_from_genome/$i.trnascan.out \
-f ../tRNA_pred_from_genome/$i.trnascan.struct \
-m ../tRNA_pred_from_genome/$i.trnascan.summ \
-s ../tRNA_pred_from_genome/$i.trnascan.isotype \
$i; done

Combine the tRNA predictions from genomes into one file
Add column names manually

for i in *out; do awk '{print FILENAME"\t"$0}' $i | sed 's/_genomic.fna.trnascan.out//g' | grep -v "Sequence" | grep -v "Name" | grep -v "\--------"; done > ../thermo_tRNAs_genome.txt

Import file in R

thermo2<-read.csv("~/Documents/tRNA/thermococcus/thermo_tRNAs_genome.txt", sep="")

Add categories

thermo2$cat <- ifelse(thermo2$Anticodon=="TCA","seleno", ifelse(thermo2$Anticodon=="CTA" | thermo2$Anticodon=="TTA" | thermo2$Anticodon=="TCA", "supp", ifelse(thermo2$tRNA_type=="Undet","undet", ifelse(grepl("pseudo", thermo2$Note, fixed=TRUE), "pseudo", "standard"))))
thermo2$cat <- factor(thermo2$cat, levels=c("standard","pseudo","undet"))

2c. GtRNAdb predictions for all 210 Archaeal genomes

gtrnadb-search15280.out has all the tRNAs, but not all the model scores.
It only reports “score”, which I think is the Inf score. **confirm this**
The other scores are available from GtRNAdb at the strain level, but these cannot be downloaded in batch.
Downloaded these manually

Combine the tRNA predictions from GtRNAdb into one file

for i in *out; do awk '{print FILENAME"\t"$0}' $i | sed 's/-tRNAs.out//g' | grep -v "Sequence" | grep -v "Name" | grep -v "\--------"; done > Archaea_210_GtRNAdb_tRNAs.txt

The columns of the resulting tab-delimited file (Archaea_210_GtRNAdb_tRNAs.txt) are:

  1. Assembly accession (acronym used in GtRNAdb)
  2. Sequence name
  3. tRNA #
  4. tRNA bounds begin
  5. tRNA bounds end
  6. tRNA type
  7. Anticodon
  8. Intron bounds begin
  9. Intron bounds end
  10. Inf score
  11. HMM score
  12. 2’str score
  13. Isotype CM
  14. Isotype score
  15. Note

Added column names manually.
Import file in R

arch210GtRNAdb <- read.csv("~/Documents/tRNA/thermococcus/Archaea_210_GtRNAdb_tRNAs.txt", sep="")

Add categories

arch210GtRNAdb$cat <- ifelse(arch210GtRNAdb$Anticodon=="TCA","seleno", ifelse(arch210GtRNAdb$Anticodon=="CTA" | arch210GtRNAdb$Anticodon=="TTA" | arch210GtRNAdb$Anticodon=="TCA", "supp", ifelse(arch210GtRNAdb$tRNA_type=="Undet","undet", ifelse(grepl("pseudo", arch210GtRNAdb$Note, fixed=TRUE), "pseudo", "standard"))))
arch210GtRNAdb$cat <- factor(arch210GtRNAdb$cat, levels=c("standard","seleno","pseudo","undet"))

NOTE that pseudogene checking was disabled for (some?) genomes
tRNAscan-SE run options use dby GtRNAdb for Thermoproteus uzoniensis 768-20 (as a random example):

------------------------------------------------------------
Search Mode:                       Archaeal
Searching with:                    Infernal First Pass->Infernal
Isotype-specific model scan:       Yes
Scan for noncanonical introns
Covariance model:                  TRNAinf-arch.cm
                                   TRNAinf-arch-SeC.cm
Infernal first pass cutoff score:  10


Pseudogene checking disabled
Reporting HMM/2' structure score breakdown
------------------------------------------------------------

2d. Run tRNAscan-SE on 210 genome assemblies

for i in *.fna; do \
tRNAscan-SE \
-H \
--detail \
-A \
-o ../tRNA_pred_from_genomes/$i.trnascan.out \
-f ../tRNA_pred_from_genomes/$i.trnascan.struct \
-m ../tRNA_pred_from_genomes/$i.trnascan.summ \
-s ../tRNA_pred_from_genomes/$i.trnascan.isotype \
$i; done

Four output files are generated for each genome:

  • .out - List of tRNAs predicted
  • .struct - Structure of each predicted tRNA
  • .summ - Summary information
  • .isotype - ?

The lists of predicted tRNAs for each genome (.out files) needed to be concatenated into one file. Added the genome accession as the first column.

for i in *.out; do awk '{print FILENAME"\t"$0}' $i | tail -n+4 | sed 's/_genomic.fna.trnascan.out//g'; done >> ../../Archaea_210genomes_tRNAs.txt

The columns of the resulting tab-delimited file (Archaea_210genomes_tRNAs.txt) are:

  1. Assembly accession
  2. Sequence name (i.e. contig name)
  3. tRNA #
  4. tRNA bounds begin
  5. tRNA bounds end
  6. tRNA type
  7. Anticodon
  8. Intron bounds begin
  9. Intron bounds end
  10. Inf score
  11. HMM score
  12. 2’str score
  13. Isotype CM
  14. Isotype score
  15. Note

Added column names manually.
Import into R.

arch210<-read.csv("~/Documents/tRNA/thermococcus/Archaea_210genomes_tRNAs.txt", sep="")

Add categories and reorder to make them look better in plot.

arch210$cat <- ifelse(arch210$Anticodon=="TCA","seleno", ifelse(arch210$Anticodon=="CTA" | arch210$Anticodon=="TTA" | arch210$Anticodon=="TCA", "supp", ifelse(arch210$tRNA_type=="Undet","undet", ifelse(grepl("pseudo", arch210$Note, fixed=TRUE), "pseudo", "standard"))))
arch210$cat <- factor(arch210$cat, levels=c("standard","seleno","pseudo","undet"))

3. Histograms

3a. Histograms for 20 Thermococcus genomes (GtRNAdb predictions)

Histogram of Isotype scores for Thermococcus genomes FIG. A

ggplot(thermo, aes(Isotype_score, fill=cat)) +
  geom_histogram(binwidth=1) +
  theme_classic(base_size = 18) +
  coord_cartesian(xlim = c(-20,160)) +
  scale_fill_manual(values=c("#ced4da","#d90429")) +
  theme(legend.title = element_blank()) +
  theme(legend.position = c(0.2,0.6)) +
  theme(axis.text.x = element_text(colour = "black")) +
  theme(axis.text.y = element_text(colour = "black")) +
  labs(y="Count", x="Isotype score")

Same plot, but zoomed in to isotype score < 90 FIG. B

ggplot(thermo, aes(Isotype_score, fill=cat)) +
  geom_histogram(binwidth=1) +
  theme_classic(base_size = 18) +
  coord_cartesian(xlim = c(0,90), ylim=c(0,2)) +
  scale_fill_manual(values=c("#ced4da","#d90429")) +
  scale_y_continuous(breaks=c(0,1,2)) +
  theme(legend.title = element_blank()) +
  theme(legend.position = c(0.2,0.8)) +
  theme(axis.text.x = element_text(colour = "black")) +
  theme(axis.text.y = element_text(colour = "black")) +
  labs(y="Count", x="Isotype score") +
  annotate("text", x=65, y=1.5, label="Val-CAC", size=6) +
  annotate("text", x=75, y=2, label="Arg-GCG", size=6) +
  annotate("segment", x=70, y=1.4, xend=82, yend=1.05, arrow=arrow(length = unit(0.2, "cm"))) +
  annotate("segment", x=80, y=1.9, xend=88, yend=1.05, arrow=arrow(length = unit(0.2, "cm")))

3b. Histograms for 20 Thermococcus genomes (local tRNAscan-SE run)

Histogram of Isotype scores for Thermococcus genomes FIG. C

ggplot(thermo2, aes(Isotype_score, fill=cat)) +
  geom_histogram(binwidth=1) +
  theme_classic(base_size = 18) +
  coord_cartesian(xlim = c(-20,160)) +
  scale_fill_manual(values=c("#ced4da","#2b2d42","#d90429")) +
  theme(legend.title = element_blank()) +
  theme(legend.position = c(0.2,0.6)) +
  theme(axis.text.x = element_text(colour = "black")) +
  theme(axis.text.y = element_text(colour = "black")) +
  labs(y="Count", x="Isotype score")

Same plot, but zoomed in to isotype score < 90 FIG. D

ggplot(thermo2, aes(Isotype_score, fill=cat)) +
  geom_histogram(binwidth=1) +
  theme_classic(base_size = 18) +
  coord_cartesian(xlim = c(0,90), ylim=c(0,3)) +
  scale_fill_manual(values=c("#ced4da","#2b2d42","#d90429")) +
  theme(legend.title = element_blank()) +
  theme(legend.position = c(0.2,0.9)) +
  theme(axis.text.x = element_text(colour = "black")) +
  theme(axis.text.y = element_text(colour = "black")) +
  labs(y="Count", x="Isotype score")

4a. Histograms for all 210 genomes (GtRNAdb predictions)

Histogram of Isotype scores. FIG. E

ggplot(arch210GtRNAdb, aes(Isotype_score, fill=cat)) +
  geom_histogram(binwidth=1) +
  coord_cartesian(xlim = c(-20,160), ylim=c(0,600)) +
  scale_fill_manual(values=c("#ced4da","#03a1fc","#d90429")) +
  theme_classic(base_size = 18) +
  theme(legend.title = element_blank()) +
  theme(legend.position = c(0.2,0.6)) +
  theme(axis.text.x = element_text(colour = "black")) +
  theme(axis.text.y = element_text(colour = "black")) +
  labs(y="Count", x="Isotype score") +
  geom_vline(xintercept=59.5, linetype="dashed") +
  geom_vline(xintercept=92, linetype="dashed") +
  annotate("text", x=20, y=600, label="Non-canonical tRNA", size=5) +
  annotate("text", x=75.5, y=600, label="Uncertain", size=5) +
  annotate("text", x=120, y=600, label="Canonical tRNA", size=5)

Same figure, but zoomed in to region isotype score < 90. FIG. F

ggplot(arch210GtRNAdb, aes(Isotype_score, fill=cat)) +
  geom_histogram(binwidth=1) +
  coord_cartesian(xlim = c(-20,90), ylim=c(0,50)) +
  scale_fill_manual(values=c("#ced4da","#03a1fc","#d90429")) +
  theme_classic(base_size = 18) +
  theme(legend.title = element_blank()) +
  theme(legend.position = c(0.2,0.6)) +
  theme(axis.text.x = element_text(colour = "black")) +
  theme(axis.text.y = element_text(colour = "black")) +
  labs(y="Count", x="Isotype score") +
  geom_vline(xintercept=60, linetype="dashed") +
  geom_vline(xintercept=85, linetype="dashed") +
  annotate("text", x=25, y=50, label="Non-canonical tRNA", size=5) +
  annotate("text", x=73, y=50, label="Uncertain", size=5) +
  annotate("text", x=110, y=50, label="Canonical tRNA", size=5)

4b. Histograms for all 210 genomes (local run)

Histogram of Isotype scores. FIG. G

ggplot(arch210, aes(Isotype_score, fill=cat)) +
  geom_histogram(binwidth=1) +
  coord_cartesian(xlim = c(-20,160), ylim=c(0,600)) +
  scale_fill_manual(values=c("#ced4da","#03a1fc","#2b2d42","#d90429")) +
  theme_classic(base_size = 18) +
  theme(legend.title = element_blank()) +
  theme(legend.position = c(0.2,0.6)) +
  theme(axis.text.x = element_text(colour = "black")) +
  theme(axis.text.y = element_text(colour = "black")) +
  labs(y="Count", x="Isotype score") +
  geom_vline(xintercept=59.5, linetype="dashed") +
  geom_vline(xintercept=92, linetype="dashed") +
  annotate("text", x=20, y=600, label="Non-canonical tRNA", size=5) +
  annotate("text", x=75.5, y=600, label="Uncertain", size=5) +
  annotate("text", x=120, y=600, label="Canonical tRNA", size=5)

Same figure, but zoomed in to region isotype score < 90. FIG. H

ggplot(arch210, aes(Isotype_score, fill=cat)) +
  geom_histogram(binwidth=1) +
  coord_cartesian(xlim = c(-20,90), ylim=c(0,50)) +
  scale_fill_manual(values=c("#ced4da","#03a1fc","#2b2d42","#d90429")) +
  theme_classic(base_size = 18) +
  theme(legend.title = element_blank()) +
  theme(legend.position = c(0.2,0.6)) +
  theme(axis.text.x = element_text(colour = "black")) +
  theme(axis.text.y = element_text(colour = "black")) +
  labs(y="Count", x="Isotype score") +
  geom_vline(xintercept=60, linetype="dashed") +
  geom_vline(xintercept=85, linetype="dashed") +
  annotate("text", x=25, y=50, label="Non-canonical tRNA", size=5) +
  annotate("text", x=73, y=50, label="Uncertain", size=5) +
  annotate("text", x=110, y=50, label="Canonical tRNA", size=5)

Histogram of Inf_score

ggplot(arch210, aes(Inf_score, fill=cat)) +
  geom_histogram(binwidth=1) +
  theme_classic(base_size = 18) +
  coord_cartesian(xlim = c(0,160)) +
  scale_fill_manual(values=c("#ced4da","#03a1fc","#2b2d42","#d90429")) +
  theme(legend.title = element_blank()) +
  theme(legend.position = c(0.15,0.6)) +
  theme(axis.text.x = element_text(colour = "black")) +
  theme(axis.text.y = element_text(colour = "black")) +
  labs(y="Count", x="Inf score")

Histogram of HMM_score

ggplot(arch210, aes(HMM_score, fill=cat)) +
  geom_histogram(binwidth=1) +
  theme_classic(base_size = 18) +
  coord_cartesian(xlim = c(0,100)) +
  scale_fill_manual(values=c("#ced4da","#03a1fc","#2b2d42","#d90429")) +
  theme(legend.title = element_blank()) +
  theme(legend.position = c(0.15,0.6)) +
  theme(axis.text.x = element_text(colour = "black")) +
  theme(axis.text.y = element_text(colour = "black")) +
  labs(y="Count", x="HMM score")

Histogram of 2’str_score

ggplot(arch210, aes(X2.str_score, fill=cat)) +
  geom_histogram(binwidth=1) +
  theme_classic(base_size = 18) +
  coord_cartesian(xlim = c(-20,50)) +
  scale_fill_manual(values=c("#ced4da","#03a1fc","#2b2d42","#d90429")) +
  theme(legend.title = element_blank()) +
  theme(legend.position = c(0.15,0.6)) +
  theme(axis.text.x = element_text(colour = "black")) +
  theme(axis.text.y = element_text(colour = "black")) +
  labs(y="Count", x="2'str score")