Evidence for shared ancestry between Actinobacteria and Firmicutes bacteriophages

Bacteriophages are known to display a broad range of host spectra, typically infecting a small set of related bacterial species. The transfer of bacteriophages between more distant clades of bacteria has often been postulated, but remains mostly unaddressed. In this work we leverage the sequencing of novel cluster of phages infecting Streptomyces bacteria and the availability of large numbers of complete phage genomes in public repositories to address this question. Using phylogenetic and comparative genomics methods, we show that several clusters of Actinobacteria-infecting phages are more closely related between them, and with a small group of Firmicutes phages, than with any other Actinobacteriophage lineage. These data indicate that this heterogeneous group of phages shares a common ancestor with well-defined genome structure. Analysis of genomic %GC content shows that these Actinobacteriophages are poorly adapted to their Actinobacteria hosts, suggesting that this phage lineage originated in an ancestor of the Firmicutes, adapted to high %GC content members of this phylum and later migrated to the Actinobacteria.


Introduction
Frequently referred to as phages, bacteriophages are viruses capable of infecting bacteria. It has been estimated that phages are the most abundant entities in the biosphere [1] and, through their regulation of bacterial populations, bacteriophages play an essential role in many global processes of the biosphere, such as carbon and nitrogen cycling [2]. In the last decade, decreasing sequencing costs have dramatically increased the number and diversity of bacteriophage genome sequences [3]. This influx of phage genomic data has reinforced the notion that phages are not only key players in geobiological processes, but also the largest reservoirs of genetic diversity in the biosphere [4]. The Science Education Alliance-Phage Hunters Advancing Genomics and Evolutionary Science (SEA-PHAGES) has undertaken a sustained effort to isolate and sequence phages infecting Actinobacteria species [3]. Among these, Mycobacteria-infecting phages have been studied the most, providing a remarkably deep sample of bacteriophages infecting a given bacterial genus [3]. Studies of genetic diversity in over 600 Mycobacteria-infecting phage genomes have revealed extensive mosaicism, and genetic exchange among relatively distant groups of Mycobacteriophages. Rarefaction analyses suggest that the Mycobacteriophage gene pool is not an isolated environment, and that it is enriched by an influx of genetic material from outside sources [5]. Here we report on the genomic characterization of a new cluster of Streptomyces phages (Cluster BI). Gene content and protein sequence phylogenies indicate that members of BI and related Actinobacteriophage clusters share a common ancestor with Lactococcus and Faecalibacterium phages [6,7].
Analysis of genomic %GC content indicates that these Actinobacteriophages are still undergoing amelioration, suggesting that they may have originated as a result of an interphylum migration event from related Firmicutes phages.

Genome data
Genomes for relevant Streptomyces phages and for reference Actinobacteria and Firmicutes bacteriophages were retrieved in GenBank format from the NCBI GenBank database [8] using custom Python scripts. These scripts also derived nucleotide and amino acid FASTA-formatted files from the GenBank records, and autonumerically reassigned locus_tag and gene GenBank identifiers for consistent pham annotation with PhamDB. For phages without a public GenBank record, nucleotide FASTA files were downloaded from PhagesDB [3] and auto-annotated with DNA Master [9] to generate a GenBank-formatted file. %GC content data was obtained from the corresponding NCBI assembly records. Group %GC content was compared using a Mann-Whitney U test with α=0.05 using a custom Python script and the scipy.stats module.

Gene content phylogeny
PhamDB was used to compute protein families, or phams, for the bacteriophage genomes under analysis [10]. The PhamDB-generated database was then imported into Phamerator [11] and the resulting pham table was exported as a comma-separated file and processed with spreadsheet software and the Janus program (Lawrence Lab) to obtain a Nexus-format file with presence/absence of each pham in each genome as a binary character. This Nexus file was used as input for SplitTree [12]. Network and tree phylogenies were inferred with the NeighborNet and BioNJ algorithms using a gene content distance [13] and branch support for the resulting phylogeny was estimated from 1,000 bootstrap pseudoreplicates. A genome-based phylogeny was generated with the VICTOR webservice [14] . Intergenomic protein sequence distances were computed with 100 pseudo-bootstrap replicates using the Genome-BLAST Distance Phylogeny (GBDP) method optimized (distance formula d 6 ) for prokaryotic viruses [14,15] and a minimum evolution tree was computed with FASTME on the resulting intergenomic distances [16] .

Protein sequence phylogeny
A profile Hidden Markov Model (HMM) of terminase protein sequences was built with HMMER (hmmbuild) using a ClustalW multiple sequence alignment of all annotated terminase, TerL or terminase large subunit sequences in the genomes under analysis [17,18]. This profile HMM was used to search (hmmsearch) the protein FASTA file derived from each genome with a cutoff e-value of 10 -3 . Putative terminase sequences identified by the profile HMM were aligned with ClustalW using default parameters. Tree inference was performed on the resulting multiple sequence alignment using the BioNJ algorithm with a Gamma distribution parameter of 1 and the Jones-Taylor-Thornton substitution model, and branch supports were estimated from 1,000 bootstrap pseudoreplicates [19].

Conserved architecture of BI cluster Streptomyces phage genomes
In the last few years, our group has characterized and sequenced several Siphoviridae Comparative analysis of these bacteriophage genomes ( Figure 1) reveals nucleotide sequence conservation to be predominant only in the virion structure and assembly genes module, which presents a genetic arrangement consistent with that observed in other Siphoviridae [21]. Within this module, the terminase gene shows the highest degree of sequence conservation, followed by segments of the portal, capsid maturation and tape measure protein coding genes ( Figure 1).
Beyond the structure and assembly module, moderate nucleotide sequence conservation is only observed for the genes coding for a predicted hydrolase in the lysis module, and for the DNA primase/polymerase and an helix-turn-helix (HTH) domain-containing protein in the replication module.

Interphylum conservation of structure and replication proteins
Functional annotation of BI cluster genomes was performed using BLASTP searches against both the NCBI GenBank and the PhamDB databases, as well as the HHpred service [22,3,8,20].
During the annotation process, BI cluster protein sequences frequently elicited significant hits

members) with other
Actinobacteriophages.
Graphical analysis of the genomic distribution of orthologs spanning both the Actinobacteriophage supercluster and the Lactococcus and Faecalibacterium phages ( Figure 2) revealed that most of the orthologous genes were contained within two conserved regions at opposite ends of the genome. The first conserved region encompasses a sizable fraction of the virion structure and assembly genes module seen in BI cluster phages, containing a HNH endonuclease, a head-to-tail connector, the terminase large subunit, the portal protein and a capsid maturation protease (Figure 2A). The second conserved region corresponds to the end of the replication module observed in cluster BI phages and contains a DNA helicase, a HNH endonuclease, a RecB exonuclease, the HTH domain-containing protein and a conserved hypothetical protein ( Figure 2B). Pairwise amino acid identity and alignment coverage for conserved orthologs among Actinobacteriophages were moderately high (56% ± SD 12 and 91% ± SD 7), and remained surprisingly high between Gordonia phage Gravy and Faecalibacterium phage FP_oengus (49 ± SD 11 and 90 ± SD 7), suggesting a relatively close evolutionary relationship.

Shared ancestry between Actinobacteria and Firmicutes phages
The presence of two genomic regions showing substantial numbers of orthologous genes across a group of Actinobacteriophages infecting multiple hosts and a small set of Firmicutes phages strongly pointed to an evolutionary relationship among these phages. To validate and examine this hypothesis, we used SplitsTree to infer the neighbor tree and estimate bootstrap support for the splits. The results (Figure 3, Figure S1, Data S1) show consistent branching (99.9% bootstrap support) of the Actinobacteriophage supercluster with both Lactococcus and Faecalibacterium phages, clearly establishing that these Firmicutes phages and the Actinobacteriophage supercluster phages share more gene content with each other than with reference Actinobacteriophages and Firmicutes phages. To further validate and support this result, we performed phylogenetic inference on the protein sequence of the large terminase subunit ( Figure 4), a very common marker for bacteriophage phylogenetic analysis [23][24][25][26]. The inferred tree also shows solid support (100% bootstrap support) for a joint branching of the Actinobacteriophage supercluster phages and Lactococcus and Faecalibacterium phages, giving further credence to the notion that these phages share a common ancestor. Identical support for the joint branching of the Actinobacteriophage supercluster phages and Lactococcus and Faecalibacterium phages was obtained through independent phylogenetic inference using a bootstrapped minimal evolution algorithm operating on intergenomic protein sequence distances inferred from pairwise genome-wide reciprocal tBLASTX ( Figure SX).

Divergence in %GC content between bacteriophages and their hosts
We analyzed the %GC content of bacteriophage genomes to assess their alignment with the genomic %GC content of their hosts. The results ( Figure 5, also present %GC content that is significantly lower than the one observed in their natural hosts and than the average for phage clusters infecting the respective genera. For comparison, the mean ± standard deviation for %GC content estimated on the host genomes is overlaid with a filled box on the phage cluster bars and the mean ± standard deviation for %GC content among phages infecting each genus is overlaid as an empty box. The %GC content for Faecalibacterium hosts was estimated from one presentative per species with whole genome shotgun assemblies.

Discussion
It is well known that bacteriophages will often infect several different hosts within the same bacterial genus, and that this host range can vary widely among phages within a given genus. As a consequence, it has been postulated that the intragenera host-phage interaction network is nested, with generalist phages infecting multiple hosts and specialist phages infecting particularly susceptible strains [27]. In contrast, relatively little is known about the ability of bacteriophages to infect across genera or broader taxonomic spans. Phage systems have been engineered to transcend genus boundaries [28] and effective transfer of virus-like particles has been documented across phyla [29],    Table S1 -List of phage genomes analyzed in this work.

Table S2
-Groups of orthologous proteins (phams) in the set of analyzed phage genomes. Data S1 -Nexus-formatted SplitTree file for the pham-based tree.