ATP synthase evolution on a cross-braced dated tree of life

Mahendrarajah, Tara A; Moody, Edmund RR; Schrempf, Dominik; Szántho, Lénárd L; Dombrowski, Nina; Davín, Adrián A; Pisani, Davide; Donoghue, Philip CJ; Szöllősi, Gergely J; Williams, Tom A; Spang, Anja

doi:10.5281/zenodo.8232759

Published August 14, 2023 | Version v2

Dataset Open

ATP synthase evolution on a cross-braced dated tree of life

Abstract

The timing of early cellular evolution, from the divergence of Archaea and Bacteria to the origin of eukaryotes, remains poorly constrained. The ATP synthase complex is thought to have originated prior to the Last Universal Common Ancestor (LUCA) and analyses of ATP synthase genes, together with ribosomes, have played a key role in inferring and rooting the tree of life. Here we reconstruct the evolutionary history of ATP synthases using an expanded sampling of Archaea, Bacteria, and eukaryotes. We developed a phylogenetic cross-bracing approach, thereby constraining equivalent speciation nodes to be contemporaneous, based on the phylogenetic imprint of endosymbioses and ancient gene duplications of the major ATP synthase subunits. This approach resulted in a highly resolved, dated species tree and established an absolute timeline for ATP synthase evolution. Our analyses show that the divergence of ATP synthase into F- and A/V-type lineages was a very early event in cellular evolution dating back to more than 4Ga, potentially predating the diversification of Archaea and Bacteria. Our cross-braced, dated tree of life also provides insight into more recent evolutionary transitions including eukaryogenesis, showing that the eukaryotic nuclear and mitochondrial lineages diverged from their closest archaeal (2.67-2.19Ga) and bacterial (2.58-2.12Ga) relatives at approximately the same time, with a slightly longer nuclear stem-lineage.

Repository Contents

1_100Eukaryote_genomes.tar.gz: includes all protein sequence files for the 100 Eukaryotes sampled in this study.

2_Phylogenies.tar.gz: includes all files used for phylogenetic analyses. Folders are organized as follows:

1_ATPsynthase_gene_trees: this folder contains all sequence, alignment, and tree files for the ATP synthase gene trees. Files are organized as follows and are associated with the corresponding parts of the manuscript: Figure 3, Figure 5B, Supplementary Figures 5-10, Supplementary Figures 18-19
- Folder '1_sequences' includes all unaligned fasta sequence files for each ATP synthase gene tree (see Methods)
- Folder ‘2_alignments’ includes all alignments generated using MAFFT L-INS-i (subdirectory: 1_untrimmed) and trimmed with BMGE (subdirectory: 2_trimmed)
- Folder '3_treefiles' includes all IQ-TREE2 output files for all ATP synthase gene phylogenies. Any files with suffix *taxa.treefile contain the full taxonomic string for each accession.
- Folder '4_pdfs' includes PDF files for each ATP synthase gene tree
2_Eukaryotic_subsets: this folder contains all sequence, alignment, and tree files for ATP synthase Eukaryotic subset gene trees. Files are organized as follows and are associated with the corresponding parts of the manuscript: Supplementary Figure 11
- Folder ‘1_sequences’ includes all unaligned fasta sequence files for the eukaryotic subsets.
- Folder ‘2_alignments’ includes all alignments generated using MAFFT L-INS-i (subdirectory: 1_untrimmed) and trimmed with BMGE (subdirectory: 2_trimmed).
- Folder ‘3_treefiles’ includes all Bayesian trees inferred for eukaryotic subsets.
- Folder '4_pdfs' includes PDF files for each eukaryotic subset tree
3_21eLife_concatenated_species_tree: this folder contains all sequence, alignment, and tree files for the single gene tree and concatenated phylogeny analyses (inferred using 21 single-copy marker genes, see Methods). Files are organized as follows and are associated with the following parts of the manuscript: Figure 1, Supplementary Figure 20
- Folder ‘1_inspection_start’ corresponds to the initial manual inspection of the single gene trees and includes the following subdirectories:
  - Folder ‘1_sequences’ includes all protein sequence fasta files corresponding to the 27 original single-copy marker genes
  - Folder ‘2_alignments’ includes all alignment files generated using MAFFT L-INS-i (subdirectory: 1_untrimmed) and trimmed with BMGE (subdirectory: 2_untrimmed)
  - Folder ‘3_treefiles’ includes all IQ-TREE2 output files for all phylogenies (27 single-copy marker genes)
  - Folder '4_pdfs' includes PDF files for each single gene tree
- Folder ‘2_inspection_final’ corresponds to the final manual inspection of the single gene trees and includes the following subdirectories:
  - Folder ‘1_sequences’ includes all protein sequence fasta files corresponding to the final 21 single-copy marker genes
  - Folder ‘2_alignments’ includes all alignment files generated using MAFFT L-INS-i (subdirectory: 1_untrimmed) and trimmed with BMGE (subdirectory: 2_untrimmed)
  - Folder ‘3_treefiles’ includes all IQ-TREE2 output files for all phylogenies (21 single-copy marker genes)
  - Folder '4_pdfs' includes PDF files for each single gene tree
- Folder ‘3_concatenated_phylogeny’ contains concatenated alignment generated from the final 21 single-copy marker gene alignments
  - Folder ‘1_alignment’ includes the concatenated alignment generated from the 21 trimmed alignments from the final inspection
  - Folder ‘2_treefiles’ includes all IQ-TREE2 output files for trees inferred using the two different models (subdirectories: LG+C20+R+F and LG+C60+R+F)
- Folder '4_Eukaryote_only_phylogeny' contains sequence, alignment, and tree files for 21 single-copy marker genes used to infer a Eukaryote-only phylogeny. Folder is organized as follows and files correspond to Supplementary Figure 3:
  - Folder ‘1_sequences’ includes all protein sequence fasta files corresponding to the 21 single-copy marker genes with only Eukaryotes
  - Folder ‘2_alignments’ includes all alignment files generated using MAFFT L-INS-i (subdirectory: 1_untrimmed) and trimmed with BMGE (subdirectory: 2_untrimmed)
  - Folder ‘3_concatenated_phylogeny’ includes concatenated alignment generated from 21 single-copy markers with only Eukaryotes (subdirectory: 1_alignment) and all IQ-TREE2 output files for the concatenated phylogeny (subdirectory: 2_treefiles)
  - Folder '4_pdfs' includes PDF files for the concatenated Eukaryote tree
4_Ribosomal_species_tree: this folder contains all sequence, alignment, and tree files for the single gene tree and concatenated phylogeny analyses (inferred using 12 ribosomal marker genes, see Methods). Files are organized as follows and are associated with the corresponding parts of the manuscript: Figure 5A, Figure 5C, Supplementary Figures 12-16, Supplementary Figure 21
- Folder ‘1_sequences’ includes all protein sequence fasta files for the original 15 ribosomal proteins. Sequence sets include the best-hit Archaea and Bacteria, and nuclear, mitochondrial, and plastid eukaryotic homologs
- Folder ‘2_alignments’ includes all alignment files generated using MAFFT L-INS-i (subdirectory: 1_untrimmed) and trimmed with TRIMAL (gappy-out) (subdirectory: 2_trimmed)
- Folder ‘3_treefiles’ includes all original FastTree tree files, tree files with highlighted sequences to remove (*blue-to-rem = eukaryotic nuclear homolog only; *colored-to-rem = eukaryotic nuclear, mitochondrial, and plastid homologs). PDFs of each marker gene tree are also included that depict highlighting of sequences to keep and/or remove.
- Folder ‘4_concatenated_phylogeny’ contains concatenated alignment generated from the final 12 ribosomal marker genes
  - Folder ‘1_alignment’ includes the concatenated alignment generated with 12 ribosomal marker proteins in MAFFT L-INS-i and trimmed with TRIMAL (gappy-out)
  - Folder ‘2_phylogeny’ includes all IQ-TREE2 output files for the species tree inferred using the LG+C60+R+F model
5_Dating_analysis: includes all Mcmcdate output files for the dating analyses (species tree and ATP synthase gene tree, see Methods).
- Folder '1_Edited1_dating' includes all dated tree files and monitor files for braced and unbraced analyses of the Edited1 species tree topology. Data corresponds to Supplementary Figure 12, Supplementary Figure 14-15
- Folder '2_Edited2_dating' includes all dated tree files and monitor files for braced and unbraced analyses of the Edited2 (focal) species tree topology. Data corresponds to Figure 5A, Figure 5C, Supplementary Figure 13, Supplementary Figure 16.
- Folder '3_ATP_synthase_dating' includes all dated tree files and monitor files for braced and unbraced analyses of the ATP synthase gene tree. Data corresponds to Figure 5B, Supplementary Figures 18-19).

3_Scripts.tar.gz: includes all workflows and scripts used for phylogenetic analyses.

1_workflows: includes bash workflows for phylogenetic analyses (details on software versions are included in each workflow summary):
- Workflow_ATPsynthase_gene_trees.sh: generation of the ATP synthase phylogenies
- Workflow_21eLife_marker_phylogeny.sh: inferring the 21 marker-gene species tree
- Workflow_Ribosomal_species_tree.sh: inferring the 12 ribosomal marker-gene species tree
- Workflow_Database_annotations.sh: workflow for gene annotation for 800 sampled Archaea, Bacteria, and Eukaryota
2_R_scripts: includes R scripts used for the Eukaryote sequence contamination screening (Figure 1, Figure 2, Supplementary Figure 2, Supplementary Figures 4, 5, 8-10), presence-absence analyses (Figure 1, Figure 2, Supplementary Figure 2), and plotting tree figures (Supplementary Figures 4-10). Input mapping files and R output files are included.
- Folder '1_Euk_contamination_screen' contains workflow 'Eukaryote_contamination_screen.Rmd' used to inspect Eukaryotic ATP synthase sequences for bacterial contamination
- Folder '2_Presence_absence' includes sub-directories:
  - Folder '1_Species_tree' includes the treefile(s) used for ordering the plots in Figure 1 and Supplementary Figure 2 (‘1_tree’), the taxonomic and COG mapping files and the list of putative contamination to remove (‘2_input_files’), the raw count table for all 800 taxa ('3_Output_files'), R output plot(s) ('4_Plotting'), and the script to generate presence-absence plots 'Presence-absence.R'.
  - Folder '2_Eukaryotes_only' includes organelle information, protein mapping files, taxonomic mapping files, and list of putative contamination to remove ('1_Input_files'); raw count table of ATP synthase subunits ('2_Output_files'); and R output plots ('3_Output_files').
    Please see 'Eukaryote_contamination_screen.Rmd' in parent directory '2_R_scripts' for more information on how Eukaryotic sequences were screened, how the list of contaminating sequences was curated, and how the plot for Figure 2 was generated.
- Folder ‘3_Plotting_trees’ includes the rectangular and radial trees generated for each ATP synthase trees (see Supplementary Figures 5-10). Trees were generated from the treefiles for the ATP synthase gene trees (see above), and script 'Plotting_trees.Rmd'
- 'Marker_gene_counts.R' script used to count marker genes per genome (see Methods)
3_TimeTree: includes python scripts used to generate the time-trees (Figure 5C, Supplementary Figures 15 and 19)

Notes

This work was supported by the Gordon and Betty Moore Foundation through grant GBMF9741 to TAW, AS, and GJSz. Furthermore, AS has received funding from the Swedish Research Council (VR starting grant 2016-03559), the NWO-I foundation of the Netherlands Organisation for Scientific Research (WISE fellowship) and the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement No. 947317, ASymbEL), the Moore–Simons Project on the Origin of the Eukaryotic Cell, Simons Foundation 735929LPI (https://doi.org/10.46714/735929LPI), and a Gordon and Betty Moore Foundation's Symbiosis in Aquatic Systems Initiative (GBMF9346). Further, this work was supported by a Royal Society University Research Fellowship to TAW. ERRM, DP, PCJD and TAW were supported by the John Templeton Foundation (62220). The opinions expressed in this publication are those of the author(s) and do not necessarily reflect the views of the John Templeton Foundation. PCJD was also funded by the Leverhulme Trust (RF-2022-167) and the Biotechnology and Biological Sciences Research Council (BB/T012773/1). GJSz, DS and LLSz received funding from the European Union's Horizon 2020 research and innovation programme (grant agreement No. 714774, GENECLOCKS). We thank Gertraud Burger, Julius Lukes, Takeshi Nara, and other members of the Diplonema papillatum sequencing consortium for sharing data. We also want to thank Courtney Stairs, Andrew Roger and Georg Hochberg for helpful discussions and/or feedback regarding eukaryotic metabolism and ancestral sequence reconstructions, respectively.

Files

Files (3.9 GB)

Name	Size	Download all
100Eukaryote_genomes.tar.gz md5:12ca113389404c073f1a4a749e4cffee	2.6 GB	Download
2_Phylogenies.tar.gz md5:7ca9fd53f5a37e1489006452348cde05	1.3 GB	Download
3_Scripts.tar.gz md5:be94776f44a7c2ec1a1d14eee606cff4	5.2 MB	Download

	All versions	This version
Views	854	96
Downloads	468	91
Data volume	712.8 GB	120.2 GB

ATP synthase evolution on a cross-braced dated tree of life

Authors/Creators

Description

Notes

Files

Files (3.9 GB)