Published February 18, 2020 | Version Version 2
Dataset Open

Undinarchaeota illuminate DPANN phylogeny and the impact of gene transfer on archaeal evolution

  • 1. NIOZ, Royal Netherlands Institute for Sea Research, Department of Marine Microbiology and Biogeochemistry, and Utrecht University, Netherlands
  • 2. School of Biological Sciences, University of Bristol, UK
  • 3. Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, The University of Queensland, Australia
  • 4. Department of Cell- and Molecular Biology, Science for Life Laboratory, Uppsala University, SE-75123, Uppsala, Sweden
  • 5. Research School of Computer Science and Research School of Biology, Australian National University, Australia


General Description

Repository with all analyses described our paper: Undinarchaeota illuminate DPANN phylogeny and the impact of gene transfer on archaeal evolution.

If you find this work useful for your own analyses, please cite this work.



The evolution and diversification of Archaea is central to the history of life on Earth. Cultivation-independent approaches have revealed the existence of the DPANN archaea: a radiation of organisms with small cell and genome sizes. Currently, the placement of the various DPANN lineages and in turn the early evolution of metabolism and symbiosis are debated. Here, we reconstructed genomes of a thus far uncharacterized archaeal phylum-level lineage UAP2 (Candidatus Undinarchaeota). Comparative genomics revealed that members of the Undinarchaeota have small estimated genome sizes and, while potentially being able to conserve energy through fermentation, likely depend on partner organisms for the acquisition of vitamins, amino acids and other metabolites. In contrast to previous indications, our phylogenomic analyses robustly placed the Undinarchaeota as independent lineage between two major and highly supported clans of ‘DPANN’. Furthermore, our work suggests that DPANN archaea have exchanged core genes with their hosts by horizontal gene transfer, adding to the difficulty of placing DPANN in the tree of life (ToL). In several cases, this pattern is sufficiently dominant that known symbiont-host clades can be identified by inferring routes of HGT across the ToL. Together, our findings provide crucial insights into the origins and evolution of DPANN archaea and their hosts.

The annotation workflow for archaeal/bacterial genomes that was used for this paper is also available on github (here) and an updated version that includes the COG search is available on:


Repository Contents

1_Genome_files.tar.gz includes all Undinarchaeota (original name UAP2) metagenome-assembled genomes (MAGs). This includes: 

  1. The original contigs for each UAP2 MAG (fna files)
  2. The prokka output for each UAP2 MAG (faa files)
  3. A concatenated file of all proteins from each UAP2 MAG and all archaeal reference genomes (364 genomes in total). This folder also includes a list of archaeal genomes investigated.

2_Phylogenies.tar.gz includes all files for the phylogenetic analyses. This includes the following folders:

1. Files for the concatenated species trees for different taxa sets. These files are related to the following parts of the manuscript: Supplementary Table 6; Figure 1 and Supplementary Figures S8-S58. The folder includes the following:

  • Folder '1_unaligned_sequences' includes individual protein sequences extract from the different taxa sets.
  • Folder '2_alignments' includes the alignment files generated by MAFFT.
  • Folder '3_alignments_trimmed' includes the alignments trimmed with BMGE.
  • Folder '4_phylogenies' includes the IQ-TREE output for all phylogenies as well as color-annotation file for figtree. Additionally files rooted with minimal ancestor deviation (MAD) rooting (*.rooted) are provided. Note, that for the final figures the *treefile_renamed (i.e. the iqtree file with the full taxa string) were artificially rooted using the DPANN archaea. The numbering corresponds to Supplementary Table S6 of the main manuscript.
  • Folder ' 5_pdfs' includes the PDFs for each tree

2. Files for single gene trees that includes:

  • The folder '1_arcogs' includes the unaligned proteins, alignments, trimmed alignments, trees and pdfs for the single gene trees based on the arCOG identifiers. The arCOGs were extract from 12 UAP2 MAGs + 352 archaeal + 3020 bacterial + 100 eukaryotic genomes. ArCOGs were only considered if they occurred in at least 3 UAP2 genomes. Notice, these files were used to investigate UAP2 for HGT events and correspond to the following parts of the manuscript: Figure 4 and Supplementary Tables 4, 5, 20-22. Additionally, the folder 0_parsing includes some information on how to generate count tables for each marker gene.
  • The folder '151_markers' including the proteins, alignments, trimmed alignments, trees and pdfs for evaluating the 151 marker set used for the concatenated species tree. Files were provided for the 127 and 364 taxa set. These files were used as a basis for the concatenated species trees that were used to generated Supplementary Figures S8-S58. Additionally, the trees were used for ranking marker proteins and generating Supplementary Tables 4-5. For the 364 taxa set, the folder also included a subfolder 0_parsing that provides scripts to investigate some statistics for each marker protein, including the average protein length, average alignment length and average bootstrap support.
  • The folder '3_other_individual_trees' includes the proteins, alignments and phylogenies for the 16S_23S, RubisCO and primase analyses. The data was used to generate the following parts of the manuscript: Supplementary Table 11, Supplementary Figures 3-5, 57 and 59.

3_Scripts.tar.gz includes all files for the phylogenetic analyses. This includes the following folders:

1. The files for the main workflow for the annotations and phylogenies.

  • This folder includes the workflow to generate annotations for archaeal genomes as well as an example script that was used to generate phylogenies. These analyses were typically run on a in-house bioinformatics cluster with 4x Xeon Gold 6140 2.3 GHz processors using bash, python and perl. The used system runs a Linux operating system, Red Hat Enterprise 7.5.

2.  A folder providing any required dependencies that include:

  • any python or perl scripts that were used during this study and/or that are mentioned in the methods section
  • Databases used for the annotations, esp. if these were slightly modified. Notice, changes typically include parsing of the mapping files or modifications of the sequence headers for easier parsing.
  • mapping files needed to link the genome accession ids to the taxonomy string as well as lists of protein IDs used for different phylogenies (i.e. 14 + 48 arCOGs used for protein phylogenies)

3. R scripts (including all needed input files) used to: 

  • generate tables and figures for the annotations, i.e. Figure 2 and 3 and Supplementary Tables 7, 8, 9, 12, 13-15 and Supplementary Figures 60, 62-64 . The input folder includes the raw output from the annotation workflow and includes annotations for the 12 UAP2 MAGs as well as 352 archaeal reference genomes.
  • generate tables and figures for the HGT analyses, i.e. Figure 4 and Supplementary Tables S20-22 Here, proteins based on arCOGs were extracted from 364 archaeal, 3020 bacterial and 98 eukaryotic genomes and used to generate single protein phylogenies. The resulting trees were used to investigate horizontal gene transfer events and the necessary scripts are provided in this folder.
  • generate tables and figures for the amino acid identify (AAI) comparisons, i.e. Supplementary Table S3 and Supplementary Figure S2. 
  • rank the marker genes for concatenated species trees for the 127 and 364 taxa set. These were used to generate Supplementary Tables S4 and S5.

General comment:

In contrast to the previous version, this datasets includes some small additional scripts generated during the revision process of the corresponding manuscript.



This work was supported by a grant of the Swedish Research Council (VR starting grant 2016-03559 to Anja Spang), the NWO-I foundation of the Netherlands Organisation for Scientific Research (WISE fellowship to AS). Tom Williams was supported by a Royal Society University Research Fellowship. Benjamin Woodcroft was supported by the Australian Research Council Discovery Early Career Research Awards #DE160100248. And an Australian Research Council (ARC) Future Fellowship (FT170100213) awarded to Chris Rinke.


Files (3.4 GB)

Name Size Download all
137.1 MB Download
1.0 GB Download
2.2 GB Download

Additional details

Related works

Is supplement to
10.1101/2020.03.05.976373 (DOI)


Discovery Early Career Researcher Award - Grant ID: DE160100248 DE160100248
Australian Research Council
ARC Future Fellowships - Grant ID: FT170100213 FT170100213
Australian Research Council