Published March 18, 2025 | Version 03
Dataset Open

Dominant contribution of Asgard archaea to eukaryogenesis (2024) Tobiasson, V., Koonin, E. PROCESSED DATA AND METADATA

  • 1. ROR icon United States National Library of Medicine
  • 2. ROR icon National Institutes of Health

Contributors

Contact person:

Data collector:

Related person:

  • 1. ROR icon National Institutes of Health
  • 2. ROR icon United States National Library of Medicine

Description

Main data deposit for "Dominant contribution of Asgard archaea to eukaryogenesis". 

Victor Tobiasson, Jacob Luo, Yuri I Wolf, Eugene V Koonin

Computational Biology Branch, Division of Intramural Research, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA

The Origin of eukaryotes is one of the key problems in evolutionary biology. The demonstration that the Last Eukaryotic Common Ancestor (LECA) already contained the mitochondrion, an endosymbiotic organelle derived from an alphaproteobacterium, and the discovery of Asgard archaea, the closest archaeal relatives of eukaryotes inform and constrain evolutionary scenarios of eukaryogenesis. We undertook a comprehensive analysis of the origins of the core eukaryotic genes tracing to the LECA within a rigorous statistical framework centered around evolutionary hypotheses testing using constrained phylogenetic trees. The results reveal dominant contributions of Asgard archaea to the origin of most of the conserved eukaryotic functional systems and pathways. A limited contribution from Alphaproteobacteria was identified, primarily relating to the energy transformation systems and Fe-S cluster biogenesis, whereas ancestry from other bacterial phyla was scattered across the eukaryotic functional landscape, without consistent trends. These findings suggest a model of eukaryogenesis in which key features of eukaryotic cell organization evolved in the Asgard ancestor, followed by the capture of the Alphaproteobacterial endosymbiont, and augmented by numerous but sporadic horizontal acquisition of genes from other bacteria both before and after endosymbiosis. 

Version 0.3, updated 180325
 
 
Main data repository for:
Dominant contribution of Asgard archaea to eukaryogenesis (2024) 
Tobiasson, V., Koonin, E.
 
Contains all final parsed data from the main Eukaryogenesis project 
investigating the evolutionary ancetries of eukaryotic protein families. 
 
Currently (non-static) available at: 
https://www.biorxiv.org/content/10.1101/2024.10.14.618318v2
https://assets-eu.researchsquare.com/files/rs-5352492/v1/2f9c68ae-cf3e-420a-8d29-867b6fb1a878.pdf
 
All code used to generate the data present within this repository available at: 
https://github.com/VictorTobiasson/eukgen 
 
 
### General information
 
To identify associations between prokaryotic and eukaryotic protein families, separate
hidden Markov model (HMM) databases for prokaryotes and eukaryotes were constructed 
using a custom, cascaded, sequence-to-profile clustering pipeline, implemented using 
mmseqs2, followed by a multistep data-reduction and multiple sequence alignment (MSA) 
procedure to generate HMM profiles using hhsuite. 
 
A prokaryotic database of 37 million protein sequences was curated from prokaryotic 
genomes obtained from the NCBI GenBank in November 2023 and supplemented with proteins 
extracted from 146 Asgard genome assemblies. To avoid inclusion of genes present only 
within a narrow subset of species, possibly resulting from horizontal transfer from 
eukaryotes post LECA, we reconstructed the “soft-core” pangenome for each of the 26 
curated prokaryotic taxonomic classes. These pangenomes include only those genes that 
are present in at least 67% of the families within each class of Bacteria and Archaea. 
The initial eukaryotic database consisted of 30 million protein sequences from 993 
species taken from EukprotV3 and cleaned using mmseqs2 to remove likely prokaryotic 
contaminants. 
 
Both databases were clustered and MSAs constructed for all non, singleton clusters 
and HMM profiles created. The resulting eukaryotic HMM dataset was queried against 
the prokaryotic dataset using hhblits to identify sets of homologous protein sequences. 
Each eukaryotic cluster and all its significant prokaryotic hits constituted an individual
 sequence set, hereinafter referred to as an Eukaryotic/Prokaryotic Orthologous Cluster 
(EPOC). The EPOCs constitute groups of homologous proteins from eukaryotes and prokaryotes 
(each EPOC contains a unique set of eukaryotic proteins, but some clusters of prokaryotic 
proteins can be present in multiple EPOCs) that were used for phylogenetic tree 
construction, annotation, and evolutionary hypothesis testing. 
 
To infer the most likely prokaryotic ancestry of the eukaryotic proteins in each EPOC, 
rather than relying on the tree topology directly, we employed a probabilistic approach 
for evolutionary hypothesis testing using constraint trees. We exhaustively sampled all 
arrangements of likely sister clades and obtained Expected Likelihood Weights (ELW) for 
the set of possible sister clade models. As the ELW metric is analogous to model selection 
confidence, here we take it to be proportional to the probability of a sampled prokaryotic 
clade to be the true sister group of the given eukaryotic clade among a set of competing 
sister clades. For each EPOC, our analysis dynamically accounts for long branch outliers 
and is robust to phylogenetically non-homogenous clades. This analysis is further capable 
of resolving eukaryotic paraphyly, treating each eukaryotic clade within a EPOC as a 
single datapoint for downstream analysis. Our resulting data contains EPOCs annotated 
using profiles generated from KEGG Orthology Groups (KOGs), each with an MSA generated 
using muscle5, a maximum likelihood tree inferred using IQtree2 and associated ELW values 
for all candidate prokaryotic sister phyla. The analysis of prokaryotic ancestry was 
performed only for those eukaryotic clades that included more than 5 distinct taxonomic 
labels, with at least one coming from Amorphea and one from Diaphoretickes, the two 
expansive eukaryotic clades considered to represent either the first or the second 
bifurcation in the evolution of eukaryotes. Thus, these clades likely represent genes 
mapping back to the LECA.
 
For further details please see main publication or contact
victor.tobiasson@nih.gov
eugene.koonin@nih.gov
 
 
### Included files
 
Unless otherwise stated all files contained are tab separated and utf-8 encoded 
with the first row containing header information. 
All data entries encoding lists are “|” (pipe) separated. 
Fields without data values are filled with string entries of “none”.
 
--- Databases ---
euk72_ep.tar.gz
prok2311_as.tar.gz
Prok2311As_final_clusters.tsv
Euk72Ep_final_clusters.tsv
prok2311_as.hmmDB.tar.gz
euk72_ep.hmmDB.tar.gz
 
--- Annotation and Curation ---
NCBI_taxonomy_species_addendum.tsv
NCBI_taxonomy_class_addendum.tsv
Euk72Ep_Prok2311As_final_classes.tsv
Euk72Ep_Prok2311As_final_classes.GTDB.tsv
KEGG_category_mapping.tsv
KEGG_metadata.tsv
 
--- EPOC data ---
EPOC_data.tar.gz
EPOC_annotation_KEGG.tsv
EPOC_data.tsv
EPOC_data.pangenomes_s10.tsv
EPOC_data.pangenomes_s25.tsv
EPOC_data.pangenomes_s67.tsv
EPOC_data.GTDB.tsv
 
# euk72_ep.tar.gz
Gunzip-ed .tar archive containing a single directory with 10 files 
constituting the initial eukaryotic mmseqs2 database with taxonomy annotation. 
Constructed from a pre-selected list of 72 eukaryotic proteomes downloaded from 
NCBI as well as a “clean” version of Eukprot, lacking highly prokaryotic-like 
contaminant sequences. 
 
# prok2311_as.tar.gz
Gunzip-ed .tar archive containing a single directory with 10 files constituting the 
initial prokaryotic mmseqs2 database with taxonomy annotation. Constructed from 
47545 complete genomes retrieved from NCBI in November 2023. 
 
# prok2311_as.hmmDB.tar.gz
Gunzip-ed .tar archive containing 6 files. Comprises an HHSuite Databse formatted 
from prok2311_as non--singleton clusters, contains 26286 profiles.
 
# euk72_ep.hmmDB.tar.gz
Gunzip-ed .tar archive containing 6 files. Comprises an HHSuite Databse formatted 
from euk72_ep non-singleton clusters, contains 1631704 profiles.
 
# NCBI_taxonomy_species_addendum.tsv
Taxonomy mapping file with manually curated ‘class’ level annotation for poorly 
annotated species. 
 
taxid: NCBI taxid
proposed_class_id: Manually assigned NCBI taxid
proposed_class_label: NCBI class name
org_name: NCBI organism name
 
# NCBI_taxonomy_class_addendum.tsv
Class revision file mapping poorly populated class level entries to higher order 
manually curated labels. Also includes information for small classes with shallow 
taxonomy which are deleted from the EPOC analysis at the level of tree construction.
 
taxid: NCBI taxid
ncbi_class: NCBI taxid of rank corresponding to ‘class’ following manual 
amendment as per NCBI_taxonomy_species_addendum.tsv
revised_class_id: Manually assigned NCBI taxid of rank corresponding to ‘class’
revised_class_label: Proposed cleartext name of manually revised revised_class_id 
 
# Euk72Ep_Prok2311As_final_classes.tsv
Final taxonomy at NCBI rank ‘class’ following revisions for all sequences in Euk72Ep or 
Prok2311As. These taxonomic labels are used for EPOC tree annotation. 
 
acc: mmseqs database header in either prok2311_as or euk72_ep databases
taxid: NCBI taxid for organism
superkingdom: Top level NCBI taxonomy classification Bacteria, Archaea or Eukarya, 
used to define Eukaryotic outgroups in EPOC analysis
class: Cleartext name of manually revised NCBI rank ‘class’ identifier for annotation
 
# Euk72Ep_Prok2311As_final_classes.GTDB.tsv
Final taxonomy at GTDB rank ‘phylum’ transferred using marker genes from GTDB release 220
 
acc: mmseqs database header in either prok2311_as or euk72_ep databases
taxid: NCBI taxid for organism
superkingdom: Top level NCBI taxonomy classification Bacteria, Archaea or Eukarya, 
used to define Eukaryotic outgroups in EPOC analysis
class: Cleartext name of assigne GTDB phylum
 
# Prok2311As_final_clusters.tsv
Cluster mapping file for accessions within the initial Prok2311A database to the 
final clusters used for HMM creation  
 
cluster_acc: cluster representative
acc: cluster member
 
# Euk72Ep_final_clusters.tsv
Cluster mapping file for accessions within the initial Prok2311A database to the 
final clusters used for HMM creation
 
cluster_acc: cluster representative
acc: cluster member
 
# EPOC_data.tar.gz
Gunzip-ed directory containing 16035 EPOC folders. Each folder is named corresponding 
to the eukaryotic cluster representative which generated its profile as an ID 
Matches the tree_name field in EPOC_data_prok2311As.tsv
contains the following files:
 
<EPOC_ID>.merged.fasta: sequences for all members of the EPOC
<EPOC_ID>.merged.fasta.leaf_mapping: tsv separated file containing taxonomy and tree reduction data
<EPOC_ID>.merged.fasta.muscle: main cropped MSA for tree generation 
<EPOC_ID>.merged.fasta.muscle.iqtree: IQtree2 output from tree generation
<EPOC_ID>.merged.fasta.muscle.treefile.annot: annotated newick tree file with final tree
<EPOC_ID>.merged.tree_data.tsv: final parsed tree data with columns matching  EPOC_data_prok2311As.tsv
 
EPOCs with more than one possible eukaryotic sister phyla also contains 
a folder "constraint_analysis" with constraint tree information used for 
ELW value calculation. 
 
# EPOC_data.tsv
Main resulting data from all Eukaryotic/Prokaryotic Orthologous Clusters (EPOCs) 
based on pangenomes defined as including 10% of species per class. This is the main
data to be used for genereting the core dataset and for data visualistation
Contains information regarding tree breakdown, LCA membership and phylogenetic 
distances between all detected LCAs. Equivalent to the stacked dataframes from all 
EPOC directories in EPOC_data 
 
tree_name: unique index for each EPOC 
euk_clade_rep: unique index for each annotated eukaryotic clade within each tree_name
euk_clade_size: number of original sequences represented by euk_clade_rep
euk_clade_weight: metric for taxonomic purity for each euk_clade_rep
euk_leaf_clade: boolean indicating whether euk_clade_rep contains a single leaf
euk_LCA: lowest taxa spanning all members in euk_clade_rep
euk_scope: list of all taxonomic classes in euk_clade_rep
euk_scope_len: length of euk_scope list
prok_clade_rep: unique index for each annotated prokaryotic clade for each euk_clade_rep
prok_clade_size: number of original sequences represented by prok_clade_rep
prok_clade_weight: metric for taxonomic purity for each prok_clade_rep
prok_leaf_clade: boolean indicating whether prok_clade_rep contains a single leaf
prok_taxa: lowest taxa spanning all members in prok_clade_rep
dist: tree-distance from lowest tree node containing all members of prok_clade_rep to lowest tree node containing all members of euk_clade_rep
top_dist: graph-distance (node-distance) from lowest tree node containing all members of prok_clade_rep to lowest tree node containing all members of euk_clade_rep
raw_stem_length: tree-distance from lowest tree node containing the union of all members of prok_clade_rep and euk_clade_rep to the tree node containing all members of euk_clade_rep
median_euk_leaf_dist: median value for all tree distances from the tree node containing all members of euk_clade_rep to the individual leaves
stem_length: raw_stem_length/median_euk_leaf_dist
logL: log likelihood of best constraint tree constructed
deltaL: log likelihood difference between constraint tree for prok_clade_rep and best constraint tree constructed
bp-RELL: validation metric from IQtree -trees, see iqtree.org
bp-RELL_accept: as above
p-KH: as above
p-KH_accept: as above
p-SH: as above
p-SH_accept: as above
c-ELW: as above
c-ELW_accept: as above
p-AU: as above
p-AU_accept: as above
 
# EPOC_data.pangenomes_s10.tsv
Resulting data from all Eukaryotic/Prokaryotic Orthologous Clusters (EPOCs) calculated 
based on pangenomes defined as including 10% of species per class.
Identical file structure to EPOC_data.tsv
 
# EPOC_data.pangenomes_s25.tsv
Resulting data from all Eukaryotic/Prokaryotic Orthologous Clusters (EPOCs) calculated 
based on pangenomes defined as including 25% of species per class.
Identical file structure to EPOC_data.tsv
 
# EPOC_data.pangenomes_s67.tsv
Resulting data from all Eukaryotic/Prokaryotic Orthologous Clusters (EPOCs) calculated 
based on pangenomes defined as including 67% of species per class.
Identical file structure to EPOC_data.tsv
 
# EPOC_data.GTDB.tsv
Resulting data  from all Eukaryotic/Prokaryotic Orthologous Clusters (EPOCs) calculated 
under revised taxonomy from GTDB based on data from Euk72Ep_Prok2311As_final_classes.GTDB.tsv
Identical file structure to EPOC_data.tsv
 
# EPOC_data.alpha_replicates.tsv
Resulting data from 20 repetitions of Eukaryotic/Prokaryotic Orthologous Clusters (EPOCs) calculated 
from a subset of Alphaproteobacterial-derived EPOCs. 
Identical file structure to EPOC_data.tsv with the addition of:
 
rep: indicating technical replicate number, 0-19
 
# EPOC_annotation_KEGG.tsv
Parsed HHblits output of HMM profiles generated from KEGG KOGs (KEGG Orthologous Groups) 
against eukaryotic profiles constituting each EPOC
 
Query: query name equal to tree_name from EPOC_data
Target: target name equal to kogid in KEGG_category_mapping and KEGG_metadata
Prob: data from HHblits, see https://github.com/soedinglab/hh-suite/wiki
E-value : as above
P-value : as above
Score: as above
SS: as above
Cols: as above
Identities: as above
Similarity: as above
Sum_probs: as above
Query-HMM-start: as above
Query-HMM-end: as above
Template-HMM-start: as above
Template-HMM-end: as above
Template_columns: as above
Template_Neff : as above
Pairwise_cov: calculated pairwise coverage from Query and Target start and end
Description: category_name from KEGG_category_mapping
 
# KEGG_category_mapping.tsv
Mapping of relevant KOG identifiers to their higher order categories as 
"Maps" "Modules" or "Reactions" as per KEGG see https://www.kegg.jp/kegg/pathway.html
 
kogid: unique KOG identifier
category_id: KEGG map, module, or reaction number
category_name: cleartext name for KOG identifier
 
# KEGG_metadata.tsv
File mapping KOGs to BRITE classification and to additional databases of chemical properties.
 
kogid: unique KOG identifier
name: cleartext name for KOG identifier
brite_A: list of BRITE-A sets including KOG
brite_B: list of BRITE-A sets including KOG
brite_C: list of BRITE-A sets including KOG
EC: list of Enzyme commission numbers associated with KOG, see https://enzyme.expasy.org/
TC: list of transporter classification numbers associated with KOG, see https://www.tcdb.org/
RN: list of KEGG reaction numbers associated with KOG
CA: list of CAZY numbers associated with KOG, see http://www.cazy.org/
GO: list of GO terms associated with KOG, see https://geneontology.org/
 

Files

READMEv03.txt

Files (34.2 GB)

Name Size Download all
md5:6ec9edd4a6fc43e0f1751f2a2c9c5760
281.6 MB Download
md5:60280afec39f7ba5f869f75afc25876f
22.4 MB Download
md5:74edb3e9248fb5d1c06ff59cafc87b7d
11.7 MB Download
md5:188ba087967a08c906d81d2a288491c4
20.6 MB Download
md5:597307ea46cdd39d99b6f47889ed7ed4
55.3 MB Download
md5:677489abf609f8605212c7abc170a5f5
54.7 MB Download
md5:135ccf30043a14e96a31f41701468010
46.1 MB Download
md5:059badacd7238a4ef553122fe5feda76
1.2 GB Download
md5:60981b0d9f8047722833e65c61771656
43.0 MB Download
md5:99ec4522b94275be54c2acf32aa49df2
12.2 GB Download
md5:e97552772099ef94dab6a3a450a687cc
6.8 GB Download
md5:5fe86c7197aa01ee47e26f524cff354d
753.4 MB Download
md5:4aba52b0119b9c201b25851bb674190b
4.8 GB Download
md5:79af5d151cfd8e3a5f070acefad6e051
2.2 GB Download
md5:75c274bb2895f4b7c828aeb6f7f06c77
2.9 MB Download
md5:b424fb0de853c8e921cef070a231aca3
3.3 MB Download
md5:60f5c488822748e9bc23fdab4ed89960
14.0 kB Download
md5:ea874f07d7f51457597d720df52717c6
25.2 kB Download
md5:ec3614a6959deae31b58da423a78ab2e
363.8 MB Download
md5:1ce3e9549ef53f78df2cd600b8b91fcf
5.2 GB Download
md5:4f20e3356ea35c5d8d8ae3eedf367890
185.9 MB Download
md5:79ed3a7d1bbcb38b67eeb828c552faca
15.3 kB Preview Download

Additional details

Related works

Is supplement to
Publication: 10.1101/2024.10.14.618318 (DOI)

Dates

Updated
2024-03-22
Revised submission

Software

Repository URL
https://github.com/VictorTobiasson/eukgen
Programming language
Python
Development Status
Active