Dominant contribution of Asgard archaea to eukaryogenesis (2024) Tobiasson, V., Koonin, E. PROCESSED DATA AND METADATA

Tobiasson, Victor

doi:10.5281/zenodo.15048010

Published March 18, 2025 | Version 03

Dataset Open

Dominant contribution of Asgard archaea to eukaryogenesis (2024) Tobiasson, V., Koonin, E. PROCESSED DATA AND METADATA

Tobiasson, Victor (Contact person)^{1, 2}

1. United States National Library of Medicine
2. National Institutes of Health

Contributors

1. National Institutes of Health
2. United States National Library of Medicine

Main data deposit for "Dominant contribution of Asgard archaea to eukaryogenesis".

Victor Tobiasson, Jacob Luo, Yuri I Wolf, Eugene V Koonin

Computational Biology Branch, Division of Intramural Research, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA

The Origin of eukaryotes is one of the key problems in evolutionary biology. The demonstration that the Last Eukaryotic Common Ancestor (LECA) already contained the mitochondrion, an endosymbiotic organelle derived from an alphaproteobacterium, and the discovery of Asgard archaea, the closest archaeal relatives of eukaryotes inform and constrain evolutionary scenarios of eukaryogenesis. We undertook a comprehensive analysis of the origins of the core eukaryotic genes tracing to the LECA within a rigorous statistical framework centered around evolutionary hypotheses testing using constrained phylogenetic trees. The results reveal dominant contributions of Asgard archaea to the origin of most of the conserved eukaryotic functional systems and pathways. A limited contribution from Alphaproteobacteria was identified, primarily relating to the energy transformation systems and Fe-S cluster biogenesis, whereas ancestry from other bacterial phyla was scattered across the eukaryotic functional landscape, without consistent trends. These findings suggest a model of eukaryogenesis in which key features of eukaryotic cell organization evolved in the Asgard ancestor, followed by the capture of the Alphaproteobacterial endosymbiont, and augmented by numerous but sporadic horizontal acquisition of genes from other bacteria both before and after endosymbiosis.

Version 0.3, updated 180325

Main data repository for:

Dominant contribution of Asgard archaea to eukaryogenesis (2024)

Tobiasson, V., Koonin, E.

Contains all final parsed data from the main Eukaryogenesis project

investigating the evolutionary ancetries of eukaryotic protein families.

Currently (non-static) available at:

https://www.biorxiv.org/content/10.1101/2024.10.14.618318v2

https://assets-eu.researchsquare.com/files/rs-5352492/v1/2f9c68ae-cf3e-420a-8d29-867b6fb1a878.pdf

All code used to generate the data present within this repository available at:

https://github.com/VictorTobiasson/eukgen

### General information

To identify associations between prokaryotic and eukaryotic protein families, separate

hidden Markov model (HMM) databases for prokaryotes and eukaryotes were constructed

using a custom, cascaded, sequence-to-profile clustering pipeline, implemented using

mmseqs2, followed by a multistep data-reduction and multiple sequence alignment (MSA)

procedure to generate HMM profiles using hhsuite.

A prokaryotic database of 37 million protein sequences was curated from prokaryotic

genomes obtained from the NCBI GenBank in November 2023 and supplemented with proteins

extracted from 146 Asgard genome assemblies. To avoid inclusion of genes present only

within a narrow subset of species, possibly resulting from horizontal transfer from

eukaryotes post LECA, we reconstructed the “soft-core” pangenome for each of the 26

curated prokaryotic taxonomic classes. These pangenomes include only those genes that

are present in at least 67% of the families within each class of Bacteria and Archaea.

The initial eukaryotic database consisted of 30 million protein sequences from 993

species taken from EukprotV3 and cleaned using mmseqs2 to remove likely prokaryotic

contaminants.

Both databases were clustered and MSAs constructed for all non, singleton clusters

and HMM profiles created. The resulting eukaryotic HMM dataset was queried against

the prokaryotic dataset using hhblits to identify sets of homologous protein sequences.

Each eukaryotic cluster and all its significant prokaryotic hits constituted an individual

sequence set, hereinafter referred to as an Eukaryotic/Prokaryotic Orthologous Cluster

(EPOC). The EPOCs constitute groups of homologous proteins from eukaryotes and prokaryotes

(each EPOC contains a unique set of eukaryotic proteins, but some clusters of prokaryotic

proteins can be present in multiple EPOCs) that were used for phylogenetic tree

construction, annotation, and evolutionary hypothesis testing.

To infer the most likely prokaryotic ancestry of the eukaryotic proteins in each EPOC,

rather than relying on the tree topology directly, we employed a probabilistic approach

for evolutionary hypothesis testing using constraint trees. We exhaustively sampled all

arrangements of likely sister clades and obtained Expected Likelihood Weights (ELW) for

the set of possible sister clade models. As the ELW metric is analogous to model selection

confidence, here we take it to be proportional to the probability of a sampled prokaryotic

clade to be the true sister group of the given eukaryotic clade among a set of competing

sister clades. For each EPOC, our analysis dynamically accounts for long branch outliers

and is robust to phylogenetically non-homogenous clades. This analysis is further capable

of resolving eukaryotic paraphyly, treating each eukaryotic clade within a EPOC as a

single datapoint for downstream analysis. Our resulting data contains EPOCs annotated

using profiles generated from KEGG Orthology Groups (KOGs), each with an MSA generated

using muscle5, a maximum likelihood tree inferred using IQtree2 and associated ELW values

for all candidate prokaryotic sister phyla. The analysis of prokaryotic ancestry was

performed only for those eukaryotic clades that included more than 5 distinct taxonomic

labels, with at least one coming from Amorphea and one from Diaphoretickes, the two

expansive eukaryotic clades considered to represent either the first or the second

bifurcation in the evolution of eukaryotes. Thus, these clades likely represent genes

mapping back to the LECA.

For further details please see main publication or contact

victor.tobiasson@nih.gov

eugene.koonin@nih.gov

### Included files

Unless otherwise stated all files contained are tab separated and utf-8 encoded

with the first row containing header information.

All data entries encoding lists are “|” (pipe) separated.

Fields without data values are filled with string entries of “none”.

--- Databases ---

euk72_ep.tar.gz

prok2311_as.tar.gz

Prok2311As_final_clusters.tsv

Euk72Ep_final_clusters.tsv

prok2311_as.hmmDB.tar.gz

euk72_ep.hmmDB.tar.gz

--- Annotation and Curation ---

NCBI_taxonomy_species_addendum.tsv

NCBI_taxonomy_class_addendum.tsv

Euk72Ep_Prok2311As_final_classes.tsv

Euk72Ep_Prok2311As_final_classes.GTDB.tsv

KEGG_category_mapping.tsv

KEGG_metadata.tsv

--- EPOC data ---

EPOC_data.tar.gz

EPOC_annotation_KEGG.tsv

EPOC_data.tsv

EPOC_data.pangenomes_s10.tsv

EPOC_data.pangenomes_s25.tsv

EPOC_data.pangenomes_s67.tsv

EPOC_data.GTDB.tsv

# euk72_ep.tar.gz

Gunzip-ed .tar archive containing a single directory with 10 files

constituting the initial eukaryotic mmseqs2 database with taxonomy annotation.

Constructed from a pre-selected list of 72 eukaryotic proteomes downloaded from

NCBI as well as a “clean” version of Eukprot, lacking highly prokaryotic-like

contaminant sequences.

# prok2311_as.tar.gz

Gunzip-ed .tar archive containing a single directory with 10 files constituting the

initial prokaryotic mmseqs2 database with taxonomy annotation. Constructed from

47545 complete genomes retrieved from NCBI in November 2023.

# prok2311_as.hmmDB.tar.gz

Gunzip-ed .tar archive containing 6 files. Comprises an HHSuite Databse formatted

from prok2311_as non--singleton clusters, contains 26286 profiles.

# euk72_ep.hmmDB.tar.gz

Gunzip-ed .tar archive containing 6 files. Comprises an HHSuite Databse formatted

from euk72_ep non-singleton clusters, contains 1631704 profiles.

# NCBI_taxonomy_species_addendum.tsv

Taxonomy mapping file with manually curated ‘class’ level annotation for poorly

annotated species.

taxid: NCBI taxid

proposed_class_id: Manually assigned NCBI taxid

proposed_class_label: NCBI class name

org_name: NCBI organism name

# NCBI_taxonomy_class_addendum.tsv

Class revision file mapping poorly populated class level entries to higher order

manually curated labels. Also includes information for small classes with shallow

taxonomy which are deleted from the EPOC analysis at the level of tree construction.

taxid: NCBI taxid

ncbi_class: NCBI taxid of rank corresponding to ‘class’ following manual

amendment as per NCBI_taxonomy_species_addendum.tsv

revised_class_id: Manually assigned NCBI taxid of rank corresponding to ‘class’

revised_class_label: Proposed cleartext name of manually revised revised_class_id

# Euk72Ep_Prok2311As_final_classes.tsv

Final taxonomy at NCBI rank ‘class’ following revisions for all sequences in Euk72Ep or

Prok2311As. These taxonomic labels are used for EPOC tree annotation.

acc: mmseqs database header in either prok2311_as or euk72_ep databases

taxid: NCBI taxid for organism

superkingdom: Top level NCBI taxonomy classification Bacteria, Archaea or Eukarya,

used to define Eukaryotic outgroups in EPOC analysis

class: Cleartext name of manually revised NCBI rank ‘class’ identifier for annotation

# Euk72Ep_Prok2311As_final_classes.GTDB.tsv

Final taxonomy at GTDB rank ‘phylum’ transferred using marker genes from GTDB release 220

acc: mmseqs database header in either prok2311_as or euk72_ep databases

taxid: NCBI taxid for organism

superkingdom: Top level NCBI taxonomy classification Bacteria, Archaea or Eukarya,

used to define Eukaryotic outgroups in EPOC analysis

class: Cleartext name of assigne GTDB phylum

# Prok2311As_final_clusters.tsv

Cluster mapping file for accessions within the initial Prok2311A database to the

final clusters used for HMM creation

cluster_acc: cluster representative

acc: cluster member

# Euk72Ep_final_clusters.tsv

Cluster mapping file for accessions within the initial Prok2311A database to the

final clusters used for HMM creation

cluster_acc: cluster representative

acc: cluster member

# EPOC_data.tar.gz

Gunzip-ed directory containing 16035 EPOC folders. Each folder is named corresponding

to the eukaryotic cluster representative which generated its profile as an ID

Matches the tree_name field in EPOC_data_prok2311As.tsv

contains the following files:

<EPOC_ID>.merged.fasta: sequences for all members of the EPOC

<EPOC_ID>.merged.fasta.leaf_mapping: tsv separated file containing taxonomy and tree reduction data

<EPOC_ID>.merged.fasta.muscle: main cropped MSA for tree generation

<EPOC_ID>.merged.fasta.muscle.iqtree: IQtree2 output from tree generation

<EPOC_ID>.merged.fasta.muscle.treefile.annot: annotated newick tree file with final tree

<EPOC_ID>.merged.tree_data.tsv: final parsed tree data with columns matching EPOC_data_prok2311As.tsv

EPOCs with more than one possible eukaryotic sister phyla also contains

a folder "constraint_analysis" with constraint tree information used for

ELW value calculation.

# EPOC_data.tsv

Main resulting data from all Eukaryotic/Prokaryotic Orthologous Clusters (EPOCs)

based on pangenomes defined as including 10% of species per class. This is the main

data to be used for genereting the core dataset and for data visualistation

Contains information regarding tree breakdown, LCA membership and phylogenetic

distances between all detected LCAs. Equivalent to the stacked dataframes from all

EPOC directories in EPOC_data

tree_name: unique index for each EPOC

euk_clade_rep: unique index for each annotated eukaryotic clade within each tree_name

euk_clade_size: number of original sequences represented by euk_clade_rep

euk_clade_weight: metric for taxonomic purity for each euk_clade_rep

euk_leaf_clade: boolean indicating whether euk_clade_rep contains a single leaf

euk_LCA: lowest taxa spanning all members in euk_clade_rep

euk_scope: list of all taxonomic classes in euk_clade_rep

euk_scope_len: length of euk_scope list

prok_clade_rep: unique index for each annotated prokaryotic clade for each euk_clade_rep

prok_clade_size: number of original sequences represented by prok_clade_rep

prok_clade_weight: metric for taxonomic purity for each prok_clade_rep

prok_leaf_clade: boolean indicating whether prok_clade_rep contains a single leaf

prok_taxa: lowest taxa spanning all members in prok_clade_rep

dist: tree-distance from lowest tree node containing all members of prok_clade_rep to lowest tree node containing all members of euk_clade_rep

top_dist: graph-distance (node-distance) from lowest tree node containing all members of prok_clade_rep to lowest tree node containing all members of euk_clade_rep

raw_stem_length: tree-distance from lowest tree node containing the union of all members of prok_clade_rep and euk_clade_rep to the tree node containing all members of euk_clade_rep

median_euk_leaf_dist: median value for all tree distances from the tree node containing all members of euk_clade_rep to the individual leaves

stem_length: raw_stem_length/median_euk_leaf_dist

logL: log likelihood of best constraint tree constructed

deltaL: log likelihood difference between constraint tree for prok_clade_rep and best constraint tree constructed

bp-RELL: validation metric from IQtree -trees, see iqtree.org

bp-RELL_accept: as above

p-KH: as above

p-KH_accept: as above

p-SH: as above

p-SH_accept: as above

c-ELW: as above

c-ELW_accept: as above

p-AU: as above

p-AU_accept: as above

# EPOC_data.pangenomes_s10.tsv

Resulting data from all Eukaryotic/Prokaryotic Orthologous Clusters (EPOCs) calculated

based on pangenomes defined as including 10% of species per class.

Identical file structure to EPOC_data.tsv

# EPOC_data.pangenomes_s25.tsv

Resulting data from all Eukaryotic/Prokaryotic Orthologous Clusters (EPOCs) calculated

based on pangenomes defined as including 25% of species per class.

Identical file structure to EPOC_data.tsv

# EPOC_data.pangenomes_s67.tsv

Resulting data from all Eukaryotic/Prokaryotic Orthologous Clusters (EPOCs) calculated

based on pangenomes defined as including 67% of species per class.

Identical file structure to EPOC_data.tsv

# EPOC_data.GTDB.tsv

Resulting data from all Eukaryotic/Prokaryotic Orthologous Clusters (EPOCs) calculated

under revised taxonomy from GTDB based on data from Euk72Ep_Prok2311As_final_classes.GTDB.tsv

Identical file structure to EPOC_data.tsv

# EPOC_data.alpha_replicates.tsv

Resulting data from 20 repetitions of Eukaryotic/Prokaryotic Orthologous Clusters (EPOCs) calculated

from a subset of Alphaproteobacterial-derived EPOCs.

Identical file structure to EPOC_data.tsv with the addition of:

rep: indicating technical replicate number, 0-19

# EPOC_annotation_KEGG.tsv

Parsed HHblits output of HMM profiles generated from KEGG KOGs (KEGG Orthologous Groups)

against eukaryotic profiles constituting each EPOC

Query: query name equal to tree_name from EPOC_data

Target: target name equal to kogid in KEGG_category_mapping and KEGG_metadata

Prob: data from HHblits, see https://github.com/soedinglab/hh-suite/wiki

E-value : as above

P-value : as above

Score: as above

SS: as above

Cols: as above

Identities: as above

Similarity: as above

Sum_probs: as above

Query-HMM-start: as above

Query-HMM-end: as above

Template-HMM-start: as above

Template-HMM-end: as above

Template_columns: as above

Template_Neff : as above

Pairwise_cov: calculated pairwise coverage from Query and Target start and end

Description: category_name from KEGG_category_mapping

# KEGG_category_mapping.tsv

Mapping of relevant KOG identifiers to their higher order categories as

"Maps" "Modules" or "Reactions" as per KEGG see https://www.kegg.jp/kegg/pathway.html

kogid: unique KOG identifier

category_id: KEGG map, module, or reaction number

category_name: cleartext name for KOG identifier

# KEGG_metadata.tsv

File mapping KOGs to BRITE classification and to additional databases of chemical properties.

kogid: unique KOG identifier

name: cleartext name for KOG identifier

brite_A: list of BRITE-A sets including KOG

brite_B: list of BRITE-A sets including KOG

brite_C: list of BRITE-A sets including KOG

EC: list of Enzyme commission numbers associated with KOG, see https://enzyme.expasy.org/

TC: list of transporter classification numbers associated with KOG, see https://www.tcdb.org/

RN: list of KEGG reaction numbers associated with KOG

CA: list of CAZY numbers associated with KOG, see http://www.cazy.org/

GO: list of GO terms associated with KOG, see https://geneontology.org/

Files

READMEv03.txt

Files (34.2 GB)

Name	Size	Download all
EPOC_annotation_KEGG.tsv md5:6ec9edd4a6fc43e0f1751f2a2c9c5760	281.6 MB	Download
EPOC_data.alpha_replicates.tsv md5:60280afec39f7ba5f869f75afc25876f	22.4 MB	Download
EPOC_data.GTDB.tsv md5:74edb3e9248fb5d1c06ff59cafc87b7d	11.7 MB	Download
EPOC_data.pangenome_s0.tsv md5:188ba087967a08c906d81d2a288491c4	20.6 MB	Download
EPOC_data.pangenome_s10.tsv md5:597307ea46cdd39d99b6f47889ed7ed4	55.3 MB	Download
EPOC_data.pangenome_s25.tsv md5:677489abf609f8605212c7abc170a5f5	54.7 MB	Download
EPOC_data.pangenome_s67.tsv md5:135ccf30043a14e96a31f41701468010	46.1 MB	Download
EPOC_data.tar.gz md5:059badacd7238a4ef553122fe5feda76	1.2 GB	Download
EPOC_data.tsv md5:60981b0d9f8047722833e65c61771656	43.0 MB	Download
euk72_ep.hmmDB.tar.gz md5:99ec4522b94275be54c2acf32aa49df2	12.2 GB	Download
euk72_ep.tar.gz md5:e97552772099ef94dab6a3a450a687cc	6.8 GB	Download
Euk72Ep_final_clusters.tsv md5:5fe86c7197aa01ee47e26f524cff354d	753.4 MB	Download
Euk72Ep_Prok2311As_final_classes.GTDB.tsv md5:4aba52b0119b9c201b25851bb674190b	4.8 GB	Download
Euk72Ep_Prok2311As_final_classes.tsv md5:79af5d151cfd8e3a5f070acefad6e051	2.2 GB	Download
KEGG_category_mapping.tsv md5:75c274bb2895f4b7c828aeb6f7f06c77	2.9 MB	Download
KEGG_metadata.tsv md5:b424fb0de853c8e921cef070a231aca3	3.3 MB	Download
NCBI_taxonomy_class_addendum.tsv md5:60f5c488822748e9bc23fdab4ed89960	14.0 kB	Download
NCBI_taxonomy_species_addendum.tsv md5:ea874f07d7f51457597d720df52717c6	25.2 kB	Download
prok2311_as.hmmDB.tar.gz md5:ec3614a6959deae31b58da423a78ab2e	363.8 MB	Download
prok2311_as.tar.gz md5:1ce3e9549ef53f78df2cd600b8b91fcf	5.2 GB	Download
Prok2311As_final_clusters.tsv md5:4f20e3356ea35c5d8d8ae3eedf367890	185.9 MB	Download
READMEv03.txt md5:79ed3a7d1bbcb38b67eeb828c552faca	15.3 kB	Preview Download

Additional details

Is supplement to: Publication: 10.1101/2024.10.14.618318 (DOI)

Updated: 2024-03-22

Revised submission

Repository URL: https://github.com/VictorTobiasson/eukgen
Programming language: Python
Development Status: Active

	All versions	This version
Views	130	43
Downloads	979	624
Data volume	1.4 TB	897.4 GB

Dominant contribution of Asgard archaea to eukaryogenesis (2024) Tobiasson, V., Koonin, E. PROCESSED DATA AND METADATA

Contributors

Contact person:

Data collector:

Related person:

Main data deposit for "Dominant contribution of Asgard archaea to eukaryogenesis".

Files

READMEv03.txt

Files (34.2 GB)

Additional details

Related works

Dates

Software

Dominant contribution of Asgard archaea to eukaryogenesis (2024) Tobiasson, V., Koonin, E. PROCESSED DATA AND METADATA

Creators

Contributors

Contact person:

Data collector:

Related person:

Description

Main data deposit for "Dominant contribution of Asgard archaea to eukaryogenesis".

Files

READMEv03.txt

Files (34.2 GB)

Additional details

Related works

Dates

Software