Inferring and comparing metabolisms across heterogeneous sets of annotated genomes using AuCoMe

Belcour, Arnaud; Got, Jeanne; Aite, Méziane; Delage, Ludovic; Collen, Jonas; Frioux, Clémence; Leblanc, Catherine; Dittami,  Simon M.; Blanquart, Samuel; Markov, Gabriel V.; Siegel, Anne

doi:10.5281/zenodo.7387234

Published December 1, 2022 | Version 3.0

Dataset Open

Inferring and comparing metabolisms across heterogeneous sets of annotated genomes using AuCoMe

1. Univ Rennes, Inria, CNRS, IRISA, F-35000 Rennes, France
2. Sorbonne Université, CNRS, Integrative Biology of Marine Models (LBI2M), Station Biologique de Roscoff(SBR), 29680 Roscoff, France
3. Inria, INRAE, Université de Bordeaux, France

CONTENT OF THIS ARCHIVE

The Zenodo archive is composed of one file and four main directories:
* analyses this directory contains all tabulated files used to create the figures and results of the paper.

* aucome_v0.5.1 this directory contains the code of AuCoMe used to run the three datasets.

* datasets this directory gathers all datasets on which AuCoMe was run: the bacterial, fungal, and algal datasets, and the 32 synthetic datasets, which contain an E. coli K–12 MG1655 genome to which various degradations were applied, together with 28 other bacterial genomes.

* metacyc_23.5.padmet the version 23.5 of the MetaCyc database (https://metacyc.org/) in the PADMET format. It was used by AuCoMe to reconstruct all the metabolic networks. Hence metacyc 23.5.padmet is required to reproduce the article results.

* padmet_v5.0.1 this directory contains the code of PADMET used to run AuCoMe.

* scripts this directory contains several scripts to generate figures and a script to degrade the E. coli K–12 MG1655 genome.

1/ Content of the analyses subdirectory
* figure_2_bacterial_nb_reactions.tsv for each species of the bacterial dataset, this file gives the number of reactions at each AuCoMe step. It was used to create figure 2B.

* figure_2_fungal_nb_reactions.tsv for each species of the fungal dataset, this file gives the number of reactions at each AuCoMe step. It was used to create figure 2C.

* figure_2_algal_nb_reactions.tsv for each species of the algal dataset, this file gives the number of reactions at each AuCoMe step. It was used to create figure 2D.

* figure_3_nb_reactions_step.tsv for each dataset of the 32 synthetic bacterial datasets, this file enumerates the number of reactions at each AuCoMe step. It was used to create figure 3A.

* figure_3_fmeasure_steps.tsv for each dataset of the 32 synthetic bacterial datasets, this file indicates the values of the F-measures resulting of the comparison of the GSMNs recovered for each E. coli K–12 MG1655 genome replicate with the gold-standard network EcoCyc. It was used to create figure 3B.

* figure_S4_Deepec_fungal.tsv for each species of the fungal dataset, at each AuCoMe step (robust orthology, non-robust orthology, and annotation or orthology), several measures were computed, i.e.: the number of reactions, the number of ECs, the number of ECs valided by DeepEC, and ratio number of ECs validated by DeepEC / number of ECs. It was used to design figure S4(a).

* figure_S4_Deepec_algal.tsv for each species of the algal dataset, at each AuCoMe step (robust orthology, non-robust orthology, and annotation or orthology), several measures were computed, i.e.: the number of reactions, the number of ECs, the number of ECs validated by DeepEC, and the ratio number of ECs valided by DeepEC / number of ECs. It was used to design figure S4(b).

* SuplFile_o-Aminophenol_reactions_tables_S10_S11_S12.ods comprises three tables: S10, S11, and S10 with more detail (like the amino acid sequences in the S12).

2/ Content of the aucome v0.5.1 subdirectory
This directory contains a copy of the AuCoMe project on the GitHub site: https://github.com/AuReMe/aucome (downloaded the 15/11/2022). It is composed of two subdirectories and five files:
* LICENCE licence of the AuCoMe software.

* README.rst README of the AuCoMe software.

* requirements.txt contains the list of requires Python packages.

* setup.cfg contains metadata about AuCoMe package and is used with setup.py to distribute AuCoMe.

* setup.py contains various information relevant to the AuCoMe package including options and metadata. Then, it is used to distribute AuCoMe with PyPI. It is also used to create an entrypoint when installing it with pip.

* recipes this subdirectory contains two files:
    – Dockerfile contains instructions to run AuCoMe in a Docker environment.

    – Singularity contains instructions to run AuCoMe in a Singularity container.

* aucome this directory contains 11 Python files:
– __init__.py indicates the directory as a python module.

– __main__.py contains the functions implementing the command-line interface of AuCoMe.

– analysis.py contains the functions to analyse the AuCoMe results.

– check.py contains the functions to check the input files.

– compare.py contains the functions to compare the AuCoMe results between two distinct subgroups.

– orthology.py contains the functions to propagate reaction through orthology.

    – reconstruction.py contains the functions to perform the reconstruction of draft GSMNs by using Pathway Tools in a parallel implementation.

    – spontaneous.py contains the functions to add spontaneous reactions to some GSMNs if it completes MetaCyc metabolic pathway.

    – structural.py contains the functions to check that no reactions are missing due to missing gene structures. A genomic search is performed for all reactions present in one organism but not in another.

    – utils.py contains a function to analyse the configuration file.

– workflow.py contains functions to run all the steps of AuCoMe.

3/ Content of the datasets subdirectory
3.1/ Content of the algal, bacterial, and fungal directories
These three directories are composed of 8 subdirectories:
* FASTA contains the proteome of each species as a FASTA file.

* cleaned_GBKs for each species, it contains the annotated genome, with the protein sequences in a GenBank format file.

* dictionaries for some species, genes needed to be renamed for compatibility reasons. This folder contains CSV files with the mapping between the old names of genes and the new ones.

* annotated_DATs contains a subdirectory per species with all the output files from Pathway Tools v23.5, without any post-treatment, in the DAT format.

* annotated_PADMETs for each species, it contains a metabolic network of the draft reconstruction step of AuCoMe, in the PADMET format.

* final_PADMETs for each species, it contains a metabolic network generated by the AuCoMe workflow, at the PADMET format.

* final_SBMLs for each species, it contains a metabolic network generated by the AuCoMe workflow, in the SBML format.

* panmetabolism is composed of 7 files describing the final metabolic networks:
– genes.tsv contains, for each organism, the list of genes and the associated reactions.

    – metabolites.tsv contains the list of metabolites present in the panmetabolism. Then, for each metabolite and for each organism, it lists the reactions that produced this compound and the reactions that consumed it.

    – pathways.tsv contains the list of pathways present in the panmetabolism. For each pathway and for each organism, it indicates the number of reactions present in this pathway, and the names of these reactions.

    – reactions.tsv contains the list of reactions present in the panmetabolism. Then for each reaction, it indicates whether or not it belongs to an organism. If a reaction is found in a species, the genes associated with the reaction are also listed.

– pvclust_reaction_dendrogram.png based on the presence/absence matrix of reactions in different species of the dataset, it computes the Jaccard distances between these species, and it applies a hierarchical clustering on these data with a complete linkage to create a dendrogram. The R package pvclust is used to create the dendrogram, with bootstrap resampling. For each node, a p-value indicates how strong the cluster is supported by data. This dendrogram is provided as a PNG picture.

3.2/ Content of the synthetic_bacterial repertory
The synthetic_bacterial repertory contains 32 subdirectories named Run_00, Run_01, . . . , etc, Run 31. Each subdirectory is composed of 9 files:
* K_12_MG1655.gbk the annotated genome of E. coli K–12 MG1655 to which degradation of the functional and/or structural annotations was applied.

* annotated_K_12_MG1655.sbml the metabolic network of E. coli K–12 MG1655 output of the draft reconstruction step of AuCoMe in the SBML format.

* annotated_K_12_MG1655.padmet the metabolic network of E. coli K–12 MG1655 output of the draft reconstruction step of AuCoMe in the PADMET format.

* orthology_K_12_MG1655.sbml the metabolic network of E. coli K–12 MG1655 output of the orthology propagation step of AuCoMe in the SBML format.

* orthology_K_12_MG1655.padmet the metabolic network of E. coli K–12 MG1655 output of the orthology propagation step of AuCoMe in the PADMET format.

* structural_K_12_MG1655.sbml the metabolic network of E. coli K–12 MG1655 output of the structural verification step of AuCoMe in the SBML format.

* structural_K_12_MG1655.padmet the metabolic network of E. coli K–12 MG1655 output of the structural verification step of AuCoMe in the PADMET format.

* final_K_12_MG1655.sbml the metabolic network of E. coli K–12 MG1655 output of the AuCoMe workflow in the SBML format.

* final_K_12_MG1655.padmet the metabolic network of E. coli K–12 MG1655 output of the AuCoMe worflow in the PADMET format.

4/ Content of the padmet_v5.0.1 subdirectory
This directory contains a copy of the PADMET project on the GitHub site: https://github.com/AuReMe/padmet/ (downloaded the 15/11/2022). It is composed of two subdirectories and six files:
* CHANGELOG.md records of all notable changes made in the PADMET project.

* docs this repertory contains all the documentation files of PADMET package in the RST format.

* LICENCE licence of the PADMET package.

* README.md manual of the PADMET package.

* requirements.txt contains the list of requires Python packages.

* setup.cfg contains metadata about PADMET package and is used with setup.py to distribute PADMET.

* setup.py contains various information relevant to the PADMET package including options and metadata. Then, it is used to distribute PADMET with PyPI. It is also used to create an entrypoint when installing it with pip.

* padmet this repertory grathers two files and two subdirectories:
– __init__.py indicates the version of PADMET.

– __main__.py contains the functions implementing the command-line interface of PADMET.

– classes contains 7 files.

– utils contains 4 files and 3 subdirectories.

4.1/ Content of the class subdirectory
The class repertory contains 7 files.
* __init__.py indicates the directory as a python module.

* instantiation.py contains a function to instantiate padmet object.

* node.py contains a class defining a Node object which is representing an element in a metabolic network (e.g: compound, reaction).

* padmetRef.py contains a class defining a PadmetRef object which is representing a database of metabolic network.

* padmetSpec.py creates a PadmetSpec object which is representing the metabolic network of a species/organism based on a reference database PadmetRef.

* policy.py contains a class defining a Policy object that is defining the types of Relations and Nodes of a network.

* relation.py contains a class defining a Relation object which is representing a link between two elements (Node) in a metabolic network.

4.2/ Content of the utils subdirectory
The utils directory contains 4 files and 3 subdirectories.
* __init__.py indicates the directory as a python module.

* gbr.py implements a lexical analysis to handle genes relationship associated with a reaction, either a complex (with and relation between genes) or isozyme (with or relation between genes).

* sbmlPlugin.py contains functions to handle SBML element (ex: species or reaction), then it returns all the sections named notes in a dictionary.

* utils.py contains a function that checks paths of file.

* connection this subdirectory contains 22 files:
- __init__.py indicates the directory as a python module.

– biggAPI_to_padmet.py allows to extract the BIGG database from the API to create a padmet. An Internet access is required.

– check_orthology_input.py is written to check if the metabolic network and the proteome of the model organism use the same identifiers for genes (or at least more than a given cutoff), before running orthology based reconstruction.

– enhanced_meneco_output.py extracts the results from Meneco gap-filling to add more information to the gap-filled reactions. Then it returns a PADMET file with more information for each reaction.

    – extract_orthofinder.py after running Orthofinder on n FASTA files, it reads the output file ’Orthogroups.tsv’ to identify the orthologous genes. It is used by AuCoMe to extract the orthologous genes.

    – extract_rxn_with_gene_assoc.py from a given SBML file, it creates a SBML with only the reactions associated to a gene.

    – gbk_to_faa.py extracts protein sequence from a GenBank into a FASTA file with Biopython package.

– gene_to_targets.py from a list of genes, it gets the products associated with the reactions linked to the genes. For example: R1 is linked to G1, R1 produces M1 and M2, this script outputs: M1, M2.

– get_metacyc_ontology.py from the PadmetRef of MetaCyc, it creates the MetaCyc ontology.

    – metexploreviz_export.py converts a PADMET object representing a metabolic network into a JSON compatible with MetExplore.

    – modelSeed_to_padmet.py from ModelSEED reactions and pathways files, it creates a PADMET.

    – network_to_gnn.py creates input for GNN (Graph Neural Networks) from PADMET or SBML.

– padmet_to_asp.py converts PADMET to Answer Set Programming.

– padmet_to_matrix.py creates a stoichiometry matrix from a PADMET file, in which the columns represent the reactions and rows represent metabolites.

    – padmet_to_padmet.py allows to merge 1-n PADMET.

    – padmet_to_tsv.py converts a PADMET representing a database (PadmetRef) and/or a PADMET representing a model (PadmetSpec) to TSV files.

– pgdb_to_padmet.py reads a PGDB folder (from BIOCYC/Pathway Tools) and creates a PADMET. It is used by AuCoMe to create PADMET files from PGDB in the annotation-based step.

– sbmlGenerator.py contains functions to generate SBML files from PADMET and TXT files usign the libsbml package. It is used by AuCoMe to create SBML files at the annotation-based, orthology and final steps.

– sbml_to_curation_form.py extracts one or several reactions from a SBML file to the form used in AuReMe for curation.

– sbml_to_padmet.py converts a SBML file into a PADMET file (with or without a reference database).

– sbml_to_sbml.py creates a SBML file from another one. Use it to change the SBML level.

– wikiGenerator.py contains all necessary functions to generate wiki pages from a PADMET file and update a wiki online. It requires WikiManager module (with wikiMate, Vendor).

* exploration this subdirectory contains 15 files:
- __init__.py indicates the directory as a python module.

– compare_padmet.py compares 1-n PADMET files, and creates a folder with 4 output files (compounds.tsv, genes.tsv, pathways.tsv and reactions.tsv). It is used by AuCoMe to create these files to analyse the metabolic networks.

– compare_sbml.py compares 2 or 1-n SBML, then it creates two output files reactions.tsv and metabolites.tsv with the reactions/metabolites in each SBML files.

– compare_sbml_padmet.py compares reaction identifiers in SBML versus PADMET, then returns the number of reactions in both, and reaction identifiers not in SBML or not in PADMET.

– convert_sbml_db.py uses the MetaNetX database to check or convert a SBML. Flat files from MetaNetx are required to run this script. They can be found in the AuReMe workflow or from the MetaNetx website.

– dendrogram_reactions_distance.py uses the reactions.tsv file from compare_padmet.py to create a dendrogram using the R package pvclust. It has been used in the article to create the metabolic dendrogram.

– flux_analysis.py runs the flux balance analyse with cobra package on an already defined reaction. It needs to set in the SBML the value ’objective_coefficient’ to 1.

– get_pwy_from_rxn.py from a file containing a list of reaction, it returns the pathways where these reactions are involved.

– padmet_stats.py creates a PADMET stats file (named padlet_stats.tsv) containing the number of pathways, reactions, genes and compounds inside the one or several PADMET files.

– pathway_production.py compares 1-n PADMET objects to show the pathway input/output for them.

– prot2genome.py contains function to search a genome using protein sequences and Gene-Protein-Reaction associations. It is used in the structural search step of AuCoMe.

– report_network.py creates reports of a PADMET file, and it writes three TSV files (all metabolites.tsv, all_pathways.tsv, and all_reactions.tsv).

– visu_network.py allows to visualize a metabolic network on a compounds perspectives.

    – visu_path.py allows to visualize a pathway in PADMET network.

    – visu_similarity_gsmn.py visualize similarity between metabolic networks using MDS.

* management this subdirectory contains 5 files:
– __init__.py indicates the directory as a python module.

– manual_curation.py updates a PadmetSpec object by filling specific forms. It either creates new reaction(s) to PADMET file, or it adds/removes reaction(s) from a PadmetRef.

– padmet_compart.py for a given PADMET file, it checks and updates compartment.

– padmet_medium.py for a given set of compounds representing the growth medium (or seeds), it creates two reactions in order to maintain consistency of the network for flux analysis.

– relation_curation.py for a given PADMET file, it adds or removes relations between nodes.

5/ Content of the scripts subdirectory
The scripts repertory contains 9 files:
* bacteria_random_degradation.py was used to degrade the E. coli K–12 MG1655 genome. The procedure for the genome degradation is described in the algorithm 1.

* Figure_2_Algal_dataset.py for each species of the algal dataset, and at each AuCoMe step. This script allows to generate the figure 2D.

* Figure_2_Bacterial_dataset.py for each species of the bacterial dataset, and at each AuCoMe step. This script allows to generate the figure 2B.

* Figure_2_Fungal_dataset.py for each species of the fungal dataset, and at each AuCoMe step. This script allows to generate the figure 2C.

* Figure_3_degradation.py allows to generate the figure 3B from the figure_3_fmeasure_steps.tsv file (described above).

* Figure_6_MDS.py allows to generate the figure 6A from two reactions.tsv files of the algal dataset (annotation-based and final).

* Sup_Figure_5_comparison_bacteria.py allows to generate the figure S5 of the paper.

* Sup_Figure_6_comparison_pathway_Fungi.py allows to generate the fig. S6 of the paper.

* Sup_Figure_7_Supervenn.py allows to generate the figure S7, it reads the reactions.tsv file of the algal dataset at the final AuCoMe step, and another tabular file that contains abbreviated names of species.

Files

AuCoMe_Supplementary_data_v3.zip

Files (5.5 GB)

Name	Size	Download all
AuCoMe_Supplementary_data_v3.zip md5:bf6a24e01c1d11e7ab34b7b5e6fa7dfb	5.5 GB	Preview Download

	All versions	This version
Views	1,045	120
Downloads	144	18
Data volume	894.1 GB	98.7 GB

Inferring and comparing metabolisms across heterogeneous sets of annotated genomes using AuCoMe

Authors/Creators

Description

Files

AuCoMe_Supplementary_data_v3.zip

Files (5.5 GB)