EsMeCaTa article dataset

Belcour, Arnaud; Hamon-Giraud, Pauline; Mataigne, Alice; Ruiz, Baptiste; Le Cunff, Yann; Got, Jeanne; Awhangbo, Lorraine; Lebreton, Mégane; Frioux, Clémence; Dittami, Simon; Dabert, Patrick; Siegel, Anne; Blanquart, Samuel

doi:10.5281/zenodo.14502342

Published January 24, 2025 | Version v1

Dataset Open

EsMeCaTa article dataset

1. Univ. Grenoble Alpes, Inria, 38000 Grenoble, France
2. Université Grenoble Alpes, CNRS, LIPhy, Grenoble, France
3. University Rennes, Inria, CNRS, IRISA, Rennes, France
4. INRAE, UR1466 OPAALE, F-35044 Rennes, France
5. Inria, INRAE, Université de Bordeaux, 33400 Talence, France
6. Sorbonne University, CNRS, Integrative Biology of Marine Models (LBI2M, UMR 8227), Station Biologique de Roscoff (SBR), 29680 Roscoff, France

This repository contains the archived files associated with EsMeCaTa article.

It is divided into several archives:

archive_figure.zip: this archive contains different sub-folders associated with each figure of the article. It contains scripts (either R or Python) and the source files used to create the figures.
- figure_1_workflow: svg source file for the creation of the worklow figure.
- figure_2_toy_example: input intermediary files, Python script and svg source files that have been used to created Figure 2 of the article.
- figure_3_validation:
  - svg source files that have been used to create the merged Figure 3 of the article.
  - figure_fmeasures_dataset: F-measures computed from the comparison of EsMeCaTa predictions against MGnify. It contains the intermediary input and R scripts to create this subplot. It also contains txt files describing the result of statistical analaysis.
  - figure_picrust: F-measures computed from the comparison of EsMeCaTa predictions against MGnify and PICRUSt against MGnify. It contains the intermediary input and R scripts to create this subplot. It also contains txt files describing the result of statistical analaysis.
  - figure_pocp: POCP metrics computed from the comparison of EsMeCaTa consensus proteomes alignment to MAG/isolates of MGnify. It contains the intermediary input and R scripts to create this subplot. It also contains txt files describing the result of statistical analaysis.
  - figure_threshold: F-measures computed between EC number predicitons from EsMeCaTa and EC numbers from genome and metagenomes for the algal microbiota dataset according to different threshold Tr (0, 0.25, 0.5, 0.75 and 0.95). It contains the intermediary input and R scripts to create this subplot. It also contains txt files describing the result of statistical analaysis.
- figure_4_biogas_reactor: svg source file to create Figure 4 of the article about methanogenic reactor microbial community.
  - figure_gseapy_orsum: result of enrichment analysis with GSEApy and Orsum on the predictions of EsMeCaTa for this community.
  - sup_figure_html_report_biogas_reactor: EsMeCaTa report for this community.
- figure_5_methanogenesis: intermediary input files, Python scripts and svg source files for the creation of the methanogenic pathway figure.
  - diamond_output_ec: aligment result file against Swissprot reference sequences.
  - diamond_output_ko: aligment result file against KEGG orthologs reference sequences.
  - reference_uniprot_data: reference file for UniProt.
  - several Python scripts to (1) search EC in EsMeCaTa predictions, (2) download reference data (from Swissprot and Kegg Orthologs) and (3) aligne these to EsMeCaTa consensus proteomes.
  - svg source files of the figure.
- figure_6_abundance: intermediary input files, Python script, R script to create the linear model. And also the svg source file for the figure.
- sup_figure_cellulosome: Python script to align reference proteins (dockerin and cohesin) to EsMeCaTa consensus proteomes, Diamond resulting alignment files and svg source for the Figure.
- sup_figure_measures: figure showing the measures of methane and OTU abundances.
- sup_figure_toy_example: EsMeCaTa HTML reports and associated figures.
input_file.zip: this archive contains the input files for EsMeCaTa for each dataset presented in the article.
- toy_example.tsv: input file for EsMeCaTa for the toy example dataset (corresponding to Sup_File_1).
- algal_microbiota.tsv: input file for EsMeCaTa for the algal microbiota dataset (corresponding to Sup_File_2).
- MGnify dataset:
  - mgnify_honeybee_esmecata.tsv: input file for EsMeCaTa for the honeybee microbiota dataset.
  - mgnify_human_oral_esmecata.tsv: input file for EsMeCaTa for the human oral microbiota dataset.
  - mgnify_marine_esmecata.tsv: input file for EsMeCaTa for the marine microbiota dataset.
  - mgnify_pig_gut_esmecata.tsv: input file for EsMeCaTa for the pig gut microbiota dataset.
  - mgnify_merged_dataset.tsv: merged taxonomic affiliaitons of the four subdatasets with MAGs having a completness equal or greater to 90% (corresponding to Sup_File_3).
- methanogenic_reactor.tsv: input file for EsMeCaTa for the methanogenic reactor dataset (corresponding to Sup_File_4). It also contains the 16S rRNA sequences reconstructed by FROGs and the abundance for each OTUs for the different samples (i.e. time points).
- methanogenic_reactor_measures.xlsx: measurements of several metabolites in the methanogenic reactor at different time points.
methanogenic_reactor_reads.zip: contains the fastq files of the sequenced community of the methanogenic reactor experiments. One file per time point.
ncbi_taxonomy_database.zip: it contains the NCBI Taxonomy file associated with the database used in the article.
- taxdmp_2023-04.tar.gz: NCBI Taxonomy database used for the algal microbiota dataset.
- taxdmp_2023-09.tar.gz: NCBI Taxonomy database used for the toy example dataset.
- taxdmp_2023-12.tar.gz: NCBI Taxonomy database used for the MGnify dataset (honeybee, human oral, marine and pig gut microbiota sub-datasets).
- taxdmp_2024-01.tar.gz: NCBI Taxonomy database used for the methanogenic reactor dataset.
esmecata_bash_script.zip: example of bash scripts used in a computer cluster (based on SLURM) to run EsMeCaTa for the different datasets.
- 0_esmecata_proteomes.sh: activate conda environment containing EsMeCaTa (with its dependencies) and run the first proteomes step on the input file.
- 1_esmecata_clustering_annotation.sh: specify the use of 10 CPUs and 60 G of RAM. Activate conda environment containing EsMeCaTa (with its dependencies) and run the clustering and annotation steps with 10 cores.

The output folder of the run of EsMeCaTa for the different datasets of the article are also present. First, let's describe the format of these output folders (that is also presented in EsMeCaTa Readme):

0_proteomes: it contains the result of the proteomes step of EsMeCaTa. Relevant ouputs are:
- proteomes: a folder containing all the downloaded (compressed) proteomes from UniProt associated with the selected taxa by EsMeCaTa.
- proteomes_description: a folder containing tabulated file for each input of EsMeCaTa that describes the proteomes found with the taxa associated with this input.
- proteome_tax_id.tsv: a tabulated file indicating for each taxonomic affiliation given as input to EsMeCaTa, which taxon had at least 5 proteomes on UniProt. It lists the proteomes associated with this taxon.
- several stastistics and metadata files about the run of EsMeCaTa.
1_clustering: containing the result of the clustering step of EsMeCaTa. Relevant ouputs are:
- cluster_founds: a folder containing one tabulated file by taxon selected by EsMeCaTa. These tabulated files list the protein clusters identified by MMseqs2. Each row corresponds to a protein cluster and all the protein IDs contained in it.
- computed_threshold: a folder containing one tabulated file by taxon selected by EsMeCaTa. These tabulated files indicate for each protein cluster, the proteome representativeness ratio Rp associated with it and list the proteomes present.
- reference_proteins: a folder containing one tabulated file by taxon selected by EsMeCaTa. These tabulated files show the protein clusters kept after filtering them according to the Tr threshold on the proteome representativeness ratio Rp.
- reference_proteins_consensus_fasta: a folder containing one fasta file by taxon selected by EsMeCaTa. It contains all the consensus sequences created by MMseqs from the filtered protein clusters.
- reference_proteins_representative_fasta: a folder containing one fasta file by taxon selected by EsMeCaTa. It contains all the representative sequences for the filtered protein clusters.
- several stastistics and metadata files about the run of EsMeCaTa.
2_annotation: containing the result of the annotation step of EsMeCaTa. Relevant ouputs are:
- annotation_reference: a folder containing one tabulated file by taxonomic affilaition selected by EsMeCaTa. These tabulated files display the annotation (GO Terms, EC numbers) associated with each protein clusters according to eggNOG-mapper.
- eggnog_output: a folder containing result files ('*.emapper.annotations', '.emapper.hits' and '.emapper.seed_orthologs') created by eggNOG-mapper from the consensus sequences file.
- pathologic: a folder containing sub-folder for each taxonomic affiliation selected by EsMeCaTa. Each sub-folder contains PathoLogic files that can be processed by Pathway Tools to reconstruct draft metabolic networks. Due to size issues with the Zenodo limit, PathoLogic for validation datasets (algal microbiota and MGnify datasets) have been removed.
- function_table.tsv: a tabulated file listing for all taxonomic affiliation processed by EsMeCaTa the associated annotations predicted by eggNOG-mapper and the number of protein clusters linked to them.
- several stastistics and metadata files about the run of EsMeCaTa.

The list of all available output folders are presented below:

toy example dataset:
- result_esmecata_toy_example.zip: EsMeCaTa output folder for the toy example dataset.
case-study dataset:
- result_esmecata_methanogenic_reactor.zip: EsMeCaTa output folder for the methanogenic reactor dataset.
validation dataset:
- algal microbiota dataset:
  - result_esmecata_burgunter_thresholds.zip: EsMeCaTa output folder for the algal microbiota dataset associated with the article from Burgunter-Delamare et al (2020). Results for the five runs according to different threshold Tr (0, 0.25, 0.5, 0.75 and 0.95).
  - result_esmecata_kleinjan_thresholds.zip: EsMeCaTa output folder for the algal microbiota dataset associated with the article from KleinJan et al (2023). Results for the five runs according to different threshold Tr (0, 0.25, 0.5, 0.75 and 0.95).
- MGnify dataset:
  - result_esmecata_honeybee.zip: EsMeCaTa output folder for the honeybee microbiota subdataset of the MGnify dataset.
  - result_esmecata_human_oral.zip: EsMeCaTa output folder for the human oral microbiota subdataset of the MGnify dataset.
  - result_esmecata_marine.zip: EsMeCaTa output folder for the marine microbiota subdataset of the MGnify dataset.
  - result_esmecata_pig_gut.zip: EsMeCaTa output folder for the pig gut microbiota subdataset of the MGnify dataset.

To reproduce the results presented in the article, several precomputed databases compatible with EsMeCaTa version 0.5.* have been generated. To summarise, these precomputed databases contained annotations and consensus proteomes predicted for each taxon of the dataset. It can be queried by EsMeCaTa to create an output similar to a classic output of EsMeCaTa but without the need to perform the whole workflow which can take several hours. Each database can be used with its corresponding input file using the following command "esmecata precomputed -i input_file.tsv -d precomputed_database.zip -o output_folder". To use on the toy example: "esmecata precomputed -i toy_example.tsv -d precomputed_db_toy_example.zip -o output_folder". More information can be found in the GitHub of EsMeCaTa, both in the readme and in a subfolder associated with the article data. The following precomputed databases are listed below:

toy example dataset:
- precomputed_db_toy_example.zip: EsMeCaTa precomputed database for the toy example dataset.
case-study dataset:
- precomputed_db_methanogenic_reactor.zip: EsMeCaTa precomputed database for the methanogenic reactor dataset.
validation dataset:
- algal microbiota dataset:
  - precomputed_db_algae_R_0_25.zip: EsMeCaTa precomputed database for the experiment with the Tr threshold of 0.25 for the algal microbiota dataset.
  - precomputed_db_algae_R_0_5.zip: EsMeCaTa precomputed database for the experiment with the Tr threshold of 0.5 for the algal microbiota dataset.
  - precomputed_db_algae_R_0_75.zip: EsMeCaTa precomputed database for the experiment with the Tr threshold of 0.75 for the algal microbiota dataset.
  - precomputed_db_algae_R_0_95.zip: EsMeCaTa precomputed database for the experiment with the Tr threshold of 0.95 for the algal microbiota dataset.
  - precomputed_db_algae_R_0.zip: EsMeCaTa precomputed database for the experiment with the Tr threshold of 0 for the algal microbiota dataset.
- MGnify dataset:
  - precomputed_db_honeybee.zip: EsMeCaTa precomputed database for the honeybee microbiota subdataset of the MGnify dataset.
  - precomputed_db_human_oral.zip: EsMeCaTa precomputed database for the human oral microbiota subdataset of the MGnify dataset.
  - precomputed_db_marine.zip: EsMeCaTa precomputed database for the marine microbiota subdataset of the MGnify dataset.
  - precomputed_db_pig_gut.zip: EsMeCaTa precomputed database for the pig gut microbiota subdataset of the MGnify dataset.

Furthermore, several scripts and intermediary data used to create figures of the article are available:

threshold_comparison.zip: reference data (genomes and MAGs annotated by eggnog-mapper) that were used as ground truth for the experiment on the impact of the Tr threhsold for the algal microbiota dataset. It contains:
- expected_data: reference data for the algal microbiota dataset (genome sequence and eggNOG-mapper annotation files):
  - 1_genome: nucleic fasta files for the algal microbiota dataset.
  - 2_annotation: resulting annotation files by eggNOG-mapper.
  - eggnog.sh: bash script to launch eggNOG-mapper on the genome folder.
- picrust2: results of PICRUSt2 on the algal microbiota dataset.
- compare_ecs.py: Python script to compute the F-measure for each Tr threhsold used for the algal microbiota dataset. It requires the uncompressed EsMeCaTa output folder for the algal microbiota datasets (result_esmecata_burgunter_thresholds and result_esmecata_kleinjan_thresholds) and the genome annotations.
mgnify_validation.zip: files and scripts used to perform the valdiation against the MGnify datasets:
- ec_picrust: a subfolder containing the comparison on the EC number and with PICRUSt:
  - 16s_rrna_sequence: a subfolder containing four fasta files showing the 16S rRNAs sequences for each dataset. These files were used as input to PICRUSt.
  - picrust_results_dataset: a subfolder containing four folders. Each folder is the result of the run of PICRUSt on the 16S rRNA sequence files.
  - run_picrust: folder containing Python scripts used to create the input files present in 16s_rrna_sequence:
    - 0_download_ref_data.py: script to download protein sequences, rRNA sequences and eggnog annotatiosn for each MAG/isolate of the MGnify dataset.
    - 1_create_dataset_16srrna.py: script to extract the 16S rRNA sequences from the rRNA fasta files.
    - 2_run_picrust2.py: script to run PICRUSt2 on the 16S rRNA fasta files. It requires PICRUSt2 to be installed in the environment.
  - compare_ec_dataset.py: Python script to compute the F-measure for the EC number between EsMeCaTa predictions and teh MAG/isolates and between PICRUSt2 and MAG/isolates.
  - create_table_mag.py: Python script to compute the table indicating the number of MAGs/isolates processed by EsMeCaTa and PICRUSt2. It requires MGnify input files (such as honeybee_esmecata_metdata.tsv), 16S rRNA sequence files (from 16s_rrna_sequence folder), PICRUSt result folder (from run_picrust folder) and proteome_tax_id.tsv for the dataset from EsMeCaTa output folder.
  - table_mgnify_data.tsv: tabulated file showing the table computed by create_table_mag.py.
- protein_sequences_pocp: folder containing comparison of the sequences between EsMeCaTa consensus proteomes and MAG/isolate sequences.
  - 0_run_diamond_comparison.py: Python script to run Diamond on EsMeCaTa consensus proteomes (from EsMeCaTa resulting folder) and MAG/isolate sequences (contained in archvie expected_data_mgnify.zip). It outputs the alignment between these sequences.
  - 1_compute_pocp.py: Python script to compute F-measure and POCP using Diamond resulting files.
  - comparison_diamond_pocp_honeybee.zip: Resulting files from Diamond alignment between EsMeCaTa consensus proteomes and MAG/isolate sequences for the honeybee microbiota dataset.
  - comparison_diamond_pocp_human_oral.zip: Resulting files from Diamond alignment between EsMeCaTa consensus proteomes and MAG/isolate sequences for the human oral microbiota dataset.
  - comparison_diamond_pocp_marine.zip: Resulting files from Diamond alignment between EsMeCaTa consensus proteomes and MAG/isolate sequences for the marine microbiota dataset.
  - comparison_diamond_pocp_pig_gut.zip: Resulting files from Diamond alignment between EsMeCaTa consensus proteomes and MAG/isolate sequences for the pig gut microbiota dataset.
- compare_go_dataset.py: a python script that computes the F-score on GO Terms predictions between EsMeCaTa predictions and annotations of the MAG/isolate.
- expected_data_mgnify.zip: an archvie containing for the four datasets: (1) a fasta file containing protein sequences associated with each MAG/isolate, (2) a tabulated file containing predictions by eggnog-mapper and (3) a fasta file containing rRNAs associated with each MAG/isolate.

Finally, a version of the code of EsMeCaTa (version 0.5.0):

esmecata-0.5.0.zip: archive of the GitHub repository of EsMeCaTa at the version 0.5.0. It is advised to use the current version of EsMeCaTa, available in its GtiHub repository.

Files

input_file.zip

Files (41.5 GB)

Name	Size	Download all
archive_figure.zip md5:16398aca34c0f7df80ace02eb56e8bf8	1.4 GB	Preview Download
esmecata-0.5.0.zip md5:da40ba64f906f91fed09aa8fca81c0e7	2.2 MB	Preview Download
esmecata_bash_script.zip md5:f04bac92e9d5a62bc431fb437f9667bf	818 Bytes	Preview Download
input_file.zip md5:7c4c040ea3c84d230e66cae89c392dad	1.5 MB	Preview Download
methanogenic_reactor_reads.zip md5:72d9892dc0edbc1b98325d9cb8834de3	525.7 MB	Preview Download
mgnify_validation.zip md5:436596bb45c13269406be9244f838a02	5.8 GB	Preview Download
ncbi_taxonomy_database.zip md5:f263d3f87b06cc95bee601e36d5314c6	238.9 MB	Preview Download
precomputed_db_algae_R_0.zip md5:ff2db1aa26192ac78634b3395fb4b595	439.3 MB	Preview Download
precomputed_db_algae_R_0_25.zip md5:c6531e25fcfa5dd87ada9274a36c0407	96.4 MB	Preview Download
precomputed_db_algae_R_0_5.zip md5:0efca51629885c08f68bd199ab7e0b91	74.3 MB	Preview Download
precomputed_db_algae_R_0_75.zip md5:b8144557041caad7987a7aeb70151095	60.2 MB	Preview Download
precomputed_db_algae_R_0_95.zip md5:ecdbe9a8b353fc64ed2394f87a45f79f	44.7 MB	Preview Download
precomputed_db_honeybee.zip md5:70a32fb3ac48e798f4021eaa292d8d72	67.4 MB	Preview Download
precomputed_db_human_oral.zip md5:867f2d18af35245748f6d12fef1da2b8	64.8 MB	Preview Download
precomputed_db_marine.zip md5:8683ba8d6a702e3090b0434161926bef	217.4 MB	Preview Download
precomputed_db_methanogenic_reactor.zip md5:83a92b914e08db3d886d1647a7bd24e3	79.1 MB	Preview Download
precomputed_db_pig_gut.zip md5:1e65f1987e05457cece3390c6c3aa024	120.1 MB	Preview Download
precomputed_db_toy_example.zip md5:53e00403d30fe3e197d3126575bb4e74	20.2 MB	Preview Download
result_esmecata_burgunter_thresholds.zip md5:083eb78824b1a17bc26a148c99d966b0	8.0 GB	Preview Download
result_esmecata_honeybee.zip md5:b6e038e864662f57a5f68e456c96a4e6	1.8 GB	Preview Download
result_esmecata_human_oral.zip md5:5442c22e29f9e8fd6b91e7c6301f1d2f	1.8 GB	Preview Download
result_esmecata_kleinjan_thresholds.zip md5:144dd2f17218d4a3794998f07687aa95	6.8 GB	Preview Download
result_esmecata_marine.zip md5:d16addd5797bc812e04313d1edb6905e	6.2 GB	Preview Download
result_esmecata_methanogenic_reactor.zip md5:183db971518a11dfec0d4ec4b06b4ca5	2.4 GB	Preview Download
result_esmecata_pig_gut.zip md5:f7f65070fdbc9e276e2aa202b4d1f582	4.3 GB	Preview Download
result_esmecata_toy_example.zip md5:bfd1d99e3ddc04567e2c6c670850f18d	608.7 MB	Preview Download
threshold_comparison.zip md5:ae91115bde5d54b3b2fb9d6dfd038236	286.2 MB	Preview Download

Additional details

Repository URL: https://github.com/AuReMe/esmecata
Programming language: Python

	All versions	This version
Views	85	85
Downloads	1,140	1,140
Data volume	1.6 TB	1.6 TB

EsMeCaTa article dataset

Creators

Description

Files

input_file.zip

Files (41.5 GB)

Additional details

Software