Published August 4, 2025 | Version v2
Data paper Open

MICROSTORE : 'omic approaches to decipher MICrobial eukaRyOteSfuncTiOn in fReshwater lake Ecosystems

Description

Changes

These changes were made between the v1 and the v2:

  • SAG:
    • contigs have been renamed and only those greater than 2.5kb were kept
    • gene prediction was done with GeneMark-EP+
    • genes are available in gff3, transcripts and proteins as Fasta sequences
  • The DNA-dependent RNA polymerase phylogenomic tree was updated with two SAGs from this dataset
  • Mapping public metagenomes
    • rows corresponding to metagenomes that had less than 100 reads mapped were removed from the table
    • column removed:
      • "used_for_UMAP", with the removal of some rows, this column contained only 'yes' values
    • columns added:
      • "clean_bases"
      • "Nb_MAG_detected_sup_10pct"
      • "cluster"
  • A metadata table for MAGs and SAGs has been added, it contains the assemblies statistics, BUSCO results and taxonomy affiliation

Archive content

This Zenodo records contains the following items : 

Single Amplified Genomes (SAGs)

The files present in the Zip archive SAG.zip are:

- genome/*.fa.gz: SAG genomes
- gene_prediction/<SAG>.gff3.gz: gene predictions
- gene_prediction/<SAG>.transcript.fa.gz: transcript sequences
- gene_prediction/<SAG>.protein.fa.gz : protein sequences

A summary for each SAG is available in the spreadsheet MAG_SAG_metadata.ods.

Metagenome-Assembled Genomes (MAGs)

The files present in the Zip archive MAG.zip are:

  • euk_mag/*.fa.gz: MAG genomes
  • gene_prediction_final/<MAG>.<GeneCaller>.gff3.gz: gene predictions
  • gene_prediction_final/<MAG>.<GeneCaller>.transcript.fa.gz: transcript sequences
  • gene_prediction_final/<MAG>.<GeneCaller>.prot.fa.gz: protein sequences

Some MAG for which the gene prediction failed (too short sequence, other reason) and that were not used in the analysis with the mapping of public metagenomes, are available in the directory not_used/.

A summary for each MAG is available in the spreadsheet MAG_SAG_metadata.ods.

Metagenomic assemblies and bins

The files present in the Zip archive metagenomic_assemblies.zip are:

  • contigs_db/: the 4 anvi'o contigs databases that contain the contigs sequences
  • merged_profiles_db: the 4 anvi'o profiles databases that contain the mapping results. The bins (MetaBAT2) are stored in these databases
  • README_metagenomic_assemblies.md: a text file with more details

WARNING: files have been compressed in .bz2, they must be decompressed (bunzip2) before usage. They use about 110 GB of disk-space.

Even though these artifacts were generated for a previous version of anvi'o, a script is available to continue using them with an up-to-date installation,  anvi-migrate (documentation).

Phylogenetic tree

All files are provided in the Zip archive phylogenetic_tree.zip.

The DNA-dependent RNA polymerase

The files present in the directory phylogenetic_tree/DNA_dependent_RNA_pol are:

  • 00_hmm_profiles/ : the HMM profiles that target the DNA-dependent RNA polymerase sub-units
  • 01_sequences/ : best hit for each sub-unit
  • 02_alignments_raw/: raw alignment, by MAFFT v7.526
  • 03_alignments_cleaned/: alignment after goalign clean sites -c 0.5
  • merge_alignments.py: Python script to concatenate the 6 sub-units
  • RNAP_aln_v6_concat.fa: the concatenated alignment
  • 04_tree/: the run of IQ-Tree v2.4.0
  • metadata_RNAP_tree.tsv: Metadata to decorate the tree

The Python script merge_alignments.py requires Python version 3 and BioPython to work.

The file metadata_RNAP_tree.tsv lists the reference genomes. The "source" column corresponds to:

Phylosift tree

The files present in the directory phylogenetic_tree/phylosift_tree/ are :

  • 01_marker_present/: marker identified and aligned by Phylosift
  • 02_marker_selected/: marker selected for the tree, markers present in at least 50% of the genomes
  • 03_marker_alignment_cleaned/: alignment cleaned by Trimal v1.5 and parameter -automated1
  • 04_phylosift_concatenated_alignment.fa: concatenation of the 50 markers
  • 05_tree/: phylogeny built by IQTree v2.2.3
  • metadata_phylosift_tree.csv: list of reference genomes and they taxonomy. This file can be directly used to decorate the tree visualised with TreeViewer.

Taxonomy affiliation

The file phylogenetic_tree/taxonomy_affiliation_MAG_SAG.tsv summarises the taxonomic affiliation proposed for MAGs and SAGs for which markers were present in a sufficient number.

The final taxonomic affiliation is also availabe in the spreadsheet MAG_SAG_metadata.ods.

Mapping public metagenomes

The files present in the Zip archive mapping_public_metagenomes.zip are :

  • public_metagenomes_metadata.tsv : the metagenomes metadata table. The columns corresponds to :
    • accession : Identifier in the public databases, except for datasets of the project
    • origin : where the metagenome was collected
    • origin_simplified : simplified version, as some names were long
    • country : country in which the sample was taken
    • broad_geo_region : the UN geoschemes code corresponding to the country (https://en.wikipedia.org/wiki/United_Nations_geoscheme)
    • dataset : this project or public data
    • DLATITUDE : latitude, in decimal degree
    • DLONGITUDE : longitude in decimal degree
    • salinity : relation of the sample to the salinity
    • ECOSYSTEM.TYPE : type of ecosystem sampled
    • MINIMUM.SIZE.FRACTION : when available, the pore size on which the genetic material was collected
    • MAXIMUM.SIZE.FRACTION : when available, the pore size used to prefilter the sample
    • SAMPLE.MATERIAL : nature of the sample, mostly water
    • clean_reads : number of reads used for the mapping
    • mapped_reads : number of reads that mapped on the MAGs and SAGs of this project
    • filtered_reads : number of reads that passed the filters from msamtools
    • Nb_MAG_detected_sup_10pct : number of MAG and SAG that were detected at more than 10 % (breadth of coverage) in the metagenome
    • cluster : the cluster the metagenome belongs to
  • The table that summarise the read count is public_metagenomes_read_count_on_MAGs_and_SAGs.tsv. The first column refers to the MAG and SAG identifiers, and the other 3097 columns represent one public metagenome each. The data present in this file is the number of read mapped per MAG/SAG per public metagenome, after filtering the mapping with msamtools v1.1.0 and the parameters filter -b -l 50 -p 95 -z 80.
  •  The file public_metagenomes_detection_of_MAGs_and_SAGs.tsv, summarises the breadth of coverage, in percent, of each MAG/SAG in each public metagenome. The value of "100" means that all positions of a particular genome is covered by at least one read from the given metagenome. And "0" means that no read from the metagenome X had mapped on the MAG/SAG Y.

Unigenes

The files stored in the Zip archive unigenes.zip are :

  • unigenes_sequences.fa.gz : unigenes sequences, clean from contamination (human, metazoans, bacteria, archaea and viruses)
  • table_readCount.noHuman.noConta.noMetazoa.annot.tsv.gz: counts of mapped reads on the unigenes plus functionnal annotations  KEGG K0, Pfam and GO (derived from Pfam)
  • table_taxonomy.perUnigene.allUnigenes.tsv.gz: unigenes taxonomic annotation

See also the work of Monjot et al.,2023

Files

MAG.zip

Files (49.2 GB)

Name Size Download all
md5:b67e89f0310fa32e212ed15dac220e57
2.1 GB Preview Download
md5:3b6a2a864f23924ba686053900e82d4c
50.7 kB Download
md5:cf72e32fa9d2190c010806161705a7c3
2.3 MB Preview Download
md5:bbe3cd9fbe644fbe1fba3adfc8dce15f
44.8 GB Preview Download
md5:894bdd22263ec9335ba602021d866ced
41.2 MB Preview Download
md5:804a55ec3b9933b06fc24033ba0025d8
8.2 kB Preview Download
md5:ba91e6659574b1f7809f6dfaf9523d7e
45.3 MB Preview Download
md5:b01750258915da99a111419a035465d9
2.3 GB Preview Download

Additional details

Funding

France Génomique
ANR-10-INBS-09-08