MICROSTORE : 'omic approaches to decipher MICrobial eukaRyOteSfuncTiOn in fReshwater lake Ecosystems
Authors/Creators
-
Courtine, Damien
(Project member)1
-
Lepere, Cecile
(Project member)1
-
wawrzyniak, ivan
(Project member)1
-
Moné, Anne
(Project member)1
- Billard, Hermine (Project member)1
-
Colombet, Jonathan
(Project member)1
-
Monjot, Arthur
(Project member)1
-
Cruaud, Corinne
(Project member)2
-
DA SILVA, Corinne
(Project member)3
-
AURY, Jean-Marc
(Project member)3
-
DEBROAS, Didier
(Project member)1
-
Bronner, Gisele
(Project member)1
Description
Changes
These changes were made between the v1 and the v2:
- SAG:
- contigs have been renamed and only those greater than 2.5kb were kept
- gene prediction was done with GeneMark-EP+
- genes are available in gff3, transcripts and proteins as Fasta sequences
- The DNA-dependent RNA polymerase phylogenomic tree was updated with two SAGs from this dataset
- Mapping public metagenomes
- rows corresponding to metagenomes that had less than 100 reads mapped were removed from the table
- column removed:
- "used_for_UMAP", with the removal of some rows, this column contained only 'yes' values
- columns added:
- "clean_bases"
- "Nb_MAG_detected_sup_10pct"
- "cluster"
- A metadata table for MAGs and SAGs has been added, it contains the assemblies statistics, BUSCO results and taxonomy affiliation
Archive content
This Zenodo records contains the following items :
Single Amplified Genomes (SAGs)
The files present in the Zip archive SAG.zip are:
- genome/*.fa.gz: SAG genomes
- gene_prediction/<SAG>.gff3.gz: gene predictions
- gene_prediction/<SAG>.transcript.fa.gz: transcript sequences
- gene_prediction/<SAG>.protein.fa.gz : protein sequences
A summary for each SAG is available in the spreadsheet MAG_SAG_metadata.ods.
Metagenome-Assembled Genomes (MAGs)
The files present in the Zip archive MAG.zip are:
euk_mag/*.fa.gz: MAG genomesgene_prediction_final/<MAG>.<GeneCaller>.gff3.gz: gene predictionsgene_prediction_final/<MAG>.<GeneCaller>.transcript.fa.gz: transcript sequencesgene_prediction_final/<MAG>.<GeneCaller>.prot.fa.gz: protein sequences
Some MAG for which the gene prediction failed (too short sequence, other reason) and that were not used in the analysis with the mapping of public metagenomes, are available in the directory not_used/.
A summary for each MAG is available in the spreadsheet MAG_SAG_metadata.ods.
Metagenomic assemblies and bins
The files present in the Zip archive metagenomic_assemblies.zip are:
contigs_db/: the 4 anvi'o contigs databases that contain the contigs sequencesmerged_profiles_db: the 4 anvi'o profiles databases that contain the mapping results. The bins (MetaBAT2) are stored in these databasesREADME_metagenomic_assemblies.md: a text file with more details
WARNING: files have been compressed in .bz2, they must be decompressed (bunzip2) before usage. They use about 110 GB of disk-space.
Even though these artifacts were generated for a previous version of anvi'o, a script is available to continue using them with an up-to-date installation, anvi-migrate (documentation).
Phylogenetic tree
All files are provided in the Zip archive phylogenetic_tree.zip.
The DNA-dependent RNA polymerase
The files present in the directory phylogenetic_tree/DNA_dependent_RNA_pol are:
00_hmm_profiles/: the HMM profiles that target the DNA-dependent RNA polymerase sub-units01_sequences/: best hit for each sub-unit02_alignments_raw/: raw alignment, by MAFFT v7.52603_alignments_cleaned/: alignment after goalign clean sites -c 0.5merge_alignments.py: Python script to concatenate the 6 sub-unitsRNAP_aln_v6_concat.fa: the concatenated alignment04_tree/: the run of IQ-Tree v2.4.0metadata_RNAP_tree.tsv: Metadata to decorate the tree
The Python script merge_alignments.py requires Python version 3 and BioPython to work.
The file metadata_RNAP_tree.tsv lists the reference genomes. The "source" column corresponds to:
- Mendota: Krinos et al.,2024, Microbiome. MAG are available at https://osf.io/9epa8/?view_only=152af26e11894ac0bcdfe542e02c6ab1
- public_database : EBI / NCBI / DDBJ
- METDB : https://metdb.sb-roscoff.fr/metdb/ . DNA-dependent RNA polymerase sequences are available at https://www.genoscope.cns.fr/tara/ section Curated DNA-dependent RNA polymerase.
- Tara : Delmont et al., 2022, Cell Genomics. DNA-dependent RNA polymerase sequences are available at https://www.genoscope.cns.fr/tara/ section Curated DNA-dependent RNA polymerase.
Phylosift tree
The files present in the directory phylogenetic_tree/phylosift_tree/ are :
01_marker_present/: marker identified and aligned by Phylosift02_marker_selected/: marker selected for the tree, markers present in at least 50% of the genomes03_marker_alignment_cleaned/: alignment cleaned by Trimal v1.5 and parameter -automated104_phylosift_concatenated_alignment.fa: concatenation of the 50 markers05_tree/: phylogeny built by IQTree v2.2.3metadata_phylosift_tree.csv: list of reference genomes and they taxonomy. This file can be directly used to decorate the tree visualised with TreeViewer.
Taxonomy affiliation
The file phylogenetic_tree/taxonomy_affiliation_MAG_SAG.tsv summarises the taxonomic affiliation proposed for MAGs and SAGs for which markers were present in a sufficient number.
The final taxonomic affiliation is also availabe in the spreadsheet MAG_SAG_metadata.ods.
Mapping public metagenomes
The files present in the Zip archive mapping_public_metagenomes.zip are :
public_metagenomes_metadata.tsv: the metagenomes metadata table. The columns corresponds to :- accession : Identifier in the public databases, except for datasets of the project
- origin : where the metagenome was collected
- origin_simplified : simplified version, as some names were long
- country : country in which the sample was taken
- broad_geo_region : the UN geoschemes code corresponding to the country (https://en.wikipedia.org/wiki/United_Nations_geoscheme)
- dataset : this project or public data
- DLATITUDE : latitude, in decimal degree
- DLONGITUDE : longitude in decimal degree
- salinity : relation of the sample to the salinity
- ECOSYSTEM.TYPE : type of ecosystem sampled
- MINIMUM.SIZE.FRACTION : when available, the pore size on which the genetic material was collected
- MAXIMUM.SIZE.FRACTION : when available, the pore size used to prefilter the sample
- SAMPLE.MATERIAL : nature of the sample, mostly water
- clean_reads : number of reads used for the mapping
- mapped_reads : number of reads that mapped on the MAGs and SAGs of this project
- filtered_reads : number of reads that passed the filters from msamtools
- Nb_MAG_detected_sup_10pct : number of MAG and SAG that were detected at more than 10 % (breadth of coverage) in the metagenome
- cluster : the cluster the metagenome belongs to
- The table that summarise the read count is
public_metagenomes_read_count_on_MAGs_and_SAGs.tsv. The first column refers to the MAG and SAG identifiers, and the other 3097 columns represent one public metagenome each. The data present in this file is the number of read mapped per MAG/SAG per public metagenome, after filtering the mapping with msamtools v1.1.0 and the parameters filter -b -l 50 -p 95 -z 80. - The file
public_metagenomes_detection_of_MAGs_and_SAGs.tsv, summarises the breadth of coverage, in percent, of each MAG/SAG in each public metagenome. The value of "100" means that all positions of a particular genome is covered by at least one read from the given metagenome. And "0" means that no read from the metagenome X had mapped on the MAG/SAG Y.
Unigenes
The files stored in the Zip archive unigenes.zip are :
unigenes_sequences.fa.gz: unigenes sequences, clean from contamination (human, metazoans, bacteria, archaea and viruses)table_readCount.noHuman.noConta.noMetazoa.annot.tsv.gz: counts of mapped reads on the unigenes plus functionnal annotations KEGG K0, Pfam and GO (derived from Pfam)table_taxonomy.perUnigene.allUnigenes.tsv.gz: unigenes taxonomic annotation
See also the work of Monjot et al.,2023
Files
MAG.zip
Files
(49.2 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:b67e89f0310fa32e212ed15dac220e57
|
2.1 GB | Preview Download |
|
md5:3b6a2a864f23924ba686053900e82d4c
|
50.7 kB | Download |
|
md5:cf72e32fa9d2190c010806161705a7c3
|
2.3 MB | Preview Download |
|
md5:bbe3cd9fbe644fbe1fba3adfc8dce15f
|
44.8 GB | Preview Download |
|
md5:894bdd22263ec9335ba602021d866ced
|
41.2 MB | Preview Download |
|
md5:804a55ec3b9933b06fc24033ba0025d8
|
8.2 kB | Preview Download |
|
md5:ba91e6659574b1f7809f6dfaf9523d7e
|
45.3 MB | Preview Download |
|
md5:b01750258915da99a111419a035465d9
|
2.3 GB | Preview Download |
Additional details
Funding
- France Génomique
- ANR-10-INBS-09-08