Prokaryotic Gene Catalogs v1 (years 2009–2015) from the Blanes and Banyuls Bay Microbial Observatories (BBMO–SOLA)
Creators
- 1. Instituto de Ciencias del Mar (ICM-CSIC)
- 2. CNRS Délégation Languedoc-Roussillon
- 3. University of La Laguna
- 4. Observatoire Océanologique de Banyuls
Description
---------------------------------------------------------------------
Introduction
---------------------------------------------------------------------
This repository contains the first public release (v1) of the BBMO–SOLA prokaryotic gene catalog and derived abundance tables for genes and functions from two long-term coastal observatories in the NW Mediterranean: Blanes Bay Microbial Observatory (BBMO, http://bbmo.icm.csic.es) and Banyuls Bay Microbial Observatory (SOLA, https://www.obs-banyuls.fr/en/oob.html). The dataset includes a length-filtered non-redundant gene catalog, functional profiles (COG, Pfam, KEGG KO, eggNOG, CAZy), an OTU table (miTags), taxonomic assignments (GTDB), and per-sample environmental metadata.
The catalog was built from 174 metagenomes (monthly sampling from January 2009 to December 2015; 84 BBMO and 90 SOLA). Reads were cleaned and individually assembled with MEGAHIT v1.1.3, genes were predicted with MetaGeneMark v3.38 and Prodigal v2.6.3, then clustered at 95% identity / 80% coverage with MMseqs2 v.9-d36de linclust to generate a non-redundant catalog. Genes <250 bp were removed, yielding 209,195,684 genes in the filtered catalog. Functional annotation used COG (rpsblast v2.7.1), Pfam (HMMER3), KEGG (DIAMOND blastp v0.9.22), eggNOG (HMMER3), and CAZy (DIAMOND blastp v0.9.22). Taxonomic assignment used MMseqs2 (release v11-e1a1c) against GTDB r89 (unclustered 54k reference). Abundance profiles were obtained by mapping cleaned reads to the catalog and counting with HTSeq v0.10.0, followed by successive gene-length (within-sample) and metagenome size (between-sample) normalizations.
In addition, for comparative purposes, this repository includes a customised prokaryotic gene catalog and associated functional profiles from the TARA Oceans expedition (248,627,885 genes >250 bp), derived from a subset of 179 samples representing the small-size fractions from surface, DCM and mesopelagic waters across tropical, subtropical, and polar regions. The catalog was processed following the same pipeline and methodological criteria as BBMO–SOLA (assembly, ORF prediction, clustering at 95% identity, filtering of genes <250 bp, functional annotation against Pfam, KEGG KO and taxonomic assignment with MMseqs2/GTDB r89), ensuring that both datasets are directly comparable. Gene abundances are reported as length- and metagenome-size–normalized counts, with functional profiles including KEGG KO and Pfam domain abundances.
---------------------------------------------------------------------
Location
---------------------------------------------------------------------
This repository provides the BBMO–SOLA prokaryotic gene catalog together with functional annotations, taxonomic assignments, OTU table, and environmental metadata. It also includes the corresponding processed versions of the TARA prokaryotic gene catalog with functional and taxonomic annotations. All resources are available exclusively through this Zenodo record: 10.5281/zenodo.17183573.
---------------------------------------------------------------------
Directory Descriptions
---------------------------------------------------------------------
BBMO - SOLA:
-
Gene catalog (filtered, non-redundant):
-
BLSO_95id_min250.fasta.gz– 209,195,684 representative genes at 95% ANI, >250 bp. (Catalog)
-
-
Abundance tables (tab-delimited):
-
BBMOSOLA-GC_250bp_gene.lengthNorm.metaGsizeGbNorm.counts.tbl.gz– gene abundances (length- & metagenome-size–normalized). -
BBMOSOLA-GC_250bp_COG.lengthNorm.metaGsizeNorm.counts.txt– COG functional abundances. -
BBMOSOLA-GC_250bp_pfam.lengthNorm.metaGsizeNorm.counts.txt– Pfam domain abundances. -
BBMOSOLA-GC_250bp_KEGG.ko.lengthNorm.metaGsizeNorm.counts.txt– KEGG KO abundances. -
BBMOSOLA-GC_250bp_eggNOG.lengthNorm.metaGsizeNorm.counts.txt– eggNOG abundances. -
BBMOSOLA-GC_250bp_CAZy.lengthNorm.metaGsizeNorm.counts.txt– CAZy family abundances.
-
-
Taxonomy & OTUs:
-
BBMOSOLA.gene.taxonomy.tsv– MMseqs2/GTDB r89 assignments for catalog genes (Kraken/Krona-style consolidated output summarized). -
BBMOSOLA.otu_table97.txt– miTags OTU table (97% identity).
-
-
Environmental metadata:
-
BBMOSOLA.environmental_metadata.txt– per-sample environmental variables (e.g., temperature, salinity, nutrients, chlorophyll, day length).
-
TARA:
-
Gene catalog (filtered, non-redundant):
-
TARA.GC_95id_min250.fasta.gz– 248,627,885 representative genes at 95% ANI, >250 bp. (Catalog)
-
-
Abundance tables (tab-delimited):
TARA-GC-ICMv_250bp_gene.lengthNorm.metaGsizeGbNorm.counts.tbl– gene abundances (length- & metagenome-size–normalized).-
TARA-GC-ICMv_250bp_KEGG.ko.lengthNorm.metaGsizeNorm.counts.tbl– KEGG KO abundances. -
TARA-GC-ICMv_250bp_pfam.lengthNorm.metaGsizeNorm.counts.tbl– Pfam domain abundances.
-
Taxonomy:
-
TARA_taxonomyResults.tsv– MMseqs2/GTDB r89 assignments for catalog genes (Kraken/Krona-style consolidated output summarized).
-
---------------------------------------------------------------------
Data availability and accession numbers
---------------------------------------------------------------------
The metagenomic datasets used in this work correspond to publicly available and newly released data:
-
SOLA (Banyuls Bay Microbial Observatory): Raw metagenomic sequences are available in the NCBI Sequence Read Archive under accession numbers PRJEB66489 and PRJEB26919 (Beauvais et al., 2023, Environmental Microbiology).
-
TARA Oceans: Metagenomic data from the Tara Oceans expedition are available in the European Nucleotide Archive under project PRJEB402.
-
BBMO (Blanes Bay Microbial Observatory): Metagenomic data from the BBMO are available in the European Nucleotide Archive under project PRJEB48035
---------------------------------------------------------------------
Contact information
---------------------------------------------------------------------
For any questions, please contact with Dr. Ramiro Logares 'logares@icm.csic.es', Lidia Montiel 'montiel@icm.csic.es' or Sergio González-Motos 'sgonzalez@icm.csic.es'.
---------------------------------------------------------------------
Acknowledgements
---------------------------------------------------------------------
This work is an effort of the log-lab (https://log-lab.barcelona), the Ecology of Marine Microbes (EMM; https://emm.icm.csic.es) group at the Institut de Ciències del Mar, Barcelona, Spain (ICM - CSIC), and the Laboratoire d’Ecogéochimie des Environnements Benthiques (LECOB) at the Observatoire Océanologique de Banyuls, France. All the bioinformatics analyses were performed at the Marine Bioinformatics Core Service ICM - CSIC (MARBITS; http://marbits.icm.csic.es/) and the Finisterrae III supercomputer at the Centro de Supercomputación de Galicia (CESGA; https://www.cesga.es/). We thank all members of the BBMO and SOLA time-series teams for their sustained efforts and contributions over the years to the generation of this dataset and the Tara Oceans Consortium for providing publicly available metagenomic resources.
---------------------------------------------------------------------
Copyright notice
---------------------------------------------------------------------
Prokaryotic Gene Catalogs v1 (years 2009–2015) from the Blanes and Banyuls Bay Microbial Observatories (BBMO–SOLA) (C).
This catalog is provided “as is” and without any warranty of any kind, of openly available for non-commercial purposes. You can redistribute and/or modify it as you wish, under the terms of the Creative Commons Attribution Share Alike 4.0 International license.
For commercial purposes, please contact us.
Additional details
Funding
- Agencia Estatal de Investigación
- MINIME PID2019-105775RB-I00
- Agencia Estatal de Investigación
- MAORI PID2022-136281NB-I00
- Agencia Estatal de Investigación
- INTERACTOMICS CTM2015-69936-P