Published October 16, 2025 | Version v1
Dataset Embargoed

Prokaryotic Gene Catalogs v1 (years 2009–2015) from the Blanes and Banyuls Bay Microbial Observatories (BBMO–SOLA)

Description

 

---------------------------------------------------------------------
Introduction
--------------------------------------------------------------------- 

This repository contains the first public release (v1) of the BBMO–SOLA prokaryotic gene catalog and derived abundance tables for genes and functions from two long-term coastal observatories in the NW Mediterranean: Blanes Bay Microbial Observatory (BBMO, http://bbmo.icm.csic.es) and Banyuls Bay Microbial Observatory (SOLA, https://www.obs-banyuls.fr/en/oob.html). The dataset includes a length-filtered non-redundant gene catalog, functional profiles (COG, Pfam, KEGG KO, eggNOG, CAZy), an OTU table (miTags), taxonomic assignments (GTDB), and per-sample environmental metadata.

The catalog was built from 174 metagenomes (monthly sampling from January 2009 to December 2015; 84 BBMO and 90 SOLA). Reads were cleaned and individually assembled with MEGAHIT v1.1.3, genes were predicted with MetaGeneMark v3.38 and Prodigal v2.6.3, then clustered at 95% identity / 80% coverage with MMseqs2 v.9-d36de linclust to generate a non-redundant catalog. Genes <250 bp were removed, yielding 209,195,684 genes in the filtered catalog. Functional annotation used COG (rpsblast v2.7.1), Pfam (HMMER3), KEGG (DIAMOND blastp v0.9.22), eggNOG (HMMER3), and CAZy (DIAMOND blastp v0.9.22). Taxonomic assignment used MMseqs2 (release v11-e1a1c) against GTDB r89 (unclustered 54k reference). Abundance profiles were obtained by mapping cleaned reads to the catalog and counting with HTSeq v0.10.0, followed by successive gene-length (within-sample) and metagenome size (between-sample) normalizations.

In addition, for comparative purposes, this repository includes a customised prokaryotic gene catalog and associated functional profiles from the TARA Oceans expedition (248,627,885 genes >250 bp), derived from a subset of 179 samples representing the small-size fractions from surface, DCM and mesopelagic waters across tropical, subtropical, and polar regions. The catalog was processed following the same pipeline and methodological criteria as BBMO–SOLA (assembly, ORF prediction, clustering at 95% identity, filtering of genes <250 bp, functional annotation against Pfam, KEGG KO and taxonomic assignment with MMseqs2/GTDB r89), ensuring that both datasets are directly comparable. Gene abundances are reported as length- and metagenome-size–normalized counts, with functional profiles including KEGG KO and Pfam domain abundances.


---------------------------------------------------------------------
Location
---------------------------------------------------------------------

This repository provides the BBMO–SOLA prokaryotic gene catalog together with functional annotations, taxonomic assignments, OTU table, and environmental metadata. It also includes the corresponding processed versions of the TARA prokaryotic gene catalog with functional and taxonomic annotations. All resources are available exclusively through this Zenodo record: 10.5281/zenodo.17183573.

 

---------------------------------------------------------------------
Directory Descriptions
---------------------------------------------------------------------

BBMO - SOLA:

  • Gene catalog (filtered, non-redundant):

    • BLSO_95id_min250.fasta.gz – 209,195,684 representative genes at 95% ANI, >250 bp. (Catalog)

  • Abundance tables (tab-delimited):

    • BBMOSOLA-GC_250bp_gene.lengthNorm.metaGsizeGbNorm.counts.tbl.gzgene abundances (length- & metagenome-size–normalized).

    • BBMOSOLA-GC_250bp_COG.lengthNorm.metaGsizeNorm.counts.txtCOG functional abundances.

    • BBMOSOLA-GC_250bp_pfam.lengthNorm.metaGsizeNorm.counts.txtPfam domain abundances.

    • BBMOSOLA-GC_250bp_KEGG.ko.lengthNorm.metaGsizeNorm.counts.txtKEGG KO abundances.

    • BBMOSOLA-GC_250bp_eggNOG.lengthNorm.metaGsizeNorm.counts.txteggNOG abundances.

    • BBMOSOLA-GC_250bp_CAZy.lengthNorm.metaGsizeNorm.counts.txtCAZy family abundances.

  • Taxonomy & OTUs:

    • BBMOSOLA.gene.taxonomy.tsv – MMseqs2/GTDB r89 assignments for catalog genes (Kraken/Krona-style consolidated output summarized).

    • BBMOSOLA.otu_table97.txt – miTags OTU table (97% identity).

  • Environmental metadata:

    • BBMOSOLA.environmental_metadata.txt – per-sample environmental variables (e.g., temperature, salinity, nutrients, chlorophyll, day length).

 TARA:

  • Gene catalog (filtered, non-redundant):

    • TARA.GC_95id_min250.fasta.gz – 248,627,885 representative genes at 95% ANI, >250 bp. (Catalog)

  • Abundance tables (tab-delimited):

    • TARA-GC-ICMv_250bp_gene.lengthNorm.metaGsizeGbNorm.counts.tbl – gene abundances (length- & metagenome-size–normalized).
    • TARA-GC-ICMv_250bp_KEGG.ko.lengthNorm.metaGsizeNorm.counts.tblKEGG KO abundances.

    • TARA-GC-ICMv_250bp_pfam.lengthNorm.metaGsizeNorm.counts.tblPfam domain abundances.

  • Taxonomy:

    • TARA_taxonomyResults.tsv – MMseqs2/GTDB r89 assignments for catalog genes (Kraken/Krona-style consolidated output summarized).

 

---------------------------------------------------------------------
Data availability and accession numbers
---------------------------------------------------------------------

The metagenomic datasets used in this work correspond to publicly available and newly released data:

  • SOLA (Banyuls Bay Microbial Observatory): Raw metagenomic sequences are available in the NCBI Sequence Read Archive under accession numbers PRJEB66489 and PRJEB26919 (Beauvais et al., 2023, Environmental Microbiology).

  • TARA Oceans: Metagenomic data from the Tara Oceans expedition are available in the European Nucleotide Archive under project PRJEB402.

  • BBMO (Blanes Bay Microbial Observatory): Metagenomic data from the BBMO are available in the European Nucleotide Archive under project PRJEB48035


---------------------------------------------------------------------
Contact information
---------------------------------------------------------------------

For any questions, please contact with Dr. Ramiro Logares 'logares@icm.csic.es', Lidia Montiel 'montiel@icm.csic.es'  or Sergio González-Motos 'sgonzalez@icm.csic.es'.


---------------------------------------------------------------------
Acknowledgements
---------------------------------------------------------------------

This work is an effort of the log-lab (https://log-lab.barcelona), the Ecology of Marine Microbes (EMM; https://emm.icm.csic.es) group at the Institut de Ciències del Mar, Barcelona, Spain (ICM - CSIC), and the Laboratoire d’Ecogéochimie des Environnements Benthiques (LECOB) at the Observatoire Océanologique de Banyuls, France. All the bioinformatics analyses were performed at the Marine Bioinformatics Core Service ICM - CSIC (MARBITS; http://marbits.icm.csic.es/) and the Finisterrae III supercomputer at the Centro de Supercomputación de Galicia (CESGA; https://www.cesga.es/). We thank all members of the BBMO and SOLA time-series teams for their sustained efforts and contributions over the years to the generation of this dataset and the Tara Oceans Consortium for providing publicly available metagenomic resources.


---------------------------------------------------------------------
Copyright notice
---------------------------------------------------------------------

Prokaryotic Gene Catalogs v1 (years 2009–2015) from the Blanes and Banyuls Bay Microbial Observatories (BBMO–SOLA) (C).

This catalog is provided “as is” and without any warranty of any kind, of openly available for non-commercial purposes. You can redistribute and/or modify it as you wish, under the terms of the Creative Commons Attribution Share Alike 4.0 International license. 

For commercial purposes, please contact us.

      

 

Files

Embargoed

The files will be made publicly available on April 1, 2026.

Reason: The dataset is part of an ongoing publication, and we would like to keep it under embargo until the corresponding paper is published

Additional details

Funding

Agencia Estatal de Investigación
MINIME PID2019-105775RB-I00
Agencia Estatal de Investigación
MAORI PID2022-136281NB-I00
Agencia Estatal de Investigación
INTERACTOMICS CTM2015-69936-P