A catalog of genes and species of the human oral microbiota

Le Chatelier, Emmanuelle; Almeida, Mathieu; Plaza Onate, Florian; Pons, Nicolas; Gauthier, Franck; Ghozlane, Amine; Ehrlich, Stanislav Dusko; Witherden, Elizabeth; Gomez-Cabrero, David

doi:10.5281/zenodo.16983006

Published August 29, 2025 | Version v9

Dataset Open

A catalog of genes and species of the human oral microbiota

1. Université Paris-Saclay, INRAE
2. Université Paris-Cité, Institut Pasteur
3. King's College London
4. King's College London, NavarraBiomed

Data sources

The oral gene catalog was built using three primary sources:

Bacterial Genomes from the Human Oral Microbiome Database (HOMD).
Fungal Genomes from the NCBI RefSeq database.
Metagenomic Sequencing Data from multiple oral microbiome studies.
The creation of the oral gene catalog was a multi-step process, combining and refining genes from each source.

Bacterial Genes

A total of 1,505 bacterial genomes were downloaded from HOMD (version 20170215, accessed in December 2017). Genes shorter than 60 nucleotides or containing ambiguous bases were filtered out. Redundancy was removed using CD-HIT-EST (v4.6; parameters: -aS 0.9 -c 0.95 -T 0 -M 0 -t 0 -d 0 -G 0). This process yielded 1,459,394 unique HOMD genes for the catalog.

Fungal Genes

1,017 fungal genomes were downloaded from NCBI RefSeq (May 2017). For the 492 genomes lacking existing annotations, gene calling was performed using Genemark-ES in fungi mode. After initial redundancy removal with CD-HIT-EST (v4.6; parameters: -aS 0.9 -c 0.95 -T 0 -M 0 -t 0 -d 0 -G 0), genes were selected for inclusion only if their corresponding genome was present in at least 20% of the samples in one of the metagenomic cohorts, determined by mapping reads with Bowtie2 (v2.2.3). This led to the selection of 2,440,644 fungal genes.

Metagenomic Sequencing Data

The gene catalog was supplemented with data from 689 oral metagenomes, including newly sequenced samples, from the following studies:

Human Microbiome Project (HMP): 382 samples (bioproject PRJNA255439).
Chinese Cohort: 212 samples (bioproject PRJEB6997).
TwinsUK Cohort: 48 newly sequenced samples (bioproject: PRJEB38483).
Raw reads were subjected to quality control and trimmed using AlienTrimmer 0.4.0 (parameters: -k 10 -l 45 -m 5 -p 40 -q 20). Human sequences were removed by mapping against the human reference genome (GRCh38.p11) using Bowtie2 2.2.3. Metagenomic assembly was performed using SPAdes 3.9.0 (parameters: "-k 21,33,55 --only-assembler –meta" for Illumina paired-end data, or "--iontorrent -t 24 -m 300 -k 21,33,55 --only-assembler" for Ion Torrent single-end data). Contigs shorter than 500 bp or with coverage less than 2x were discarded. Gene calling was conducted with Prodigal (parameters: -m -p meta). Genes shorter than 60 bp were filtered out, and redundancy was removed with CD-HIT-EST (v4.6; parameters: -aS 0.9 -c 0.95 -T 0 -M 0 -t 0 -d 0 -G 0).

Final Gene Catalog

The final gene catalog was assembled by sequentially adding non-redundant genes from each data source. Genes from HOMD and fungal genomes were combined first using cd-hit-est-2d. Then, non-redundant genes from the HMP, Chinese, and TwinsUK cohorts were sequentially added using cd-hit-est-2d (same parameters as cd-hit-est). A final redundancy removal step was performed. This process resulted in a catalogue of 8.4 million non-redundant genes

MSPs Recovery

The 689 metagenomic samples were aligned against the final gene catalog using the Meteor software suite to produce a gene abundance table. Then, co-abundant genes were binned into 853 Metagenomic Species Pan-genomes (MSPs) using MSPminer.

MSPs Taxonomic Annotation

Taxonomic annotation for the MSPs was performed by aligning all core and accessory genes against representative genomes from the GTDB database (release r214) using blastn (task: megablast, word_size: 16).

A species-level assignment was given if over 50% of the genes matched a representative genome with a mean nucleotide identity of at least 95% and a mean gene length coverage of at least 90%.
The remaining MSPs were assigned to a higher taxonomic level (genus to superkingdom) if more than 50% of their genes shared the same annotation.

Mapping rate distribution across public cohorts

We generated mapping rate distribution plots using Meteor2 (default parameters), comparing performance between: PRJEB6997, PRJNA255439, PRJNA48479 (cohort used in catalogue assembly) and PRJEB24090, PRJEB28422, PRJEB45799 (independent cohort not used in assembly).

Files

catalogue_mapping_rate_hs_8_4_oral.pdf

Files (14.5 GB)

Name	Size	Download all
catalogue_mapping_rate_hs_8_4_oral.pdf md5:eb4194f49df53b4d39c3aeb576f0666b	16.2 kB	Preview Download
hs_8_4_oral.tar.xz md5:8b4ac6627fbe382a05a9c9a24adea158	13.9 GB	Download
hs_8_4_oral_taxo.tar.xz md5:1028078a8cc15262bad7fac6a66329ae	666.9 MB	Download

	All versions	This version
Views	287	10
Downloads	188	1
Data volume	1.9 TB	48.6 kB

A catalog of genes and species of the human oral microbiota

Creators

Description

Data sources

Bacterial Genes

Fungal Genes

Metagenomic Sequencing Data

Final Gene Catalog

MSPs Recovery

MSPs Taxonomic Annotation

Mapping rate distribution across public cohorts

Files

catalogue_mapping_rate_hs_8_4_oral.pdf

Files (14.5 GB)