Published August 29, 2025
| Version v9
Dataset
Open
A catalog of genes and species of the human oral microbiota
Creators
- 1. Université Paris-Saclay, INRAE
- 2. Université Paris-Cité, Institut Pasteur
- 3. King's College London
- 4. King's College London, NavarraBiomed
Description
Data sources
The oral gene catalog was built using three primary sources:
Bacterial Genomes from the Human Oral Microbiome Database (HOMD).
Fungal Genomes from the NCBI RefSeq database.
Metagenomic Sequencing Data from multiple oral microbiome studies.
The creation of the oral gene catalog was a multi-step process, combining and refining genes from each source.
Bacterial Genes
A total of 1,505 bacterial genomes were downloaded from HOMD (version 20170215, accessed in December 2017). Genes shorter than 60 nucleotides or containing ambiguous bases were filtered out. Redundancy was removed using CD-HIT-EST (v4.6; parameters: -aS 0.9 -c 0.95 -T 0 -M 0 -t 0 -d 0 -G 0). This process yielded 1,459,394 unique HOMD genes for the catalog.
Fungal Genes
1,017 fungal genomes were downloaded from NCBI RefSeq (May 2017). For the 492 genomes lacking existing annotations, gene calling was performed using Genemark-ES in fungi mode. After initial redundancy removal with CD-HIT-EST (v4.6; parameters: -aS 0.9 -c 0.95 -T 0 -M 0 -t 0 -d 0 -G 0), genes were selected for inclusion only if their corresponding genome was present in at least 20% of the samples in one of the metagenomic cohorts, determined by mapping reads with Bowtie2 (v2.2.3). This led to the selection of 2,440,644 fungal genes.
Metagenomic Sequencing Data
The gene catalog was supplemented with data from 689 oral metagenomes, including newly sequenced samples, from the following studies:
Human Microbiome Project (HMP): 382 samples (bioproject PRJNA255439).
Chinese Cohort: 212 samples (bioproject PRJEB6997).
TwinsUK Cohort: 48 newly sequenced samples (bioproject: PRJEB38483).
Raw reads were subjected to quality control and trimmed using AlienTrimmer 0.4.0 (parameters: -k 10 -l 45 -m 5 -p 40 -q 20). Human sequences were removed by mapping against the human reference genome (GRCh38.p11) using Bowtie2 2.2.3. Metagenomic assembly was performed using SPAdes 3.9.0 (parameters: "-k 21,33,55 --only-assembler –meta" for Illumina paired-end data, or "--iontorrent -t 24 -m 300 -k 21,33,55 --only-assembler" for Ion Torrent single-end data). Contigs shorter than 500 bp or with coverage less than 2x were discarded. Gene calling was conducted with Prodigal (parameters: -m -p meta). Genes shorter than 60 bp were filtered out, and redundancy was removed with CD-HIT-EST (v4.6; parameters: -aS 0.9 -c 0.95 -T 0 -M 0 -t 0 -d 0 -G 0).
Final Gene Catalog
The final gene catalog was assembled by sequentially adding non-redundant genes from each data source. Genes from HOMD and fungal genomes were combined first using cd-hit-est-2d. Then, non-redundant genes from the HMP, Chinese, and TwinsUK cohorts were sequentially added using cd-hit-est-2d (same parameters as cd-hit-est). A final redundancy removal step was performed. This process resulted in a catalogue of 8.4 million non-redundant genes
MSPs Recovery
The 689 metagenomic samples were aligned against the final gene catalog using the Meteor software suite to produce a gene abundance table. Then, co-abundant genes were binned into 853 Metagenomic Species Pan-genomes (MSPs) using MSPminer.
MSPs Taxonomic Annotation
Taxonomic annotation for the MSPs was performed by aligning all core and accessory genes against representative genomes from the GTDB database (release r214) using blastn (task: megablast, word_size: 16).
A species-level assignment was given if over 50% of the genes matched a representative genome with a mean nucleotide identity of at least 95% and a mean gene length coverage of at least 90%.
The remaining MSPs were assigned to a higher taxonomic level (genus to superkingdom) if more than 50% of their genes shared the same annotation.
Mapping rate distribution across public cohorts
We generated mapping rate distribution plots using Meteor2 (default parameters), comparing performance between: PRJEB6997, PRJNA255439, PRJNA48479 (cohort used in catalogue assembly) and PRJEB24090, PRJEB28422, PRJEB45799 (independent cohort not used in assembly).Files
catalogue_mapping_rate_hs_8_4_oral.pdf
Files
(14.5 GB)
Name | Size | Download all |
---|---|---|
md5:eb4194f49df53b4d39c3aeb576f0666b
|
16.2 kB | Preview Download |
md5:8b4ac6627fbe382a05a9c9a24adea158
|
13.9 GB | Download |
md5:1028078a8cc15262bad7fac6a66329ae
|
666.9 MB | Download |