A catalog of genes and species of the human skin microbiota

Plaza Onate, Florian

doi:10.5281/zenodo.16982908

Published August 29, 2025 | Version v9

Dataset Open

A catalog of genes and species of the human skin microbiota

Plaza Onate, Florian¹

1. Université Paris-Saclay, INRAE

Dataset overview

This dataset provides:
a non-redundant high-quality catalog of 2.9 million genes
392 Metagenomic Species Pangenomes (MSPs)
This dataset can be used to analyze shotgun sequencing data of the human skin microbiota.

How to use this dataset

Create a gene abundance table by aligning reads from each sample against the catalog. For this purpose, you can use Meteor or NGLess. Then, normalize raw counts by gene length.
Taxonomic profiling: the abundance of each species can be estimated as the average abundance of its 100 first core genes. To reduce the false positive rate, only consider that a species is present if at least 10/100 marker genes are detected.

Methods

Data sources

This dataset was built using the following data sources:
118 isolate-derived genomes from the HMRGD
246 isolate-derived genomes from the Skin Microbial Genome Collection (SMGC)
1,407 skin metagenome assemblies from the Skin Microbial Genome Collection (SMGC)

Non-redundant gene catalog

After filtering out short contigs (<1500 bp), genes were predicted with Prodigal on genomes (mode: single) and metagenome assemblies (mode: meta). Complete genes (partial=00) were pooled and clustered with cd-hit-est (parameters -c 0.95 -aS 0.90 -G 0 -d 0 -M 0 -T 0) by choosing those from the longest contigs as representatives.

Functional annotation

KOs assignments were obtained with KofamScan using the KEGG 107 database.

MSPs recovery

Reads from the 1,120 skin metagenomes available in the bioproject PRJNA46333 were aligned against the non-redundant gene catalog with the Meteor software suite to produce a raw gene abundance table (2.9M genes quantified in 1,120samples). Then, co-abundant genes were binned in 392 Metagenomic Species Pan-genomes (MSPs, i.e. clusters of co-abundant genes that likely belong to the same microbial species) using MSPminer.

MSPs taxonomic annotation

Taxonomic annotation was performed by alignment of all core and accessory genes against representative genomes of the GTDB database (release r214) using blastn (version 2.7.1, task = megablast, word_size = 16). A species-level assignment was given if > 50% of the genes matched the representative genome of a given species, with a mean nucleotide identity ≥ 95% and mean gene length coverage ≥ 90%. The remaining MSPs were assigned to a higher taxonomic level (genus to superkingdom), if more than 50% of their genes had the same annotation.

Construction of the phylogenetic tree

39 universal phylogenetic markers genes were extracted from the MSPs (or the corresponding genome if available) with fetchMGs. Then, the markers were separately aligned with MUSCLE. The 40 alignments were merged and trimmed with trimAl (parameters: -automated1). Finally, the phylogenetic tree was computed with FastTreeMP (parameters: -gamma -pseudo -spr -mlacc 3 -slownni).

Mapping rate distribution across public cohorts

We generated mapping rate distribution plots using Meteor2 (default parameters), comparing performance between: PRJNA46333 (cohort used in catalogue assembly) and PRJEB80549 (independent cohort not used in assembly).

Files

catalogue_mapping_rate_hs_2_9_skin.pdf

Files (4.2 GB)

Name	Size	Download all
catalogue_mapping_rate_hs_2_9_skin.pdf md5:cc2728d90aba38d35636d97cdbc0bbd1	5.0 kB	Preview Download
hs_2_9_skin.tar.xz md5:8891e2d14458de8e94b84f816d71ca78	3.9 GB	Download
hs_2_9_skin_taxo.tar.xz md5:e75ccba8fa2c003d8fd3f7594a08ce8c	265.5 MB	Download

	All versions	This version
Views	355	13
Downloads	255	25
Data volume	563.2 GB	12.5 GB

A catalog of genes and species of the human skin microbiota

Creators

Description

Dataset overview

How to use this dataset

Methods

Data sources

Non-redundant gene catalog

Functional annotation

MSPs recovery

MSPs taxonomic annotation

Construction of the phylogenetic tree

Mapping rate distribution across public cohorts

Files

catalogue_mapping_rate_hs_2_9_skin.pdf

Files (4.2 GB)