Published August 29, 2025
| Version v9
Dataset
Open
A catalog of genes and species of the human skin microbiota
Description
Dataset overview
This dataset provides:
a non-redundant high-quality catalog of 2.9 million genes
392 Metagenomic Species Pangenomes (MSPs)
This dataset can be used to analyze shotgun sequencing data of the human skin microbiota.
How to use this dataset
Create a gene abundance table by aligning reads from each sample against the catalog. For this purpose, you can use Meteor or NGLess. Then, normalize raw counts by gene length.
Taxonomic profiling: the abundance of each species can be estimated as the average abundance of its 100 first core genes. To reduce the false positive rate, only consider that a species is present if at least 10/100 marker genes are detected.
Methods
Data sources
This dataset was built using the following data sources:
118 isolate-derived genomes from the HMRGD
246 isolate-derived genomes from the Skin Microbial Genome Collection (SMGC)
1,407 skin metagenome assemblies from the Skin Microbial Genome Collection (SMGC)
Non-redundant gene catalog
After filtering out short contigs (<1500 bp), genes were predicted with Prodigal on genomes (mode: single) and metagenome assemblies (mode: meta). Complete genes (partial=00) were pooled and clustered with cd-hit-est (parameters -c 0.95 -aS 0.90 -G 0 -d 0 -M 0 -T 0) by choosing those from the longest contigs as representatives.
Functional annotation
KOs assignments were obtained with KofamScan using the KEGG 107 database.
MSPs recovery
Reads from the 1,120 skin metagenomes available in the bioproject PRJNA46333 were aligned against the non-redundant gene catalog with the Meteor software suite to produce a raw gene abundance table (2.9M genes quantified in 1,120samples). Then, co-abundant genes were binned in 392 Metagenomic Species Pan-genomes (MSPs, i.e. clusters of co-abundant genes that likely belong to the same microbial species) using MSPminer.
MSPs taxonomic annotation
Taxonomic annotation was performed by alignment of all core and accessory genes against representative genomes of the GTDB database (release r214) using blastn (version 2.7.1, task = megablast, word_size = 16). A species-level assignment was given if > 50% of the genes matched the representative genome of a given species, with a mean nucleotide identity ≥ 95% and mean gene length coverage ≥ 90%. The remaining MSPs were assigned to a higher taxonomic level (genus to superkingdom), if more than 50% of their genes had the same annotation.
Construction of the phylogenetic tree
39 universal phylogenetic markers genes were extracted from the MSPs (or the corresponding genome if available) with fetchMGs. Then, the markers were separately aligned with MUSCLE. The 40 alignments were merged and trimmed with trimAl (parameters: -automated1). Finally, the phylogenetic tree was computed with FastTreeMP (parameters: -gamma -pseudo -spr -mlacc 3 -slownni).
Mapping rate distribution across public cohorts
We generated mapping rate distribution plots using Meteor2 (default parameters), comparing performance between: PRJNA46333 (cohort used in catalogue assembly) and PRJEB80549 (independent cohort not used in assembly).Files
catalogue_mapping_rate_hs_2_9_skin.pdf
Files
(4.2 GB)
Name | Size | Download all |
---|---|---|
md5:cc2728d90aba38d35636d97cdbc0bbd1
|
5.0 kB | Preview Download |
md5:8891e2d14458de8e94b84f816d71ca78
|
3.9 GB | Download |
md5:e75ccba8fa2c003d8fd3f7594a08ce8c
|
265.5 MB | Download |