MicrobiomeHD: the human gut microbiome in health and disease

Duvallet, Claire; Gibbons, Sean; Gurry, Thomas; Irizarry, Rafael; Alm, Eric

doi:10.5281/zenodo.840333

Published August 8, 2017 | Version v2

Dataset Open

MicrobiomeHD: the human gut microbiome in health and disease

1. Department of Biological Engineering, MIT
2. Department of Applied Statistics, Harvard University
3. The Center for Microbiome Informatics and Therapeutics

Overview

MicrobiomeHD is a standardized database of human gut microbiome studies in health and disease. This database includes publicly available 16S data from published case-control studies and their associated patient metadata. Raw sequencing data for each study was downloaded and processed through a standardized pipeline.

To be included in MicrobiomeHD, datasets have:

publicly available raw sequencing data (fastq or fasta)
publicly available metadata with at least case and control labels for each patient
at least 15 case patients

Currently, MicrobiomeHD is focused on stool samples. Additional samples may be included in certain datasets, as indicated in the metadata.

Files

Additional information about the datasets included in this MicrobiomeHD release are in the MicrobiomeHD github repo https://github.com/cduvallet/microbiomeHD, in the file db/dataset_info.yaml. Top-level identifiers correspond to the dataset IDs used in Duvallet et al. 2017. Sample sizes in the yaml file are those that were described in the papers, and may not exactly reflect the actual data (due to missing/extra data, samples which didn't pass quality control, etc).

Each dataset was downloaded and processed through a standardized pipeline. The raw processing results are available in the *.tar.gz files here. Each file has the same directory structure and files, as described in the pipeline documentation: http://amplicon-sequencing-pipeline.readthedocs.io/en/latest/output.html.

Specific files of interest include:

summary_file.txt: this file contains a summary of all parameters used to process the data
datasetID.metadata.txt: the metadata associated with the samples. Note that some samples in the metadata may not have sequencing data, and vice versa.
RDP/datasetID.otu_table.100.denovo.rdp_assigned: the 100% OTU tables with Latin taxonomic names assigned using the RDP classifier (c = 0.5).
datasetID.otu_seqs.100.fasta: representative sequences for each OTU in the 100% OTU table. OTU labels in the OTU table end with d__denovoID - these denovoIDs correspond to the sequences in this file.

The raw data was acquired as described in the supplementary materials of Duvallet et al.'s "Meta analysis of microbiome studies identifies shared and disease-specific patterns".

Raw sequencing data was processed with the Alm lab's in-house 16S processing pipeline: https://github.com/thomasgurry/amplicon_sequencing_pipeline

Pipeline documentation is available at: http://amplicon-sequencing-pipeline.readthedocs.io/

Metadata was extracted from the original papers and/or data sources, and formatted manually.

Contributing

MicrobiomeHD is a resource that can be used to extract disease-specific microbiome signals in individual case-control studies. Many microbes respond non-specifically to health and disease, and the majority of bacterial associations within individual studies overlap with this "core" response. Researchers should cross-check their results with the data presented here to ensure that their identified microbial associations are specific to their disease under study.

We provide an updated list of "core" microbes here, as well as the raw OTU tables for anyone who wishes to reproduce and adapt this analysis to their study question.

If you would like to include your case-control dataset in MicrobiomeHD, please email duvallet[at]mit.edu.

For us to process your data through our standard pipeline, you will need to provide the following files and information about your data:

raw sequencing data in fastq or fasta format (preferably fastq)
information about which processing steps will be required (e.g. removing primers or barcodes, merging paired-end reads, etc)
sample IDs associated with the sequencing data (either mapped to barcodes still in the sequences, or to each de-multiplexed sequencing file)
case/control metadata of each sample
other relevant metadata (e.g. sampling site, if not all samples are stool; sampling time point, if multiple samples per patient were taken; etc)

By using MicrobiomeHD in your own analyses, you agree to contribute your dataset to this database and to make your raw sequencing data (i.e. fastq files) publicly available.

Citing MicrobiomeHD

The MicrobiomeHD database and original publications for each of these datasets are described in Duvallet et al. (2017): http://biorxiv.org/content/early/2017/05/08/134031

If you use any of these datasets in your analysis, please cite both MicrobiomeHD (Duvallet et al. (2017)) and the original publication for each dataset that you use.

The code used to process and analyze this data in Duvallet et al. (2017) is available on github: https://github.com/cduvallet/microbiomeHD

Files

Data files

file-S3.core_genera.txt: Supplemental Table 3 from Duvallet et al. (2017), listing the core health- and disease-associated microbes.
dataset_info.yaml: yaml file with additional dataset metadata.

Datasets

Note that MicrobiomeHD contains all 28 datasets from Duvallet et al. (2017), as well as additional datasets which did not meet the inclusion criteria for the meta-analysis presented in the paper. Additional information about the datasets included in this MicrobiomeHD release are in the original publications and the MicrobiomeHD github repo https://github.com/cduvallet/microbiomeHD, and in the file dataset_info.yaml.

The sample sizes listed here reflect what was reported in the original publications. Some may have discrepancies between what is reported and what is in the actual data due to missing data, quality issues, barcode mismatches, etc.

asd_son_results.tar.gz (asd_son): NT: 44, ASD: 59
- http://dx.doi.org/10.1371/journal.pone.0137725
autism_kb_results.tar.gz (asd_kang): H: 20, ASD: 20
- http://dx.doi.org/10.1371/journal.pone.0068322
cdi_schubert_results.tar.gz (noncdi_schubert): H: 155, nonCDI: 89, CDI: 94
- http://dx.doi.org/10.1128/mBio.01021-14
cdi_vincent_v3v5_results.tar.gz (cdi_vincent): H: 25, CDI: 25
- http://dx.doi.org/10.1186/2049-2618-1-18
cdi_youngster_results.tar.gz (cdi_youngster): H: 4, CDI: 19
- http://dx.doi.org/10.1093/cid/ciu135
crc_baxter_results.tar.gz (crc_baxter): adenoma: 198, H: 172, CRC: 120
- http://dx.doi.org/10.1186/s13073-016-0290-3
crc_xiang_results.tar.gz (crc_chen): H: 22, CRC: 21
- http://dx.doi.org/10.1371/journal.pone.0039743
crc_zackular_results.tar.gz (crc_zackular): adenoma: 30, H: 30, CRC: 30
- http://dx.doi.org/10.1158/1940-6207.CAPR-14-0129
crc_zeller_results.tar.gz (crc_zeller): H: 75, CRC: 41
- http://dx.doi.org/10.15252/msb.20145645
crc_zhao_results.tar.gz (crc_wang): H: 56, CRC: 46
- http://dx.doi.org/10.1038/ismej.2011.109}
edd_singh_results.tar.gz (edd_singh): STEC: 28, CAMP: 71, SALM: 66, SHIG: 34, H: 75
- http://dx.doi.org/10.1186/s40168-015-0109-2
hiv_dinh_results.tar.gz (hiv_dinh): H: 16, HIV: 21
- http://dx.doi.org/10.1093/infdis/jiu409
hiv_lozupone_results.tar.gz (hiv_lozupone): H: 13, HIV: 25
- http://dx.doi.org/10.1016/j.chom.2013.08.006
hiv_noguerajulian_results.tar.gz (hiv_noguerajulian): H: 34, HIV: 206
- https://doi.org/10.1016%2Fj.ebiom.2016.01.032
ibd_alm_results.tar.gz (ibd_papa): IBDundef: 1, nonIBD: 24, UC: 43, CD: 23
- http://dx.doi.org/10.1371/journal.pone.0039242
ibd_engstrand_maxee_results.tar.gz (ibd_willing): CCD: 12, H: 35, ICD: 15, UC: 16, ICCD: 2
- http://dx.doi.org/10.1053/j.gastro.2010.08.049
ibd_gevers_2014_results.tar.gz (ibd_gevers): H: 31, CD: 224
- http://dx.doi.org/10.1016/j.chom.2014.02.005
ibd_huttenhower_results.tar.gz (ibd_morgan): H: 18, UC: 48, CD: 62
- http://dx.doi.org/10.1186/gb-2012-13-9-r79
mhe_zhang_results.tar.gz (liv_zhang): CIRR: 25, H: 26, MHE: 26
- http://dx.doi.org/10.1038/ajg.2013.221
nash_chan_results.tar.gz (nash_wong): H: 22, NASH: 16
- http://dx.doi.org/10.1371/journal.pone.0062885
nash_ob_baker_results.tar.gz (nash_zhu): H: 16, NASH: 22, OB: 25
- http://dx.doi.org/10.1002/hep.26093
ob_goodrich_results.tar.gz (ob_goodrich): OW: 322, H: 433, OB: 183
- http://dx.doi.org/10.1016/j.cell.2014.09.053
ob_gordon_2008_v2_results.tar.gz (ob_turnbaugh): H: 61, OB: 219
- http://dx.doi.org/10.1038/nature07540
ob_ross_results.tar.gz (ob_ross): H: 26, OB: 37
- http://dx.doi.org/10.1186/s40168-015-0072-y
ob_zupancic_results.tar.gz (ob_zupancic): H: 167, OB: 117
- http://dx.doi.org/10.1371/journal.pone.0043052
par_scheperjans_results.tar.gz (par_scheperjans): H: 72, PAR: 72
- http://dx.doi.org/10.1002/mds.26069
ra_littman_results.tar.gz (art_scher): H: 28, NORA: 44, CRA: 26, PSA: 16
- http://dx.doi.org/10.7554/eLife.01202
t1d_alkanani_results.tar.gz (t1d_alkanani): T1D: 21, H: 55, T1D_new-onset: 35
- http://dx.doi.org/10.2337/db14-1847
t1d_mejialeon_results.tar.gz (t1d_mejialeon): T1D: 21, H: 8
- http://dx.doi.org/10.1038/srep03814

Version changes

Changes in Version 2: added crc_zhu and ob_escobar datasets, as well as list of core genera and dataset_info.yaml.

Files

file-S3.core_genera.txt

Files (143.7 MB)

Name	Size	Download all
asd_son_results.tar.gz md5:a858454465462d80fe8426838bf337b8	2.4 MB	Download
autism_kb_results.tar.gz md5:b2e01280f6ed12a834f417e109d3260b	358.5 kB	Download
cdi_schubert_results.tar.gz md5:26a2589c448d8425766db6b86155694f	2.4 MB	Download
cdi_vincent_v3v5_results.tar.gz md5:f666c4c75c634c419ee1d70c21ae7905	388.1 kB	Download
cdi_youngster_results.tar.gz md5:a44465a772073306944fd578f4ed714b	7.5 MB	Download
crc_baxter_results.tar.gz md5:3604d25fe3ea0e48379d1f78a3aeb231	17.8 MB	Download
crc_xiang_results.tar.gz md5:19aa45e48707c757a08b7da2f4c4f226	200.4 kB	Download
crc_zackular_results.tar.gz md5:c3c329ddb64552b87baefe3afc888a2b	7.5 MB	Download
crc_zeller_results.tar.gz md5:f3acbff6066f997082fd0c7e6010bcd5	11.8 MB	Download
crc_zhao_results.tar.gz md5:64fdb268528a0aba6fb70e12d9987a2d	190.2 kB	Download
crc_zhu_results.tar.gz md5:fe4fa15c6da0bbfcc1dea61dd0c13b5b	285.4 kB	Download
dataset_info.yaml md5:dccf3ab65b17260f1d83ca96b4e3ed9f	29.1 kB	Download
edd_singh_results.tar.gz md5:547ea3a2514788a277d9018633ff96e4	685.6 kB	Download
file-S3.core_genera.txt md5:a759c8296545e26ecfd43cb959b28ec6	15.2 kB	Preview Download
hiv_dinh_results.tar.gz md5:2828c8a8ea89999ccc85997ff67d6962	534.7 kB	Download
hiv_lozupone_results.tar.gz md5:4142fc9769176f2ac39701b77206f775	944.7 kB	Download
hiv_noguerajulian_results.tar.gz md5:76a6f7fee55a162497e209723f3ffb87	12.9 MB	Download
ibd_alm_results.tar.gz md5:e2d6037596ba692cb971e50a84f64290	5.0 MB	Download
ibd_engstrand_maxee_results.tar.gz md5:51f791417d399ac9325d2a071e5ebfb5	1.7 MB	Download
ibd_gevers_2014_results.tar.gz md5:8fccf6a82c628a5f5d414e98bc750aa1	13.3 MB	Download
ibd_huttenhower_results.tar.gz md5:cde252b92f88c7070828653a91d223f5	867.7 kB	Download
mhe_zhang_results.tar.gz md5:de544698abe1e080a1b8aadf894370b2	345.2 kB	Download
nash_chan_results.tar.gz md5:1a46a9306693955e9f145feac003a91d	887.0 kB	Download
nash_ob_baker_results.tar.gz md5:1c765176e467dddc7acb6f9bda6147d8	3.0 MB	Download
ob_escobar_results.tar.gz md5:fa70d6096f416f43260bd9cb1112bd15	258.9 kB	Download
ob_goodrich_results.tar.gz md5:7aa4ee253d1c4ac51cbadd6ef366dea7	22.2 MB	Download
ob_gordon_2008_v2_results.tar.gz md5:4990722f64490f5791fca728dae4eab5	6.5 MB	Download
ob_ross_results.tar.gz md5:c845d366ee57b921ea044e49e126e5d2	638.8 kB	Download
ob_zupancic_results.tar.gz md5:e0e57cf2f00af8b938b8cc46db9c61ac	12.5 MB	Download
par_scheperjans_results.tar.gz md5:f8a689586d46884c0a7b1678678224c3	1.5 MB	Download
ra_littman_results.tar.gz md5:9ffb272523af8aa93e0aa655a35a98ad	1.3 MB	Download
t1d_alkanani_results.tar.gz md5:b879397ca81065a74eb8726fee102436	7.4 MB	Download
t1d_mejialeon_results.tar.gz md5:ef9f9b298aa4353c1bb7e21ee6e9eb18	299.3 kB	Download

Additional details

Is supplement to: https://github.com/cduvallet/microbiomeHD (URL)

	All versions	This version
Views	18,695	6,965
Downloads	16,178	6,715
Data volume	155.5 GB	83.4 GB

MicrobiomeHD: the human gut microbiome in health and disease

Authors/Creators

Description

Files

file-S3.core_genera.txt

Files (143.7 MB)

Additional details

Related works