Dataset Open Access

MicrobiomeHD: the human gut microbiome in health and disease

Duvallet, Claire; Gibbons, Sean; Gurry, Thomas; Irizarry, Rafael; Alm, Eric

Overview

MicrobiomeHD is a standardized database of human gut microbiome studies in health and disease. This database includes publicly available 16S data from published case-control studies and their associated patient metadata. Raw sequencing data for each study was downloaded and processed through a standardized pipeline.

To be included in MicrobiomeHD, datasets have:

  • publicly available raw sequencing data (fastq or fasta)
  • publicly available metadata with at least case and control labels for each patient

Currently, MicrobiomeHD is focused on stool samples. Additional samples may be included in certain datasets, as indicated in the metadata.

Files

Additional information about the datasets included in this MicrobiomeHD release are in the MicrobiomeHD github repo https://github.com/cduvallet/microbiomeHD, in the file db/dataset_info.yaml. Top-level identifiers correspond to dataset IDs labeled by disease_first-author. For the most part, sample sizes in the yaml file are those that were described in the papers, and may not exactly reflect the actual data (due to missing/extra data, samples which didn't pass quality control, etc).

Each dataset was downloaded and processed through a standardized pipeline. The raw processing results are available in the *.tar.gz files here. Each file has the same directory structure and files, as described in the pipeline documentation: http://amplicon-sequencing-pipeline.readthedocs.io/en/latest/output.html.

Specific files of interest in each *.tar.gz folder include:

  • summary_file.txt: this file contains a summary of all parameters used to process the data
  • datasetID.metadata.txt: the metadata associated with the samples. Note that some samples in the metadata may not have sequencing data, and vice versa.
  • RDP/datasetID.otu_table.100.denovo.rdp_assigned: the 100% OTU tables with Latin taxonomic names assigned using the RDP classifier (c = 0.5).
  • datasetID.otu_seqs.100.fasta: representative sequences for each OTU in the 100% OTU table. OTU labels in the OTU table end with d__denovoID - these denovoIDs correspond to the sequences in this file.
  • README.txt: additional information about steps taken to download and process each dataset, as needed.

The raw data was acquired as described in the supplementary materials of Duvallet et al.'s "Meta analysis of microbiome studies identifies shared and disease-specific patterns" and, when available, the respective dataset README files.

Raw sequencing data was processed with the Alm lab's in-house 16S processing pipeline: https://github.com/thomasgurry/amplicon_sequencing_pipeline

Pipeline documentation is available at: http://amplicon-sequencing-pipeline.readthedocs.io/

Metadata was extracted from the original papers and/or data sources, and formatted manually. When possible, these steps are documented in each dataset's associated README.txt file.

Contributing

MicrobiomeHD is a resource that can be used to extract disease-specific microbiome signals in individual case-control studies. Many microbes respond non-specifically to health and disease, and the majority of bacterial associations within individual studies overlap with this non-specific response. Researchers should cross-check their results with the data presented here to ensure that their identified microbial associations are specific to their disease under study.

We provide an updated list of non-specific microbes here, as well as the raw OTU tables for anyone who wishes to reproduce and adapt this analysis to their study question.

If you would like to include your case-control dataset in MicrobiomeHD, please email ejalm[at]mit.edu and duvallet[at]mit.edu.

For us to process your data through our standard pipeline, you will need to provide the following files and information about your data:

  • raw sequencing data in fastq or fasta format (preferably fastq)
  • information about which processing steps will be required (e.g. removing primers or barcodes, merging paired-end reads, etc)
  • sample IDs associated with the sequencing data (either mapped to barcodes still in the sequences, or to each de-multiplexed sequencing file)
  • case/control metadata of each sample
  • other relevant metadata (e.g. sampling site, if not all samples are stool; sampling time point, if multiple samples per patient were taken; etc)

By using MicrobiomeHD in your own analyses, you agree to contribute your dataset to this database and to make your raw sequencing data (i.e. fastq files) publicly available.

Citing MicrobiomeHD

The MicrobiomeHD database and original publications for each of these datasets are described in Duvallet et al. (2017): http://dx.doi.org/10.1038/s41467-017-01973-8

Duvallet, C., Gibbons, S. M., Gurry, T., Irizarry, R. A., & Alm, E. J. (2017). Meta-analysis of gut microbiome studies identifies disease-specific and shared responses. Nature communications, 8(1), 1784.

If you use any of these datasets in your analysis, please cite both MicrobiomeHD (Duvallet et al. (2017)) and the original publication for each dataset that you use.

The code used to process and analyze this data in the paper is available on github: https://github.com/cduvallet/microbiomeHD

Files

Data files

file-S3.nonspecific_genera.txt: Supplemental Table 3 from Duvallet et al. (2017), listing the non-specific health- and disease-associated microbes.
dataset_info.yaml: yaml file with additional dataset metadata.

Datasets

Note that MicrobiomeHD contains all 28 datasets from Duvallet et al. (2017), as well as additional datasets which did not meet the inclusion criteria for the meta-analysis presented in the paper. Additional information about the datasets included in this MicrobiomeHD release are in the original publications and the MicrobiomeHD github repo https://github.com/cduvallet/microbiomeHD, and in the file dataset_info.yaml.

The sample sizes listed here reflect what was reported in the original publications. Some may have discrepancies between what is reported and what is in the actual data due to missing data, quality issues, barcode mismatches, etc.

  • asd_son_results.tar.gz (asd_son): NT: 44, ASD: 59
    • http://dx.doi.org/10.1371/journal.pone.0137725
  • autism_kb_results.tar.gz (asd_kang): H: 20, ASD: 20
    • http://dx.doi.org/10.1371/journal.pone.0068322
  • cdi_schubert_results.tar.gz (cdi_schubert): H: 155, nonCDI: 89, CDI: 94
    • http://dx.doi.org/10.1128/mBio.01021-14
  • cdi_vincent_v3v5_results.tar.gz (cdi_vincent): H: 25, CDI: 25
    • http://dx.doi.org/10.1186/2049-2618-1-18
  • cdi_youngster_results.tar.gz (cdi_youngster): H: 4, CDI: 19
    • http://dx.doi.org/10.1093/cid/ciu135
  • crc_baxter_results.tar.gz (crc_baxter): adenoma: 198, H: 172, CRC: 120
    • http://dx.doi.org/10.1186/s13073-016-0290-3
  • crc_xiang_results.tar.gz (crc_chen): H: 22, CRC: 21
    • http://dx.doi.org/10.1371/journal.pone.0039743
  • crc_zackular_results.tar.gz (crc_zackular): adenoma: 30, H: 30, CRC: 30
    • http://dx.doi.org/10.1158/1940-6207.CAPR-14-0129
  • crc_zeller_results.tar.gz (crc_zeller): H: 75, CRC: 41
    • http://dx.doi.org/10.15252/msb.20145645
  • crc_zhao_results.tar.gz (crc_wang): H: 56, CRC: 46
    • http://dx.doi.org/10.1038/ismej.2011.109}
  • edd_singh_results.tar.gz (edd_singh): STEC: 28, CAMP: 71, SALM: 66, SHIG: 34, H: 75
    • http://dx.doi.org/10.1186/s40168-015-0109-2
  • hiv_dinh_results.tar.gz (hiv_dinh): H: 16, HIV: 21
    • http://dx.doi.org/10.1093/infdis/jiu409
  • hiv_lozupone_results.tar.gz (hiv_lozupone): H: 13, HIV: 25
    • http://dx.doi.org/10.1016/j.chom.2013.08.006
  • hiv_noguerajulian_results.tar.gz (hiv_noguerajulian): H: 34, HIV: 206
    • https://doi.org/10.1016%2Fj.ebiom.2016.01.032
  • ibd_alm_results.tar.gz (ibd_papa): IBDundef: 1, nonIBD: 24, UC: 43, CD: 23
    • http://dx.doi.org/10.1371/journal.pone.0039242
  • ibd_engstrand_maxee_results.tar.gz (ibd_willing): CCD: 12, H: 35, ICD: 15, UC: 16, ICCD: 2
    • http://dx.doi.org/10.1053/j.gastro.2010.08.049
  • ibd_gevers_2014_results.tar.gz (ibd_gevers): H: 31, CD: 224
    • http://dx.doi.org/10.1016/j.chom.2014.02.005
  • ibd_huttenhower_results.tar.gz (ibd_morgan): H: 18, UC: 48, CD: 62
    • http://dx.doi.org/10.1186/gb-2012-13-9-r79
  • mhe_zhang_results.tar.gz (liv_zhang): CIRR: 25, H: 26, MHE: 26
    • http://dx.doi.org/10.1038/ajg.2013.221
  • nash_chan_results.tar.gz (nash_wong): H: 22, NASH: 16
    • http://dx.doi.org/10.1371/journal.pone.0062885
  • nash_ob_baker_results.tar.gz (nash_ob_zhu): H: 16, NASH: 22, OB: 25
    • http://dx.doi.org/10.1002/hep.26093
  • ob_escobar_results.tar.gz (ob_escobar): OW: 10, H: 10, OB: 10
    • https://doi.org/10.1186/s12866-014-0311-6
  • ob_goodrich_results.tar.gz (ob_goodrich): OW: 322, H: 433, OB: 183
    • http://dx.doi.org/10.1016/j.cell.2014.09.053
  • ob_gordon_2008_v2_results.tar.gz (ob_turnbaugh): H: 61, OB: 219
    • http://dx.doi.org/10.1038/nature07540
  • ob_jumpertz_results.tar.gz (ob_jumpertz): H: 12, OB: 9
    • http://ajcn.nutrition.org/content/early/2011/05/03/ajcn.110.010132
  • ob_ross_results.tar.gz (ob_ross): H: 26, OB: 37
    • http://dx.doi.org/10.1186/s40168-015-0072-y
  • ob_wu_results.tar.gz (ob_wu): bmi_data: 101
    • http://dx.doi.org/10.1126/science.1208344
  • ob_zeevi_results.tar.gz (ob_zeevi): bmi_data: 870
    • http://dx.doi.org/10.1016/j.cell.2015.11.001
  • ob_zupancic_results.tar.gz (ob_zupancic): H: 167, OB: 117
    • http://dx.doi.org/10.1371/journal.pone.0043052
  • par_scheperjans_results.tar.gz (par_scheperjans): H: 72, PAR: 72
    • http://dx.doi.org/10.1002/mds.26069
  • ra_littman_results.tar.gz (art_scher): H: 28, NORA: 44, CRA: 26, PSA: 16
    • http://dx.doi.org/10.7554/eLife.01202
  • t1d_alkanani_results.tar.gz (t1d_alkanani): T1D: 21, H: 55, T1D_new-onset: 35
    • http://dx.doi.org/10.2337/db14-1847
  • t1d_mejialeon_results.tar.gz (t1d_mejialeon): T1D: 21, H: 8
    • http://dx.doi.org/10.1038/srep03814

Version changes

Version 3

  • added missing ob_escobar metadata
  • added ob_jumpertz, ob_zeevi, and ob_wu
  • added README.txt files to all folders, with info about data downloading and processing steps
  • removed deprecated quality_control folders from all dataset results
  • changed Supplemental File S3 to the most updated version of non-specific genera (as published in Duvallet et al 2017)

Version 2

  • added crc_zhu and ob_escobar datasets
  • added list of core genera and dataset_info.yaml

Files (165.1 MB)
Name Size
asd_son_results.tar.gz
md5:267c109cab9d1c8f8d54712d53825d87
2.0 MB Download
autism_kb_results.tar.gz
md5:f358c440ec806c3bc8a83f7454941f29
251.3 kB Download
cdi_schubert_results.tar.gz
md5:4655b22631158c23cf1d0fb702823c56
2.2 MB Download
cdi_vincent_v3v5_results.tar.gz
md5:b259451bece278d95970a229e35fbef7
334.1 kB Download
cdi_youngster_results.tar.gz
md5:d12b7cf6de7c7a0a0717086e937d75bf
7.5 MB Download
crc_baxter_results.tar.gz
md5:8e26fd3c3219807a63546265cea0bf63
17.7 MB Download
crc_xiang_results.tar.gz
md5:bb1f4fc37c5d3de372cc8ac745c146c3
152.5 kB Download
crc_zackular_results.tar.gz
md5:acba967fe9304b306dd9fd22ab15e648
7.5 MB Download
crc_zeller_results.tar.gz
md5:20b8e24c4d7afdf73c45ad9c581e133a
11.8 MB Download
crc_zhao_results.tar.gz
md5:050784d6a65ee7e6651da99352e4fe16
96.8 kB Download
dataset_info.yaml
md5:bce52d64570fa2c100d5780077461ce1
28.2 kB Download
edd_singh_results.tar.gz
md5:9e369e1e62cb87a53f56059194e74b99
616.7 kB Download
file-S3.nonspecific_genera.txt
md5:193a602bc4b1292c3ce4a5589480552d
14.9 kB Download
hiv_dinh_results.tar.gz
md5:40052d2c92203684ac95893e65b792da
473.1 kB Download
hiv_lozupone_results.tar.gz
md5:c5fb9498d3c7fe894da3b1c57010aaa2
887.9 kB Download
hiv_noguerajulian_results.tar.gz
md5:f3b20839a201ed6d6a92eeb7ada95e20
12.8 MB Download
ibd_alm_results.tar.gz
md5:51692965a2ca22ff413c891785fbe011
5.0 MB Download
ibd_engstrand_maxee_results.tar.gz
md5:e0e9d2f3bfe2a60579f3da9710e0f849
1.5 MB Download
ibd_gevers_2014_results.tar.gz
md5:10ba9aeeb1894bddc8f712fa8ad4578c
13.3 MB Download
ibd_huttenhower_results.tar.gz
md5:dd760c6519f76d22c7105d878d699673
794.5 kB Download
mhe_zhang_results.tar.gz
md5:b5bfdd6683e9a18c32a29aea550f1df7
292.6 kB Download
nash_chan_results.tar.gz
md5:ca799881665ffd13a612a07a90f44924
837.3 kB Download
nash_ob_baker_results.tar.gz
md5:671fb14cde25766f93672f71ecceff22
2.9 MB Download
ob_escobar_results.tar.gz
md5:480e54ef35844246cbf0d2053d4dfbdb
225.7 kB Download
ob_goodrich_results.tar.gz
md5:9f3fc0fa2a3e2fcfc6cbf24c99e9762d
22.2 MB Download
ob_gordon_2008_v2_results.tar.gz
md5:23ebfb3e6c78d2e9794dbe0659813314
6.5 MB Download
ob_jumpertz_results.tar.gz
md5:b71b4ebe52b9f668a31c19b6a5b8207b
1.2 MB Download
ob_ross_results.tar.gz
md5:f2949aac1d8342b3978ec63a55cd02ad
584.3 kB Download
ob_wu_results.tar.gz
md5:0a5fd3811024399df8cae8cf2d3d4c35
3.0 MB Download
ob_zeevi_results.tar.gz
md5:07029ea6518b7f970dd3ba4ae13691d2
20.2 MB Download
ob_zupancic_results.tar.gz
md5:e3de99636c5f811765f3adade2c01ce6
12.4 MB Download
par_scheperjans_results.tar.gz
md5:b0789854ef54157f4f41255340c880de
1.4 MB Download
ra_littman_results.tar.gz
md5:3198a0e16f32026cee847bdd3e3d052c
1.1 MB Download
t1d_alkanani_results.tar.gz
md5:556eb450ee0d4c4a435f9167308cf0d0
7.3 MB Download
t1d_mejialeon_results.tar.gz
md5:99a828d6a7d506013468ec8e20467fc6
245.8 kB Download
8,594
9,006
views
downloads
All versions This version
Views 8,5942,052
Downloads 9,0062,176
Data volume 58.3 GB10.8 GB
Unique views 7,1901,873
Unique downloads 1,237399

Share

Cite as