There is a newer version of this record available.

Dataset Open Access

MicrobiomeHD: the human gut microbiome in health and disease

Duvallet, Claire; Gibbons, Sean; Gurry, Thomas; Irizarry, Rafael; Alm, Eric

Citation Style Language JSON Export

  "publisher": "Zenodo", 
  "DOI": "10.5281/zenodo.569601", 
  "author": [
      "family": "Duvallet, Claire"
      "family": "Gibbons, Sean"
      "family": "Gurry, Thomas"
      "family": "Irizarry, Rafael"
      "family": "Alm, Eric"
  "issued": {
    "date-parts": [
  "abstract": "<p><strong>Overview</strong></p>\n\n<p>MicrobiomeHD is a standardized database of human gut microbiome studies in health and disease. This database includes publicly available 16S data from published case-control studies and their associated patient metadata. Raw sequencing data for each study was downloaded and processed through a standardized pipeline.</p>\n\n<p>To be included in MicrobiomeHD, datasets have:</p>\n\n<ul>\n\t<li>publicly available raw sequencing data (fastq or fasta)</li>\n\t<li>publicly available metadata with at least case and control labels for each patient</li>\n\t<li>at least 15 case patients</li>\n</ul>\n\n<p>Currently, MicrobiomeHD is focused on stool samples. Additional samples may be included in certain datasets, as indicated in the metadata.</p>\n\n<p><strong>Files</strong></p>\n\n<p>Additional information about the datasets included in this MicrobiomeHD release are in the MicrobiomeHD github repo, in the file <em>db/dataset_info.yaml</em>. Top-level identifiers correspond to the dataset IDs used in Duvallet et al. 2017. Sample sizes in the yaml file are those that were described in the papers, and may not exactly reflect the actual data (due to missing/extra data, samples which didn't pass quality control, etc).</p>\n\n<p>Each dataset was downloaded and processed through a standardized pipeline. The raw processing results are available in the *.tar.gz files here. Each file has the same directory structure and files, as described in the pipeline documentation:</p>\n\n<p>Specific files of interest include:</p>\n\n<ul>\n\t<li><strong>summary_file.txt</strong>: this file contains a summary of all parameters used to process the data</li>\n\t<li><strong>datasetID.metadata.txt</strong>: the metadata associated with the samples. Note that some samples in the metadata may not have sequencing data, and vice versa.</li>\n\t<li><strong>RDP/datasetID.otu_table.100.denovo.rdp_assigned</strong>: the 100% OTU tables with Latin taxonomic names assigned using the RDP classifier.</li>\n\t<li><strong>datasetID.otu_seqs.100.fasta</strong>: representative sequences for each OTU in the 100% OTU table. OTU labels in the OTU table end with d__denovoID - these denovoIDs correspond to the sequences in this file. Processing</li>\n</ul>\n\n<p>The raw data was acquired as described in the supplementary materials of Duvallet et al.'s \"Meta analysis of microbiome studies identifies shared and disease-specific patterns\".</p>\n\n<p>Raw sequencing data was processed with the Alm lab's in-house 16S processing pipeline:</p>\n\n<p>Pipeline documentation is available at:</p>\n\n<p>Metadata was extracted from the original papers and/or data sources, and formatted manually.</p>\n\n<p><strong>Contributing</strong></p>\n\n<p>MicrobiomeHD is a resource that can be used to extract disease-specific microbiome signals in individual case-control studies. Many microbes respond non-specifically to health and disease, and the majority of bacterial associations within individual studies overlap with this \"core\" response. Researchers should cross-check their results with the data presented here to ensure that their identified microbial associations are specific to their disease under study.</p>\n\n<p>We provide an updated list of \"core\" microbes here, as well as the raw OTU tables for anyone who wishes to reproduce and adapt this analysis to their study question.</p>\n\n<p>If you would like to include your case-control dataset in MicrobiomeHD, please email duvallet[at]</p>\n\n<p>For us to process your data through our standard pipeline, you will need to provide the following files and information about your data:</p>\n\n<ul>\n\t<li>raw sequencing data in fastq or fasta format (preferably fastq)</li>\n\t<li>information about which processing steps will be required (e.g. removing primers or barcodes, merging paired-end reads, etc)</li>\n\t<li>sample IDs associated with the sequencing data (either mapped to barcodes still in the sequences, or to each de-multiplexed sequencing file)</li>\n\t<li>case/control metadata of each sample</li>\n\t<li>other relevant metadata (e.g. sampling site, if not all samples are stool; sampling time point, if multiple samples per patient were taken; etc)</li>\n</ul>\n\n<p>By using MicrobiomeHD in your own analyses, you agree to contribute your dataset to this database and to make your raw sequencing data (i.e. fastq files) publicly available.</p>\n\n<p><strong>Citing MicrobiomeHD</strong></p>\n\n<p>The MicrobiomeHD database and original publications for each of these datasets are described in Duvallet et al. (2017):</p>\n\n<p>If you use any of these datasets in your analysis, please cite both MicrobiomeHD (Duvallet et al. (2017)) and the original publication for each dataset that you use.</p>\n\n<p>The code used to process and analyze this data in Duvallet et al. (2017) is available on github:</p>\n\n<p><strong>Files</strong></p>\n\n<p><em>Core genera</em></p>\n\n<p><strong>file-S3.core_genera.txt</strong>: Supplemental Table 3 from Duvallet et al. (2017), listing the core health- and disease-associated microbes.</p>\n\n<p><em>Datasets</em></p>\n\n<p>Note that MicrobiomeHD contains all 28 datasets from Duvallet et al. (2017), as well as additional datasets which did not meet the inclusion criteria for the meta-analysis presented in the paper. Additional information about the datasets included in this MicrobiomeHD release are in the original publications and the MicrobiomeHD github repo, in the file <em>db/dataset_info.yaml</em>.</p>\n\n<p>The sample sizes listed here reflect what was reported in the original publications. Some may have discrepancies between what is reported and what is in the actual data due to missing data, quality issues, barcode mismatches, etc.</p>\n\n<ul>\n\t<li><strong>asd_son_results.tar.gz</strong> (<em>asd_son</em>): NT: 44, ASD: 59\n\n\t<ul>\n\t\t<li></li>\n\t</ul>\n\t</li>\n\t<li><strong>autism_kb_results.tar.gz</strong> (<em>asd_kang</em>): H: 20, ASD: 20\n\t<ul>\n\t\t<li></li>\n\t</ul>\n\t</li>\n\t<li><strong>cdi_schubert_results.tar.gz</strong> (<em>noncdi_schubert</em>): H: 155, nonCDI: 89, CDI: 94\n\t<ul>\n\t\t<li></li>\n\t</ul>\n\t</li>\n\t<li><strong>cdi_vincent_v3v5_results.tar.gz</strong> (<em>cdi_vincent</em>): H: 25, CDI: 25\n\t<ul>\n\t\t<li></li>\n\t</ul>\n\t</li>\n\t<li><strong>cdi_youngster_results.tar.gz</strong> (<em>cdi_youngster</em>): H: 4, CDI: 19\n\t<ul>\n\t\t<li></li>\n\t</ul>\n\t</li>\n\t<li><strong>crc_baxter_results.tar.gz</strong> (<em>crc_baxter</em>): adenoma: 198, H: 172, CRC: 120\n\t<ul>\n\t\t<li></li>\n\t</ul>\n\t</li>\n\t<li><strong>crc_xiang_results.tar.gz</strong> (<em>crc_chen</em>): H: 22, CRC: 21\n\t<ul>\n\t\t<li></li>\n\t</ul>\n\t</li>\n\t<li><strong>crc_zackular_results.tar.gz</strong> (<em>crc_zackular</em>): adenoma: 30, H: 30, CRC: 30\n\t<ul>\n\t\t<li></li>\n\t</ul>\n\t</li>\n\t<li><strong>crc_zeller_results.tar.gz</strong> (<em>crc_zeller</em>): H: 75, CRC: 41\n\t<ul>\n\t\t<li></li>\n\t</ul>\n\t</li>\n\t<li><strong>crc_zhao_results.tar.gz</strong> (<em>crc_wang</em>): H: 56, CRC: 46\n\t<ul>\n\t\t<li>}</li>\n\t</ul>\n\t</li>\n\t<li><strong>edd_singh_results.tar.gz</strong> (<em>edd_singh</em>): STEC: 28, CAMP: 71, SALM: 66, SHIG: 34, H: 75\n\t<ul>\n\t\t<li></li>\n\t</ul>\n\t</li>\n\t<li><strong>hiv_dinh_results.tar.gz</strong> (<em>hiv_dinh</em>): H: 16, HIV: 21\n\t<ul>\n\t\t<li></li>\n\t</ul>\n\t</li>\n\t<li><strong>hiv_lozupone_results.tar.gz</strong> (<em>hiv_lozupone</em>): H: 13, HIV: 25\n\t<ul>\n\t\t<li></li>\n\t</ul>\n\t</li>\n\t<li><strong>hiv_noguerajulian_results.tar.gz</strong> (<em>hiv_noguerajulian</em>): H: 34, HIV: 206\n\t<ul>\n\t\t<li></li>\n\t</ul>\n\t</li>\n\t<li><strong>ibd_alm_results.tar.gz</strong> (<em>ibd_papa</em>): IBDundef: 1, nonIBD: 24, UC: 43, CD: 23\n\t<ul>\n\t\t<li></li>\n\t</ul>\n\t</li>\n\t<li><strong>ibd_engstrand_maxee_results.tar.gz</strong> (<em>ibd_willing</em>): CCD: 12, H: 35, ICD: 15, UC: 16, ICCD: 2\n\t<ul>\n\t\t<li></li>\n\t</ul>\n\t</li>\n\t<li><strong>ibd_gevers_2014_results.tar.gz</strong> (<em>ibd_gevers</em>): H: 31, CD: 224\n\t<ul>\n\t\t<li></li>\n\t</ul>\n\t</li>\n\t<li><strong>ibd_huttenhower_results.tar.gz</strong> (<em>ibd_morgan</em>): H: 18, UC: 48, CD: 62\n\t<ul>\n\t\t<li></li>\n\t</ul>\n\t</li>\n\t<li><strong>mhe_zhang_results.tar.gz</strong> (<em>liv_zhang</em>): CIRR: 25, H: 26, MHE: 26\n\t<ul>\n\t\t<li></li>\n\t</ul>\n\t</li>\n\t<li><strong>nash_chan_results.tar.gz</strong> (<em>nash_wong</em>): H: 22, NASH: 16\n\t<ul>\n\t\t<li></li>\n\t</ul>\n\t</li>\n\t<li><strong>nash_ob_baker_results.tar.gz</strong> (<em>nash_zhu</em>): H: 16, NASH: 22, OB: 25\n\t<ul>\n\t\t<li></li>\n\t</ul>\n\t</li>\n\t<li><strong>ob_goodrich_results.tar.gz</strong> (<em>ob_goodrich</em>): OW: 322, H: 433, OB: 183\n\t<ul>\n\t\t<li></li>\n\t</ul>\n\t</li>\n\t<li><strong>ob_gordon_2008_v2_results.tar.gz</strong> (<em>ob_turnbaugh</em>): H: 61, OB: 219\n\t<ul>\n\t\t<li></li>\n\t</ul>\n\t</li>\n\t<li><strong>ob_ross_results.tar.gz</strong> (<em>ob_ross</em>): H: 26, OB: 37\n\t<ul>\n\t\t<li></li>\n\t</ul>\n\t</li>\n\t<li><strong>ob_zupancic_results.tar.gz</strong> (<em>ob_zupancic</em>): H: 167, OB: 117\n\t<ul>\n\t\t<li></li>\n\t</ul>\n\t</li>\n\t<li><strong>par_scheperjans_results.tar.gz</strong> (<em>par_scheperjans</em>): H: 72, PAR: 72\n\t<ul>\n\t\t<li></li>\n\t</ul>\n\t</li>\n\t<li><strong>ra_littman_results.tar.gz</strong> (<em>art_scher</em>): H: 28, NORA: 44, CRA: 26, PSA: 16\n\t<ul>\n\t\t<li></li>\n\t</ul>\n\t</li>\n\t<li><strong>t1d_alkanani_results.tar.gz</strong> (<em>t1d_alkanani</em>): T1D: 21, H: 55, T1D_new-onset: 35\n\t<ul>\n\t\t<li></li>\n\t</ul>\n\t</li>\n\t<li><strong>t1d_mejialeon_results.tar.gz</strong> (<em>t1d_mejialeon</em>): T1D: 21, H: 8\n\t<ul>\n\t\t<li></li>\n\t</ul>\n\t</li>\n</ul>\n\n<p><strong>Version changes</strong></p>\n\n<p>Changes in Version 2: added crc_zhu and ob_escobar datasets, as well as list of core genera.</p>", 
  "title": "MicrobiomeHD: the human gut microbiome in health and disease", 
  "type": "dataset", 
  "id": "569601"
All versions This version
Views 8,8292,977
Downloads 9,3971,434
Data volume 60.2 GB7.8 GB
Unique views 7,3812,787
Unique downloads 1,291267


Cite as