Software Open Access

CCBR/MicroArrayPipeline: Release v2.0.0

Kristin Valdez; Skyler Kuhn; felloumi

MAAPster (v2.0.0)


MicroArray Analysis Pipeline (MAAPster) is a comprehensive tool designed by the CCR Collaborative Bioinformatics Resource (CCBR) that performs transcriptome analysis of human or mouse Affymetrix gene expression data. Samples may be uploaded locally or pulled from publicly available data by entering a GEO Series identification number. MAAPster is able to analyze data from a single experiment, or GEO Series, and allows for the investigation of multiple contrasts between groups of samples. Raw CEL files are analyzed using several Bioconductor packages in ‘R’, including limma, oligo, and gsva (Carvalho & Irizarry, 2010; Hanzelmann, Castelo, & Guinney, 2013; Ritchie et al., 2015). Output includes quality control plots, normalized gene expression data, differential gene expression analysis, gene expression heatmaps, pathway analysis and single sample GSEA analysis.

Visit https://maap.ccr.cancer.gov/ to run MAAPster now! 

Input

There are 2 input options:

  1. Local File Upload: Analyze expression data by uploading CEL files from a local computer. Data must be expression data in CEL file format, and all data must be from the same type of Affymetrix chip.

  2. GEO: Analyze expression data from the Gene Expression Omnibus (GEO) https://www.ncbi.nlm.nih.gov/geo/ by providing the relevant series ID from the experiment, as noted below. Enter the entire ID, starting with GSE. Samples from different chips cannot be analyzed together. If the series ID contains samples from multiple chips, choose the type of chip from the drop-down menu. Samples analyzed on each chip will populate as the chip is chosen

Once the files are uploaded, choose groups for each sample. At least 2 groups must be created. The control group should be entered last. For example:

Disease
VS
Control

Optional: If batch effects are present in the dataset, assign batch names as shown. Batch effects will be corrected for differential gene expression analysis by incorporating batches into the linear model.

For plots based on normalized gene expression and before ssGSEA analysis, normalized gene expression is corrected for batch effects with the ComBat function from the sva package (Leek, Johnson, Parker, Jaffe, & Storey, 2012). 

Array Probe Quality Control

Two plots, NUSE and RLE, display quality control metrics for the microarray probes. Plots are created with the oligo package:

  • The Normalized Unscaled Standard Error (NUSE) plot shows standard error estimates for each sample. These estimates are normalized to 1 across all samples, so a sample with a significantly higher value may be of lower quality (Carvalho & Irizarry, 2010).

The Relative Log Expression (RLE) plot compares the expression level of one probeset to the median expression of that probeset across all samples.

Normalization

Each sample is normalized to correct for technical artifacts using the Robust Multichip Averaging (RMA) method described in (Irizarry et al., 2003). Optional cyclic loess is also available and can be performed if RMA is insufficient (Ballman, Grill, Oberg, & Therneau, 2004; Bolstad, Irizarry, Astrand, & Speed, 2003). Only selected samples are used for normalization. Pre-normalization and post-normalization histograms, MA plots and box plots display relative expression (represented by probe intensity) before and after normalization. Data for the plots are generated with the oligo package (Carvalho & Irizarry, 2010). Normalized data can be downloaded in the DEG-Enrichments Results tab.

Histogram: The histogram displays the density of probes at each log-intensity. Each sample is represented by a curved line. Before normalization, samples should be distributed randomly. An outlier sample may be due to technical problems, and separate groupings of samples may indicate batch effects. Either of these potential issues should be investigated further before downstream analyses. After normalization, all samples lines should follow the same curve. Again, if samples are clustered into groups, batch effects may be present.

MA plot: Log-ratios (M) are plotted against the average log-intensities (A). Plots are generated that compare each sample to the median of all samples. After RMA, the red sample line should be relatively equal to the blue median line.

Boxplot: Displays the distribution of log-intensities for each sample. After RMA, distributions across all samples should be approximately the same.

Sample Quality Control

The sample similarity heatmap and 3D PCA provide information about the quality of replicates in the groups.

3D PCA: In general, samples in the same group should cluster together, and groups of samples should cluster separately from other groups. If batch effects are present, re-run the analysis and assign samples to respective batches for correction. For example, the dataset below has batch effects. The first principal component, PC1, separates replicates 1 and 2 (rep1 and rep2) from replicates 3 and 4 (rep3 and rep4):

Differential Gene Expression

The DEG-Enrichments Results tab displays differentially expressed genes between groups. MAAPster runs analysis in the background with the limma package. Linear modeling is performed using limma’s lmFit function, and differential gene expression is determined using the contrasts.fit and eBayes functions (Ritchie et al, 2015). Using the toptable function, false discovery rates are calculated to adjust p-values for multiple testing (Ritchie et al., 2015).

Pathway Analysis

The top 500 significantly upregulated and top 500 significantly downregulated genes are extracted for pathway analysis (significance is determined as unadjusted p-value < 0.05), Pathway analysis is performed with CCBR’s l2p R package (https://github.com/CCBR/l2p).

Single-sample GSEA (ssGSEA)

ssGSEA is performed using the gsva package as previously described (Hanzelmann et al., 2013). Pathway enrichment scores are then analyzed to determine fold changes and p-values between groups of samples, similar to the differential gene expression analysis described above. Differential pathway enrichment is ranked by p-value, and the top 50 pathways are displayed in a heatmap. Human gene set modules were downloaded from the BROAD Institute’s MSigDB, and mouse gene set modules were downloaded from the Gene Set Knowledgebase.

Release Notes

  • Ported key functions of the shiny application to an R package: mpstr
  • Batch correction added
  • Support for multi-chip GEO experiments
  • Upgraded L2P to latest R package version
  • Supports uploading group and/or batch information from an excel file
  • Supports one-to-many relationships between sample and group metadata
  • GSE upload error catching
    • When the user enters an invalid GSE ID: ie 1234
    • When the user enters a valid GSE ID for a non-Affymetrix chip: ie GSE141545
    • When the user enters a valid GSE ID for an Affymetrix chip that is not supported: ie for rat GSE86329
The following chips are supported, and additional Affymetrix mouse or human chips may be added upon request: "Human Genome U133 Plus 2.0 Array", "GeneChip™ Human Genome U133A Array", "GeneChip™ Human Genome U133A 2.0 Array", "GeneChip™ Human Genome U133B Array", "GeneChip™ Human Gene 1.0 ST Array (v1 and v2)", "GeneChip™ Human Gene 1.1 ST Array Version 1", "GeneChip™ Human Gene 2.0 ST Array", "GeneChip™ Human Gene 2.1 ST", "Clariom™ S Assay HT, human", "Clariom™ S Assay, human", "Human Genome U219 Array", "GeneChip Human Genome U95 Version 2", "GeneChip™ Human Transcriptome Array 2.0", "Human Exon 1.0 ST Array", "HT Human Genome U133 Array", "Clariom™ D Assay, human", "GeneChip™ Mouse Gene 1.0 ST Array", "GeneChip™ Mouse Gene 1.1 ST Array", "GeneChip™ Mouse Gene 2.0 ST Array", "Clariom™ S Assay HT, mouse", "Clariom™ S Assay, mouse", "GeneChip™ Mouse Genome 430 2.0 Array", "GeneChip™ Mouse Genome 430A 2.0 Array", "GeneChip® Mouse Expression Set 430", "GeneChip® Murine Genome U74v2 Set", "Mouse Exon 1.0 ST Array"
Files (17.4 MB)
Name Size
CCBR/MicroArrayPipeline-v2.0.0.zip
md5:bf343e5eb5b25f5efa25b4a9355f426a
17.4 MB Download
  • Ballman, K. V., Grill, D. E., Oberg, A. L., & Therneau, T. M. (2004). Faster cyclic loess: normalizing RNA arrays via linear models. Bioinformatics, 20(16), 2778-2786. doi:10.1093/bioinformatics/bth327

  • Bolstad, B. M., Irizarry, R. A., Astrand, M., & Speed, T. P. (2003). A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19(2), 185-193.

  • Carvalho, B. S., & Irizarry, R. A. (2010). A framework for oligonucleotide microarray preprocessing. Bioinformatics, 26(19), 2363-2367. doi:10.1093/bioinformatics/btq431

  • Hanzelmann, S., Castelo, R., & Guinney, J. (2013). GSVA: gene set variation analysis for microarray and RNA-seq data. BMC Bioinformatics, 14, 7. doi:10.1186/1471-2105-14-7

  • Irizarry, R. A., Hobbs, B., Collin, F., Beazer-Barclay, Y. D., Antonellis, K. J., Scherf, U., & Speed, T. P. (2003). Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics, 4(2), 249-264. doi:10.1093/biostatistics/4.2.249

  • Leek, J. T., Johnson, W. E., Parker, H. S., Jaffe, A. E., & Storey, J. D. (2012). The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics, 28(6), 882-883. doi:10.1093/bioinformatics/bts034

  • Ritchie, M. E., Phipson, B., Wu, D., Hu, Y., Law, C. W., Shi, W., & Smyth, G. K. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res, 43(7), e47. doi:10.1093/nar/gkv007

119
4
views
downloads
All versions This version
Views 119119
Downloads 44
Data volume 69.8 MB69.8 MB
Unique views 9696
Unique downloads 33

Share

Cite as