Software Open Access
Kristin Valdez; Skyler Kuhn; felloumi
MicroArray Analysis Pipeline (MAAPster) is a comprehensive tool designed by the CCR Collaborative Bioinformatics Resource (CCBR) that performs transcriptome analysis of human or mouse Affymetrix gene expression data. Samples may be uploaded locally or pulled from publicly available data by entering a GEO Series identification number. MAAPster is able to analyze data from a single experiment, or GEO Series, and allows for the investigation of multiple contrasts between groups of samples. Raw CEL files are analyzed using several Bioconductor packages in ‘R’, including limma, oligo, and gsva (Carvalho & Irizarry, 2010; Hanzelmann, Castelo, & Guinney, 2013; Ritchie et al., 2015). Output includes quality control plots, normalized gene expression data, differential gene expression analysis, gene expression heatmaps, pathway analysis and single sample GSEA analysis.
Visit https://maap.ccr.cancer.gov/ to run MAAPster now!
There are 2 input options:
Local File Upload: Analyze expression data by uploading CEL files from a local computer. Data must be expression data in CEL file format, and all data must be from the same type of Affymetrix chip.
GEO: Analyze expression data from the Gene Expression Omnibus (GEO) https://www.ncbi.nlm.nih.gov/geo/ by providing the relevant series ID from the experiment, as noted below. Enter the entire ID, starting with GSE. Samples from different chips cannot be analyzed together. If the series ID contains samples from multiple chips, choose the type of chip from the drop-down menu. Samples analyzed on each chip will populate as the chip is chosen
Once the files are uploaded, choose groups for each sample. At least 2 groups must be created. The control group should be entered last. For example:
Disease VS Control
Optional: If batch effects are present in the dataset, assign batch names as shown. Batch effects will be corrected for differential gene expression analysis by incorporating batches into the linear model.
For plots based on normalized gene expression and before ssGSEA analysis, normalized gene expression is corrected for batch effects with the ComBat function from the sva package (Leek, Johnson, Parker, Jaffe, & Storey, 2012).
Array Probe Quality Control
Two plots, NUSE and RLE, display quality control metrics for the microarray probes. Plots are created with the oligo package:
The Normalized Unscaled Standard Error (NUSE) plot shows standard error estimates for each sample. These estimates are normalized to 1 across all samples, so a sample with a significantly higher value may be of lower quality (Carvalho & Irizarry, 2010).
The Relative Log Expression (RLE) plot compares the expression level of one probeset to the median expression of that probeset across all samples.
Each sample is normalized to correct for technical artifacts using the Robust Multichip Averaging (RMA) method described in (Irizarry et al., 2003). Optional cyclic loess is also available and can be performed if RMA is insufficient (Ballman, Grill, Oberg, & Therneau, 2004; Bolstad, Irizarry, Astrand, & Speed, 2003). Only selected samples are used for normalization. Pre-normalization and post-normalization histograms, MA plots and box plots display relative expression (represented by probe intensity) before and after normalization. Data for the plots are generated with the oligo package (Carvalho & Irizarry, 2010). Normalized data can be downloaded in the DEG-Enrichments Results tab.
Histogram: The histogram displays the density of probes at each log-intensity. Each sample is represented by a curved line. Before normalization, samples should be distributed randomly. An outlier sample may be due to technical problems, and separate groupings of samples may indicate batch effects. Either of these potential issues should be investigated further before downstream analyses. After normalization, all samples lines should follow the same curve. Again, if samples are clustered into groups, batch effects may be present.
MA plot: Log-ratios (M) are plotted against the average log-intensities (A). Plots are generated that compare each sample to the median of all samples. After RMA, the red sample line should be relatively equal to the blue median line.
Boxplot: Displays the distribution of log-intensities for each sample. After RMA, distributions across all samples should be approximately the same.
Sample Quality Control
The sample similarity heatmap and 3D PCA provide information about the quality of replicates in the groups.
3D PCA: In general, samples in the same group should cluster together, and groups of samples should cluster separately from other groups. If batch effects are present, re-run the analysis and assign samples to respective batches for correction. For example, the dataset below has batch effects. The first principal component, PC1, separates replicates 1 and 2 (rep1 and rep2) from replicates 3 and 4 (rep3 and rep4):
Differential Gene Expression
The DEG-Enrichments Results tab displays differentially expressed genes between groups. MAAPster runs analysis in the background with the limma package. Linear modeling is performed using limma’s lmFit function, and differential gene expression is determined using the contrasts.fit and eBayes functions (Ritchie et al, 2015). Using the toptable function, false discovery rates are calculated to adjust p-values for multiple testing (Ritchie et al., 2015).
The top 500 significantly upregulated and top 500 significantly downregulated genes are extracted for pathway analysis (significance is determined as unadjusted p-value < 0.05), Pathway analysis is performed with CCBR’s l2p R package (https://github.com/CCBR/l2p).
Single-sample GSEA (ssGSEA)
ssGSEA is performed using the gsva package as previously described (Hanzelmann et al., 2013). Pathway enrichment scores are then analyzed to determine fold changes and p-values between groups of samples, similar to the differential gene expression analysis described above. Differential pathway enrichment is ranked by p-value, and the top 50 pathways are displayed in a heatmap. Human gene set modules were downloaded from the BROAD Institute’s MSigDB, and mouse gene set modules were downloaded from the Gene Set Knowledgebase.
L2Pto latest R package version
Ballman, K. V., Grill, D. E., Oberg, A. L., & Therneau, T. M. (2004). Faster cyclic loess: normalizing RNA arrays via linear models. Bioinformatics, 20(16), 2778-2786. doi:10.1093/bioinformatics/bth327
Bolstad, B. M., Irizarry, R. A., Astrand, M., & Speed, T. P. (2003). A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19(2), 185-193.
Carvalho, B. S., & Irizarry, R. A. (2010). A framework for oligonucleotide microarray preprocessing. Bioinformatics, 26(19), 2363-2367. doi:10.1093/bioinformatics/btq431
Hanzelmann, S., Castelo, R., & Guinney, J. (2013). GSVA: gene set variation analysis for microarray and RNA-seq data. BMC Bioinformatics, 14, 7. doi:10.1186/1471-2105-14-7
Irizarry, R. A., Hobbs, B., Collin, F., Beazer-Barclay, Y. D., Antonellis, K. J., Scherf, U., & Speed, T. P. (2003). Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics, 4(2), 249-264. doi:10.1093/biostatistics/4.2.249
Leek, J. T., Johnson, W. E., Parker, H. S., Jaffe, A. E., & Storey, J. D. (2012). The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics, 28(6), 882-883. doi:10.1093/bioinformatics/bts034
Ritchie, M. E., Phipson, B., Wu, D., Hu, Y., Law, C. W., Shi, W., & Smyth, G. K. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res, 43(7), e47. doi:10.1093/nar/gkv007