Background

This document contains the code used for the analyses reported in the manuscript:

Systematic comparison of transcriptomes of Caco-2 cells cultured under different cellular and physiological conditions.
By: Janneke Elzinga, Menno Grouls, Guido Hooiveld et al. Published in: Archives of Toxicology (2023), DOI: 10.1007/s00204-022-03430-y

In essence this workflow is based on the work reported in:
Rohart F, Mason EA, Matigian N, Mosbergen R, Korn O, Chen T, Butcher S, Patel J, Atkinson K, Khosrotehrani K, Fisk NM, Lê Cao K, Wells CA. A molecular classification of human mesenchymal stromal cells. PeerJ 2016;4:e1845. https://doi.org/10.7717/peerj.1845

Angel PW, Rajab N, Deng Y, Pacheco CM, Chen T, Lê Cao K, Choi J, Wells CA. A simple, scalable approach to building a cross-platform transcriptome atlas. PLoS Comput Biol 2020;16(9):e1008219. https://doi.org/10.1371/journal.pcbi.1008219

Total run-time is approximately 50 minutes.

Part 1 - Download datasets, combine and transform expression values.

Screening of literature and databases resulted in the identification of 13 series (experiments) comprising 100 relevant samples.

All these samples were run on an Affymetrix platform.

An overview of selected series and samples is given in file “Supplementary_file_1d.xlsx”

Load required libraries for all analyses, and set seed for reproducible results.

The seed applies only to the variance partition analysis, since this analysis involves solving linear mixed models with iterative algorithms. If not set, between runs this may result in small differences in the 5th decimal place.

library(openxlsx)
library(oligo)
library(GEOquery)
library(affycoretools)
library(org.Hs.eg.db)
library(data.table)
library(stringr)
library(dplyr)
library(PCAtools)
library(ggplot2)
library(gridExtra)
library(cowplot)
library(VGAM)
library(RColorBrewer)
library(scales)
library(pheatmap)
library(variancePartition)
library(BiocParallel)
library(clusterProfiler)
library(enrichplot)
library(GO.db)

set.seed(4231)

Start by importing the Excel file with selected experiments, samples and all metadata-information.

samples.overview <- "Supplementary_file_1d.xlsx"
exp.info <- read.xlsx(samples.overview, sheet=1, colNames=TRUE, detectDates = TRUE,  check.names = TRUE)

Extract all GEO experiment (GSE) and platform (GPL) ids, corresponding platform abbreviations, and PdInfo and annotation packages.

  GEO.datasets <- exp.info[, c("GSE.no.", "Platform.abbr.", "GPL.no.", "PdInfo.package", "Annotation.package")]
  GEO.datasets <- GEO.datasets[!duplicated(GEO.datasets[,"GSE.no."])   , ]

If needed, install required PdInfo and annotation packages.

#BiocManager::install( c(unique(GEO.datasets$"PdInfo.package"), unique(GEO.datasets$"Annotation.package")) )

Check how many samples and experiments were analysed on each array platform.

# number of experiments (GSE) per array platform
table(GEO.datasets$Platform.abbr.)
## 
##       hgu133a2    hgu133plus2          hta20       hugene10       hugene11 
##              1              4              1              2              1 
##       hugene21 nugohs1a520180 
##              3              1
# number of samples per array platform
table(exp.info$Platform.abbr.)
## 
##       hgu133a2    hgu133plus2          hta20       hugene10       hugene11 
##              4             29              9             12              8 
##       hugene21 nugohs1a520180 
##             30              8

Download archive with raw data (CEL files) and extract archive. Per array perform bg-correction and summarize expression on the level of probesets, and add annotation. Do not normalize the expression values.
Save all results in separate subdirectories.
Please note that some series contain additional data, such as miRNA expression data. Therefore, delete all micro-RNA samples that were jointly submitted with the samples of interest, as well as the corresponding GSEMatrix files, since these are not of interest.
Lastly, series GSE30292 contains a corrupt CEL file (GSM750879_01_E3.CEL.gz), so remove it before processing.

Custom function downloadProcessGEO does all of this.

Source all helpFunctions.

source("helpFunctions.R")
downloadProcessGEO(GEO.datasets, remove.cel=c("GSM750879_01_E3.CEL.gz"), normalize=FALSE )
## Extracting TAR archive containing raw data for series GSE115022 
## 
## 
## Identified a single GEO metadata file (GSEMatrix)
##  for the required platform entry: GPL11532 .
## This metadata file is: GSE115022_series_matrix.txt.gz 
## 
## Reading in : GSE115022/GSM3163140_G146_35_1_CTRL_PROBIOTICS_G10_82.CEL.gz
## Reading in : GSE115022/GSM3163141_G146_37_1_STRAIN_2_G12_84.CEL.gz
## Reading in : GSE115022/GSM3163142_G146_36_1_STRAIN_1_G11_83.CEL.gz
## Reading in : GSE115022/GSM3163143_G146_38_1_STRAIN_3_H01_85.CEL.gz
## Reading in : GSE115022/GSM3163144_G146_30_1_CTRL_FIBER_F07_67.CEL.gz
## Reading in : GSE115022/GSM3163145_G146_31_1_FIBER_1_F08_68.CEL.gz
## Reading in : GSE115022/GSM3163146_G146_32_1_FIBER_2_F09_69.CEL.gz
## Reading in : GSE115022/GSM3163147_G146_33_1_FIBER_3_F10_70.CEL.gz
## Reading in : GSE115022/GSM3163148_G146_34_1_FIBER_4_F11_71.CEL.gz
## Reading in : GSE115022/GSM3163149_G146_35_2_CTRL_PROBIOTICS_H02_86.CEL.gz
## Reading in : GSE115022/GSM3163150_G146_37_2_STRAIN_2_H04_88.CEL.gz
## Reading in : GSE115022/GSM3163151_G146_36_2_STRAIN_1_H03_87_2.CEL.gz
## Reading in : GSE115022/GSM3163152_G146_38_2_STRAIN_3_H05_89.CEL.gz
## Reading in : GSE115022/GSM3163153_G146_30_2_CTRL_FIBER_F12_72_2.CEL.gz
## Reading in : GSE115022/GSM3163154_G146_31_2_fiber_1_G01_73.CEL.gz
## Reading in : GSE115022/GSM3163155_G146_32_2_FIBER_2_G02_74.CEL.gz
## Reading in : GSE115022/GSM3163156_G146_33_2_FIBER_3_G03_75.CEL.gz
## Reading in : GSE115022/GSM3163157_G146_34_2_FIBER_4_G04_76.CEL.gz
## Reading in : GSE115022/GSM3163158_G146_30_3_CTRL_FIBER_G05_77.CEL.gz
## Reading in : GSE115022/GSM3163159_G146_31_3_FIBER_1_G06_78.CEL.gz
## Reading in : GSE115022/GSM3163160_G146_32_3_FIBER_2_G07_79.CEL.gz
## Reading in : GSE115022/GSM3163161_G146_33_3_FIBER_3_G08_80.CEL.gz
## Reading in : GSE115022/GSM3163162_G146_34_3_FIBER_4_G09_81_2.CEL.gz
## Background correcting
## Calculating Expression
## Extracting TAR archive containing raw data for series GSE156269 
## 
## 
## Identified a single GEO metadata file (GSEMatrix)
##  for the required platform entry: GPL17692 .
## This metadata file is: GSE156269_series_matrix.txt.gz 
## 
## Reading in : GSE156269/GSM4728034_G322_F07_T1.CEL.gz
## Reading in : GSE156269/GSM4728035_G322_F09_T2.CEL.gz
## Reading in : GSE156269/GSM4728036_G322_G05_T3.CEL.gz
## Reading in : GSE156269/GSM4728037_G322_G07_T4.CEL.gz
## Reading in : GSE156269/GSM4728038_G322_G09_C1.CEL.gz
## Reading in : GSE156269/GSM4728039_G322_H05_C2_2.CEL.gz
## Reading in : GSE156269/GSM4728040_G322_H07_C3.CEL.gz
## Reading in : GSE156269/GSM4728041_G322_H09_C4.CEL.gz
## Background correcting
## Calculating Expression
## Extracting TAR archive containing raw data for series GSE15636 
## 
## 
## Identified a single GEO metadata file (GSEMatrix)
##  for the required platform entry: GPL570 .
## This metadata file is: GSE15636_series_matrix.txt.gz 
## 
## Reading in : GSE15636/GSM391349.CEL.gz
## Reading in : GSE15636/GSM391350.CEL.gz
## Reading in : GSE15636/GSM391351.CEL.gz
## Reading in : GSE15636/GSM391352.CEL.gz
## Reading in : GSE15636/GSM391353.CEL.gz
## Reading in : GSE15636/GSM391354.CEL.gz
## Reading in : GSE15636/GSM391355.CEL.gz
## Reading in : GSE15636/GSM391356.CEL.gz
## Reading in : GSE15636/GSM391357.CEL.gz
## Reading in : GSE15636/GSM391358.CEL.gz
## Reading in : GSE15636/GSM391359.CEL.gz
## Reading in : GSE15636/GSM391360.CEL.gz
## Reading in : GSE15636/GSM391361.CEL.gz
## Reading in : GSE15636/GSM391362.CEL.gz
## Reading in : GSE15636/GSM391363.CEL.gz
## Reading in : GSE15636/GSM391364.CEL.gz
## Reading in : GSE15636/GSM391365.CEL.gz
## Reading in : GSE15636/GSM391366.CEL.gz
## Reading in : GSE15636/GSM391367.CEL.gz
## Reading in : GSE15636/GSM391368.CEL.gz
## Background correcting
## Calculating Expression
## Extracting TAR archive containing raw data for series GSE158620 
## 
## 
## Identified a single GEO metadata file (GSEMatrix)
##  for the required platform entry: GPL17692 .
## This metadata file is: GSE158620_series_matrix.txt.gz 
## 
## Reading in : GSE158620/GSM4804472_G327_A05_02_TW2_CtrlTiO2.CEL.gz
## Reading in : GSE158620/GSM4804473_G327_A07_03_TW3_CtrlTiO2.CEL.gz
## Reading in : GSE158620/GSM4804474_G327_A09_04_TW4_CtrlTiO2.CEL.gz
## Reading in : GSE158620/GSM4804475_G327_B05_05_TW5_TiO2.CEL.gz
## Reading in : GSE158620/GSM4804476_G327_B07_07_TW7_TiO2.CEL.gz
## Reading in : GSE158620/GSM4804477_G327_B09_08_TW8_TiO2.CEL.gz
## Reading in : GSE158620/GSM4804478_G327_C05_09_Chip1_CtrlTiO2.CEL.gz
## Reading in : GSE158620/GSM4804479_G327_C07_10_Chip2_CtrlTiO2.CEL.gz
## Reading in : GSE158620/GSM4804480_G327_C09_12_Chip4_CtrlTiO2_3.CEL.gz
## Reading in : GSE158620/GSM4804481_G327_D05_13_Chip5_TiO2.CEL.gz
## Reading in : GSE158620/GSM4804482_G327_D07_14_Chip6_TiO2.CEL.gz
## Reading in : GSE158620/GSM4804483_G327_D09_15_Chip7_TiO2_3.CEL.gz
## Reading in : GSE158620/GSM4804484_G327_E05_17_TW9_CtrlZnO.CEL.gz
## Reading in : GSE158620/GSM4804485_G327_E07_18_TW10_CtrlZnO.CEL.gz
## Reading in : GSE158620/GSM4804486_G327_E09_19_TW11_CtrlZnO.CEL.gz
## Reading in : GSE158620/GSM4804487_G327_F05_21_TW13_ZnO.CEL.gz
## Reading in : GSE158620/GSM4804488_G327_F07_22_TW14_ZnO.CEL.gz
## Reading in : GSE158620/GSM4804489_G327_F09_24_TW16_ZnO.CEL.gz
## Reading in : GSE158620/GSM4804490_G327_G05_25_Chip9_CtrlZnO.CEL.gz
## Reading in : GSE158620/GSM4804491_G327_G07_27_Chip11_CtrlZnO.CEL.gz
## Reading in : GSE158620/GSM4804492_G327_G09_28_Chip12_CtrlZnO.CEL.gz
## Reading in : GSE158620/GSM4804493_G327_H05_30_Chip14_ZnO.CEL.gz
## Reading in : GSE158620/GSM4804494_G327_H07_31_Chip15_ZnO.CEL.gz
## Reading in : GSE158620/GSM4804495_G327_H09_32_Chip16_ZnO.CEL.gz
## Reading in : GSE158620/GSM4804496_G328_A05_01_TW1_CtrlTiO2.CEL.gz
## Reading in : GSE158620/GSM4804497_G328_A09_11_Chip3_CtrlTiO2.CEL.gz
## Reading in : GSE158620/GSM4804498_G328_B05_16_Chip8_TiO2.CEL.gz
## Reading in : GSE158620/GSM4804499_G328_B07_20_TW12_CtrlZnO.CEL.gz
## Reading in : GSE158620/GSM4804500_G328_B09_23_TW15_ZnO.CEL.gz
## Reading in : GSE158620/GSM4804501_G328_C05_26_Chip10_CtrlZnO.CEL.gz
## Reading in : GSE158620/GSM4804502_G328_C07_29_Chip16_1_ZnO.CEL.gz
## Background correcting
## Calculating Expression
## Extracting TAR archive containing raw data for series GSE173729 
## 
## 
## Identified a single GEO metadata file (GSEMatrix)
##  for the required platform entry: GPL17692 .
## This metadata file is: GSE173729_series_matrix.txt.gz 
## 
## Reading in : GSE173729/GSM5277496_G336_A05_01_3_CacoStat3.CEL.gz
## Reading in : GSE173729/GSM5277497_G336_A07_02_5_CacoSWMS2.CEL.gz
## Reading in : GSE173729/GSM5277498_G336_A09_03_1_CacoStat1.CEL.gz
## Reading in : GSE173729/GSM5277499_G336_B05_04_5_CacoSWMS2.CEL.gz
## Reading in : GSE173729/GSM5277500_G336_B07_05_2_CacoStat2.CEL.gz
## Reading in : GSE173729/GSM5277501_G336_B09_06_4_CacoSWMS1.CEL.gz
## Reading in : GSE173729/GSM5277502_G336_C05_07_9_HT29MTXStat3.CEL.gz
## Reading in : GSE173729/GSM5277503_G336_C07_08_12_HT29MTXSWMS3.CEL.gz
## Reading in : GSE173729/GSM5277504_G336_C09_09_8_HT29MTXStat2.CEL.gz
## Reading in : GSE173729/GSM5277505_G336_D05_10_11_HT29MTXSWMS2.CEL.gz
## Reading in : GSE173729/GSM5277506_G336_D07_11_7_HT29MTXStat1.CEL.gz
## Reading in : GSE173729/GSM5277507_G336_D09_12_11_HT29MTXSWMS2_2.CEL.gz
## Reading in : GSE173729/GSM5277508_G336_E05_13_15_LS174T3.CEL.gz
## Reading in : GSE173729/GSM5277509_G336_E07_14_14_LS174T2.CEL.gz
## Reading in : GSE173729/GSM5277510_G336_E09_15_15_LS174T3.CEL.gz
## Background correcting
## Calculating Expression
## Extracting TAR archive containing raw data for series GSE17625 
## 
## 
## Identified a single GEO metadata file (GSEMatrix)
##  for the required platform entry: GPL570 .
## This metadata file is: GSE17625_series_matrix.txt.gz 
## 
## Reading in : GSE17625/GSM440049.CEL.gz
## Reading in : GSE17625/GSM440050.CEL.gz
## Reading in : GSE17625/GSM440051.CEL.gz
## Reading in : GSE17625/GSM440052.CEL.gz
## Reading in : GSE17625/GSM440053.CEL.gz
## Reading in : GSE17625/GSM440054.CEL.gz
## Background correcting
## Calculating Expression
## Extracting TAR archive containing raw data for series GSE21976 
## 
## 
## Identified a single GEO metadata file (GSEMatrix)
##  for the required platform entry: GPL7020 .
## This metadata file is: GSE21976_series_matrix.txt.gz 
## 
## Reading in : GSE21976/GSM546502.CEL.gz
## Reading in : GSE21976/GSM546503.CEL.gz
## Reading in : GSE21976/GSM546504.CEL.gz
## Reading in : GSE21976/GSM546505.CEL.gz
## Reading in : GSE21976/GSM546506.CEL.gz
## Reading in : GSE21976/GSM546507.CEL.gz
## Reading in : GSE21976/GSM546508.CEL.gz
## Reading in : GSE21976/GSM546509.CEL.gz
## Background correcting
## Calculating Expression
## Extracting TAR archive containing raw data for series GSE30292 
## 
## 
## Identified a single GEO metadata file (GSEMatrix)
##  for the required platform entry: GPL570 .
## This metadata file is: GSE30292_series_matrix.txt.gz 
## 
## 
## Data serie includes user-listed CEL file(s) that are removed: GSE30292/GSM750879_01_E3.CEL.gz 
## Reading in : GSE30292/GSM750842_15_Caco2.CEL.gz
## Reading in : GSE30292/GSM750843_16_Caco2.CEL.gz
## Reading in : GSE30292/GSM750844_17_Caco2.CEL.gz
## Reading in : GSE30292/GSM750845_18_HT29.CEL.gz
## Reading in : GSE30292/GSM750846_19_HT29.CEL.gz
## Reading in : GSE30292/GSM750847_20_HT29.CEL.gz
## Reading in : GSE30292/GSM750848_21_T84.CEL.gz
## Reading in : GSE30292/GSM750849_22_T84.CEL.gz
## Reading in : GSE30292/GSM750850_23_T84.CEL.gz
## Reading in : GSE30292/GSM750851_24_SW480.CEL.gz
## Reading in : GSE30292/GSM750852_25_SW480.CEL.gz
## Reading in : GSE30292/GSM750853_26_SW480.CEL.gz
## Reading in : GSE30292/GSM750854_27_LS174T.CEL.gz
## Reading in : GSE30292/GSM750855_28_LS174T.CEL.gz
## Reading in : GSE30292/GSM750856_29_LS174T.CEL.gz
## Reading in : GSE30292/GSM750857_63_cacoF.CEL.gz
## Reading in : GSE30292/GSM750858_64_cacoF.CEL.gz
## Reading in : GSE30292/GSM750859_65_cacoF.CEL.gz
## Reading in : GSE30292/GSM750860_66_cacoJU.CEL.gz
## Reading in : GSE30292/GSM750861_67_cacoJD.CEL.gz
## Reading in : GSE30292/GSM750862_68_cacoB.CEL.gz
## Reading in : GSE30292/GSM750863_69_cacoB.CEL.gz
## Reading in : GSE30292/GSM750864_70_cacoB.CEL.gz
## Reading in : GSE30292/GSM750865_71_cacoR.CEL.gz
## Reading in : GSE30292/GSM750866_72_cacoR.CEL.gz
## Reading in : GSE30292/GSM750867_73_cacoR.CEL.gz
## Reading in : GSE30292/GSM750868_N01_cacoCDX1.CEL.gz
## Reading in : GSE30292/GSM750869_N02_cacoCDX2.CEL.gz
## Reading in : GSE30292/GSM750870_N03_cacoGFP.CEL.gz
## Reading in : GSE30292/GSM750880_02_E8.CEL.gz
## Reading in : GSE30292/GSM750881_03_E9.CEL.gz
## Reading in : GSE30292/GSM750882_04_N3.CEL.gz
## Reading in : GSE30292/GSM750883_05_N8.CEL.gz
## Reading in : GSE30292/GSM750884_06_N14.CEL.gz
## Reading in : GSE30292/GSM750885_07_CAF8.CEL.gz
## Reading in : GSE30292/GSM750886_08_CAF14.CEL.gz
## Reading in : GSE30292/GSM750887_09_CAF17.CEL.gz
## Reading in : GSE30292/GSM750888_10_IL_Biopredict.CEL.gz
## Reading in : GSE30292/GSM750889_11_IL_PAT31.CEL.gz
## Reading in : GSE30292/GSM750890_12_IL_Spain.CEL.gz
## Reading in : GSE30292/GSM750891_13_JE_PAT25.CEL.gz
## Reading in : GSE30292/GSM750892_14_JE_PAT26.CEL.gz
## Background correcting
## Calculating Expression
## Extracting TAR archive containing raw data for series GSE30364 
## 
## 
## Identified a single GEO metadata file (GSEMatrix)
##  for the required platform entry: GPL570 .
## This metadata file is: GSE30364_series_matrix.txt.gz 
## 
## Reading in : GSE30364/GSM753507_Caco2-2.CEL.gz
## Reading in : GSE30364/GSM753508_Caco2-1.CEL.gz
## Reading in : GSE30364/GSM753509_Caco2-3.CEL.gz
## Reading in : GSE30364/GSM753510_Caco2-IL8-2.CEL.gz
## Reading in : GSE30364/GSM753511_Caco2-IL8-3.CEL.gz
## Reading in : GSE30364/GSM753512_Caco2-IL8-1.CEL.gz
## Background correcting
## Calculating Expression
## Extracting TAR archive containing raw data for series GSE65790 
## 
## 
## Identified a single GEO metadata file (GSEMatrix)
##  for the required platform entry: GPL6244 .
## This metadata file is: GSE65790_series_matrix.txt.gz 
## 
## Reading in : GSE65790/GSM1606385_Transwell-1.CEL.gz
## Reading in : GSE65790/GSM1606386_Transwell-2.CEL.gz
## Reading in : GSE65790/GSM1606387_Gut_Chip-1.CEL.gz
## Reading in : GSE65790/GSM1606388_Gut_Chip-2.CEL.gz
## Reading in : GSE65790/GSM1606389_Gut_Chip_+_VSL_3-1.CEL.gz
## Reading in : GSE65790/GSM1606390_Gut_Chip_+_VSL_3-2.CEL.gz
## Background correcting
## Calculating Expression
## Extracting TAR archive containing raw data for series GSE7259 
## 
## 
## Identified a single GEO metadata file (GSEMatrix)
##  for the required platform entry: GPL571 .
## This metadata file is: GSE7259_series_matrix.txt.gz 
## 
## Reading in : GSE7259/GSM174819.CEL.gz
## Reading in : GSE7259/GSM174944.CEL.gz
## Reading in : GSE7259/GSM174945.CEL.gz
## Reading in : GSE7259/GSM174946.CEL.gz
## Reading in : GSE7259/GSM174947.CEL.gz
## Reading in : GSE7259/GSM174948.CEL.gz
## Reading in : GSE7259/GSM174949.CEL.gz
## Reading in : GSE7259/GSM174950.CEL.gz
## Background correcting
## Calculating Expression
## Extracting TAR archive containing raw data for series GSE79383 
## 
## 
## GEO data serie includes file(s) with pattern: ' miRNA ' that are removed
## before processing:
##  GSM2094175_383_PS_RNA_87_miRNA-4_0_.CEL.gz GSM2094176_379_PS_RNA_88_miRNA-4_0_.CEL.gz GSM2094177_385_PS_RNA_94_miRNA-4_0_.CEL.gz GSM2094178_382_PS_RNA_79_miRNA-4_0_.CEL.gz GSM2094179_386_PS_RNA_91_miRNA-4_0_.CEL.gz GSM2094180_380_PS_RNA_108_miRNA-4_0_.CEL.gz GSM2094181_388_PS_RNA_12_miRNA-4_0_.CEL.gz GSM2094182_389_PS_RNA_13_miRNA-4_0_.CEL.gz GSM2094183_378_PS_RNA_15_miRNA-4_0_.CEL.gz 
## 
## 
## 
## Identified one or more GEO metadata files (GSEMatrix)
##  for non-relevant platform entries!
## 
## Removing the non-relevant GSEMatrix from the download:
##  GSE79383-GPL19117_series_matrix.txt.gz 
## because only GEO metadata for GEO platform GPL17586 should be retained.
## Retained metadata file is: GSE79383-GPL17586_series_matrix.txt.gz 
## 
## Reading in : GSE79383/GSM2094166_1006.PS_RNA_87_HTA-2_0_.CEL.gz
## Reading in : GSE79383/GSM2094167_1003.PS_RNA_88_HTA-2_0_.CEL.gz
## Reading in : GSE79383/GSM2094168_1007.PS_RNA_94_HTA-2_0_.CEL.gz
## Reading in : GSE79383/GSM2094169_999.PS_RNA_79_HTA-2_0_.CEL.gz
## Reading in : GSE79383/GSM2094170_1005.PS_RNA_91_HTA-2_0_.CEL.gz
## Reading in : GSE79383/GSM2094171_998.PS_RNA_108_HTA-2_0_.CEL.gz
## Reading in : GSE79383/GSM2094172_997.PS_RNA_12_HTA-2_0_.CEL.gz
## Reading in : GSE79383/GSM2094173_1008.PS_RNA_13_HTA-2_0_.CEL.gz
## Reading in : GSE79383/GSM2094174_1004.PS_RNA_15_HTA-2_0_.CEL.gz
## Background correcting
## Calculating Expression
## Extracting TAR archive containing raw data for series GSE81867 
## 
## 
## GEO data serie includes file(s) with pattern: ' miRNA ' that are removed
## before processing:
##  GSM2176813_mi_chip_1_cells_miRNA-4_0_.CEL.gz GSM2176813_mi_chip_1_cells_miRNA-4_0_.rma-dabg.chp.gz GSM2176814_mi_chip_2_cells_miRNA-4_0_.CEL.gz GSM2176814_mi_chip_2_cells_miRNA-4_0_.rma-dabg.chp.gz GSM2176815_mi_chip_3_cells_miRNA-4_0_.CEL.gz GSM2176815_mi_chip_3_cells_miRNA-4_0_.rma-dabg.chp.gz GSM2176816_mi_static_4_cells_miRNA-4_0_.CEL.gz GSM2176816_mi_static_4_cells_miRNA-4_0_.rma-dabg.chp.gz GSM2176817_mi_static_5_cells_miRNA-4_0_.CEL.gz GSM2176817_mi_static_5_cells_miRNA-4_0_.rma-dabg.chp.gz GSM2176818_mi_static_6_cells_miRNA-4_0_.CEL.gz GSM2176818_mi_static_6_cells_miRNA-4_0_.rma-dabg.chp.gz 
## 
## 
## 
## Identified one or more GEO metadata files (GSEMatrix)
##  for non-relevant platform entries!
## 
## Removing the non-relevant GSEMatrix from the download:
##  GSE81867-GPL21572_series_matrix.txt.gz 
## because only GEO metadata for GEO platform GPL6244 should be retained.
## Retained metadata file is: GSE81867-GPL6244_series_matrix.txt.gz 
## 
## Reading in : GSE81867/GSM2176819_m_Chip_1_cells.CEL.gz
## Reading in : GSE81867/GSM2176820_m_Chip_2_cells.CEL.gz
## Reading in : GSE81867/GSM2176821_m_Chip_3_cells.CEL.gz
## Reading in : GSE81867/GSM2176822_m_Transwell_4_cells.CEL.gz
## Reading in : GSE81867/GSM2176823_m_Transwell_5_cells.CEL.gz
## Reading in : GSE81867/GSM2176824_m_Transwell_6_cells.CEL.gz
## Background correcting
## Calculating Expression

Part 2 - Import data from all experiments, filter probesets/genes and samples, combine into a single ExpressionSet, and for each sample perform rank percentile transformation.

Import expression data from all downloaded experiments in R session

all.Rdata.files <- dir(pattern="Rdata", recursive = TRUE)

for (i in seq_along(all.Rdata.files)) {load(all.Rdata.files[i])}

Filter probesets: for each dataset keep only probesets that measure the expression of a gene (entrezids). In case genes are represented multiple times, keep the probeset with highest expression level.

input.datasetnames <- paste("data.bgexpr", GEO.datasets[,"GSE.no."], sep=".")
filtered.datasetnames <- paste(GEO.datasets[,"GSE.no."], 'filtered', sep=".")

for(i in seq_along(filtered.datasetnames)){assign(filtered.datasetnames[i], filterDataset(get(input.datasetnames[i]), ctrls="AFFX"))}
## 
## Array design: pd.hugene.1.1.st.v1
## Number of features in dataset: 33297 
## 
## Removed 0 control features.
## 
## Continued with 33297 features.
## 
## Of these 33297 features, 22039 features have been annotated with an ENTREZID.
## 
## Sorted expression data from high to low average signal.
## 
## Removed 2035 duplicate features (based on ENTREZID).
## 
## Kept 20004 unique features.
## 
## Renamed features to ENTREZID.
## 
## Set annotation slot to 'none'.
## 
## Array design: pd.hugene.2.1.st
## Number of features in dataset: 53617 
## 
## Removed 0 control features.
## 
## Continued with 53617 features.
## 
## Of these 53617 features, 30923 features have been annotated with an ENTREZID.
## 
## Sorted expression data from high to low average signal.
## 
## Removed 3501 duplicate features (based on ENTREZID).
## 
## Kept 27422 unique features.
## 
## Renamed features to ENTREZID.
## 
## Set annotation slot to 'none'.
## 
## Array design: pd.hg.u133.plus.2
## Number of features in dataset: 54675 
## 
## Removed 62 control features.
## 
## Continued with 54613 features.
## 
## Of these 54613 features, 44690 features have been annotated with an ENTREZID.
## 
## Sorted expression data from high to low average signal.
## 
## Removed 23289 duplicate features (based on ENTREZID).
## 
## Kept 21401 unique features.
## 
## Renamed features to ENTREZID.
## 
## Set annotation slot to 'none'.
## 
## Array design: pd.hugene.2.1.st
## Number of features in dataset: 53617 
## 
## Removed 0 control features.
## 
## Continued with 53617 features.
## 
## Of these 53617 features, 30923 features have been annotated with an ENTREZID.
## 
## Sorted expression data from high to low average signal.
## 
## Removed 3501 duplicate features (based on ENTREZID).
## 
## Kept 27422 unique features.
## 
## Renamed features to ENTREZID.
## 
## Set annotation slot to 'none'.
## 
## Array design: pd.hugene.2.1.st
## Number of features in dataset: 53617 
## 
## Removed 0 control features.
## 
## Continued with 53617 features.
## 
## Of these 53617 features, 30923 features have been annotated with an ENTREZID.
## 
## Sorted expression data from high to low average signal.
## 
## Removed 3501 duplicate features (based on ENTREZID).
## 
## Kept 27422 unique features.
## 
## Renamed features to ENTREZID.
## 
## Set annotation slot to 'none'.
## 
## Array design: pd.hg.u133.plus.2
## Number of features in dataset: 54675 
## 
## Removed 62 control features.
## 
## Continued with 54613 features.
## 
## Of these 54613 features, 44690 features have been annotated with an ENTREZID.
## 
## Sorted expression data from high to low average signal.
## 
## Removed 23289 duplicate features (based on ENTREZID).
## 
## Kept 21401 unique features.
## 
## Renamed features to ENTREZID.
## 
## Set annotation slot to 'none'.
## 
## Array design: pd.nugo.hs1a520180
## Number of features in dataset: 23941 
## 
## Removed 71 control features.
## 
## Continued with 23870 features.
## 
## Of these 23870 features, 19019 features have been annotated with an ENTREZID.
## 
## Sorted expression data from high to low average signal.
## 
## Removed 2305 duplicate features (based on ENTREZID).
## 
## Kept 16714 unique features.
## 
## Renamed features to ENTREZID.
## 
## Set annotation slot to 'none'.
## 
## Array design: pd.hg.u133.plus.2
## Number of features in dataset: 54675 
## 
## Removed 62 control features.
## 
## Continued with 54613 features.
## 
## Of these 54613 features, 44690 features have been annotated with an ENTREZID.
## 
## Sorted expression data from high to low average signal.
## 
## Removed 23289 duplicate features (based on ENTREZID).
## 
## Kept 21401 unique features.
## 
## Renamed features to ENTREZID.
## 
## Set annotation slot to 'none'.
## 
## Array design: pd.hg.u133.plus.2
## Number of features in dataset: 54675 
## 
## Removed 62 control features.
## 
## Continued with 54613 features.
## 
## Of these 54613 features, 44690 features have been annotated with an ENTREZID.
## 
## Sorted expression data from high to low average signal.
## 
## Removed 23289 duplicate features (based on ENTREZID).
## 
## Kept 21401 unique features.
## 
## Renamed features to ENTREZID.
## 
## Set annotation slot to 'none'.
## 
## Array design: pd.hugene.1.0.st.v1
## Number of features in dataset: 33297 
## 
## Removed 0 control features.
## 
## Continued with 33297 features.
## 
## Of these 33297 features, 22039 features have been annotated with an ENTREZID.
## 
## Sorted expression data from high to low average signal.
## 
## Removed 2035 duplicate features (based on ENTREZID).
## 
## Kept 20004 unique features.
## 
## Renamed features to ENTREZID.
## 
## Set annotation slot to 'none'.
## 
## Array design: pd.hg.u133a.2
## Number of features in dataset: 22277 
## 
## Removed 62 control features.
## 
## Continued with 22215 features.
## 
## Of these 22215 features, 21046 features have been annotated with an ENTREZID.
## 
## Sorted expression data from high to low average signal.
## 
## Removed 8004 duplicate features (based on ENTREZID).
## 
## Kept 13042 unique features.
## 
## Renamed features to ENTREZID.
## 
## Set annotation slot to 'none'.
## 
## Array design: pd.hta.2.0
## Number of features in dataset: 70523 
## 
## Removed 31 control features.
## 
## Continued with 70492 features.
## 
## Of these 70492 features, 30293 features have been annotated with an ENTREZID.
## 
## Sorted expression data from high to low average signal.
## 
## Removed 6363 duplicate features (based on ENTREZID).
## 
## Kept 23930 unique features.
## 
## Renamed features to ENTREZID.
## 
## Set annotation slot to 'none'.
## 
## Array design: pd.hugene.1.0.st.v1
## Number of features in dataset: 33297 
## 
## Removed 0 control features.
## 
## Continued with 33297 features.
## 
## Of these 33297 features, 22039 features have been annotated with an ENTREZID.
## 
## Sorted expression data from high to low average signal.
## 
## Removed 2035 duplicate features (based on ENTREZID).
## 
## Kept 20004 unique features.
## 
## Renamed features to ENTREZID.
## 
## Set annotation slot to 'none'.

Combine the filtered datasets, keeping only the genes that are common (intersect) in all datasets.

common.genes <- intersectGenes(filtered.datasetnames)
combined.datasets <- combineDataset(datasets=filtered.datasetnames, common.features=common.genes)
## Processed dataset 1 : GSE115022.filtered 
## Processed dataset 2 : GSE156269.filtered 
## Processed dataset 3 : GSE15636.filtered 
## Processed dataset 4 : GSE158620.filtered 
## Processed dataset 5 : GSE173729.filtered 
## Processed dataset 6 : GSE17625.filtered 
## Processed dataset 7 : GSE21976.filtered 
## Processed dataset 8 : GSE30292.filtered 
## Processed dataset 9 : GSE30364.filtered 
## Processed dataset 10 : GSE65790.filtered 
## Processed dataset 11 : GSE7259.filtered 
## Processed dataset 12 : GSE79383.filtered 
## Processed dataset 13 : GSE81867.filtered 
## 
## Removed 0 duplicate samples (based on GSM identifier).
## 
## Kept 188 unique samples.
## 
## Included 11203 features.

Keep only the samples (100) that are listed in the Excel file, since some datasets contain more samples than those of interest.

sampleNames(combined.datasets) <- sub(".CEL", "", sampleNames(combined.datasets))
combined.dataset <- combined.datasets[, as.character(exp.info[,"GSM.no."]) ]

Add all metadata to the pData slot of the ExpressionSet.

# Alphabetically order levels of GSEid and Treatment (for nice plotting).
m1 <- match( rownames(pData(combined.dataset)), exp.info$GSM.no.)
pData(combined.dataset)$"Title Dataset" <-  exp.info[m1, "Dataset.title"]
pData(combined.dataset)$GSEid <-  factor( exp.info[m1, "GSE.no."], levels= str_sort( unique(exp.info[m1, "GSE.no."] ), numeric = TRUE) )
pData(combined.dataset)$GSMid <-  exp.info[m1, "GSM.no."]
pData(combined.dataset)$PlatformGEO <-  exp.info[m1, "Platform"]
pData(combined.dataset)$Platform <- factor( exp.info[m1, "Platform.abbr."], levels= c("hgu133a2","hgu133plus2","nugohs1a520180","hugene10","hugene11", "hugene21","hta20") )
pData(combined.dataset)$Device <-  factor( exp.info[m1, "Device"] , levels = c("Insert", "Chip") )
pData(combined.dataset)$"Cell System"<-  factor( exp.info[m1, "Cell.cystem..Caco.2.or.co.culture."] , levels = c("Caco-2", "Co-culture") )
pData(combined.dataset)$Flow <-  factor( exp.info[m1, "Flow..static..dynamic.or.part..Dynamic."] )
pData(combined.dataset)$Oxygen <-  factor( exp.info[m1, "Oxygen..Oxic.or.Anoxic.apical."] , levels = c("Oxic", "Anoxic apical") )
# Note that NR gives 2x NA in numeric!!
pData(combined.dataset)$"Culture Time Continue" <-  as.numeric( exp.info[m1, "Total.culture.time"] )
pData(combined.dataset)$"Culture Time" <-  factor( exp.info[m1, "Time..cateogorical."] , levels=c("Short","Medium", "Long", "NR") )
# Manually set order of treatments
pData(combined.dataset)$Microbiome <-  factor( exp.info[m1, "Microbiome"] , levels= c("Control", "BC1", "B1, strain A","B1, strain B", "SN of B1, strain B", "B2","B3", "B4", "SN of B4","B5", "SN of B5", "B6", "B7", "B7 and B8")  )

Remove datasets from experiments that only consist of a single sample (if present). This turns out not to be the case.

table(pData(combined.dataset)$EXPERIMENT)  #check number of samples per experiment.
## 
## GSE115022 GSE156269  GSE15636 GSE158620 GSE173729  GSE17625  GSE21976  GSE30292 
##         8         8        17        16         6         6         8         3 
##  GSE30364  GSE65790   GSE7259  GSE79383  GSE81867 
##         3         6         4         9         6
combined.dataset.final <- removeSingletExperiment(combined.dataset)
## 
## Note: no experiments consisting of only a single sample (based on EXPERIMENT) are present!

Perform rank percentile transformation (rpt) of each sample; expression values for genes will then range between 0 and 1, and create new ExpressionSet. Save result.

rpt <- apply( exprs(combined.dataset.final), 2, function(x) dplyr::percent_rank(x) )
combined.dataset.rpt <- ExpressionSet(assayData=rpt, phenoData=phenoData(combined.dataset.final), featureData=featureData(combined.dataset.final), annotation=annotation(combined.dataset.final) )

save(combined.dataset.rpt, file="combined.dataset.rpt.100samples.Rdata")


# check object
combined.dataset.rpt
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 11203 features, 100 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: GSM3163149 GSM3163141 ... GSM2176824 (100 total)
##   varLabels: title geo_accession ... Microbiome (83 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: 2597 3312 ... 374286 (11203 total)
##   fvarLabels: ENTREZID SYMBOL GENENAME
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: none

Part 3 - Principal component analysis (PCA).

Perform PCA on the combined dataset as it is.

metadata <- pData(combined.dataset.rpt)
p.r <- PCAtools::pca(exprs(combined.dataset.rpt), metadata = metadata)

# Determine optimum number of PCs to retain
horn <- parallelPCA( exprs(combined.dataset.rpt) )
horn$n
## [1] 9
elbow <- findElbowPoint(p.r$variance)
elbow
## PC10 
##   10
# Screeplot
screeplot(p.r,
    title = "Scree plot",
    subtitle = "regular PCA",
    components = getComponents(p.r, 1:15),
    hline = 80,
    axisLabSize=10,
    vline = c(horn$n, elbow)) +
    geom_label(aes(x = horn$n - 1, y = 60,
    label = 'Horn\'s', vjust = -1, size = 8)) +
    geom_label(aes(x = elbow + 2, y = 35,
    label = 'Elbow method', vjust = -1, size = 8)
    )

ggsave(filename = "screeplot.not.corrected.PCA.pdf", units="in", width=8, height=8)

Explore Spearman’s rank correlation coefficients of metadata variables with the principal components in a so-called eigencorplot. Such plot highlights the main (known) sources of variation in a dataset.

First plot Spearman’s rank correlation coefficients, or Spearman’s rho, between each metadata variable and principle component. Ten principle components are selected because this is the largest number of components identified by Horn’s and the Elbow method.

eigenplot.p.r <- eigencorplot(p.r,
    components = getComponents(p.r, 1:10), #10 = largest number from horn$n vs elbow
    metavars = c('GSEid', 'Platform', 'Device', 'Cell System', 'Culture Time', 'Flow', 'Oxygen', 'Microbiome'),     
    col=c("red", "white", "royalblue"),
    cexCorval = 0.7,
    colCorval = 'white',
    fontCorval = 2,
    rotLabX = 45,
    scale = TRUE,
    main = bquote(atop(Spearman ~ rho ~ between ~ each ~ metadata ~ variable ~ and ~ principle ~ component, uncorrected ~ dataset)),
    cexMain = 1.25,
    plotRsquared = FALSE,
    corFUN = 'spearman',
    corMultipleTestCorrection = 'BH',
    signifSymbols = c('****', '***', '**', '*', ''),
    signifCutpoints = c(0, 0.0001, 0.001, 0.01, 0.05, 1)
    )

print(eigenplot.p.r)

pdf(file="spearman.rho.not.corrected.PCA.pdf", width=8, height=8)
print(eigenplot.p.r)
dev.off()
## png 
##   2

Also plot squared Spearman’s Rho between each metadata variable and principle component. The squared Spearman’s Rho represents the proportion of shared variance explained by each variable and principle component.

eigenplot.p.r.squared <- eigencorplot(p.r,
    components = getComponents(p.r, 1:10), #10 = largest number horn$n vs elbow
    metavars = c('GSEid', 'Platform', 'Device', 'Cell System', 'Culture Time', 'Flow', 'Oxygen', 'Microbiome'),     
    col=(c("lightyellow", "darkgreen") ),
    cexCorval = 0.7,
    fontCorval = 2,
    rotLabX = 45,
    scale = TRUE,
    main = bquote(atop(Spearman ~ rho^2 ~ between ~ each ~ metadata ~ variable ~ and ~ principle ~ component, uncorrected ~ dataset)),
    cexMain = 1.25,
    plotRsquared = TRUE,
    corFUN = 'spearman',
    corMultipleTestCorrection = 'BH',
    signifSymbols = c('****', '***', '**', '*', ''),
    signifCutpoints = c(0, 0.0001, 0.001, 0.01, 0.05, 1)
    )

print(eigenplot.p.r.squared)

pdf(file="spearman.rho.squared.not.corrected.PCA.pdf", width=8, height=8)
print(eigenplot.p.r.squared)
dev.off()
## png 
##   2

Perform multilevel PCA, blocking on the variable Platform.

The eigencorplot shows that the (technical) variable Platform contributes second most variance to the combined dataset. This variable Platform represents technical rather than biological variance, so it will be corrected for by decomposing the variance for Platform in the expression matrix, analogous to multilevel PCA.

platform  <- as.data.frame(pData(combined.dataset.rpt)$Platform)
Xw <- t (withinVariation(X = t(exprs(combined.dataset.rpt)), design = platform ) )
combined.dataset.rpt.platform.corrected <- ExpressionSet(assayData=Xw, phenoData=phenoData(combined.dataset.rpt),featureData=featureData(combined.dataset.rpt), annotation=annotation(combined.dataset.rpt) )
metadata <- pData(combined.dataset.rpt.platform.corrected)
p.ml <- PCAtools::pca(exprs(combined.dataset.rpt.platform.corrected), metadata = metadata)

# Determine optimum number of PCs to retain
horn <- parallelPCA( exprs(combined.dataset.rpt.platform.corrected) )
horn$n
## [1] 13
elbow <- findElbowPoint(p.ml$variance)
elbow
## PC11 
##   11
# Screeplot
screeplot(p.ml,
    title = "Scree plot",
    subtitle = "PCA, corrected for platform",
    components = getComponents(p.ml, 1:15),
    hline = 80,
    axisLabSize=10,
    vline = c(horn$n, elbow)) +
    geom_label(aes(x = horn$n - 1, y = 60,
    label = 'Horn\'s', vjust = -1, size = 8)) +
    geom_label(aes(x = elbow + 2, y = 35,
    label = 'Elbow method', vjust = -1, size = 8)
    )

ggsave(filename = "screeplot.platform.corrected.PCA.pdf", units="in", width=8, height=8)

Also for the multilevel PCA, explore Spearman’s rho between each metadata variable and principle component in an eigencroplot. Thirteen principle components are selected because this is the largest number of components identified by Horn’s and the Elbow method.

eigenplot.p.ml <- eigencorplot(p.ml,
    components = getComponents(p.ml, 1:13), #13 = largest number horn$n vs elbow
    metavars = c('GSEid', 'Platform', 'Device', 'Cell System', 'Culture Time', 'Flow', 'Oxygen', 'Microbiome'),     
    col=c("red", "white", "royalblue"),
    colCorval="black",
    cexCorval = 0.55,
    fontCorval = 2,
    rotLabX = 45,
    scale = TRUE,
    main = bquote(atop(Spearman ~ rho ~ between ~ each ~ metadata ~ variable ~ and ~ principle ~ component, corrected ~ dataset)),
    cexMain = 1.25,
    plotRsquared = FALSE,
    corFUN = 'spearman',
    corMultipleTestCorrection = 'BH',
    signifSymbols = c('****', '***', '**', '*', ''),
    signifCutpoints = c(0, 0.0001, 0.001, 0.01, 0.05, 1)
    )

print(eigenplot.p.ml )

pdf(file="spearman.rho.platform.corrected.PCA.pdf", width=8, height=8)
print(eigenplot.p.ml)
dev.off()
## png 
##   2

Also for the multilevel PCA, plot the squared Spearman’s Rho between each metadata variable and principle component.

eigenplot.p.ml.squared <- eigencorplot(p.ml,
    components = getComponents(p.ml, 1:13), #13 = largest number horn$n vs elbow
    metavars = c('GSEid', 'Platform', 'Device', 'Cell System', 'Culture Time', 'Flow', 'Oxygen', 'Microbiome'),     
    col=(c("lightyellow", "darkgreen") ),
    colCorval="black",
    cexCorval = 0.55,
    fontCorval = 2,
    rotLabX = 45,
    scale = TRUE,
    main = bquote(atop(Spearman ~ rho^2 ~ between ~ each ~ metadata ~ variable ~ and ~ principle ~ component, corrected ~ dataset)),
    cexMain = 1.25,
    plotRsquared = TRUE,
    corFUN = 'spearman',
    corMultipleTestCorrection = 'BH',
    signifSymbols = c('****', '***', '**', '*', ''),
    signifCutpoints = c(0, 0.0001, 0.001, 0.01, 0.05, 1)
    )

print(eigenplot.p.ml.squared)

pdf(file="spearman.rho.squared.platform.corrected.PCA.pdf", width=8, height=8)
print(eigenplot.p.ml.squared)
dev.off()
## png 
##   2

Since decomposing the variance for Platform drastically reduces its impact on the dataset, continue only with results of the decomposed, multilevel PCA. Generate biplots in which all metadata is overlaid.

color.vector <- c("#2fddce", "#94ea5b", "#208eb7", "#c2287d", "#069668", "#c9dd87", "#4c3e76", "#beb1e7", "#154e56", "#77d1fd", "#1932bf", "#f89ade", "#21a708", "#d3d3d3")

color.by <- c("GSEid", "Platform", "Device", "Cell System", "Flow", "Oxygen", "Culture Time", "Microbiome")
all.biplots.all.genes <- run.all.biplots(p.ml, colby=color.by, colvec=color.vector)

ggsave(
   filename = "PCA.biplots.multilevel.pdf", 
   plot = marrangeGrob(all.biplots.all.genes, nrow=1, ncol=1, top=NULL),
   units="in", width = 8, height =8,
   device = "pdf"
   )

all.biplots.all.genes
## $plot.GSEid

## 
## $plot.Platform

## 
## $plot.Device

## 
## $`plot.Cell System`

## 
## $plot.Flow

## 
## $plot.Oxygen

## 
## $`plot.Culture Time`

## 
## $plot.Microbiome

Part 4 - Visualization in heatmaps.

For generation of the heatmap (and variance partition analysis), the rpt values are probit transformed to fit the normality assumption of a linear mixed model.

Perform probit transformation of each sample, create new ExpressionSet, and store.

Note that the value zero (0) cannot be transformed. Therefore, zeros are replaced by a b-value, equalling 0.5 * smallest non-zero value.

smallest.nonzero <- min( exprs(combined.dataset.rpt) [exprs(combined.dataset.rpt) > 0] )
smallest.nonzero
## [1] 8.926977e-05
bvalue <- 0.5*smallest.nonzero
bvalue
## [1] 4.463489e-05
probit.tr <- apply( exprs(combined.dataset.rpt), 2, function(x) VGAM::probitlink(x, bvalue = bvalue) )
combined.dataset.probit <- ExpressionSet(assayData=probit.tr, phenoData=phenoData(combined.dataset.rpt), featureData=featureData(combined.dataset.rpt), annotation=annotation(combined.dataset.rpt) )

save(combined.dataset.probit, file="combined.dataset.probit.100samples.Rdata")

Create heatmap of 10% most variable genes, for the non-corrected expression data.

# An easy way of finding the 10% most variable genes is to make use of PCATools.
p.r <- PCAtools::pca( exprs(combined.dataset.probit), removeVar = 0.90)
selected.genes <- rownames(p.r$loadings)
expr.selected <- exprs(combined.dataset.probit)[selected.genes,]

# Factors to show in legend of heatmap
Platform  <-pData(combined.dataset.probit)[,"Platform"]
GSEid <- pData(combined.dataset.probit)[,"GSEid"]
Device <- pData(combined.dataset.probit)[,"Device"]
CellSystem <- pData(combined.dataset.probit)[,"Cell System"]
Flow <- pData(combined.dataset.probit)[,"Flow"]
Oxygen <- pData(combined.dataset.probit)[,"Oxygen"]
CultureTime <- pData(combined.dataset.probit)[,"Culture Time"]
Microbiome <- pData(combined.dataset.probit)[,"Microbiome"]

# Create legend
legend.df <- data.frame(Platform = Platform,
                            GSEid=GSEid,
                            Device=Device,
                            CellSystem=CellSystem,
                            Flow=Flow,
                            Oxygen=Oxygen,
                            CultureTime=CultureTime,
                            Microbiome=Microbiome)

colnames(legend.df)[4] <- "Cell System"
colnames(legend.df)[7] <- "Culture Time"

rownames(legend.df) <- colnames(expr.selected)

# define and set legend colors
color.vector <- c("#2fddce", "#94ea5b", "#208eb7", "#c2287d", "#069668", "#c9dd87", "#4c3e76", "#beb1e7", "#154e56", "#77d1fd", "#1932bf", "#f89ade", "#21a708", "#d3d3d3")

c.platform <- color.vector[ 1:length( levels(Platform) ) ]; names(c.platform) <- levels(Platform)
c.gseid <- color.vector[ 1:length( levels(GSEid) ) ]; names(c.gseid) <- levels(GSEid)
c.device <- color.vector[ 1:length( levels(Device) ) ]; names(c.device) <- levels(Device)
c.cellsystem <- color.vector[ 1:length( levels(CellSystem) ) ]; names(c.cellsystem) <- levels(CellSystem)
c.flow <- color.vector[ 1:length( levels(Flow) ) ]; names(c.flow) <- levels(Flow)
c.oxygen <- color.vector[ 1:length( levels(Oxygen) ) ]; names(c.oxygen) <- levels(Oxygen)
c.culturetime <- color.vector[ 1:length( levels(CultureTime) ) ]; names(c.culturetime) <- levels(CultureTime)
c.microbiome <- color.vector[ 1:length( levels(Microbiome) ) ]; names(c.microbiome) <- levels(Microbiome)

color.list <- list("Platform" = c.platform, "GSEid" = c.gseid, "Device" = c.device, "Cell System" = c.cellsystem, "Flow"=c.flow, "Oxygen"=c.oxygen, "Culture Time"=c.culturetime, "Microbiome"=c.microbiome)


# change default colors used for heatmap to use blue-white-red scale
clrs.heatmap <- colorRampPalette(c("#0070c0", "#ffffff", "#ff0000"))(200)   

p <- pheatmap(expr.selected,
  filename="Heatmap.10perc.most.variable.genes.non.corrected.pdf",
  main = c("Heatmap of 10% most variable genes, non-corrected data"),
  color=clrs.heatmap,
  scale="row",
  annotation_col = legend.df,
  annotation_colors = color.list,
  cellheight= NA,
  fontsize=8,
  fontsize_row=3,
  fontsize_col=3,
  border_color=NA,
  cluster_rows=TRUE,
  cluster_cols=TRUE,
  clustering_distance_rows = "euclidean",
  clustering_distance_cols = "euclidean",
  treeheight_row = 20,
  treeheight_col = 20, 
  show_rownames =FALSE,
  show_colnames=FALSE,
  angle_col=45,
  fontsize_number=0.8,
  width = 8.27,
  height = 11.69
  )

x11(); p

Do same for Platform-corrected expression data.

platform  <- as.data.frame(pData(combined.dataset.probit)$Platform)
Xw <- t (withinVariation(X = t(exprs(combined.dataset.probit)), design = platform ) )
combined.dataset.probit.platform.corrected <- ExpressionSet(assayData=Xw, phenoData=phenoData(combined.dataset.probit), featureData=featureData(combined.dataset.probit), annotation=annotation(combined.dataset.probit) )

p.ml <- PCAtools::pca( exprs(combined.dataset.probit.platform.corrected), removeVar = 0.90)
selected.genes <- rownames(p.ml$loadings)
expr.selected.ml <- exprs(combined.dataset.probit.platform.corrected)[selected.genes,]

p <- pheatmap(expr.selected.ml,
  filename="Heatmap.10perc.most.variable.genes.platform.corrected.pdf",
  main = c("Heatmap of 10% most variable genes, platform-corrected data"),
  color=clrs.heatmap,
  scale="row",
  annotation_col = legend.df,
  annotation_colors = color.list,
  cellheight= NA,
  fontsize=8,
  fontsize_row=3,
  fontsize_col=3,
  border_color=NA,
  cluster_rows=TRUE,
  cluster_cols=TRUE,
  clustering_distance_rows = "euclidean",
  clustering_distance_cols = "euclidean",
  treeheight_row = 20,
  treeheight_col = 20, 
  show_rownames =FALSE,
  show_colnames=FALSE,
  angle_col=45,
  fontsize_number=0.8,
  width = 8.27,
  height = 11.69
  )

x11(); p

Part 5 - variance partition analysis.

For variance partition analysis, the input is probit transformed rpt values in order to fit the normality assumption of a linear mixed model.

Perform variance partition analysis for non-corrected, probit-transformed data.

# Initialize parallel back-end with 6 cores
register(SnowParam(6))

# extract probit-transformed expression data
expr.data <- exprs(combined.dataset.probit)

# make colnames pData confirm to R-standards, otherwise variancePartition functions will fail
pheno.data <- pData(combined.dataset.probit)
colnames(pheno.data) <- make.names(colnames(pheno.data))

# remove 12th column (= "Culture.Time.Continue"), since only categorical variables will be considered
pheno.data <- pheno.data[, -12]

# Specify variables to consider
form <- ~ (1|GSEid) + (1|Platform) + (1|Device) + (1|Culture.Time) + (1|Microbiome) + (1|Cell.System) + (1|Flow)+(1|Oxygen)

# Run function and save output
varPart <- fitExtractVarPartModel(exprObj=expr.data, formula=form, data=pheno.data )
save(varPart, file="varPart.results.probit.non.corrected.RData")

Do the same for the Platform-corrected, probit-transformed data. Note that pheno.data remains the same.

expr.data.ml <- exprs(combined.dataset.probit.platform.corrected)

# Run function and save output
varPart.ml <- fitExtractVarPartModel(exprObj=expr.data.ml, formula=form, data=pheno.data )
save(varPart.ml, file="varPart.results.probit.platform.corrected.RData")

Export results from variancePartition analysis.

# Process row- and column names for nice visualization
colnames(varPart) <- trimws( gsub("\\.|Cat", " ", colnames(varPart) ) )
colnames(varPart.ml) <- trimws( gsub("\\.|Cat", " ", colnames(varPart.ml) ) )

colnames(pheno.data) <- trimws( gsub("\\.|Cat", " ", colnames(pheno.data) ) )

# export results as Excel file
anno.result <- AnnotationDbi:::select(org.Hs.eg.db, keys = rownames(varPart), keytype = "ENTREZID", columns = c("ENTREZID", "SYMBOL", "GENENAME") )
anno.result <- anno.result[!duplicated(anno.result[,1]),] # Get rid of duplicates; only keep 1st hit
rownames(anno.result) <- anno.result[,1]

write.xlsx(x=cbind(varPart, anno.result), file="explained.variances.all.genes.xlsx", sheetName="explained.variances", rowNames = TRUE)
write.xlsx(x=cbind(varPart.ml, anno.result), file="explained.variances.all.genes.platform.corrected.xlsx", sheetName="explained.variances", rowNames = TRUE)

Part 6 - Visualization and biological interpretation of results of variance partition analysis. Present output from non-corrected and platform-corrected dataset next to each other.

First make some quick comparisons between the non-corrected and platform-corrected datasets.

#For plots, use gene symbols as names of features
rownames(varPart) <- fData(combined.dataset.probit)$SYMBOL
rownames(varPart.ml) <- fData(combined.dataset.probit.platform.corrected)$SYMBOL

# Sort columns variance; for both datasets use order of features as in vp 
vp <- sortCols( varPart )
head(vp)
##            Platform      GSEid      Device Culture Time  Microbiome        Flow
## GAPDH  9.717632e-01 0.01074593 0.000000000 6.118381e-03 0.002422884 0.000000000
## HSPA8  8.360156e-01 0.02057758 0.003441859 0.000000e+00 0.054016600 0.000000000
## RACK1  9.311995e-01 0.05178194 0.000000000 9.061932e-03 0.002310462 0.000000000
## RPL10  6.213844e-01 0.02714387 0.003558667 1.699035e-11 0.009537405 0.001702495
## RPS28  1.997215e-01 0.57345841 0.001545247 1.886248e-01 0.013021288 0.014988221
## EEF1A1 9.599731e-10 0.54380828 0.000000000 6.499120e-10 0.178913046 0.026317309
##              Oxygen  Cell System   Residuals
## GAPDH  2.000738e-09 2.483467e-12 0.008949648
## HSPA8  5.017388e-09 4.508073e-02 0.040867605
## RACK1  2.204962e-11 8.287582e-04 0.004817419
## RPL10  2.840365e-01 4.770812e-02 0.004928528
## RPS28  3.918573e-08 1.375742e-08 0.008640489
## EEF1A1 9.034779e-10 1.841386e-01 0.066822804
vp.ml <- varPart.ml[, colnames(vp)]
head(vp.ml)
##            Platform      GSEid       Device Culture Time  Microbiome       Flow
## GAPDH  7.571109e-31 0.38477582 1.056989e-02 1.777526e-01 0.004681138 0.02072364
## HSPA8  2.957706e-31 0.04876958 1.424170e-02 1.950091e-05 0.442678821 0.00000000
## RACK1  3.700085e-09 0.76099264 5.333527e-11 6.231753e-02 0.054214754 0.00000000
## RPL10  3.926924e-10 0.26976509 5.414603e-02 2.056727e-10 0.144956743 0.02239052
## RPS28  0.000000e+00 0.65707184 2.652349e-03 2.757780e-01 0.022823266 0.02603685
## EEF1A1 4.613500e-10 0.42398180 0.000000e+00 0.000000e+00 0.219027876 0.02799231
##              Oxygen  Cell System  Residuals
## GAPDH  1.012744e-31 1.091298e-07 0.40149677
## HSPA8  7.462820e-32 1.875055e-01 0.30678493
## RACK1  5.043417e-09 0.000000e+00 0.12247507
## RPL10  4.691048e-09 4.323635e-01 0.07637806
## RPS28  2.949040e-10 0.000000e+00 0.01563772
## EEF1A1 4.276402e-09 2.508658e-01 0.07813219
# bar plot of variance fractions for the first 10 genes
color.vector <- c("#2fddce", "#94ea5b", "#208eb7", "#c2287d", "#069668", "#c9dd87", "#4c3e76", "#beb1e7", "#154e56", "#77d1fd", "#1932bf", "#f89ade", "#21a708", "#d3d3d3")
colors <-  color.vector[ 1:dim(vp)[2] ]

plotPercentBars( vp[1:10,], col=colors )

ggsave( filename = "variance.fractions.10.genes.pdf")


plotPercentBars( vp.ml[1:10,], col=colors )

ggsave(filename = "variance.fractions.platform.corrected.10.genes.pdf")


# violin plots of contribution of each variable to total variance
plotVarPart( vp, col=colors)

ggsave(filename = "violin.plot.contribution.variances.all.genes.pdf")

plotVarPart( vp.ml, col=colors )

ggsave(filename = "violin.plot.contribution.variances.platform.corrected.all.genes.pdf")

Continue with plotting the expression (rpt-values) of highly varying genes for all metadata variables. Highly varying is defined when proportion of variance explained by a metadata variable for a gene is > 0.4.

Next use these highly varying genes for functional overrepresentation analysis (ORA) using KEGG pathways, and all 3 subcategories of the Gene Ontology; the Biological Process (BP), Molecular Function (MF) and Cellular Component (CC) subcategories.

Do this all first for the highly varying genes from the non-corrected dataset, and then for the platform-corrected dataset.

Per metadata variable, plot the expression of all highly varying genes in a PDF, but also plot the expression for 2 genes in the output below.

rpt.expr <- exprs(combined.dataset.rpt)
rownames(rpt.expr) <- fData(combined.dataset.rpt)$SYMBOL

cutoff <- 0.40

# identify high varying genes for each variable for the non-corrected dataset
cellsystem.genes <-varPart[order(varPart$"Cell System", decreasing=TRUE),]
cellsystem.genes <- rownames( cellsystem.genes[cellsystem.genes$"Cell System" > cutoff,] )
length(cellsystem.genes)
## [1] 21
culturetime.genes <-varPart[order(varPart$"Culture Time", decreasing=TRUE),]
culturetime.genes <- rownames( culturetime.genes[culturetime.genes$"Culture Time" > cutoff,] )
length(culturetime.genes)
## [1] 172
flow.genes <-varPart[order(varPart$"Flow", decreasing=TRUE),]
flow.genes <- rownames( flow.genes[flow.genes$"Flow" > cutoff,] )
length(flow.genes)
## [1] 12
gseid.genes <-varPart[order(varPart$"GSEid", decreasing=TRUE),]
gseid.genes <- rownames( gseid.genes[gseid.genes$"GSEid" > cutoff,] )
length(gseid.genes)
## [1] 2021
device.genes <-varPart[order(varPart$"Device", decreasing=TRUE),]
device.genes <- rownames( device.genes[device.genes$"Device" > cutoff,] )
length(device.genes)
## [1] 128
oxygen.genes <-varPart[order(varPart$Oxygen, decreasing=TRUE),]
oxygen.genes <- rownames( oxygen.genes[oxygen.genes$Oxygen > cutoff,] )
length(oxygen.genes)
## [1] 525
platform.genes <-varPart[order(varPart$Platform, decreasing=TRUE),]
platform.genes <- rownames( platform.genes[platform.genes$Platform > cutoff,] )
length(platform.genes)
## [1] 6034
microbiome.genes <-varPart[order(varPart$Microbiome, decreasing=TRUE),]
microbiome.genes <- rownames( microbiome.genes[microbiome.genes$Microbiome > cutoff,] )
length(microbiome.genes)
## [1] 19
# Per variable, generate plots for all highly varying genes
# Cell System
p <- lapply(cellsystem.genes, function(x) {
    c.vec <- color.vector[ 1:length( levels(pheno.data$"Cell System") ) ]; names(c.vec) <- levels(pheno.data$"Cell System")
    plotStratify( Expression ~ Cell.System, data.frame( Expression = rpt.expr [x,], "Cell System" = pheno.data$"Cell System"), main=x, sort=FALSE, colorBy = c.vec) + theme(legend.position = "right")
    })
# save the expression of all highly varying genes in a PDF.
ggsave(filename = "expr.cellsystem.variance.genes.not.corrected.0.40.pdf", plot = marrangeGrob(p, nrow=1, ncol=1), width = 15, height = 9)

# Plot the expression for 2 genes in the R session.
invisible( lapply(head(p, n=2L), print) )

# Culture Time
p <- lapply(culturetime.genes, function(x) {
    c.vec <- color.vector[ 1:length( levels(pheno.data$"Culture Time") ) ]; names(c.vec) <- levels(pheno.data$"Culture Time")
    plotStratify( Expression ~ Culture.Time, data.frame( Expression = rpt.expr [x,], "Culture Time" = pheno.data$"Culture Time"), main=x, sort=FALSE, colorBy = c.vec) + theme(legend.position = "right")
        })
ggsave(filename = "expr.culture.time.variance.genes.not.corrected.0.40.pdf", plot = marrangeGrob(p, nrow=1, ncol=1), width = 15, height = 9)

invisible( lapply(head(p, n=2L), print) )

# Flow
p <- lapply(flow.genes, function(x) {
    c.vec <- color.vector[ 1:length( levels(pheno.data$"Flow") ) ]; names(c.vec) <- levels(pheno.data$"Flow")
    plotStratify( Expression ~ Flow, data.frame( Expression = rpt.expr [x,], "Flow" = pheno.data$"Flow"), main=x, sort=FALSE, colorBy = c.vec ) + theme(legend.position = "right")
        })
ggsave(filename = "expr.flow.variance.genes.not.corrected.0.40.pdf", plot = marrangeGrob(p, nrow=1, ncol=1), width = 15, height = 9)

invisible( lapply(head(p, n=2L), print) )

# GSEid (genes not plotted because there are too many highly varying genes!)
# p <- lapply(gseid.genes, function(x) {
#    c.vec <- color.vector[ 1:length( levels(pheno.data$"GSEid") ) ]; names(c.vec) <- levels(pheno.data$"GSEid")
#    plotStratify( Expression ~ GSEid, data.frame( Expression = rpt.expr [x,], "GSEid" = pheno.data$"GSEid"), main=x, sort=FALSE, colorBy = c.vec) + theme(legend.position = "right")
#       })
# ggsave( filename = "expr.gseid.variance.not.corrected.genes.0.40.pdf", plot = marrangeGrob(p, nrow=1, ncol=1), width = 15, height = 9)
#
# invisible( lapply(head(p, n=2L), print) )


# Device
p <- lapply(device.genes, function(x) {
    c.vec <- color.vector[ 1:length( levels(pheno.data$"Device") ) ]; names(c.vec) <- levels(pheno.data$"Device")
    plotStratify( Expression ~ Device, data.frame( Expression = rpt.expr [x,], "Device" = pheno.data$"Device"), main=x, sort=FALSE, colorBy = c.vec) + theme(legend.position = "right")
        })
ggsave(filename = "expr.device.variance.genes.not.corrected.0.40.pdf", plot = marrangeGrob(p, nrow=1, ncol=1),width = 15, height = 9)

invisible( lapply(head(p, n=2L), print) )

# Oxygen
p <- lapply( oxygen.genes, function(x) {
    c.vec <- color.vector[ 1:length( levels(pheno.data$"Oxygen") ) ]; names(c.vec) <- levels(pheno.data$"Oxygen")
    plotStratify( Expression ~ Oxygen, data.frame( Expression = rpt.expr [x,], "Oxygen" = pheno.data$"Oxygen"), main=x, sort=FALSE, colorBy = c.vec) + theme(legend.position = "right")
        })
ggsave(filename = "expr.oxygen.variance.genes.not.corrected.0.40.pdf", plot = marrangeGrob(p, nrow=1, ncol=1), width = 15, height = 9)

invisible( lapply(head(p, n=2L), print) )

# Microbiome
p <- lapply(microbiome.genes, function(x) {
    c.vec <- color.vector[ 1:length( levels(pheno.data$"Microbiome") ) ]; names(c.vec) <- levels(pheno.data$"Microbiome")
    plotStratify( Expression ~ Microbiome, data.frame( Expression = rpt.expr [x,], "Microbiome" = pheno.data$"Microbiome"), main=x, sort=FALSE, colorBy = c.vec) + theme(legend.position = "right")
        })
ggsave(filename = "expr.microbiome.variance.genes.not.corrected.0.40.pdf", plot = marrangeGrob(p, nrow=1, ncol=1), width = 15, height = 9)

invisible( lapply(head(p, n=2L), print) )

Continue with functional overrepresentation analysis (ORA) of the lists of varying genes.

For functional interpretation use the KEGG pathways, and Gene Ontology categories.

# rename rownames(varPart) to entrezid
rownames(varPart) <- fData(combined.dataset.probit)$ENTREZID

# select per variable again the highly varying genes, but now these genes are identified by entrezid
plat.var <- rownames(varPart[varPart$Platform > cutoff,] )
GSEid.var <- rownames(varPart[varPart$GSEid > cutoff,] )
device.var <- rownames(varPart[varPart$Device > cutoff,] )
cultime.var <- rownames(varPart[varPart$"Culture Time" > cutoff,] )
micro.var <- rownames(varPart[varPart$Microbiome > cutoff,] )
flow.var <- rownames(varPart[varPart$Flow > cutoff,] )
oxy.var <- rownames(varPart[varPart$Oxygen > cutoff,] )
cellsys.var <- rownames(varPart[varPart$"Cell System" > cutoff,] )

# combine into a list, and check
cp.input <- list("Platform"=plat.var, "GSEid"=GSEid.var, "Device"=device.var, "Culture time"=cultime.var, "Microbiome"=micro.var, "Flow"=flow.var, "Oxygen"=oxy.var, "Cell System"=cellsys.var)
str(cp.input)
## List of 8
##  $ Platform    : chr [1:6034] "2597" "3312" "10399" "6134" ...
##  $ GSEid       : chr [1:2021] "6234" "1915" "6141" "2950" ...
##  $ Device      : chr [1:128] "1152" "378" "374" "328" ...
##  $ Culture time: chr [1:172] "4224" "1811" "7429" "462" ...
##  $ Microbiome  : chr [1:19] "183" "4946" "1615" "10276" ...
##  $ Flow        : chr [1:12] "481" "6713" "39" "2224" ...
##  $ Oxygen      : chr [1:525] "6125" "6165" "10209" "5683" ...
##  $ Cell System : chr [1:21] "2665" "2029" "3915" "1153" ...
# Perform KEGG pathway ORA for all lists of varying genes
c.k <- compareCluster(geneCluster = cp.input, fun = enrichKEGG, pvalueCutoff = 0.05, pAdjustMethod = "BH", minGSSize = 10, maxGSSize = 500, qvalueCutoff = 0.2)
c.k <- pairwise_termsim(c.k)
c.k <- setReadable(c.k, OrgDb = org.Hs.eg.db, keyType="ENTREZID")

# Plot the ORA results in a dotplot; per variable, only show the 10 most significant overrepresented KEGG pathway 
n.char <- max(nchar ( as.data.frame(c.k)$Description )  )
p <- dotplot(c.k, font.size=8, showCategory = 10, title = "enriched KEGG pathways\nvariance cutoff > 0.4, max 10 pathways\nnot corrected data", label_format=n.char)
ggsave(file="KEGG.not.corrected.ora.var0.4.pdf", width = 8.3, height = 11.7, units = "in")
print(p)

# added: plot ORA results in dotplot, but also include the empty clusters
KEGG.incl.empty <- dotplotInclEmptyClusters(cp.input, fun="enrichKEGG", pvalueCutoff = 0.05, pAdjustMethod = "BH", qvalueCutoff = 0.2, minGSSize = 10, maxGSSize = 500, showCategory=10, font.size=8, angle=30, title="enriched KEGG pathways\nvariance cutoff > 0.4, max 10 pathways\nnot corrected data")
ggsave(plot=KEGG.incl.empty$dotplot.out, file="KEGG.not.corrected.ora.var0.4.all.clusters.pdf", width = 8.3, height = 11.7, units = "in")
print(KEGG.incl.empty$dotplot.out)

# Perform GO overrepresentation analysis; do this for all 3 sub-categories
c.go.bp <- compareCluster(geneCluster = cp.input, fun = enrichGO, ont = "BP", OrgDb="org.Hs.eg.db")
c.go.bp <- pairwise_termsim(c.go.bp)
c.go.bp <- setReadable(c.go.bp, OrgDb = org.Hs.eg.db, keyType="ENTREZID")

c.go.mf <- compareCluster(geneCluster = cp.input, fun = enrichGO, ont = "MF", OrgDb="org.Hs.eg.db")
c.go.mf <- pairwise_termsim(c.go.mf)
c.go.mf <- setReadable(c.go.mf, OrgDb = org.Hs.eg.db, keyType="ENTREZID")

c.go.cc <- compareCluster(geneCluster = cp.input, fun = enrichGO, ont = "CC", OrgDb="org.Hs.eg.db")
c.go.cc <- pairwise_termsim(c.go.cc)
c.go.cc <- setReadable(c.go.cc, OrgDb = org.Hs.eg.db, keyType="ENTREZID")


# Biological Process
n.char <- max(nchar ( as.data.frame(c.go.bp)$Description )  )
p <- dotplot(c.go.bp, font.size=8, showCategory = 10, title = "enriched GO-BP categories\nvariance cutoff > 0.4, max 10 categories\nnot corrected data", label_format=n.char)
ggsave(file="GOBP.not.corrected.ora.var0.4.pdf", width = 8.3, height = 11.7, units = "in")
print(p)

# Molecular Function
n.char <- max(nchar ( as.data.frame(c.go.mf)$Description )  )
p <- dotplot(c.go.mf, font.size=8, showCategory = 10, title = "enriched GO-MF categories\nvariance cutoff > 0.4, max 10 categories\nnot corrected data", label_format=n.char)
ggsave(file="GOMF.not.corrected.ora.var0.4.pdf", width = 8.3, height = 11.7, units = "in")
print(p)

# Cellular Compartment
n.char <- max(nchar ( as.data.frame(c.go.cc)$Description )  )
p <- dotplot(c.go.cc, font.size=8, showCategory = 10, title = "enriched GO-CC categories\nvariance cutoff > 0.4, max 10 categories\nnot corrected data", label_format=n.char)
ggsave(file="GOCC.not.corrected.ora.var0.4.pdf", width = 8.3, height = 11.7, units = "in")
print(p)

# plot ORA results in dotplot, but also include the empty clusters
GOBP.incl.empty <- dotplotInclEmptyClusters(cp.input, fun="enrichGO", ont = "BP", keyTypeGO = "ENTREZID", OrgDb="org.Hs.eg.db", showCategory=10, font.size=8, angle=30, label_format= 40, title="enriched GO-BP categories\nvariance cutoff > 0.4, max 10 categories\nnot corrected data")
ggsave(plot=GOBP.incl.empty$dotplot.out, file="GOBP.not.corrected.ora.var0.4.all.clusters.pdf", width = 8.3, height = 11.7, units = "in")
print(GOBP.incl.empty$dotplot.out)

# plot ORA results in dotplot, but also include the empty clusters
GOMF.incl.empty <- dotplotInclEmptyClusters(cp.input, fun="enrichGO", ont = "MF", keyTypeGO = "ENTREZID", OrgDb="org.Hs.eg.db", showCategory=10, font.size=8, angle=30, label_format= 40, title="enriched GO-MF categories\nvariance cutoff > 0.4, max 10 categories\nnot corrected data")
ggsave(plot=GOMF.incl.empty$dotplot.out, file="GOMF.not.corrected.ora.var0.4.all.clusters.pdf", width = 8.3, height = 11.7, units = "in")
print(GOMF.incl.empty$dotplot.out)

# plot ORA results in dotplot, but also include the empty clusters
GOCC.incl.empty <- dotplotInclEmptyClusters(cp.input, fun="enrichGO", ont = "CC", keyTypeGO = "ENTREZID", OrgDb="org.Hs.eg.db", showCategory=10, font.size=8, angle=30, label_format= 40, title="enriched GO-CC categories\nvariance cutoff > 0.4, max 10 categories\nnot corrected data")
ggsave(plot=GOCC.incl.empty$dotplot.out, file="GOCC.not.corrected.ora.var0.4.all.clusters.pdf", width = 8.3, height = 11.7, units = "in")
print(GOCC.incl.empty$dotplot.out)

Repeat the visualization of the variance partition analysis for the platform-corrected results.

cutoff <- 0.40

cellsystem.genes <-varPart.ml[order(varPart.ml$"Cell System", decreasing=TRUE),]
cellsystem.genes <- rownames( cellsystem.genes[cellsystem.genes$"Cell System" > cutoff,] )
length(cellsystem.genes)
## [1] 322
culturetime.genes <-varPart.ml[order(varPart.ml$"Culture Time", decreasing=TRUE),]
culturetime.genes <- rownames( culturetime.genes[culturetime.genes$"Culture Time" > cutoff,] )
length(culturetime.genes)
## [1] 463
device.genes <-varPart.ml[order(varPart.ml$"Device", decreasing=TRUE),]
device.genes <- rownames( device.genes[device.genes$"Device" > cutoff,] )
length(device.genes)
## [1] 370
flow.genes <-varPart.ml[order(varPart.ml$Flow, decreasing=TRUE),]
flow.genes <- rownames( flow.genes[flow.genes$Flow > cutoff,] )
length(flow.genes)
## [1] 81
gseid.genes <-varPart.ml[order(varPart.ml$"GSEid", decreasing=TRUE),]
gseid.genes <- rownames( gseid.genes[gseid.genes$"GSEid" > cutoff,] )
length(gseid.genes)
## [1] 5198
microbiome.genes <-varPart.ml[order(varPart.ml$Microbiome, decreasing=TRUE),]
microbiome.genes <- rownames( microbiome.genes[microbiome.genes$Microbiome > cutoff,] )
length(microbiome.genes)
## [1] 293
oxygen.genes <-varPart.ml[order(varPart.ml$Oxygen, decreasing=TRUE),]
oxygen.genes <- rownames( oxygen.genes[oxygen.genes$Oxygen > cutoff,] )
length(oxygen.genes)
## [1] 0
platform.genes <-varPart.ml[order(varPart.ml$"Platform", decreasing=TRUE),]
platform.genes <- rownames( platform.genes[platform.genes$"Platform" > cutoff,] )
length(platform.genes)
## [1] 0
# Cell System
p <- lapply(cellsystem.genes, function(x) {
    c.vec <- color.vector[ 1:length( levels(pheno.data$"Cell System") ) ]; names(c.vec) <- levels(pheno.data$"Cell System")
    plotStratify( Expression ~ Cell.System, data.frame( Expression = rpt.expr [x,], "Cell System" = pheno.data$"Cell System"), main=x, sort=FALSE, colorBy = c.vec)  + theme(legend.position = "right")
    })
ggsave(filename = "expr.cellsystem.variance.genes.platform.corrected.0.40.pdf", plot = marrangeGrob(p, nrow=1, ncol=1), width = 15, height = 9)

invisible( lapply(head(p, n=2L), print) )

# Culture Time
p <- lapply(culturetime.genes, function(x) {
    c.vec <- color.vector[ 1:length( levels(pheno.data$"Culture Time") ) ]; names(c.vec) <- levels(pheno.data$"Culture Time")
    plotStratify( Expression ~ Culture.Time, data.frame( Expression = rpt.expr [x,], "Culture Time" = pheno.data$"Culture Time"), main=x, sort=FALSE, colorBy = c.vec) + theme(legend.position = "right")
        })
ggsave(filename = "expr.culture.time.variance.genes.platform.corrected.0.40.pdf", plot = marrangeGrob(p, nrow=1, ncol=1), width = 15, height = 9)

invisible( lapply(head(p, n=2L), print) )

# Flow
p <- lapply(flow.genes, function(x) {
    c.vec <- color.vector[ 1:length( levels(pheno.data$"Flow") ) ]; names(c.vec) <- levels(pheno.data$"Flow")
    plotStratify( Expression ~ Flow, data.frame( Expression = rpt.expr [x,], "Flow" = pheno.data$"Flow"), main=x, sort=FALSE, colorBy = c.vec ) + theme(legend.position = "right")
        })
ggsave(filename = "expr.flow.variance.genes.platform.corrected.0.40.pdf", plot = marrangeGrob(p, nrow=1, ncol=1), width = 15, height = 9)

invisible( lapply(head(p, n=2L), print) )

# GSEid (genes not plotted because there are too many highly varying genes!)
#p <- lapply(gseid.genes, function(x) {
#    c.vec <- color.vector[ 1:length( levels(pheno.data$"GSEid") ) ]; names(c.vec) <- levels(pheno.data$"GSEid")
#    plotStratify( Expression ~ GSEid, data.frame( Expression = rpt.expr [x,], "GSEid" = pheno.data$"GSEid"), main=x, sort=FALSE, colorBy = c.vec) + theme(legend.position = "right")
#       })
# ggsave(filename = "expr.gseid.variance.genes.platform.corrected.0.40.pdf", plot = marrangeGrob(p, nrow=1, ncol=1), width = 15, height = 9)
#
# invisible( lapply(head(p, n=2L), print) )



# Device
p <- lapply(device.genes, function(x) {
    c.vec <- color.vector[ 1:length( levels(pheno.data$"Device") ) ]; names(c.vec) <- levels(pheno.data$"Device")
    plotStratify( Expression ~ Device, data.frame( Expression = rpt.expr [x,], "Device" = pheno.data$"Device"), main=x, sort=FALSE, colorBy = c.vec) + theme(legend.position = "right")
        })
ggsave(filename = "expr.device.variance.genes.platform.corrected.0.40.pdf", plot = marrangeGrob(p, nrow=1, ncol=1), width = 15, height = 9)

invisible( lapply(head(p, n=2L), print) )

# Oxygen
p <- lapply( oxygen.genes, function(x) {
    c.vec <- color.vector[ 1:length( levels(pheno.data$"Oxygen") ) ]; names(c.vec) <- levels(pheno.data$"Oxygen")
    plotStratify( Expression ~ Oxygen, data.frame( Expression = rpt.expr [x,], "Oxygen" = pheno.data$"Oxygen"), main=x, sort=FALSE, colorBy = c.vec) + theme(legend.position = "right")
        })
ggsave(filename = "expr.oxygen.variance.genes.platform.corrected.0.40.pdf", plot = marrangeGrob(p, nrow=1, ncol=1), width = 15, height = 9)

invisible( lapply(head(p, n=2L), print) )



# Microbiome
p <- lapply(microbiome.genes, function(x) {
    c.vec <- color.vector[ 1:length( levels(pheno.data$"Microbiome") ) ]; names(c.vec) <- levels(pheno.data$"Microbiome")
    plotStratify( Expression ~ Microbiome, data.frame( Expression = rpt.expr [x,], "Microbiome" = pheno.data$"Microbiome"), main=x, sort=FALSE, colorBy = c.vec) + theme(legend.position = "right")
        })
ggsave(
   filename = "expr.microbiome.variance.genes.platform.corrected.0.40.pdf", plot = marrangeGrob(p, nrow=1, ncol=1), width = 15, height = 9)

invisible( lapply(head(p, n=2L), print) )

Continue with functional overrepresentation analysis (ORA) of the lists of varying genes.

# rename rownames(varPart) to entrezid
rownames(varPart.ml) <- fData(combined.dataset.probit.platform.corrected)$ENTREZID


plat.var.ml <- rownames(varPart.ml[varPart.ml$Platform > cutoff,] )
GSEid.var.ml <- rownames(varPart.ml[varPart.ml$GSEid > cutoff,] )
device.var.ml <- rownames(varPart.ml[varPart.ml$Device > cutoff,] )
cultime.var.ml <- rownames(varPart.ml[varPart.ml$"Culture Time" > cutoff,] )
micro.var.ml <- rownames(varPart.ml[varPart.ml$Microbiome > cutoff,] )
flow.var.ml <- rownames(varPart.ml[varPart.ml$Flow > cutoff,] )
oxy.var.ml <- rownames(varPart.ml[varPart.ml$Oxygen > cutoff,] )
cellsys.var.ml <- rownames(varPart.ml[varPart.ml$"Cell System" > cutoff,] )

cp.input.ml <- list("Platform"=plat.var.ml, "GSEid"=GSEid.var.ml, "Device"=device.var.ml, "Culture time"=cultime.var.ml, "Microbiome"=micro.var.ml, "Flow"=flow.var.ml, "Oxygen"=oxy.var.ml, "Cell System"=cellsys.var.ml)

str(cp.input.ml)
## List of 8
##  $ Platform    : chr(0) 
##  $ GSEid       : chr [1:5198] "10399" "6234" "1915" "345" ...
##  $ Device      : chr [1:370] "9168" "6513" "2879" "5660" ...
##  $ Culture time: chr [1:463] "5757" "293" "999" "7323" ...
##  $ Microbiome  : chr [1:293] "3312" "183" "10960" "8763" ...
##  $ Flow        : chr [1:81] "1152" "6307" "1717" "10628" ...
##  $ Oxygen      : chr(0) 
##  $ Cell System : chr [1:322] "6134" "23070" "5336" "23521" ...
# Perform KEGG pathway ORA for all lists of varying genes
c.k.ml <- compareCluster(geneCluster = cp.input.ml, fun = enrichKEGG, pvalueCutoff = 0.05, pAdjustMethod = "BH", minGSSize = 10, maxGSSize = 500, qvalueCutoff = 0.2)
c.k.ml <- pairwise_termsim(c.k.ml)
c.k.ml <- setReadable(c.k.ml, OrgDb = org.Hs.eg.db, keyType="ENTREZID")

n.char <- max(nchar ( as.data.frame(c.k.ml)$Description )  )
p <- dotplot(c.k.ml, font.size=8, showCategory = 10, title = "enriched KEGG pathways\nvariance cutoff > 0.4, max 10 pathways\nplatform corrected data", label_format=n.char)
ggsave(file="KEGG.platform.corrected.ora.var0.4.pdf", width = 8.3, height = 11.7, units = "in")
print(p)

# plot ORA results in dotplot, but also include the empty clusters
KEGG.ml.incl.empty <- dotplotInclEmptyClusters(cp.input.ml, fun="enrichKEGG", pvalueCutoff = 0.05, pAdjustMethod = "BH", qvalueCutoff = 0.2, minGSSize = 10, maxGSSize = 500, showCategory=10, font.size=8, angle=30, title="enriched KEGG pathways\nvariance cutoff > 0.4, max 10 pathways\nplatform corrected data")
ggsave(plot=KEGG.ml.incl.empty$dotplot.out, file="KEGG.platform.corrected.ora.var0.4.all.clusters.pdf", width = 8.3, height = 11.7, units = "in")
print(KEGG.ml.incl.empty$dotplot.out)

# Perform GO subcategories overrepresentation analysis
c.go.ml.bp <- compareCluster(geneCluster = cp.input.ml, fun = enrichGO, ont = "BP", OrgDb="org.Hs.eg.db")
c.go.ml.bp <- pairwise_termsim(c.go.ml.bp)
c.go.ml.bp <- setReadable(c.go.ml.bp, OrgDb = org.Hs.eg.db, keyType="ENTREZID")

c.go.ml.mf <- compareCluster(geneCluster = cp.input.ml, fun = enrichGO, ont = "MF", OrgDb="org.Hs.eg.db")
c.go.ml.mf <- pairwise_termsim(c.go.ml.mf)
c.go.ml.mf <- setReadable(c.go.ml.mf, OrgDb = org.Hs.eg.db, keyType="ENTREZID")

c.go.ml.cc <- compareCluster(geneCluster = cp.input.ml, fun = enrichGO, ont = "CC", OrgDb="org.Hs.eg.db")
c.go.ml.cc <- pairwise_termsim(c.go.ml.cc)
c.go.ml.cc <- setReadable(c.go.ml.cc, OrgDb = org.Hs.eg.db, keyType="ENTREZID")


# Biological Process
n.char <- max(nchar ( as.data.frame(c.go.ml.bp)$Description )  )
p <- dotplot(c.go.ml.bp, font.size=8, showCategory = 10, title = "enriched GO-BP categories\nvariance cutoff > 0.4, max 10 categories\nplatform corrected data", label_format=n.char)
ggsave(file="GOBP.platform.corrected.ora.var0.4.pdf", width = 8.3, height = 11.7, units = "in")
print(p)

# Molecular Function
n.char <- max(nchar ( as.data.frame(c.go.ml.mf)$Description )  )
p <- dotplot(c.go.ml.mf, font.size=8, showCategory = 10, title = "enriched GO-MF categories\nvariance cutoff > 0.4, max 10 categories\nplatform corrected data", label_format=n.char)
ggsave(file="GOMF.platform.corrected.ora.var0.4.pdf", width = 8.3, height = 11.7, units = "in")
print(p)

# Cellular Compartment
n.char <- max(nchar ( as.data.frame(c.go.ml.cc)$Description )  )
p <- dotplot(c.go.ml.cc, font.size=8, showCategory = 10, title = "enriched GO-CC categories\nvariance cutoff > 0.4, max 10 categories\nplatform corrected data", label_format=n.char)
ggsave(file="GOCC.platform.corrected.ora.var0.4.pdf", width = 8.3, height = 11.7, units = "in")
print(p)

# plot ORA results in dotplot, but also include the empty clusters
GOBP.ml.incl.empty <- dotplotInclEmptyClusters(cp.input.ml, fun="enrichGO", ont = "BP", keyTypeGO = "ENTREZID", OrgDb="org.Hs.eg.db", showCategory=10, font.size=8, angle=30, label_format= 40, title="enriched GO-BP categories\nvariance cutoff > 0.4, max 10 categroies\nplatform corrected data")
ggsave(plot=GOBP.ml.incl.empty$dotplot.out, file="GOBP.platform.corrected.ora.var0.4.all.clusters.pdf", width = 8.3, height = 11.7, units = "in")
print(GOBP.ml.incl.empty$dotplot.out)

# plot ORA results in dotplot, but also include the empty clusters
GOMF.ml.incl.empty <- dotplotInclEmptyClusters(cp.input.ml, fun="enrichGO", ont = "MF", keyTypeGO = "ENTREZID", OrgDb="org.Hs.eg.db", showCategory=10, font.size=8, angle=30, label_format= 40, title="enriched GO-MF categories\nvariance cutoff > 0.4, max 10 categroies\nplatform corrected data")
ggsave(plot=GOMF.ml.incl.empty$dotplot.out, file="GOMF.platform.corrected.ora.var0.4.all.clusters.pdf", width = 8.3, height = 11.7, units = "in")
print(GOMF.ml.incl.empty$dotplot.out)

# plot ORA results in dotplot, but also include the empty clusters
GOCC.ml.incl.empty <- dotplotInclEmptyClusters(cp.input.ml, fun="enrichGO", ont = "CC", keyTypeGO = "ENTREZID", OrgDb="org.Hs.eg.db", showCategory=10, font.size=8, angle=30, label_format= 40, title="enriched GO-CC categories\nvariance cutoff > 0.4, max 10 categroies\nplatform corrected data")
ggsave(plot=GOCC.ml.incl.empty$dotplot.out, file="GOCC.platform.corrected.ora.var0.4.all.clusters.pdf", width = 8.3, height = 11.7, units = "in")
print(GOCC.ml.incl.empty$dotplot.out)

## create data to save in Excel file

# First export all results for the non-corrected dataset

# KEGG results
# significant gene sets only
dataframes.not.corrected.sign.only <- split(as.data.frame(c.k), as.data.frame(c.k)$Cluster)
# create workbook
wb <- createWorkbook()
#Iterate using function inside Map()
Map(function(data, nameofsheet){     
    addWorksheet(wb, nameofsheet)
    writeData(wb, nameofsheet, data)},
    dataframes.not.corrected.sign.only, names(dataframes.not.corrected.sign.only) )
## $Platform
## [1] 0
## 
## $GSEid
## [1] 0
## 
## $Device
## [1] 0
## 
## $`Culture time`
## [1] 0
## 
## $Microbiome
## [1] 0
## 
## $Flow
## [1] 0
## 
## $Oxygen
## [1] 0
## 
## $`Cell System`
## [1] 0
## Save workbook to excel file
saveWorkbook(wb, file = "clusterProfiler.not.corrected.KEGG.var0.4.sign.sets.xlsx", overwrite = TRUE)


# all gene sets
dataframes.not.corrected.all <- split(as.data.frame(KEGG.incl.empty$compareClusterOut), as.data.frame(KEGG.incl.empty$compareClusterOut)$Cluster)
wb <- createWorkbook()
Map(function(data, nameofsheet){     
    addWorksheet(wb, nameofsheet)
    writeData(wb, nameofsheet, data)},
    dataframes.not.corrected.all, names(dataframes.not.corrected.all) )
## $Platform
## [1] 0
## 
## $GSEid
## [1] 0
## 
## $Device
## [1] 0
## 
## $`Culture time`
## [1] 0
## 
## $Microbiome
## [1] 0
## 
## $Flow
## [1] 0
## 
## $Oxygen
## [1] 0
## 
## $`Cell System`
## [1] 0
saveWorkbook(wb, file = "clusterProfiler.not.corrected.KEGG.var0.4.all.sets.xlsx", overwrite = TRUE)



# Gene Ontology results
# significant gene sets only
dataframes.not.corrected.sign.only <- split(as.data.frame(c.go.bp), as.data.frame(c.go.bp)$Cluster)
wb <- createWorkbook()
Map(function(data, nameofsheet){     
    addWorksheet(wb, nameofsheet)
    writeData(wb, nameofsheet, data)},
    dataframes.not.corrected.sign.only, names(dataframes.not.corrected.sign.only) )
## $Platform
## [1] 0
## 
## $GSEid
## [1] 0
## 
## $Device
## [1] 0
## 
## $`Culture time`
## [1] 0
## 
## $Microbiome
## [1] 0
## 
## $Flow
## [1] 0
## 
## $Oxygen
## [1] 0
## 
## $`Cell System`
## [1] 0
saveWorkbook(wb, file = "clusterProfiler.not.corrected.GOBP.var0.4.sign.sets.xlsx", overwrite = TRUE)

dataframes.not.corrected.sign.only <- split(as.data.frame(c.go.mf), as.data.frame(c.go.mf)$Cluster)
wb <- createWorkbook()
Map(function(data, nameofsheet){     
    addWorksheet(wb, nameofsheet)
    writeData(wb, nameofsheet, data)},
    dataframes.not.corrected.sign.only, names(dataframes.not.corrected.sign.only) )
## $Platform
## [1] 0
## 
## $GSEid
## [1] 0
## 
## $Device
## [1] 0
## 
## $`Culture time`
## [1] 0
## 
## $Microbiome
## [1] 0
## 
## $Flow
## [1] 0
## 
## $Oxygen
## [1] 0
## 
## $`Cell System`
## [1] 0
saveWorkbook(wb, file = "clusterProfiler.not.corrected.GOMF.var0.4.sign.sets.xlsx", overwrite = TRUE)

dataframes.not.corrected.sign.only <- split(as.data.frame(c.go.cc), as.data.frame(c.go.cc)$Cluster)
wb <- createWorkbook()
Map(function(data, nameofsheet){     
    addWorksheet(wb, nameofsheet)
    writeData(wb, nameofsheet, data)},
    dataframes.not.corrected.sign.only, names(dataframes.not.corrected.sign.only) )
## $Platform
## [1] 0
## 
## $GSEid
## [1] 0
## 
## $Device
## [1] 0
## 
## $`Culture time`
## [1] 0
## 
## $Microbiome
## [1] 0
## 
## $Flow
## [1] 0
## 
## $Oxygen
## [1] 0
## 
## $`Cell System`
## [1] 0
saveWorkbook(wb, file = "clusterProfiler.not.corrected.GOCC.var0.4.sign.sets.xlsx", overwrite = TRUE)


# all gene sets
dataframes.not.corrected.all <- split(as.data.frame(GOBP.incl.empty$compareClusterOut), as.data.frame(GOBP.incl.empty$compareClusterOut)$Cluster)
wb <- createWorkbook()
Map(function(data, nameofsheet){     
    addWorksheet(wb, nameofsheet)
    writeData(wb, nameofsheet, data)},
    dataframes.not.corrected.all, names(dataframes.not.corrected.all) )
## $Platform
## [1] 0
## 
## $GSEid
## [1] 0
## 
## $Device
## [1] 0
## 
## $`Culture time`
## [1] 0
## 
## $Microbiome
## [1] 0
## 
## $Flow
## [1] 0
## 
## $Oxygen
## [1] 0
## 
## $`Cell System`
## [1] 0
saveWorkbook(wb, file = "clusterProfiler.not.corrected.GOBP.var0.4.all.sets.xlsx", overwrite = TRUE)

dataframes.not.corrected.all <- split(as.data.frame(GOMF.incl.empty$compareClusterOut), as.data.frame(GOMF.incl.empty$compareClusterOut)$Cluster)
wb <- createWorkbook()
Map(function(data, nameofsheet){     
    addWorksheet(wb, nameofsheet)
    writeData(wb, nameofsheet, data)},
    dataframes.not.corrected.all, names(dataframes.not.corrected.all) )
## $Platform
## [1] 0
## 
## $GSEid
## [1] 0
## 
## $Device
## [1] 0
## 
## $`Culture time`
## [1] 0
## 
## $Microbiome
## [1] 0
## 
## $Flow
## [1] 0
## 
## $Oxygen
## [1] 0
## 
## $`Cell System`
## [1] 0
saveWorkbook(wb, file = "clusterProfiler.not.corrected.GOMF.var0.4.all.sets.xlsx", overwrite = TRUE)

dataframes.not.corrected.all <- split(as.data.frame(GOCC.incl.empty$compareClusterOut), as.data.frame(GOCC.incl.empty$compareClusterOut)$Cluster)
wb <- createWorkbook()
Map(function(data, nameofsheet){     
    addWorksheet(wb, nameofsheet)
    writeData(wb, nameofsheet, data)},
    dataframes.not.corrected.all, names(dataframes.not.corrected.all) )
## $Platform
## [1] 0
## 
## $GSEid
## [1] 0
## 
## $Device
## [1] 0
## 
## $`Culture time`
## [1] 0
## 
## $Microbiome
## [1] 0
## 
## $Flow
## [1] 0
## 
## $Oxygen
## [1] 0
## 
## $`Cell System`
## [1] 0
saveWorkbook(wb, file = "clusterProfiler.not.corrected.GOCC.var0.4.all.sets.xlsx", overwrite = TRUE)




# Repeat for all results for the platform-corrected dataset

# KEGG results
# significant gene sets only
dataframes.platform.corrected.sign.only <- split(as.data.frame(c.k.ml), as.data.frame(c.k.ml)$Cluster)
# create workbook
wb <- createWorkbook()
#Iterate using function inside Map()
Map(function(data, nameofsheet){     
    addWorksheet(wb, nameofsheet)
    writeData(wb, nameofsheet, data)},
    dataframes.platform.corrected.sign.only, names(dataframes.platform.corrected.sign.only) )
## $Platform
## [1] 0
## 
## $GSEid
## [1] 0
## 
## $Device
## [1] 0
## 
## $`Culture time`
## [1] 0
## 
## $Microbiome
## [1] 0
## 
## $Flow
## [1] 0
## 
## $Oxygen
## [1] 0
## 
## $`Cell System`
## [1] 0
## Save workbook to excel file
saveWorkbook(wb, file = "clusterProfiler.platform.corrected.KEGG.var0.4.sign.sets.xlsx", overwrite = TRUE)


# all gene sets
dataframes.platform.corrected.all <- split(as.data.frame(KEGG.ml.incl.empty$compareClusterOut), as.data.frame(KEGG.ml.incl.empty$compareClusterOut)$Cluster)
wb <- createWorkbook()
Map(function(data, nameofsheet){     
    addWorksheet(wb, nameofsheet)
    writeData(wb, nameofsheet, data)},
    dataframes.platform.corrected.all, names(dataframes.platform.corrected.all) )
## $Platform
## [1] 0
## 
## $GSEid
## [1] 0
## 
## $Device
## [1] 0
## 
## $`Culture time`
## [1] 0
## 
## $Microbiome
## [1] 0
## 
## $Flow
## [1] 0
## 
## $Oxygen
## [1] 0
## 
## $`Cell System`
## [1] 0
saveWorkbook(wb, file = "clusterProfiler.platform.corrected.KEGG.var0.4.all.sets.xlsx", overwrite = TRUE)



# Gene Ontology results
# significant gene sets only
dataframes.platform.corrected.sign.only <- split(as.data.frame(c.go.ml.bp), as.data.frame(c.go.ml.bp)$Cluster)
wb <- createWorkbook()
Map(function(data, nameofsheet){     
    addWorksheet(wb, nameofsheet)
    writeData(wb, nameofsheet, data)},
    dataframes.platform.corrected.sign.only, names(dataframes.platform.corrected.sign.only) )
## $Platform
## [1] 0
## 
## $GSEid
## [1] 0
## 
## $Device
## [1] 0
## 
## $`Culture time`
## [1] 0
## 
## $Microbiome
## [1] 0
## 
## $Flow
## [1] 0
## 
## $Oxygen
## [1] 0
## 
## $`Cell System`
## [1] 0
saveWorkbook(wb, file = "clusterProfiler.platform.corrected.GOBP.var0.4.sign.sets.xlsx", overwrite = TRUE)

dataframes.platform.corrected.sign.only <- split(as.data.frame(c.go.ml.mf), as.data.frame(c.go.ml.mf)$Cluster)
wb <- createWorkbook()
Map(function(data, nameofsheet){     
    addWorksheet(wb, nameofsheet)
    writeData(wb, nameofsheet, data)},
    dataframes.platform.corrected.sign.only, names(dataframes.platform.corrected.sign.only) )
## $Platform
## [1] 0
## 
## $GSEid
## [1] 0
## 
## $Device
## [1] 0
## 
## $`Culture time`
## [1] 0
## 
## $Microbiome
## [1] 0
## 
## $Flow
## [1] 0
## 
## $Oxygen
## [1] 0
## 
## $`Cell System`
## [1] 0
saveWorkbook(wb, file = "clusterProfiler.platform.corrected.GOMF.var0.4.sign.sets.xlsx", overwrite = TRUE)

dataframes.platform.corrected.sign.only <- split(as.data.frame(c.go.ml.cc), as.data.frame(c.go.ml.cc)$Cluster)
wb <- createWorkbook()
Map(function(data, nameofsheet){     
    addWorksheet(wb, nameofsheet)
    writeData(wb, nameofsheet, data)},
    dataframes.platform.corrected.sign.only, names(dataframes.platform.corrected.sign.only) )
## $Platform
## [1] 0
## 
## $GSEid
## [1] 0
## 
## $Device
## [1] 0
## 
## $`Culture time`
## [1] 0
## 
## $Microbiome
## [1] 0
## 
## $Flow
## [1] 0
## 
## $Oxygen
## [1] 0
## 
## $`Cell System`
## [1] 0
saveWorkbook(wb, file = "clusterProfiler.platform.corrected.GOCC.var0.4.sign.sets.xlsx", overwrite = TRUE)


# all gene sets
dataframes.platform.corrected.all <- split(as.data.frame(GOBP.ml.incl.empty$compareClusterOut), as.data.frame(GOBP.ml.incl.empty$compareClusterOut)$Cluster)
wb <- createWorkbook()
Map(function(data, nameofsheet){     
    addWorksheet(wb, nameofsheet)
    writeData(wb, nameofsheet, data)},
    dataframes.platform.corrected.all, names(dataframes.platform.corrected.all) )
## $Platform
## [1] 0
## 
## $GSEid
## [1] 0
## 
## $Device
## [1] 0
## 
## $`Culture time`
## [1] 0
## 
## $Microbiome
## [1] 0
## 
## $Flow
## [1] 0
## 
## $Oxygen
## [1] 0
## 
## $`Cell System`
## [1] 0
saveWorkbook(wb, file = "clusterProfiler.platform.corrected.GOBP.var0.4.all.sets.xlsx", overwrite = TRUE)

dataframes.platform.corrected.all <- split(as.data.frame(GOMF.ml.incl.empty$compareClusterOut), as.data.frame(GOMF.ml.incl.empty$compareClusterOut)$Cluster)
wb <- createWorkbook()
Map(function(data, nameofsheet){     
    addWorksheet(wb, nameofsheet)
    writeData(wb, nameofsheet, data)},
    dataframes.platform.corrected.all, names(dataframes.platform.corrected.all) )
## $Platform
## [1] 0
## 
## $GSEid
## [1] 0
## 
## $Device
## [1] 0
## 
## $`Culture time`
## [1] 0
## 
## $Microbiome
## [1] 0
## 
## $Flow
## [1] 0
## 
## $Oxygen
## [1] 0
## 
## $`Cell System`
## [1] 0
saveWorkbook(wb, file = "clusterProfiler.platform.corrected.GOMF.var0.4.all.sets.xlsx", overwrite = TRUE)

dataframes.platform.corrected.all <- split(as.data.frame(GOCC.ml.incl.empty$compareClusterOut), as.data.frame(GOCC.ml.incl.empty$compareClusterOut)$Cluster)
wb <- createWorkbook()
Map(function(data, nameofsheet){     
    addWorksheet(wb, nameofsheet)
    writeData(wb, nameofsheet, data)},
    dataframes.platform.corrected.all, names(dataframes.platform.corrected.all) )
## $Platform
## [1] 0
## 
## $GSEid
## [1] 0
## 
## $Device
## [1] 0
## 
## $`Culture time`
## [1] 0
## 
## $Microbiome
## [1] 0
## 
## $Flow
## [1] 0
## 
## $Oxygen
## [1] 0
## 
## $`Cell System`
## [1] 0
saveWorkbook(wb, file = "clusterProfiler.platform.corrected.GOCC.var0.4.all.sets.xlsx", overwrite = TRUE)

Part 7 – SessionInfo

sessionInfo()
## R version 4.2.1 (2022-06-23 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19042)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## attached base packages:
## [1] splines   stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] hta20transcriptcluster.db_8.8.0      pd.hta.2.0_3.12.2                   
##  [3] hgu133a2.db_3.13.0                   pd.hg.u133a.2_3.12.0                
##  [5] hugene10sttranscriptcluster.db_8.8.0 pd.hugene.1.0.st.v1_3.14.1          
##  [7] nugohs1a520180.db_3.4.0              pd.nugo.hs1a520180_3.4.0            
##  [9] hgu133plus2.db_3.13.0                pd.hg.u133.plus.2_3.12.0            
## [11] hugene21sttranscriptcluster.db_8.8.0 pd.hugene.2.1.st_3.14.1             
## [13] hugene11sttranscriptcluster.db_8.8.0 pd.hugene.1.1.st.v1_3.14.1          
## [15] DBI_1.1.3                            RSQLite_2.2.17                      
## [17] GO.db_3.15.0                         enrichplot_1.16.2                   
## [19] clusterProfiler_4.4.4                variancePartition_1.26.0            
## [21] BiocParallel_1.30.3                  limma_3.52.4                        
## [23] pheatmap_1.0.12                      scales_1.2.1                        
## [25] RColorBrewer_1.1-3                   VGAM_1.1-7                          
## [27] cowplot_1.1.1                        gridExtra_2.3                       
## [29] PCAtools_2.8.0                       ggrepel_0.9.1                       
## [31] ggplot2_3.3.6                        dplyr_1.0.10                        
## [33] stringr_1.4.1                        data.table_1.14.2                   
## [35] org.Hs.eg.db_3.15.0                  AnnotationDbi_1.58.0                
## [37] affycoretools_1.68.1                 GEOquery_2.64.2                     
## [39] oligo_1.60.0                         Biostrings_2.64.1                   
## [41] GenomeInfoDb_1.32.4                  XVector_0.36.0                      
## [43] IRanges_2.30.1                       S4Vectors_0.34.0                    
## [45] Biobase_2.56.0                       oligoClasses_1.58.0                 
## [47] BiocGenerics_0.42.0                  openxlsx_4.2.5                      
## 
## loaded via a namespace (and not attached):
##   [1] rsvd_1.0.5                  Hmisc_4.7-1                
##   [3] Rsamtools_2.12.0            foreach_1.5.2              
##   [5] crayon_1.5.1                rbibutils_2.2.9            
##   [7] MASS_7.3-58.1               nlme_3.1-159               
##   [9] backports_1.4.1             GOSemSim_2.22.0            
##  [11] rlang_1.0.6                 irlba_2.3.5                
##  [13] nloptr_2.0.3                filelock_1.0.2             
##  [15] GOstats_2.62.0              rjson_0.2.21               
##  [17] bit64_4.0.5                 glue_1.6.2                 
##  [19] pbkrtest_0.5.1              parallel_4.2.1             
##  [21] DOSE_3.22.1                 tidyselect_1.1.2           
##  [23] SummarizedExperiment_1.26.1 XML_3.99-0.10              
##  [25] tidyr_1.2.1                 GenomicAlignments_1.32.1   
##  [27] xtable_1.8-4                magrittr_2.0.3             
##  [29] evaluate_0.16               Rdpack_2.4                 
##  [31] cli_3.4.1                   zlibbioc_1.42.0            
##  [33] hwriter_1.3.2.1             rstudioapi_0.14            
##  [35] bslib_0.4.0                 rpart_4.1.16               
##  [37] fastmatch_1.1-3             aod_1.3.2                  
##  [39] ensembldb_2.20.2            treeio_1.20.2              
##  [41] BiocSingular_1.12.0         xfun_0.33                  
##  [43] cluster_2.1.4               caTools_1.18.2             
##  [45] tidygraph_1.2.2             KEGGREST_1.36.3            
##  [47] tibble_3.1.8                ff_4.0.7                   
##  [49] biovizBase_1.44.0           ape_5.6-2                  
##  [51] png_0.1-7                   reshape_0.8.9              
##  [53] withr_2.5.0                 bitops_1.0-7               
##  [55] ggforce_0.3.4               RBGL_1.72.0                
##  [57] plyr_1.8.7                  GSEABase_1.58.0            
##  [59] AnnotationFilter_1.20.0     PFAM.db_3.15.0             
##  [61] dqrng_0.3.0                 pillar_1.8.1               
##  [63] gplots_3.1.3                cachem_1.0.6               
##  [65] GenomicFeatures_1.48.4      DelayedMatrixStats_1.18.1  
##  [67] vctrs_0.4.1                 ellipsis_0.3.2             
##  [69] generics_0.1.3              tools_4.2.1                
##  [71] gcrma_2.68.0                foreign_0.8-82             
##  [73] munsell_0.5.0               tweenr_2.0.2               
##  [75] fgsea_1.22.0                DelayedArray_0.22.0        
##  [77] fastmap_1.1.0               compiler_4.2.1             
##  [79] rtracklayer_1.56.1          GenomeInfoDbData_1.2.8     
##  [81] edgeR_3.38.4                lattice_0.20-45            
##  [83] deldir_1.0-6                AnnotationForge_1.38.1     
##  [85] utf8_1.2.2                  BiocFileCache_2.4.0        
##  [87] jsonlite_1.8.0              affy_1.74.0                
##  [89] GGally_2.1.2                graph_1.74.0               
##  [91] ScaledMatrix_1.4.1          tidytree_0.4.1             
##  [93] sparseMatrixStats_1.8.0     genefilter_1.78.0          
##  [95] lazyeval_0.2.2              doParallel_1.0.17          
##  [97] latticeExtra_0.6-30         R.utils_2.12.0             
##  [99] checkmate_2.1.0             rmarkdown_2.16             
## [101] textshaping_0.3.6           dichromat_2.0-0.1          
## [103] downloader_0.4              BSgenome_1.64.0            
## [105] igraph_1.3.5                survival_3.4-0             
## [107] yaml_2.3.5                  systemfonts_1.0.4          
## [109] Glimma_2.6.0                htmltools_0.5.3            
## [111] memoise_2.0.1               VariantAnnotation_1.42.1   
## [113] BiocIO_1.6.0                locfit_1.5-9.6             
## [115] graphlayouts_0.8.1          viridisLite_0.4.1          
## [117] digest_0.6.29               assertthat_0.2.1           
## [119] RhpcBLASctl_0.21-247.1      rappdirs_0.3.3             
## [121] yulab.utils_0.0.5           blob_1.2.3                 
## [123] R.oo_1.25.0                 ragg_1.2.2                 
## [125] ReportingTools_2.36.0       preprocessCore_1.58.0      
## [127] labeling_0.4.2              Formula_1.2-4              
## [129] OrganismDbi_1.38.1          ProtGenerics_1.28.0        
## [131] RCurl_1.98-1.8              broom_1.0.1                
## [133] hms_1.1.2                   colorspace_2.0-3           
## [135] base64enc_0.1-3             BiocManager_1.30.18        
## [137] GenomicRanges_1.48.0        aplot_0.1.7                
## [139] affxparser_1.68.1           nnet_7.3-17                
## [141] sass_0.4.2                  Rcpp_1.0.9                 
## [143] fansi_1.0.3                 tzdb_0.3.0                 
## [145] R6_2.5.1                    grid_4.2.1                 
## [147] lifecycle_1.0.2             zip_2.2.1                  
## [149] curl_4.3.2                  minqa_1.2.4                
## [151] affyio_1.66.0               jquerylib_0.1.4            
## [153] DO.db_2.9                   Matrix_1.5-1               
## [155] qvalue_2.28.0               ggbio_1.44.1               
## [157] iterators_1.0.14            htmlwidgets_1.5.4          
## [159] beachmat_2.12.0             polyclip_1.10-0            
## [161] biomaRt_2.52.0              purrr_0.3.4                
## [163] shadowtext_0.1.2            gridGraphics_0.5-1         
## [165] htmlTable_2.4.1             patchwork_1.1.2            
## [167] codetools_0.2-18            matrixStats_0.62.0         
## [169] gtools_3.9.3                prettyunits_1.1.1          
## [171] dbplyr_2.2.1                R.methodsS3_1.8.2          
## [173] gtable_0.3.1                highr_0.9                  
## [175] ggfun_0.0.7                 httr_1.4.4                 
## [177] KernSmooth_2.23-20          stringi_1.7.8              
## [179] progress_1.2.2              reshape2_1.4.4             
## [181] farver_2.1.1                annotate_1.74.0            
## [183] viridis_0.6.2               Rgraphviz_2.40.0           
## [185] ggtree_3.4.4                xml2_1.3.3                 
## [187] boot_1.3-28                 lme4_1.1-30                
## [189] restfulr_0.0.15             interp_1.1-3               
## [191] readr_2.1.2                 geneplotter_1.74.0         
## [193] ggplotify_0.1.0             Category_2.62.0            
## [195] DESeq2_1.36.0               bit_4.0.4                  
## [197] scatterpie_0.1.8            jpeg_0.1-9                 
## [199] MatrixGenerics_1.8.1        ggraph_2.0.6               
## [201] pkgconfig_2.0.3             knitr_1.40