COVID19 Blood Biomarker Analysis
BACKGROUND INFO
Pre-processing
From OLINKS website
Pre-processing and quality control of the data is performed using the Olink NPX Manager software. The built-in quality control (QC) allows us to have control over the technical performance of the assays as well as the samples, making sure that we can deliver reliable, high quality data. The basis of the QC depends on four internal controls that are spiked into all samples and external controls in every Olink Analysis.
The quality control is divided into two parts:
Run evaluation
Standard deviations for Incubation Control 1, Incubation Control 2 and the Detection Control are calculated and should be below the predetermined threshold: 0.2 NPX, for the entire 96-well sample plate.
Sample evaluation
A sample plate median value is calculated for the Incubation Control 2 and the Detection Control, respectively. For each sample, the result for each of these internal controls is allowed to deviate no more than 0.3NPX from the plate median. If any or both of the internal controls exceed the 0.3NPX limit, the sample will fail the QC. If more than 1/6th of the samples fail the QC, the run is deemed unreliable. The reason for the issues will then be evaluated and (if applicable) samples will be rerun.
The NPX Manager software displays the sample evaluation in two different views; the Plate View, showing NPX-values for each sample, and the QC view, showing deviation from the plate median in NPX units. In the example below, all samples but one, “Sample 11” in well B2, passes the QC. For that sample, the Incubation Control 2 and the Detection Control deviate about 0.7 NPX and 1 NPX from the plate median, respectively, and the sample will therefore fail the QC. The NPX values for this sample will be included in the data export file but indicated with red font.
NPX
NPX, Normalized Protein eXpression, is Olink’s arbitrary unit which is in Log2 scale. It is calculated from Ct values and data pre-processing (normalization) is performed to minimize both intra- and inter-assay variation. NPX data allows users to identify changes for individual protein levels across their sample set, and then use this data to establish protein signatures.
NPX is a relative quantification unit logarithmically related to protein concentration. Even if two different proteins have the same NPX values, their absolute concentrations may differ. NPX should be compared for each assay separately between samples within a run. NPX should not be compared between runs without proper inter-plate normalization due to the risk of falsely interpreting shifts in median between runs as a biological difference.
NPX - a difference of 1 NPX equals a doubling of protein concentration
only 1 plate so NO need to cross normalise
Detection threshold
From OLINKS website:
Limit of detection (LOD) is calculated separately for each Olink assay and sample plate. The LOD is based on the background, estimated from negative controls included on every plate, plus three standard deviations. The standard deviation is assay specific and estimated during product validation for every panel.
For studies including more than one plate per panel, the maximum observed LOD for each assay is selected as study LOD. Consequently, all plates included in the study receive the same assay specific LOD. The estimated LOD is a conservative measurement especially in large multiplate studies where there is high probability that observed data is in fact above the true background signal.
Consider excluding assays with low detection from analysis Olink recommends that assays with a large proportion of samples below LOD is excluded from the analysis. The limit for exclusion should be decided on a study basis and consider design including sample size and experimental variables.
Suitable exclusion limits may be in the range of less than 25-50% of samples above LOD.
Characteristics of data below LOD As with all affinity based assays, data from Olink’s platform have a S-curve (sigmoid) relationship with the true protein concentration in a sample. Data below LOD have a higher risk to be in the non-linear phase of the S-curve meaning that 1 NPX difference may not correspond to 2x protein concentration in this region. This may bias estimates including data below LOD and should be considered when interpreting any results that are based on data below LOD
Strategies for handling data below LOD in data analysis Several strategies exist for handling data below LOD that varies in complexity. Olink delivers data below LOD to allow researches to choose the strategy that is best for their study and interpret results with the complexity of data below LOD in mind. Some examples of strategies include:
Replace data below LOD: It is common to replace data below LOD with a specific value. This will left-censor the data which creates a skewed distribution. Estimates of, for example, mean will be biased and parametric statistical tests may have lower statistical power. Common values to use for replacement is the value for LOD or LOD/(sqrt(2). The latter have been reported in literature to give less biased estimate of means.
- Use actual data below LOD: As data below LOD may be non-linear, estimates of for example mean may be biased. However, especially in large multiplate studies LOD is a conservative measurement. Using actual data may increase the statistical power and give a less skewed distribution compared to replace data below LOD with a value.
- Impute data below LOD: A more complex approach for handling data below LOD is to impute the true value. Several methods for imputation exist and includes maximum-likelihood estimation (MLE) of the distribution below LOD.
- Set data below LOD to missing: Olink do not recommend that data below LOD is excluded from analysis as the most distinct biomarkers may have a low concentration under specific conditions.
DATA - INFO
Groups:
- 0 = Control
- 1 = Mild symptoms
- 2 = Moderate symptoms
- 3 = Severe symptoms (ICU)
Data:
- Background corrected + log2 transformed + normalised (by OLINKs software)
- Expression measured in NPX
- Rows with NA have failed the assay
- Rows where value is less than detection threshold (LOD) means the expression value is still valid, but expression value is just to low to accurately quantify
- Data has been reformatted to create two tables:
- Expression data - contains sample information and expression data using OLINK ID. OLINK ID used as they are unique.
- Assay info - contains information on the assay, i.e panel, OLINK ID, Uniprot ID, etc..
Batch
- one plate - 93 samples + 368 proteins (plate can do 90 samples + 6 controls only according to OLINKS - maybe contains controls?)
ADDTIONAL PROTEIN MEASURES ADDED
- Tau
- Nfl
- GFAP
These were measured using the Simoa method and will be analysed seperately
Analysis plan:
- Remove failed samples - rows with NA/blank
- Remove proteins where expression is < LOD in 50% of samples per group - calculated per group (0,1,2,3). protein need to have low LOD in all groups to be considered for removal
- Expression values below LOD converted to LOD/sqrt(2). Negative expression values left as is (recommended by OLINK)
- Set cell values lower than LOD to LOD/(sqrt2) (as recommended by OLINKS)
- Linear regression - limma:
- 0 vs 1
- 0 vs 2
- 0 vs 3
- 0 vs 1,2,3 (grouped - case vs control)
- 0 vs 1 vs 2 vs 3 vs 4
- longitudinal - 6 patients from mild an 6 patients from severe
- Visualisation:
- volcano
- Boxplot
DATA EXPLORATION
Read Data set wd
# pc
output_dir<-"D:/Dropbox/Projects/COVID19_blood_biomarker/Analysis"
#laptop
#output_dir<-"C:/Users/hamel/Dropbox/Projects/COVID19_blood_biomarker/Analysis"
# read - pc
protein_data<-read.csv("D:/Dropbox/Projects/COVID19_blood_biomarker/Data/protein_expression_data.txt", sep="\t", head=T, fill=T)
assay_data<-read.csv("D:/Dropbox/Projects/COVID19_blood_biomarker/Data/protein_info.txt", sep="\t", head=T)
# read - laptop
#protein_data<-read.csv("C:/Users/hamel/Dropbox/Projects/COVID19_blood_biomarker/Data/protein_expression_data.txt", sep="\t", head=T, fill=T)
#assay_data<-read.csv("C:/Users/hamel/Dropbox/Projects/COVID19_blood_biomarker/Data/protein_info.txt", sep="\t", head=T)
# dim
dim(protein_data)## [1] 93 377
Add sample ID
Adding a sample ID to give each sample a unique ID. ID will be “sample” + a number:
- sample1
- sample2
- sample4
- etc…
# add sample id
rownames(protein_data)<-paste("sample",rownames(protein_data), sep="")
# check
head(protein_data)[1:10]## Group Pat.Code Days.since.onset Gender Etnicity Age Tau NfL
## sample1 1 101 8 M Cauc 71.56 0.880 14.590
## sample2 1 101 27 M Cauc 71.61 0.990 15.700
## sample3 1 102 12 F Cauc 71.26 0.918 20.472
## sample4 1 103 22 M Cauc 69.09 0.188 11.184
## sample5 1 104 23 M Cauc 66.20 0.808 14.124
## sample6 1 105 17 F Cauc 60.36 1.810 12.360
## GFAP OID00379
## sample1 343.830 4.60307
## sample2 484.040 4.89467
## sample3 600.086 4.75002
## sample4 202.817 5.50362
## sample5 279.862 4.72375
## sample6 120.690 4.29556
Groups
Groups:
- 0 = control
- 1 = Mild symptoms
- 2 = Moderate symptoms
- 3 = Severe symptoms (ICU)
The number of samples per group:
##
## 0 1 2 3
## 33 26 9 25
Number of duplicate samples
Some patients were assayed again at a different time point.
This how many duplicates there are by patient ID:
# count duplicate by pat.code
duplicate_count<-as.data.frame(table(protein_data$Pat.Code))
# count where more than 1
nrow(subset(duplicate_count, Freq > 1))## [1] 12
# duplicate samples
duplicate_samples<-protein_data[protein_data$Pat.Code %in% subset(duplicate_count, Freq > 1)$Var1,]
# freq by group
table(duplicate_samples[1])##
## 1 3
## 12 12
Calculating time difference between longitudinal data in group 1:
# seperate by group and check difference in Days.since.onset
group1_duplicates_temp<-duplicate_samples[duplicate_samples$Group==1,][c(2,3)]
# data frame to store data
group1_duplicates<-as.data.frame(matrix(nrow=6, ncol=3))
# collapse by duplicate sample ID
temp_counter=1
for (x in seq(1,12,2)) {
group1_duplicates[temp_counter,c(1,2)]<-group1_duplicates_temp[c(x),]
group1_duplicates[temp_counter,3]<-group1_duplicates_temp[c(x+1),2]
temp_counter=temp_counter+1
}
# remove temp files
rm(temp_counter, group1_duplicates_temp)
# add colnames
colnames(group1_duplicates)<-c("Pat.Code", "1st_sample", "2nd_sample")
# calculate diff
group1_duplicates$time_diff<-as.numeric(group1_duplicates$`2nd_sample`)- as.numeric(group1_duplicates$`1st_sample`)
# table
datatable(group1_duplicates)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.00 14.25 17.00 16.17 19.00 20.00
Calculating time difference between longitudinal data in group 3:
# seperate by group and check difference in Days.since.onset
group3_duplicates_temp<-duplicate_samples[duplicate_samples$Group==3,][c(2,3)]
# data frame to store data
group3_duplicates<-as.data.frame(matrix(nrow=6, ncol=3))
# collapse by duplicate sample ID
temp_counter=1
for (x in seq(1,12,2)) {
group3_duplicates[temp_counter,c(1,2)]<-group3_duplicates_temp[c(x),]
group3_duplicates[temp_counter,3]<-group3_duplicates_temp[c(x+1),2]
temp_counter=temp_counter+1
}
# remove temp files
rm(temp_counter, group3_duplicates_temp)
# add colnames
colnames(group3_duplicates)<-c("Pat.Code", "1st_sample", "2nd_sample")
# calculate diff
group3_duplicates$time_diff<-as.numeric(group3_duplicates$`2nd_sample`)- as.numeric(group3_duplicates$`1st_sample`)
# table
datatable(group3_duplicates)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 3.000 3.167 4.000 5.000
Summary
12 patients have been repeated twice:
- 6 from group 1 (mild)
- 6 from group 3 (severe)
Group 1 average time difference between repeated samples is 16 days
Group 3 avarage time difference between repeated samples is 3 days
Number of duplicate proteins
some proteins repeated on different assay
## Assay Gene.ID Uniprot.ID OLINK.ID LOD
## 1 Olink CARDIOVASCULAR II(v.5006) BMP-6 P22004 OID00379 0.51245
## 2 Olink CARDIOVASCULAR II(v.5006) ANGPT1 Q15389 OID00380 -0.30886
## 3 Olink CARDIOVASCULAR II(v.5006) ADM P35318 OID00381 1.17289
## 4 Olink CARDIOVASCULAR II(v.5006) CD40-L P29965 OID00382 0.26247
## 5 Olink CARDIOVASCULAR II(v.5006) SLAMF7 Q9NQ25 OID00383 1.62544
## 6 Olink CARDIOVASCULAR II(v.5006) PGF P49763 OID00384 0.33637
## [1] 368
## [1] 355
## [1] 355
Expression plots
Basic boxplots
# format data
boxplot_data_pre_qc<-stack(as.data.frame(t(protein_data[10:ncol(protein_data)])))
# add group
boxplot_data_pre_qc<-merge(boxplot_data_pre_qc, protein_data[1], by.x="ind", by.y="row.names")
# format
colnames(boxplot_data_pre_qc)<-c("Sample", "NPX", "Group")
# recode group
boxplot_data_pre_qc$Group[boxplot_data_pre_qc$Group=="0"]<-"Control"
boxplot_data_pre_qc$Group[boxplot_data_pre_qc$Group=="1"]<-"Mild"
boxplot_data_pre_qc$Group[boxplot_data_pre_qc$Group=="2"]<-"Moderate"
boxplot_data_pre_qc$Group[boxplot_data_pre_qc$Group=="3"]<-"Severe"
# plot
ggplotly(ggplot(data=boxplot_data_pre_qc, aes(x = Sample, y = NPX, fill=Group)) +
geom_boxplot() +
ggtitle("Distribution of protein expression") +
theme(plot.title = element_text(hjust = 0.5)))## Warning: Removed 2392 rows containing non-finite values (stat_boxplot).
Density plot
Density plots
# by group
ggplot(boxplot_data_pre_qc, aes(x=NPX, colour=Group)) +
geom_density() +
ggtitle("Expression density by Group") +
theme(plot.title = element_text(hjust = 0.5))## Warning: Removed 2392 rows containing non-finite values (stat_density).
#by sample
ggplotly(ggplot(boxplot_data_pre_qc, aes(x=NPX, colour=Sample)) +
geom_density() +
ggtitle("Expression density by sample") +
theme(plot.title = element_text(hjust = 0.5), legend.position = "none"))## Warning: Removed 2392 rows containing non-finite values (stat_density).
DATA PROCESSING
Remove samples which are all empty - NA in all rows
There are 4 assays, each with 92 proteins. total = 368
Number of missing prtoein measures by sample (ordered by missingess):
## sample1 sample3 sample4 sample5 sample6 sample7 sample8 sample9
## 0 0 0 0 0 0 0 0
## sample10 sample11 sample12 sample13 sample14 sample15 sample16 sample17
## 0 0 0 0 0 0 0 0
## sample18 sample19 sample20 sample21 sample22 sample23 sample24 sample25
## 0 0 0 0 0 0 0 0
## sample26 sample27 sample28 sample29 sample30 sample31 sample32 sample33
## 0 0 0 0 0 0 0 0
## sample34 sample35 sample36 sample37 sample38 sample39 sample40 sample41
## 0 0 0 0 0 0 0 0
## sample42 sample43 sample44 sample45 sample46 sample47 sample48 sample49
## 0 0 0 0 0 0 0 0
## sample50 sample52 sample53 sample54 sample55 sample56 sample57 sample58
## 0 0 0 0 0 0 0 0
## sample59 sample60 sample61 sample62 sample63 sample64 sample65 sample66
## 0 0 1 1 1 1 1 1
## sample67 sample68 sample69 sample70 sample71 sample73 sample74 sample75
## 1 1 1 1 1 1 1 1
## sample77 sample78 sample79 sample81 sample82 sample84 sample85 sample86
## 1 1 1 1 1 1 1 1
## sample87 sample89 sample90 sample91 sample92 sample93 sample2 sample72
## 1 1 1 1 1 1 184 369
## sample76 sample80 sample83 sample88 sample51
## 369 369 369 369 371
Six samples with missing data and 1 sample with 50% missing (Sample2)
Investigating sample 2:
# find missingness by sample 2
sample_2_missingness<-as.data.frame(t(protein_data[2,10:ncol(protein_data)]))
# merge with assay data
sample_2_missingness<-merge(sample_2_missingness, assay_data[c(1,4)], by.x="row.names", by.y="OLINK.ID")
# change colnames
colnames(sample_2_missingness)<-c("OLINK.ID", "NPX", "Assay")
# change all numeric value to 1
sample_2_missingness[!(is.na(sample_2_missingness$NPX)),2]<-1
# contingency table
table(sample_2_missingness$NPX, sample_2_missingness$Assay)##
## Olink CARDIOVASCULAR II(v.5006) Olink IMMUNE RESPONSE(v.3203)
## 1 92 0
##
## Olink INFLAMMATION(v.3022) Olink NEUROLOGY(v.8012)
## 1 92 0
Summary of sample2:
- Sample2 is a mild symptom patient missing from the immune response and neurology assay (sample kept in)
- It is also a duplicated sample of sample1.
- We can remove this sample for the immune and neurology assay.
Following samples removed (all missing expression values):
- sample51
- sample72
- sample76
- sample80
- sample83
- sample88
# make list of samples to remove
samples_to_remove<-rownames(as.data.frame(tail(sort(rowSums(is.na(protein_data))), 6)))
# The samples eing removed belong to group:
protein_data[rownames(protein_data) %in% samples_to_remove,1]## [1] 3 0 0 0 0 0
# remove from expression
protein_data_clean<-protein_data[!(rownames(protein_data) %in% samples_to_remove),]
# check
sort(rowSums(is.na(protein_data_clean)))## sample1 sample3 sample4 sample5 sample6 sample7 sample8 sample9
## 0 0 0 0 0 0 0 0
## sample10 sample11 sample12 sample13 sample14 sample15 sample16 sample17
## 0 0 0 0 0 0 0 0
## sample18 sample19 sample20 sample21 sample22 sample23 sample24 sample25
## 0 0 0 0 0 0 0 0
## sample26 sample27 sample28 sample29 sample30 sample31 sample32 sample33
## 0 0 0 0 0 0 0 0
## sample34 sample35 sample36 sample37 sample38 sample39 sample40 sample41
## 0 0 0 0 0 0 0 0
## sample42 sample43 sample44 sample45 sample46 sample47 sample48 sample49
## 0 0 0 0 0 0 0 0
## sample50 sample52 sample53 sample54 sample55 sample56 sample57 sample58
## 0 0 0 0 0 0 0 0
## sample59 sample60 sample61 sample62 sample63 sample64 sample65 sample66
## 0 0 1 1 1 1 1 1
## sample67 sample68 sample69 sample70 sample71 sample73 sample74 sample75
## 1 1 1 1 1 1 1 1
## sample77 sample78 sample79 sample81 sample82 sample84 sample85 sample86
## 1 1 1 1 1 1 1 1
## sample87 sample89 sample90 sample91 sample92 sample93 sample2
## 1 1 1 1 1 1 184
List proteins with high LOD rate + NA
Proteins with > 50% LOD in all groups are removed.
Making a list of proteins where more than 50% are below the LOD for each group
#create subset of group 0 (exprs only)
group0_exprs<-(subset(protein_data_clean, protein_data_clean$Group==0))[10:ncol(protein_data_clean)]
dim(group0_exprs)## [1] 28 368
#create subset of group 1 (exprs only)
group1_exprs<-(subset(protein_data_clean, protein_data_clean$Group==1))[10:ncol(protein_data_clean)]
dim(group1_exprs)## [1] 26 368
#create subset of group 2 (exprs only)
group2_exprs<-(subset(protein_data_clean, protein_data_clean$Group==2))[10:ncol(protein_data_clean)]
dim(group2_exprs)## [1] 9 368
#create subset of group 3 (exprs only)
group3_exprs<-(subset(protein_data_clean, protein_data_clean$Group==3))[10:ncol(protein_data_clean)]
dim(group3_exprs)## [1] 24 368
## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] TRUE
#create count file
missingness<-assay_data[4]
#add groups
missingness$Group0<-0
missingness$Group1<-0
missingness$Group2<-0
missingness$Group3<-0
#count group 0
for (x in 1:nrow(missingness)) {
missingness[x,2]<-(length(group0_exprs[x][group0_exprs[x]<assay_data$LOD[x]])/nrow(group0_exprs))*100
}
#count group 1
for (x in 1:nrow(missingness)) {
missingness[x,3]<-(length(group1_exprs[x][group1_exprs[x]<assay_data$LOD[x]])/nrow(group1_exprs))*100
}
#count group 2
for (x in 1:nrow(missingness)) {
missingness[x,4]<-(length(group2_exprs[x][group2_exprs[x]<assay_data$LOD[x]])/nrow(group2_exprs))*100
}
#count group 3
for (x in 1:nrow(missingness)) {
missingness[x,5]<-(length(group3_exprs[x][group3_exprs[x]<assay_data$LOD[x]])/nrow(group3_exprs))*100
}Count the number of proteins where expression values are < LOD per group
This is the count of proteins with > 50% missingness per group.
# move OLINK ID to rowname and remove from file
rownames(missingness)<-missingness$OLINK.ID
missingness$OLINK.ID<-NULL
#count per group
apply(missingness, 2, function(x) {sum(x>50)})## Group0 Group1 Group2 Group3
## 17 27 20 18
#count across group where missiness is over 50%
missingness_across_groups<-as.data.frame(apply(missingness, 1, function(x) {sum(x>50)}))
colnames(missingness_across_groups)<-"missingness_count"
# This is the count of 50% across groups.
table(missingness_across_groups)## missingness_across_groups
## 0 1 2 3 4
## 341 4 4 6 13
Summary of protein missingness
- 13 proteins have >50% missingness across all groups
- 6 proteins have >50% missingness across 3 groups
- 4 proteins have >50% missingness across 2 groups
- 4 proteins have >50% missingness across 1 groups
- 341 proteins DON’T have >50% missingess in any group
list of proteins to remove. i.e missingness over 50% in all groups
Remove proteins with high LOD + NA across all groups
Remove the 13 proteins from expression
## [1] 13
## [1] 368
# remove proteins
protein_data_clean<-protein_data_clean[!(colnames(protein_data_clean) %in% proteins_to_remove)]
# number of proteins
length(colnames(protein_data_clean[10:ncol(protein_data_clean)]))## [1] 355
Remove proteins from assay data
## [1] 368
# remove proteins
assay_data_clean<-assay_data[!(assay_data$OLINK.ID %in% proteins_to_remove),]
# number of proteins
length(rownames(assay_data_clean))## [1] 355
check expression data and assay data has same proteins
## [1] TRUE
Replace low expression
How many expression values < LOD in data:
# extract expression table only
exprs_only<-protein_data_clean[10:ncol(protein_data_clean)]
# check protein order same
all(colnames(exprs_only)==assay_data_clean$OLINK.ID)## [1] TRUE
# count total number of cells with value less than its LOD
total_low_LOD_count<-assay_data_clean[4]
total_low_LOD_count$missingness<-0
# individual count
for (x in 1:nrow(total_low_LOD_count)) {
total_low_LOD_count[x,2]<-length(na.omit(exprs_only[x][exprs_only[x] < assay_data_clean$LOD[x]]))
}
# total number of expression values below LOD
sum(total_low_LOD_count$missingness)## [1] 834
# total number of proteins with an expression value less than LOD
nrow(subset(total_low_LOD_count, total_low_LOD_count$missingness!=0))## [1] 44
# number of proteins with an expression value less than LOD - what panels are they in
table(assay_data_clean[assay_data_clean$OLINK.ID %in% subset(total_low_LOD_count, total_low_LOD_count$missingness!=0)$OLINK.ID,]$Assay)##
## Olink CARDIOVASCULAR II(v.5006) Olink IMMUNE RESPONSE(v.3203)
## 2 24
## Olink INFLAMMATION(v.3022) Olink NEUROLOGY(v.8012)
## 13 5
Change these low LOD values to LOD/sqrt(2)
# for each column, replace values lower than LOD to LOD/sqrt(2)
for (x in 1:ncol(exprs_only)) {
exprs_only[x]<-replace(exprs_only[x],
exprs_only[x] < assay_data_clean$LOD[x],
assay_data_clean$LOD[x]/(sqrt(2)))
}Check again how many value are less than LOD. This time have to check if values is less than the new assigned value LOD/sqrt(2)
# count less than LOD/sqrt(2)
total_low_LOD_count$missingness2<-0
# individual count
for (x in 1:nrow(total_low_LOD_count)) {
total_low_LOD_count[x,3]<-length(na.omit(exprs_only[x][exprs_only[x] < assay_data_clean$LOD[x]/(sqrt(2))]))
}
# count
sum(total_low_LOD_count$missingness2)## [1] 20
Replace expression data in clean data
#replace exression data
protein_data_clean[10:ncol(protein_data_clean)]<-exprs_only
#check coloumns still in order
all(colnames(protein_data_clean)[10:ncol(protein_data_clean)]==assay_data_clean$OLINK.ID)==T## [1] TRUE
POST PROCESSING CHECKS
PCA
Performing PCA prior to data processing. PCA does not handle missing values, so proteins with missingnes are removed.
# pca - remove NAs - PCA does not like them
pca_data_na_removed<-na.omit(as.data.frame(t(protein_data_clean[10:ncol(protein_data_clean)])))
# numbr of proteins remaining
nrow(pca_data_na_removed)## [1] 173
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 27.672 5.9117 3.69170 1.92258 1.63432 1.46674 1.33791
## Proportion of Variance 0.905 0.0413 0.01611 0.00437 0.00316 0.00254 0.00212
## Cumulative Proportion 0.905 0.9463 0.96239 0.96676 0.96992 0.97246 0.97458
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 1.27822 1.18915 1.0893 1.03854 0.98079 0.95904 0.9211
## Proportion of Variance 0.00193 0.00167 0.0014 0.00127 0.00114 0.00109 0.0010
## Cumulative Proportion 0.97651 0.97818 0.9796 0.98086 0.98199 0.98308 0.9841
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.88753 0.88417 0.8244 0.78928 0.7684 0.70312 0.69719
## Proportion of Variance 0.00093 0.00092 0.0008 0.00074 0.0007 0.00058 0.00057
## Cumulative Proportion 0.98501 0.98594 0.9867 0.98748 0.9882 0.98876 0.98933
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 0.69037 0.66786 0.65778 0.62561 0.60703 0.59309 0.59044
## Proportion of Variance 0.00056 0.00053 0.00051 0.00046 0.00044 0.00042 0.00041
## Cumulative Proportion 0.98990 0.99042 0.99093 0.99140 0.99183 0.99225 0.99266
## PC29 PC30 PC31 PC32 PC33 PC34 PC35
## Standard deviation 0.56691 0.56314 0.54632 0.53285 0.52758 0.51224 0.49714
## Proportion of Variance 0.00038 0.00037 0.00035 0.00034 0.00033 0.00031 0.00029
## Cumulative Proportion 0.99304 0.99342 0.99377 0.99410 0.99443 0.99474 0.99503
## PC36 PC37 PC38 PC39 PC40 PC41 PC42
## Standard deviation 0.48182 0.46702 0.45864 0.43671 0.43017 0.42053 0.41861
## Proportion of Variance 0.00027 0.00026 0.00025 0.00023 0.00022 0.00021 0.00021
## Cumulative Proportion 0.99531 0.99557 0.99582 0.99604 0.99626 0.99647 0.99668
## PC43 PC44 PC45 PC46 PC47 PC48 PC49
## Standard deviation 0.40589 0.39303 0.38133 0.36610 0.36004 0.35717 0.35053
## Proportion of Variance 0.00019 0.00018 0.00017 0.00016 0.00015 0.00015 0.00015
## Cumulative Proportion 0.99687 0.99705 0.99722 0.99738 0.99754 0.99769 0.99783
## PC50 PC51 PC52 PC53 PC54 PC55 PC56
## Standard deviation 0.34295 0.33212 0.31615 0.31093 0.30086 0.2962 0.2927
## Proportion of Variance 0.00014 0.00013 0.00012 0.00011 0.00011 0.0001 0.0001
## Cumulative Proportion 0.99797 0.99810 0.99822 0.99833 0.99844 0.9985 0.9987
## PC57 PC58 PC59 PC60 PC61 PC62 PC63
## Standard deviation 0.27543 0.27354 0.27234 0.25679 0.25171 0.24003 0.23846
## Proportion of Variance 0.00009 0.00009 0.00009 0.00008 0.00007 0.00007 0.00007
## Cumulative Proportion 0.99874 0.99882 0.99891 0.99899 0.99906 0.99913 0.99920
## PC64 PC65 PC66 PC67 PC68 PC69 PC70
## Standard deviation 0.23449 0.23147 0.22204 0.21257 0.20800 0.20387 0.19496
## Proportion of Variance 0.00006 0.00006 0.00006 0.00005 0.00005 0.00005 0.00004
## Cumulative Proportion 0.99926 0.99933 0.99939 0.99944 0.99949 0.99954 0.99959
## PC71 PC72 PC73 PC74 PC75 PC76 PC77
## Standard deviation 0.19228 0.18889 0.17939 0.16721 0.16099 0.15834 0.15376
## Proportion of Variance 0.00004 0.00004 0.00004 0.00003 0.00003 0.00003 0.00003
## Cumulative Proportion 0.99963 0.99967 0.99971 0.99974 0.99977 0.99980 0.99983
## PC78 PC79 PC80 PC81 PC82 PC83 PC84
## Standard deviation 0.14866 0.13920 0.13673 0.12891 0.12542 0.11510 0.10475
## Proportion of Variance 0.00003 0.00002 0.00002 0.00002 0.00002 0.00002 0.00001
## Cumulative Proportion 0.99986 0.99988 0.99990 0.99992 0.99994 0.99996 0.99997
## PC85 PC86 PC87
## Standard deviation 0.09805 0.09393 0.09216
## Proportion of Variance 0.00001 0.00001 0.00001
## Cumulative Proportion 0.99998 0.99999 1.00000
# extract PC1 and PC2
PC1_plot<-pca_data$rotation[,1:2]
# add phenotype information to PC information
PC1_plot<-cbind(PC1_plot, protein_data_clean[c(1,3,4,5)])
PC1_plot$Group<-as.character(PC1_plot$Group)
# ggplot plot
ggplotly(ggplot(data=PC1_plot, aes(x=PC1, y=PC2, col=Group)) +
# scatterplot
geom_point() +
# add title to plot
ggtitle("PC1 vs PC2 by Group") +
# centre title
theme(plot.title = element_text(hjust = 0.5)))plot by Ethnicity
# ggplot plot
ggplotly(ggplot(data=PC1_plot, aes(x=PC1, y=PC2, col=Etnicity)) +
# scatterplot
geom_point() +
# add title to plot
ggtitle("PC1 vs PC2 by Ethnicity") +
# centre title
theme(plot.title = element_text(hjust = 0.5)))Ethinicity for all controls are uknown
plot by Gender
Expression boxplot
Data expression plot aftr QC
Basic boxplots
# format data
boxplot_data_post_qc<-stack(as.data.frame(t(protein_data_clean[10:ncol(protein_data_clean)])))
# add group
boxplot_data_post_qc<-merge(boxplot_data_post_qc, protein_data_clean[1], by.x="ind", by.y="row.names")
# format
colnames(boxplot_data_post_qc)<-c("Sample", "NPX", "Group")
# recode group
boxplot_data_post_qc$Group[boxplot_data_post_qc$Group=="0"]<-"Control"
boxplot_data_post_qc$Group[boxplot_data_post_qc$Group=="1"]<-"Mild"
boxplot_data_post_qc$Group[boxplot_data_post_qc$Group=="2"]<-"Moderate"
boxplot_data_post_qc$Group[boxplot_data_post_qc$Group=="3"]<-"Severe"
# plot
ggplotly(ggplot(data=boxplot_data_post_qc, aes(x = Sample, y = NPX, fill=Group)) +
geom_boxplot() +
ggtitle("Distribution of protein expression") +
theme(plot.title = element_text(hjust = 0.5)))## Warning: Removed 182 rows containing non-finite values (stat_boxplot).
Density plot
Density plots
# by group
ggplot(boxplot_data_post_qc, aes(x=NPX, colour=Group)) +
geom_density() +
ggtitle("Expression density by Group") +
theme(plot.title = element_text(hjust = 0.5))## Warning: Removed 182 rows containing non-finite values (stat_density).
#by sample
ggplotly(ggplot(boxplot_data_post_qc, aes(x=NPX, colour=Sample)) +
geom_density() +
ggtitle("Expression density by sample") +
theme(plot.title = element_text(hjust = 0.5), legend.position = "none"))## Warning: Removed 182 rows containing non-finite values (stat_density).
POST QC
create function to merge differenital expression results with protein ID. also used in correlation analysis, hence shifted up to here.
# create protein mapping file to merge with DE results
protein_mapping<-assay_data_clean[c(1,2,3,4)]
# change rownames
rownames(protein_mapping)<-protein_mapping$OLINK.ID
protein_mapping$OLINK.ID<-NULL
# modify assay name
protein_mapping$Assay<-as.character(protein_mapping$Assay)
protein_mapping$Assay[protein_mapping$Assay=="Olink CARDIOVASCULAR II(v.5006)"]<-"Cardiovascular"
protein_mapping$Assay[protein_mapping$Assay=="Olink IMMUNE RESPONSE(v.3203)"]<-"Immune"
protein_mapping$Assay[protein_mapping$Assay=="Olink INFLAMMATION(v.3022)"]<-"Inflammation"
protein_mapping$Assay[protein_mapping$Assay=="Olink NEUROLOGY(v.8012)"]<-"Neurology"
#check table
table(protein_mapping$Assay)##
## Cardiovascular Immune Inflammation Neurology
## 92 90 81 92
merge_protin_ID<-function(x){
# merge
temp<-merge(protein_mapping, x, by="row.names")
# move rownames
rownames(temp)<-temp$Row.names
temp$Row.names<-NULL
# reorder by most sig
temp<-temp[order(temp$adj.P.Val, -abs(temp$logFC)),]
return(temp)
}Groups
Groups:
- 0 = control
- 1 = Mild symptoms
- 2 = Moderate symptoms
- 3 = Severe symptoms (ICU)
The number of samples per group:
##
## 0 1 2 3
## 28 26 9 24
Age
- summarise all into a table
summary of age
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 21.52 47.42 59.67 59.18 71.41 87.29
## upper mean lower
## 70.27734 63.10643 55.93552
## upper mean lower
## 56.79125 51.34500 45.89875
## upper mean lower
## 74.63664 64.82222 55.00780
## upper mean lower
## 65.79818 60.98125 56.16432
## upper mean lower
## 60.84354 57.32068 53.79782
ggplot(data=protein_data_clean, aes(x=as.character(Group), y=Age, fill=Group)) +
geom_boxplot() +
ggtitle("Distribution of age by group") +
theme(plot.title = element_text(hjust = 0.5))T-test of Group 0 vs 1 (control vs mild)
# group 0 vs 1
t.test(protein_data_clean[protein_data_clean$Group==0,6], protein_data_clean[protein_data_clean$Group==1,6])##
## Welch Two Sample t-test
##
## data: protein_data_clean[protein_data_clean$Group == 0, 6] and protein_data_clean[protein_data_clean$Group == 1, 6]
## t = 2.6837, df = 49.31, p-value = 0.009886
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.955688 20.567170
## sample estimates:
## mean of x mean of y
## 63.10643 51.34500
T-test of Group 0 vs 2 (control vs moderate)
# group 0 vs 2
t.test(protein_data_clean[protein_data_clean$Group==0,6], protein_data_clean[protein_data_clean$Group==2,6])##
## Welch Two Sample t-test
##
## data: protein_data_clean[protein_data_clean$Group == 0, 6] and protein_data_clean[protein_data_clean$Group == 2, 6]
## t = -0.31156, df = 19.764, p-value = 0.7586
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -13.212180 9.780593
## sample estimates:
## mean of x mean of y
## 63.10643 64.82222
T-test of Group 0 vs 3 (control vs severe)
# group 0 vs 3
t.test(protein_data_clean[protein_data_clean$Group==0,6], protein_data_clean[protein_data_clean$Group==3,6])##
## Welch Two Sample t-test
##
## data: protein_data_clean[protein_data_clean$Group == 0, 6] and protein_data_clean[protein_data_clean$Group == 3, 6]
## t = 0.50605, df = 45.716, p-value = 0.6153
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -6.329499 10.579856
## sample estimates:
## mean of x mean of y
## 63.10643 60.98125
T-test of Group 1 vs 2 (mild vs moderate)
# group 1 vs 2
t.test(protein_data_clean[protein_data_clean$Group==1,6], protein_data_clean[protein_data_clean$Group==2,6])##
## Welch Two Sample t-test
##
## data: protein_data_clean[protein_data_clean$Group == 1, 6] and protein_data_clean[protein_data_clean$Group == 2, 6]
## t = -2.6897, df = 14.67, p-value = 0.01705
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -24.178172 -2.776273
## sample estimates:
## mean of x mean of y
## 51.34500 64.82222
T-test of Group 1 vs 3 (mild vs severe)
# group 1 vs 3
t.test(protein_data_clean[protein_data_clean$Group==1,6], protein_data_clean[protein_data_clean$Group==3,6])##
## Welch Two Sample t-test
##
## data: protein_data_clean[protein_data_clean$Group == 1, 6] and protein_data_clean[protein_data_clean$Group == 3, 6]
## t = -2.7349, df = 47.656, p-value = 0.008736
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -16.722004 -2.550496
## sample estimates:
## mean of x mean of y
## 51.34500 60.98125
T-test of Group 2 vs 3 (moderate vs severe)
# group 2 vs 3
t.test(protein_data_clean[protein_data_clean$Group==2,6], protein_data_clean[protein_data_clean$Group==3,6])##
## Welch Two Sample t-test
##
## data: protein_data_clean[protein_data_clean$Group == 2, 6] and protein_data_clean[protein_data_clean$Group == 3, 6]
## t = 0.79173, df = 13.098, p-value = 0.4426
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -6.631821 14.313766
## sample estimates:
## mean of x mean of y
## 64.82222 60.98125
T-test of Group 0 vs 1,2,3 (control vs case)
# group 0 vs 1,2,3
t.test(protein_data_clean[protein_data_clean$Group==0,6], protein_data_clean[protein_data_clean$Group %in% c(1,2,3),6])##
## Welch Two Sample t-test
##
## data: protein_data_clean[protein_data_clean$Group == 0, 6] and protein_data_clean[protein_data_clean$Group %in% c(1, 2, 3), protein_data_clean[protein_data_clean$Group == 0, 6] and 6]
## t = 1.4786, df = 41.196, p-value = 0.1469
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.115556 13.687058
## sample estimates:
## mean of x mean of y
## 63.10643 57.32068
Summary of Age
- Significant difference between group 0 and 1 only
- no significant differnce when comparing controls against all cases (grouped)
Gender
- summarise all into a table (Fisher is better for smaller numbers)
summary of Gender
##
## F M
## 31 56
##
## 0 1 2 3
## F 13 13 3 2
## M 15 13 6 22
Fisher test of Group 0 vs 1 (control vs mild)
# extract
group_0_vs_1_GG<-protein_data_clean[protein_data_clean$Group==0 | protein_data_clean$Group==1,]
# test
fisher.test(table(group_0_vs_1_GG$Group, group_0_vs_1_GG$Gender))##
## Fisher's Exact Test for Count Data
##
## data: table(group_0_vs_1_GG$Group, group_0_vs_1_GG$Gender)
## p-value = 1
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.2609774 2.8733805
## sample estimates:
## odds ratio
## 0.8689712
Fisher test of Group 0 vs 2 (control vs moderate)
# extract
group_0_vs_2_GG<-protein_data_clean[protein_data_clean$Group==0 | protein_data_clean$Group==2,]
# test
fisher.test(table(group_0_vs_2_GG$Group, group_0_vs_2_GG$Gender))##
## Fisher's Exact Test for Count Data
##
## data: table(group_0_vs_2_GG$Group, group_0_vs_2_GG$Gender)
## p-value = 0.7023
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.2908013 12.7086492
## sample estimates:
## odds ratio
## 1.708209
Fisher test of Group 0 vs 3 (control vs severe)
# extract
group_0_vs_3_GG<-protein_data_clean[protein_data_clean$Group==0 | protein_data_clean$Group==3,]
# test
fisher.test(table(group_0_vs_3_GG$Group, group_0_vs_3_GG$Gender))##
## Fisher's Exact Test for Count Data
##
## data: table(group_0_vs_3_GG$Group, group_0_vs_3_GG$Gender)
## p-value = 0.004729
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 1.696152 95.007720
## sample estimates:
## odds ratio
## 9.132736
Fisher test of Group 1 vs 2 (control vs severe)
# extract
group_1_vs_2_GG<-protein_data_clean[protein_data_clean$Group==1 | protein_data_clean$Group==2,]
# test
fisher.test(table(group_1_vs_2_GG$Group, group_1_vs_2_GG$Gender))##
## Fisher's Exact Test for Count Data
##
## data: table(group_1_vs_2_GG$Group, group_1_vs_2_GG$Gender)
## p-value = 0.4605
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.3294235 14.7840643
## sample estimates:
## odds ratio
## 1.961239
Fisher test of Group 1 vs 3 (control vs severe)
# extract
group_1_vs_3_GG<-protein_data_clean[protein_data_clean$Group==1 | protein_data_clean$Group==3,]
# test
fisher.test(table(group_1_vs_3_GG$Group, group_1_vs_3_GG$Gender))##
## Fisher's Exact Test for Count Data
##
## data: table(group_1_vs_3_GG$Group, group_1_vs_3_GG$Gender)
## p-value = 0.001765
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 1.916874 110.135285
## sample estimates:
## odds ratio
## 10.46825
Fisher test of Group 2 vs 3 (control vs severe)
# extract
group_2_vs_3_GG<-protein_data_clean[protein_data_clean$Group==2 | protein_data_clean$Group==3,]
# test
fisher.test(table(group_2_vs_3_GG$Group, group_2_vs_3_GG$Gender))##
## Fisher's Exact Test for Count Data
##
## data: table(group_2_vs_3_GG$Group, group_2_vs_3_GG$Gender)
## p-value = 0.111
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.4776759 75.1620727
## sample estimates:
## odds ratio
## 5.15309
Fisher test of Group 0 vs 1,2,3 (control vs case)
# extract
case_vs_control_GG<-protein_data_clean
case_vs_control_GG$Group<-ifelse(case_vs_control_GG$Group==0, "control", "case")
# test
fisher.test(table(case_vs_control_GG$Group, case_vs_control_GG$Gender))##
## Fisher's Exact Test for Count Data
##
## data: table(case_vs_control_GG$Group, case_vs_control_GG$Gender)
## p-value = 0.1596
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.1820679 1.4254448
## sample estimates:
## odds ratio
## 0.5107209
Summary of Gender
- Signiicantly More Males than females in 0 vs 3, 1 vs 3
- no significant differnce when comparing controls against all cases (grouped)
Ethnicity
Ethnicity by group
##
## 0 1 2 3
## 28 0 0 0
## African black 0 0 0 1
## Cauc 0 20 8 23
## Persian 0 6 1 0
Number of duplicate samples
Check if we removed any duplicates
# count duplicate by pat.code
duplicate_count_after_qc<-as.data.frame(table(protein_data_clean$Pat.Code))
# count where more than 1
nrow(subset(duplicate_count_after_qc, Freq > 1))## [1] 11
# duplicate samples
duplicate_samples_after_qc<-protein_data_clean[protein_data_clean$Pat.Code %in% subset(duplicate_count_after_qc, Freq > 1)$Var1,]
# freq by group
table(duplicate_samples_after_qc[1])##
## 1 3
## 12 10
Summary of duplicates after QC
- one set of duplicates lost from group3.
- Group1: 12 samples (6 patients)
- Group3: 10 samples (5 patients)
Number of duplicate proteins
some proteins repeated on different assay
## Assay Gene.ID Uniprot.ID OLINK.ID LOD
## 1 Olink CARDIOVASCULAR II(v.5006) BMP-6 P22004 OID00379 0.51245
## 2 Olink CARDIOVASCULAR II(v.5006) ANGPT1 Q15389 OID00380 -0.30886
## 3 Olink CARDIOVASCULAR II(v.5006) ADM P35318 OID00381 1.17289
## 4 Olink CARDIOVASCULAR II(v.5006) CD40-L P29965 OID00382 0.26247
## 5 Olink CARDIOVASCULAR II(v.5006) SLAMF7 Q9NQ25 OID00383 1.62544
## 6 Olink CARDIOVASCULAR II(v.5006) PGF P49763 OID00384 0.33637
## [1] 355
## [1] 344
## [1] 344
Time difference for longitudinal data
Calculating time difference between longitudinal data in group 1
# seperate by group and check difference in Days.since.onset
group1_duplicates_after_qc_temp2<-duplicate_samples_after_qc[duplicate_samples_after_qc$Group==1,][c(2,3)]
# data frame to store data
group1_duplicates_after_qc<-as.data.frame(matrix(nrow=6, ncol=3))
# collapse by duplicate sample ID
temp_counter=1
for (x in seq(1,12,2)) {
group1_duplicates_after_qc[temp_counter,c(1,2)]<-group1_duplicates_after_qc_temp2[c(x),]
group1_duplicates_after_qc[temp_counter,3]<-group1_duplicates_after_qc_temp2[c(x+1),2]
temp_counter=temp_counter+1
}
# remove temp files
rm(temp_counter, group1_duplicates_after_qc_temp2)
# add colnames
colnames(group1_duplicates_after_qc)<-c("Pat.Code", "1st_sample", "2nd_sample")
# calculate diff
group1_duplicates_after_qc$time_diff<-as.numeric(group1_duplicates_after_qc$`2nd_sample`)- as.numeric(group1_duplicates_after_qc$`1st_sample`)
# table
datatable(group1_duplicates_after_qc)This is the time difference in repeat sampling in group 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.00 14.25 17.00 16.17 19.00 20.00
Calculating time difference between longitudinal data in group 3
# seperate by group and check difference in Days.since.onset
group3_duplicates_after_qc_temp<-duplicate_samples_after_qc[duplicate_samples_after_qc$Group==3,][c(2,3)]
# data frame to store data
group3_duplicates_after_qc<-as.data.frame(matrix(nrow=5, ncol=3))
# collapse by duplicate sample ID
temp_counter=1
for (x in seq(1,10,2)) {
group3_duplicates_after_qc[temp_counter,c(1,2)]<-group3_duplicates_after_qc_temp[c(x),]
group3_duplicates_after_qc[temp_counter,3]<-group3_duplicates_after_qc_temp[c(x+1),2]
temp_counter=temp_counter+1
}
# remove temp files
rm(temp_counter, group3_duplicates_after_qc_temp)
# add colnames
colnames(group3_duplicates_after_qc)<-c("Pat.Code", "1st_sample", "2nd_sample")
# calculate diff
group3_duplicates_after_qc$time_diff<-as.numeric(group3_duplicates_after_qc$`2nd_sample`)- as.numeric(group3_duplicates_after_qc$`1st_sample`)
# table
datatable(group3_duplicates_after_qc)This is the time difference in repeat sampling in group 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.0 2.0 2.0 2.8 4.0 4.0
Summary
11 patients have been repeated twice:
- 6 from group 1 (mild)
- 5 from group 3 (severe)
- Group 1 average time difference between repeated samples is 16 days
- Group 3 avarage time difference between repeated samples is 3 days
summary of demographics in logitudinal patients
summary of demographics of longitudinal patients
group1_duplicates_demographics<-subset(protein_data_clean, protein_data_clean$Pat.Code %in% group1_duplicates_after_qc$Pat.Code)[c(2, 4,5,6)]
group1_duplicates_demographics<-group1_duplicates_demographics[!duplicated(group1_duplicates_demographics$Pat.Code),]
#age
CI(group1_duplicates_demographics$Age, ci=0.95)## upper mean lower
## 66.16550 51.48167 36.79783
## Pat.Code Gender Etnicity Age
## sample1 101 M Cauc 71.56
## sample6 105 F Cauc 60.36
## sample13 111 F Cauc 55.44
## sample15 112 M Cauc 48.02
## sample19 115 M Persian 40.97
## sample24 119 F Persian 32.54
## upper mean lower
## 13.587328 8.500000 3.412672
## upper mean lower
## 27.12034 24.66667 22.21299
## upper mean lower
## 20.22659 16.16667 12.10674
group3_duplicates_demographics<-subset(protein_data_clean, protein_data_clean$Pat.Code %in% group3_duplicates_after_qc$Pat.Code)[c(2, 4,5,6)]
group3_duplicates_demographics<-group3_duplicates_demographics[!duplicated(group3_duplicates_demographics$Pat.Code),]
# age
CI(group3_duplicates_demographics$Age, ci=0.95)## upper mean lower
## 73.40117 59.59000 45.77883
## upper mean lower
## 11.258525 8.400000 5.541475
## upper mean lower
## 15.067141 11.200000 7.332859
## upper mean lower
## 4.160175 2.800000 1.439825
Days since onset with longitudinal
calculate 95% CI
## upper mean lower
## NA NA NA
## upper mean lower
## 24.49607 20.88462 17.27316
## upper mean lower
## 13.71879 12.00000 10.28121
## upper mean lower
## 13.008426 11.041667 9.074907
# case s control
CI(subset(protein_data_clean, !(protein_data_clean$Group==0))$Days.since.onset, ci=0.95)## upper mean lower
## 17.65654 15.52542 13.39431
summary
Exploration of days since onset. Control group have been set 0
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.00 8.50 13.00 15.53 22.00 35.00 28
# set controls to 0
protein_data_clean[is.na(protein_data_clean$Days.since.onset),3]<-0
#plot
ggplot(data=protein_data_clean, aes(x=as.character(Group), y=Days.since.onset, fill=Group)) +
geom_boxplot() +
ggtitle("Distribution of onset_days by group") +
theme(plot.title = element_text(hjust = 0.5))T-test of Group 1 vs 2 (control vs mild)
# group 1 vs 2
t.test(protein_data_clean[protein_data_clean$Group==1,3], protein_data_clean[protein_data_clean$Group==2,3])##
## Welch Two Sample t-test
##
## data: protein_data_clean[protein_data_clean$Group == 1, 3] and protein_data_clean[protein_data_clean$Group == 2, 3]
## t = 4.663, df = 31.624, p-value = 5.406e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 5.001707 12.767524
## sample estimates:
## mean of x mean of y
## 20.88462 12.00000
T-test of Group 2 vs 3 (control vs mild)
# group 2 vs 3
t.test(protein_data_clean[protein_data_clean$Group==2,3], protein_data_clean[protein_data_clean$Group==3,3])##
## Welch Two Sample t-test
##
## data: protein_data_clean[protein_data_clean$Group == 2, 3] and protein_data_clean[protein_data_clean$Group == 3, 3]
## t = 0.79327, df = 28.744, p-value = 0.4341
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.513432 3.430098
## sample estimates:
## mean of x mean of y
## 12.00000 11.04167
T-test of Group 1 vs 3 (control vs mild)
# group 1 vs 3
t.test(protein_data_clean[protein_data_clean$Group==1,3], protein_data_clean[protein_data_clean$Group==3,3])##
## Welch Two Sample t-test
##
## data: protein_data_clean[protein_data_clean$Group == 1, 3] and protein_data_clean[protein_data_clean$Group == 3, 3]
## t = 4.9346, df = 38.265, p-value = 1.604e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 5.80584 13.88006
## sample estimates:
## mean of x mean of y
## 20.88462 11.04167
Summary of Days since onset
- Signiicantly More Males than females in 0 vs 3, 1 vs 3
- no significant differnce when comparing controls against all cases (grouped)
Correlation days since onset with severity
Any Correlation between the days since onset and protein expression?
First check correlation with disease severity as distribution of days since onset was skewed by disease severity
Spearmans used as severity is coded as 0,1,2,3 and is classed as ordinal value while protein expression is continous (non-ordinal).
days_since_onset_cor<-rcorr(as.numeric(protein_data_clean[,1]), as.numeric(protein_data_clean[,3]), type="spearman")
days_since_onset_cor$P[2]## [1] 8.055056e-08
## [1] 0.5373206
Days since onset is correlated with disease severity. However, this maybe driven by group 1.
correlation of days of onset within group 1
#create subset of group 1
group1_only<-(subset(protein_data_clean, protein_data_clean$Group==1))
# group1 exprs only
group1_exprs_only<-group1_only[10:ncol(group1_only)]
# empty dataframe for results
grp1_cor_days<-as.data.frame(matrix(nrow=nrow(assay_data_clean), ncol=2))
# colnames
colnames(grp1_cor_days)<-c("cor", "p-value")
# for each protein perform correlation analysis against disease severity (i.e 0,1,2,3) - use expression data only
for (x in 1:ncol(group1_exprs_only)) {
# add protein name
rownames(grp1_cor_days)[x]<-colnames(group1_exprs_only)[x]
# perform test
cor_results_temp<-rcorr(group1_only[,3], group1_exprs_only[,x], type="pearson")
#extract cor
grp1_cor_days[x,1]<-cor_results_temp$r[2]
#extract p -value
grp1_cor_days[x,2]<-cor_results_temp$P[2]
}
# check
head(grp1_cor_days)## cor p-value
## OID00379 0.326065145 0.10402926
## OID00380 0.328589705 0.10122181
## OID00381 0.024365172 0.90595237
## OID00382 0.457958610 0.01864101
## OID00383 -0.285302193 0.15772128
## OID00384 -0.003808262 0.98526918
# sort by p-value
grp1_cor_days<-grp1_cor_days[order(grp1_cor_days$`p-value`, -abs(-grp1_cor_days$cor)),]
# sort by pvalue
plot(grp1_cor_days[,1], grp1_cor_days[,2])## [1] 51
# sort by pvalue
grp1_cor_days<-grp1_cor_days[order(grp1_cor_days$`p-value`, -abs(grp1_cor_days$cor)),]
#results table
datatable(grp1_cor_days)how many postive/negative correlated
## [1] 24
## [1] 27
plot in group
ggplot(data=group1_only[c(3,(grep("OID01018", colnames(group1_only))))],
aes(x=Days.since.onset, y=OID01018)) +
geom_point() +
geom_smooth(method = "lm", se=F)## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
plot in all groups (minus control as they are zero)
# create cases only group
cases_only<-protein_data_clean[protein_data_clean$Group!=0,]
# plot
ggplot(data=cases_only[c(3,(grep("OID01018", colnames(cases_only))))],
aes(x=Days.since.onset, y=OID01018)) +
geom_point() +
geom_smooth(method = "lm", se=F)## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
correlation of days of onset within group 2
#create subset of group 2
group2_only<-(subset(protein_data_clean, protein_data_clean$Group==2))
# group2 exprs only
group2_exprs_only<-group2_only[10:ncol(group2_only)]
# empty dataframe for results
grp2_cor_days<-as.data.frame(matrix(nrow=nrow(assay_data_clean), ncol=2))
# colnames
colnames(grp2_cor_days)<-c("cor", "p-value")
# for each protein perform correlation analysis against disease severity (i.e 0,1,2,3) - use expression data only
for (x in 1:ncol(group2_exprs_only)) {
# add protein name
rownames(grp2_cor_days)[x]<-colnames(group2_exprs_only)[x]
# perform test
cor_results_temp<-rcorr(group2_only[,3], group2_exprs_only[,x], type="spearman")
#extract cor
grp2_cor_days[x,1]<-cor_results_temp$r[2]
#extract p -value
grp2_cor_days[x,2]<-cor_results_temp$P[2]
}
# check
head(grp2_cor_days)## cor p-value
## OID00379 0.3697610 0.3273605
## OID00380 0.5126231 0.1582071
## OID00381 0.5042195 0.1663127
## OID00382 0.2521097 0.5128372
## OID00383 -0.5378341 0.1352889
## OID00384 0.2689171 0.4841161
# sort by p-value
grp2_cor_days<-grp2_cor_days[order(grp2_cor_days$`p-value`, -abs(-grp2_cor_days$cor)),]
# sort by pvalue
plot(grp2_cor_days[,1], grp2_cor_days[,2])## [1] 12
# sort by pvalue
grp2_cor_days<-grp2_cor_days[order(grp2_cor_days$`p-value`, -abs(grp2_cor_days$cor)),]
#results table
datatable(grp2_cor_days)how many postive/negative correlated
## [1] 8
## [1] 4
plot
ggplot(data=group2_only[c(3,(grep("OID00968", colnames(group2_only))))],
aes(x=Days.since.onset, y=OID00968)) +
geom_point() +
geom_smooth(method = "lm", se=F)## `geom_smooth()` using formula 'y ~ x'
plot in all cases
# plot
ggplot(data=cases_only[c(3,(grep("OID00968", colnames(cases_only))))],
aes(x=Days.since.onset, y=OID00968)) +
geom_point() +
geom_smooth(method = "lm", se=F)## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
correlation of days of onset within group 3
#create subset of group 3
group3_only<-(subset(protein_data_clean, protein_data_clean$Group==3))
# group3 exprs only
group3_exprs_only<-group3_only[10:ncol(group3_only)]
# empty dataframe for results
grp3_cor_days<-as.data.frame(matrix(nrow=nrow(assay_data_clean), ncol=2))
# colnames
colnames(grp3_cor_days)<-c("cor", "p-value")
# for each protein perform correlation analysis against disease severity (i.e 0,1,2,3) - use expression data only
for (x in 1:ncol(group3_exprs_only)) {
# add protein name
rownames(grp3_cor_days)[x]<-colnames(group3_exprs_only)[x]
# perform test
cor_results_temp<-rcorr(group3_only[,3], group3_exprs_only[,x], type="spearman")
#extract cor
grp3_cor_days[x,1]<-cor_results_temp$r[2]
#extract p -value
grp3_cor_days[x,2]<-cor_results_temp$P[2]
}
# check
head(grp3_cor_days)## cor p-value
## OID00379 0.30015918 0.1541338
## OID00380 -0.06213208 0.7730304
## OID00381 0.06957042 0.7466791
## OID00382 -0.15401754 0.4724157
## OID00383 0.25421645 0.2306313
## OID00384 0.08619732 0.6888009
# sort by p-value
grp3_cor_days<-grp3_cor_days[order(grp3_cor_days$`p-value`, -abs(-grp3_cor_days$cor)),]
# sort by pvalue
plot(grp3_cor_days[,1], grp3_cor_days[,2])## [1] 49
# sort by pvalue
grp3_cor_days<-grp3_cor_days[order(grp3_cor_days$`p-value`, -abs(grp3_cor_days$cor)),]
#results table
datatable(grp3_cor_days)how many postive/negative correlated
## [1] 23
## [1] 26
ggplot(data=group3_only[c(3,(grep("OID01021", colnames(group3_only))))],
aes(x=Days.since.onset, y=OID01021)) +
geom_point() +
geom_smooth(method = "lm", se=F)## `geom_smooth()` using formula 'y ~ x'
plot in all cases
# plot
ggplot(data=cases_only[c(3,(grep("OID01021", colnames(cases_only))))],
aes(x=Days.since.onset, y=OID01021)) +
geom_point() +
geom_smooth(method = "lm", se=F)## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
Check overlap of correlated proteins
# move protein name column + add analysis
grp1_cor_days$OLINK.ID<-rownames(grp1_cor_days)
grp2_cor_days$OLINK.ID<-rownames(grp2_cor_days)
grp3_cor_days$OLINK.ID<-rownames(grp3_cor_days)
grp1_cor_days$analysis<-"grp1"
grp2_cor_days$analysis<-"grp2"
grp3_cor_days$analysis<-"grp3"
# merge
grp_cor_merged<-rbind((subset(grp1_cor_days, grp1_cor_days$`p-value`<0.05)),
(subset(grp2_cor_days, grp2_cor_days$`p-value`<0.05)),
(subset(grp3_cor_days, grp3_cor_days$`p-value`<0.05)))
# which grps are they
grp_cor_merged_frq<-(as.data.frame(table(grp_cor_merged$OLINK.ID)))
grp_cor_merged_frq<-grp_cor_merged_frq[grp_cor_merged_frq$Freq>1,]
# subset to duplicates
grp_cor_merged_dups<-subset(grp_cor_merged, grp_cor_merged$OLINK.ID %in% grp_cor_merged_frq$Var1)
grp_cor_merged_dups<-grp_cor_merged_dups[order(grp_cor_merged_dups$`p-value`),]
head(grp_cor_merged_dups)## cor p-value OLINK.ID analysis
## OID01018 -0.6158719 0.001047063 OID01018 grp1
## OID004511 -0.6121322 0.001477203 OID00451 grp3
## OID003401 -0.6112571 0.001507046 OID00340 grp3
## OID00528 -0.5738006 0.002177467 OID00528 grp1
## OID005281 -0.5941927 0.002201023 OID00528 grp3
## OID00340 -0.5801698 0.002364417 OID00340 grp1
##
## grp1 grp2 grp3
## OID00340 1 0 1
## OID00351 1 0 1
## OID00361 1 0 1
## OID00362 0 1 1
## OID00432 0 1 1
## OID00451 1 0 1
## OID00457 0 1 1
## OID00528 1 0 1
## OID00945 1 0 1
## OID00993 1 0 1
## OID01001 1 1 0
## OID01018 1 0 1
## OID01019 1 0 1
#get gene name
subset(assay_data_clean, assay_data_clean$OLINK.ID %in% rownames(grp_cor_merged_dups))## Assay Gene.ID Uniprot.ID OLINK.ID LOD
## 54 Olink CARDIOVASCULAR II(v.5006) HO-1 P09601 OID00432 0.93744
## 73 Olink CARDIOVASCULAR II(v.5006) FABP2 P12104 OID00451 1.04637
## 79 Olink CARDIOVASCULAR II(v.5006) ACE2 Q9BYF1 OID00457 0.78739
## 102 Olink IMMUNE RESPONSE(v.3203) IRF9 Q00978 OID00945 0.83330
## 150 Olink IMMUNE RESPONSE(v.3203) IL10 P22301 OID00993 1.58493
## 158 Olink IMMUNE RESPONSE(v.3203) FAM3B P58499 OID01001 0.79853
## 175 Olink IMMUNE RESPONSE(v.3203) DDX58 O95786 OID01018 0.79252
## 176 Olink IMMUNE RESPONSE(v.3203) IL12RB1 P42701 OID01019 1.12822
## 242 Olink INFLAMMATION(v.3022) IL10 P22301 OID00528 1.48617
## 330 Olink NEUROLOGY(v.8012) G-CSF P09919 OID00340 1.34264
## 341 Olink NEUROLOGY(v.8012) BMP-4 P12644 OID00351 1.72820
## 351 Olink NEUROLOGY(v.8012) N-CDase Q9NR71 OID00361 -1.12374
## 352 Olink NEUROLOGY(v.8012) NAAA Q02083 OID00362 0.72062
Plot most sig protein in more than 1 group
ggplot(data=cases_only[c(3,(grep("OID01018", colnames(cases_only))))],
aes(x=Days.since.onset, y=OID01018)) +
geom_point() +
geom_smooth(method = "lm", se=F)## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
Summary of days since onset correlation
- 42 proteins significantly correlated with days in group 1
- 12 proteins significantly correlated with days in group 2
- 49 proteins significantly correlated with days in group 3
- 13 proteins in more in 2 groups are correlated with days
days since onset - correlation with controls
MISLEADINg AS SEVERE AND MILD ARE DIFFERENT END OF THE DAYS.SNCE.ONSET SEPCTRUM. any protein that will be DE will alao be correlated. Better to use within groups to see effect of days on expression of protein.
Includes controls which are set to 0.
# copy dataset and change to case vs control
control_vs_case_dataset<-protein_data_clean
# change to case and control
control_vs_case_dataset$Group<-ifelse(control_vs_case_dataset$Group==0, 0, 1)
# exprs only
control_vs_case_dataset_exprs_onlly<-control_vs_case_dataset[10:ncol(control_vs_case_dataset)]
# empty dataframe for results
control_vs_case_cor_days<-as.data.frame(matrix(nrow=nrow(assay_data_clean), ncol=2))
# colnames
colnames(control_vs_case_cor_days)<-c("cor", "p-value")
# for each protein perform correlation analysis against disease severity (i.e 0,1,2,3) - use expression data only
for (x in 1:ncol(control_vs_case_dataset_exprs_onlly)) {
# add protein name
rownames(control_vs_case_cor_days)[x]<-colnames(control_vs_case_dataset_exprs_onlly)[x]
# perform test
cor_results_temp<-rcorr(control_vs_case_dataset[,3], control_vs_case_dataset_exprs_onlly[,x], type="spearman")
#extract cor
control_vs_case_cor_days[x,1]<-cor_results_temp$r[2]
#extract p -value
control_vs_case_cor_days[x,2]<-cor_results_temp$P[2]
}
# check
head(control_vs_case_cor_days)## cor p-value
## OID00379 0.1190516 2.720778e-01
## OID00380 -0.3393009 1.304664e-03
## OID00381 0.2374262 2.680820e-02
## OID00382 -0.5174981 2.857371e-07
## OID00383 0.1888725 7.977194e-02
## OID00384 -0.0513918 6.364057e-01
# sort by p-value
control_vs_case_cor_days<-control_vs_case_cor_days[order(control_vs_case_cor_days$`p-value`, -abs(-control_vs_case_cor_days$cor)),]
# sort by pvalue
plot(control_vs_case_cor_days[,1], control_vs_case_cor_days[,2])## [1] 151
# sort by pvalue
control_vs_case_cor_days<-control_vs_case_cor_days[order(control_vs_case_cor_days$`p-value`, -abs(control_vs_case_cor_days$cor)),]
#results table
datatable(control_vs_case_cor_days)plot most sig
ggplot(data=control_vs_case_dataset[c(3,(grep("OID00550", colnames(control_vs_case_dataset))))],
aes(x=Days.since.onset, y=OID00550)) +
geom_point() +
geom_smooth(method = "lm", se=F)## `geom_smooth()` using formula 'y ~ x'
days since onset - correlation without controls
controls removed.
# case_expr
case_expr_only<-cases_only[10:ncol(cases_only)]
# empty dataframe for results
ind_group_cor_severity<-as.data.frame(matrix(nrow=ncol(case_expr_only), ncol=2))
# colnames
colnames(ind_group_cor_severity)<-c("cor", "p-value")
# for each protein perform correlation analysis against disease severity (i.e 0,1,2,3) - use expression data only
for (x in 1:ncol(case_expr_only)) {
# add protein name
rownames(ind_group_cor_severity)[x]<-colnames(case_expr_only)[x]
# perform test
cor_results_temp<-rcorr(as.numeric(cases_only[,3]), case_expr_only[,x], type="spearman")
#extract cor
ind_group_cor_severity[x,1]<-cor_results_temp$r[2]
#extract p -value
ind_group_cor_severity[x,2]<-cor_results_temp$P[2]
}
# check
head(ind_group_cor_severity)## cor p-value
## OID00379 0.0428449 0.7472977676
## OID00380 0.1210376 0.3611486301
## OID00381 -0.3495242 0.0066583547
## OID00382 0.1352411 0.3071209932
## OID00383 -0.4571489 0.0002727858
## OID00384 -0.3465956 0.0071623229
# sort by p-value
ind_group_cor_severity<-ind_group_cor_severity[order(ind_group_cor_severity$`p-value`, -abs(-ind_group_cor_severity$cor)),]
# sort by pvalue
plot(ind_group_cor_severity[,1], ind_group_cor_severity[,2])## [1] 177
# merge with protein ID
ind_group_cor_severity<-merge(protein_mapping, ind_group_cor_severity, by="row.names")
# move rownames
rownames(ind_group_cor_severity)<-ind_group_cor_severity$Row.names
ind_group_cor_severity$Row.names<-NULL
# sort by pvalue
ind_group_cor_severity<-ind_group_cor_severity[order(ind_group_cor_severity$`p-value`, -abs(ind_group_cor_severity$cor)),]
#results table
datatable(ind_group_cor_severity)DATA ANALYSIS
Using Limma for differential expression analysis
create differential expression function:
- uses control vs case (2 groups at a time)
- controls for age
run_limma<-function(groups) {
# extract data by group
dataset_group<-protein_data_clean[protein_data_clean$Group %in% groups,]
print(table(dataset_group$Group))
# design a model - control will always be on top as 0
design<-model.matrix(~0 + as.factor(dataset_group$Group) + dataset_group$Age + as.factor(dataset_group$Gender) + dataset_group$Days.since.onset)
colnames(design)<-c("case", "control", "age", "Gender", "Days.since.onset")
# make contrast - what to compare
contrast<- makeContrasts(Diff = control - case, levels=design)
print(head(contrast))
# apply linear model to each protein
# Robust regression provides an alternative to least squares regression that works with less restrictive assumptions. Specifically, it provides much better regression coefficient estimates when outliers are present in the data
fit<-lmFit(t(dataset_group[10:ncol(dataset_group)]), design=design, method="robust", maxit=1000)
# apply contrast
contrast_fit<-contrasts.fit(fit, contrast)
# apply empirical Bayes smoothing to the SE
ebays_fit<-eBayes(contrast_fit)
# summary
print(summary(decideTests(ebays_fit)))
# extract DE results
DE_results<-topTable(ebays_fit, n=ncol(dataset_group), adjust.method="fdr", confint=TRUE)
return(DE_results)
}0 vs 1 Differential expression
Group 0 vs 1 (control vs mild)
##
## 0 1
## 28 26
## Contrasts
## Levels Diff
## case -1
## control 1
## age 0
## Gender 0
## Days.since.onset 0
## Diff
## Down 105
## NotSig 208
## Up 42
## [1] 147
# merge with protein ID
group_0_vs_1<-merge_protin_ID(group_0_vs_1)
# results
datatable(format(group_0_vs_1, digits=5))significant (adjusted p < 0.05) proteins belong to assay:
##
## Olink CARDIOVASCULAR II(v.5006) Olink IMMUNE RESPONSE(v.3203)
## 32 54
## Olink INFLAMMATION(v.3022) Olink NEUROLOGY(v.8012)
## 31 30
Volcano plot of DE results
Boxplot of most significant protein
ggplot(data=protein_data_clean[c(1,(grep("OID00981", colnames(protein_data_clean))))],
aes(x=as.character(Group), y=OID00981, fill=as.character(Group))) +
geom_boxplot()## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
0 vs 2 Differential expression
Group 0 vs 2 (control vs mild)
##
## 0 2
## 28 9
## Contrasts
## Levels Diff
## case -1
## control 1
## age 0
## Gender 0
## Days.since.onset 0
## Diff
## Down 57
## NotSig 290
## Up 8
## [1] 65
# merge with protein ID
group_0_vs_2<-merge_protin_ID(group_0_vs_2)
# results
datatable(format(group_0_vs_2, digits=5))significant (adjusted p < 0.05) proteins belong to assay:
##
## Olink CARDIOVASCULAR II(v.5006) Olink IMMUNE RESPONSE(v.3203)
## 17 28
## Olink INFLAMMATION(v.3022) Olink NEUROLOGY(v.8012)
## 7 13
Volcano plot of DE results
Boxplot of most significant protein
ggplot(data=protein_data_clean[c(1,(grep("OID00374", colnames(protein_data_clean))))],
aes(x=as.character(Group), y=OID00374, fill=as.character(Group))) +
geom_boxplot()## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
0 vs 3 Differential expression
Group 0 vs 3 (control vs mild)
##
## 0 3
## 28 24
## Contrasts
## Levels Diff
## case -1
## control 1
## age 0
## Gender 0
## Days.since.onset 0
## Diff
## Down 105
## NotSig 168
## Up 82
## [1] 187
# merge with protein ID
group_0_vs_3<-merge_protin_ID(group_0_vs_3)
# results
datatable(format(group_0_vs_3, digits=5))significant (adjusted p < 0.05) proteins belong to assay:
##
## Olink CARDIOVASCULAR II(v.5006) Olink IMMUNE RESPONSE(v.3203)
## 41 58
## Olink INFLAMMATION(v.3022) Olink NEUROLOGY(v.8012)
## 44 44
Volcano plot of DE results
Most significant protein is NF2 again.
1 vs 2 Differential expression
Group 1 vs 2 (mild vs moderate)
##
## 1 2
## 26 9
## Contrasts
## Levels Diff
## case -1
## control 1
## age 0
## Gender 0
## Days.since.onset 0
## Diff
## Down 49
## NotSig 209
## Up 97
## [1] 146
# merge with protein ID
group_1_vs_2<-merge_protin_ID(group_1_vs_2)
# results
datatable(format(group_1_vs_2, digits=5))significant (adjusted p < 0.05) proteins belong to assay:
##
## Olink CARDIOVASCULAR II(v.5006) Olink IMMUNE RESPONSE(v.3203)
## 34 39
## Olink INFLAMMATION(v.3022) Olink NEUROLOGY(v.8012)
## 37 36
Volcano plot of DE results
Plot most sig protein
# plot all groups most correlated
ggplot(data=protein_data_clean[c(1,(grep("OID00541", colnames(protein_data_clean))))],
aes(x=as.character(Group), y=OID00541 , fill=as.character(Group))) +
geom_boxplot()1 vs 3 Differential expression
Group 1 vs 3 (mild vs severe)
##
## 1 3
## 26 24
## Contrasts
## Levels Diff
## case -1
## control 1
## age 0
## Gender 0
## Days.since.onset 0
## Diff
## Down 39
## NotSig 158
## Up 158
## [1] 197
# merge with protein ID
group_1_vs_3<-merge_protin_ID(group_1_vs_3)
# results
datatable(format(group_1_vs_3, digits=5))significant (adjusted p < 0.05) proteins belong to assay:
##
## Olink CARDIOVASCULAR II(v.5006) Olink IMMUNE RESPONSE(v.3203)
## 53 48
## Olink INFLAMMATION(v.3022) Olink NEUROLOGY(v.8012)
## 48 48
Volcano plot of DE results
Plot most sig protein
# plot all groups most correlated
ggplot(data=protein_data_clean[c(1,(grep("OID00459", colnames(protein_data_clean))))],
aes(x=as.character(Group), y=OID00459 , fill=as.character(Group))) +
geom_boxplot()2 vs 3 Differential expression
Group 2 vs 3 (moderate vs severe)
##
## 2 3
## 9 24
## Contrasts
## Levels Diff
## case -1
## control 1
## age 0
## Gender 0
## Days.since.onset 0
## Diff
## Down 1
## NotSig 291
## Up 63
## [1] 64
# merge with protein ID
group_2_vs_3<-merge_protin_ID(group_2_vs_3)
# results
datatable(format(group_2_vs_3, digits=5))significant (adjusted p < 0.05) proteins belong to assay:
##
## Olink CARDIOVASCULAR II(v.5006) Olink IMMUNE RESPONSE(v.3203)
## 17 11
## Olink INFLAMMATION(v.3022) Olink NEUROLOGY(v.8012)
## 15 21
Volcano plot of DE results
Plot most sig protein
# plot all groups most correlated
ggplot(data=protein_data_clean[c(1,(grep("OID00517", colnames(protein_data_clean))))],
aes(x=as.character(Group), y=OID00517 , fill=as.character(Group))) +
geom_boxplot()case vs control
recode all cases (1,2,3) to case and compare to control
# copy dataset and change to case vs control
control_vs_case_dataset<-protein_data_clean
# change to case and control
control_vs_case_dataset$Group<-ifelse(control_vs_case_dataset$Group==0, 0, 1)
# check
table(control_vs_case_dataset$Group)##
## 0 1
## 28 59
# design a model - control will always be on top as 0
control_vs_case_design<-model.matrix(~0 + as.factor(control_vs_case_dataset$Group) + control_vs_case_dataset$Age + as.factor(control_vs_case_dataset$Gender) + control_vs_case_dataset$Days.since.onset)
colnames(control_vs_case_design)<-c("case", "control", "age", "Gender", "Days.since.onset")
# make contrast - what to compare
control_vs_case_contrast<- makeContrasts(Diff = control - case, levels=control_vs_case_design)
head(control_vs_case_contrast)## Contrasts
## Levels Diff
## case -1
## control 1
## age 0
## Gender 0
## Days.since.onset 0
# apply linear model to each protein
# Robust regression provides an alternative to least squares regression that works with less restrictive assumptions. Specifically, it provides much better regression coefficient estimates when outliers are present in the data
control_vs_case_fit<-lmFit(t(control_vs_case_dataset[10:ncol(control_vs_case_dataset)]), design=control_vs_case_design, method="robust", maxit=100)
# apply contrast
control_vs_case_contrast_fit<-contrasts.fit(control_vs_case_fit, control_vs_case_contrast)
# apply empirical Bayes smoothing to the SE
control_vs_case_ebays_fit<-eBayes(control_vs_case_contrast_fit)
# summary
print(summary(decideTests(control_vs_case_ebays_fit)))## Diff
## Down 149
## NotSig 86
## Up 120
# extract DE results
control_vs_case_DE_results<-topTable(control_vs_case_ebays_fit, n=ncol(control_vs_case_dataset), adjust.method="fdr", confint=TRUE)
# number of DE proteins
nrow(subset(control_vs_case_DE_results, control_vs_case_DE_results$adj.P.Val<0.05))## [1] 269
# merge with protein ID
control_vs_case_DE_results<-merge_protin_ID(control_vs_case_DE_results)
# results
datatable(format(control_vs_case_DE_results, digits=5))significant (adjusted p < 0.05) proteins belong to assay:
##
## Olink CARDIOVASCULAR II(v.5006) Olink IMMUNE RESPONSE(v.3203)
## 65 71
## Olink INFLAMMATION(v.3022) Olink NEUROLOGY(v.8012)
## 62 71
Volcano plot of DE results
EnhancedVolcano(control_vs_case_DE_results,
lab = rownames(control_vs_case_DE_results),
x = 'logFC',
y = 'adj.P.Val')NF2 most sig again
Longitudinal group 1
Differential expression between the 12 samples in group 1 that were repeated at different time points
- 6 patients = 12 samples
- baseline and first repeat assigned to samples
- controlling for age + days.since.onset + gender
- paired t-test
# extract group 1
group1_duplicates_data<-duplicate_samples_after_qc[duplicate_samples_after_qc$Group==1,]
# add column for longitudinal analysis - sample.repeat - baseline/first (baseline is always 1st sample)
group1_duplicates_data$sample.repeat<-"baseline"
group1_duplicates_data$sample.repeat[c(2,4,6,8,10,12)]<-"first"
# move sample.repeat to front
ncol(group1_duplicates_data)## [1] 365
group1_duplicates_data<-group1_duplicates_data[c(365,1:364)]
# check
head(group1_duplicates_data, 12)[1:5]## sample.repeat Group Pat.Code Days.since.onset Gender
## sample1 baseline 1 101 8 M
## sample2 first 1 101 27 M
## sample6 baseline 1 105 17 F
## sample7 first 1 105 27 F
## sample13 baseline 1 111 8 F
## sample14 first 1 111 22 F
## sample15 baseline 1 112 7 M
## sample16 first 1 112 26 M
## sample19 baseline 1 115 2 M
## sample20 first 1 115 22 M
## sample24 baseline 1 119 9 F
## sample25 first 1 119 24 F
##
## baseline first
## 6 6
# design a model - control will always be on top as 0
group1_longitudinal_design<-model.matrix(~0 + as.factor(group1_duplicates_data$sample.repeat) + group1_duplicates_data$Age + group1_duplicates_data$Days.since.onset + as.factor(group1_duplicates_data$Gender))
colnames(group1_longitudinal_design)<-c("baseline", "first", "age", "days.since.onset", "Gender")
# estimtate correlation between repeated patients
group1_corfit <- duplicateCorrelation(t(group1_duplicates_data[10:ncol(group1_duplicates_data)]),group1_longitudinal_design,block=group1_duplicates_data$Pat.Code)
group1_corfit$consensus## [1] 0.5824546
# make contrast - what to compare
group1_longitudinal_contrast<- makeContrasts(Diff = first - baseline, levels=group1_longitudinal_design)
head(group1_longitudinal_contrast)## Contrasts
## Levels Diff
## baseline -1
## first 1
## age 0
## days.since.onset 0
## Gender 0
# add inter-subject correlation to model
# apply linear model to each protein
# Robust regression provides an alternative to least squares regression that works with less restrictive assumptions. Specifically, it provides much better regression coefficient estimates when outliers are present in the data
group1_longitudinal_fit<-lmFit(t(group1_duplicates_data[10:ncol(group1_duplicates_data)]), design=group1_longitudinal_design, method="robust", maxit=100, correlation = group1_corfit$consensus)
# apply contrast
group1_longitudinal_contrast_fit<-contrasts.fit(group1_longitudinal_fit, group1_longitudinal_contrast)
# apply empirical Bayes smoothing to the SE
group1_longitudinal_ebays_fit<-eBayes(group1_longitudinal_contrast_fit)## Warning: Zero sample variances detected, have been offset away from zero
## Diff
## Down 3
## NotSig 343
## Up 10
# extract DE results
group1_longitudinal_DE_results<-topTable(group1_longitudinal_ebays_fit, n=ncol(group1_duplicates_data), adjust.method="fdr", confint=TRUE)
# number of DE proteins
nrow(subset(group1_longitudinal_DE_results, group1_longitudinal_DE_results$adj.P.Val<0.05))## [1] 13
# merge with protein ID
group1_longitudinal_DE_results<-merge_protin_ID(group1_longitudinal_DE_results)
# results
datatable(format(group1_longitudinal_DE_results, digits=5))significant (adjusted p < 0.05) proteins belong to assay:
##
## Olink CARDIOVASCULAR II(v.5006) Olink IMMUNE RESPONSE(v.3203)
## 5 4
## Olink INFLAMMATION(v.3022) Olink NEUROLOGY(v.8012)
## 2 2
Boxplots of most significant proteins
# plot by group 1 duplicate
ggplot(data=group1_duplicates_data[c(1,(grep("OID00386", colnames(group1_duplicates_data))))],
aes(x=sample.repeat, y=OID00386, fill=sample.repeat)) +
geom_boxplot()Longitudinal group 3
Differential expression between the 10 samples in group 3 that were repeated at different time points
- 5 patients = 10 samples
- baseline and first repeat assigned to samples
- controlling for age + days.since.onset
- paired t-test
- All males (gender not modelled)
# extract group 3
group3_duplicates_data<-duplicate_samples_after_qc[duplicate_samples_after_qc$Group==3,]
# add column for longitudinal analysis - sample.repeat - baseline/first (baseline is always 1st sample)
group3_duplicates_data$sample.repeat<-"baseline"
group3_duplicates_data$sample.repeat[c(2,4,6,8,10)]<-"first"
# move sample.repeat to front
ncol(group3_duplicates_data)## [1] 365
group3_duplicates_data<-group3_duplicates_data[c(365,1:364)]
# check
head(group3_duplicates_data, 12)[1:5]## sample.repeat Group Pat.Code Days.since.onset Gender
## sample39 baseline 3 304 11 M
## sample40 first 3 304 15 M
## sample43 baseline 3 307 6 M
## sample44 first 3 307 8 M
## sample47 baseline 3 310 10 M
## sample48 first 3 310 12 M
## sample54 baseline 3 315 9 M
## sample55 first 3 315 13 M
## sample58 baseline 3 318 6 M
## sample59 first 3 318 8 M
##
## baseline first
## 5 5
# design a model - control will always be on top as 0
group3_longitudinal_design<-model.matrix(~0 + as.factor(group3_duplicates_data$sample.repeat) + group3_duplicates_data$Age + group3_duplicates_data$Days.since.onset)
colnames(group3_longitudinal_design)<-c("baseline", "first", "age", "days.since.onset")
# estimtate correlation between repeated patients
group3_corfit <- duplicateCorrelation(t(group3_duplicates_data[10:ncol(group3_duplicates_data)]),group3_longitudinal_design,block=group3_duplicates_data$Pat.Code)## Warning in glmgam.fit(dx, dy, coef.start = start, tol = tol, maxit = maxit, :
## Too much damping - convergence tolerance not achievable
## Warning in glmgam.fit(dx, dy, coef.start = start, tol = tol, maxit = maxit, :
## Too much damping - convergence tolerance not achievable
## [1] 0.42842
# make contrast - what to compare
group3_longitudinal_contrast<- makeContrasts(Diff = first - baseline, levels=group3_longitudinal_design)
head(group3_longitudinal_contrast)## Contrasts
## Levels Diff
## baseline -1
## first 1
## age 0
## days.since.onset 0
# add inter-subject correlation to model
# apply linear model to each protein
# Robust regression provides an alternative to least squares regression that works with less restrictive assumptions. Specifically, it provides much better regression coefficient estimates when outliers are present in the data
group3_longitudinal_fit<-lmFit(t(group3_duplicates_data[10:ncol(group3_duplicates_data)]), design=group3_longitudinal_design, method="robust", maxit=1000, correlation = group3_corfit$consensus)
# apply contrast
group3_longitudinal_contrast_fit<-contrasts.fit(group3_longitudinal_fit, group3_longitudinal_contrast)
# apply empirical Bayes smoothing to the SE
group3_longitudinal_ebays_fit<-eBayes(group3_longitudinal_contrast_fit)## Warning: Zero sample variances detected, have been offset away from zero
## Diff
## Down 3
## NotSig 350
## Up 3
# extract DE results
group3_longitudinal_DE_results<-topTable(group3_longitudinal_ebays_fit, n=ncol(group3_duplicates_data), adjust.method="fdr", confint=TRUE)
# number of DE proteins
nrow(subset(group3_longitudinal_DE_results, group3_longitudinal_DE_results$adj.P.Val<0.05))## [1] 6
# merge with protein ID
group3_longitudinal_DE_results<-merge_protin_ID(group3_longitudinal_DE_results)
# results
datatable(format(group3_longitudinal_DE_results, digits=5))significant (adjusted p < 0.05) proteins belong to assay:
##
## Olink CARDIOVASCULAR II(v.5006) Olink IMMUNE RESPONSE(v.3203)
## 2 1
## Olink INFLAMMATION(v.3022) Olink NEUROLOGY(v.8012)
## 0 3
Boxplot of most sig
# plot by group 3 duplicate
ggplot(data=group3_duplicates_data[c(1,(grep("OID00424", colnames(group3_duplicates_data))))],
aes(x=sample.repeat, y=OID00424 , fill=sample.repeat)) +
geom_boxplot()correlation analysis of severity
Any Correlation between the severity of symptoms and protein expression?
from control->mild->moderate->severe
# empty dataframe for results
cor_severity<-as.data.frame(matrix(nrow=nrow(assay_data_clean), ncol=2))
# colnames
colnames(cor_severity)<-c("cor", "p-value")
# for each protein perform correlation analysis against disease severity (i.e 0,1,2,3) - use expression data only
for (x in 1:ncol(exprs_only)) {
# add protein name
rownames(cor_severity)[x]<-colnames(exprs_only)[x]
# perform test
cor_results_temp<-rcorr(as.numeric(protein_data_clean[,1]), exprs_only[,x], type="spearman")
#extract cor
cor_severity[x,1]<-cor_results_temp$r[2]
#extract p -value
cor_severity[x,2]<-cor_results_temp$P[2]
}
# check
head(cor_severity)## cor p-value
## OID00379 0.2246237 3.646928e-02
## OID00380 -0.3972408 1.391701e-04
## OID00381 0.6636468 2.464917e-12
## OID00382 -0.5503791 3.342356e-08
## OID00383 0.6658724 1.961320e-12
## OID00384 0.3863220 2.191736e-04
# sort by p-value
cor_severity<-cor_severity[order(cor_severity$`p-value`, -abs(-cor_severity$cor)),]
# sort by pvalue
plot(cor_severity[,1], cor_severity[,2])## [1] 286
# merge with protein ID
cor_severity<-merge(protein_mapping, cor_severity, by="row.names")
# move rownames
rownames(cor_severity)<-cor_severity$Row.names
cor_severity$Row.names<-NULL
# sort by pvalue
cor_severity<-cor_severity[order(cor_severity$`p-value`, -abs(cor_severity$cor)),]
#results table
datatable(cor_severity)Plot most correlated
# plot all groups most correlated
ggplot(data=protein_data_clean[c(1,(grep("OID00999", colnames(protein_data_clean))))],
aes(x=as.character(Group), y=OID00999 , fill=as.character(Group))) +
geom_boxplot()## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
Plot most anti-correlated
# plot most anti-correlated
ggplot(data=protein_data_clean[c(1,(grep("OID00951", colnames(protein_data_clean))))],
aes(x=as.character(Group), y=OID00951, fill=as.character(Group))) +
geom_boxplot()## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
SIMOA PROTEINS
Additional measurements of Tau, NfL and GFAP were measured using simoa. Single molecule array (Simoa)
- Tau - a microtubule-associated protein
- NfL - Neurofilament light polypeptide
- GFAP - major intermediate filament proteins of mature astrocytes
Pearsons correlation used for this as both continous variable
check correlation of these proteins with other data:
correlation function
run_simoa_cor<-function(protein){
# protein of interest
sim_prot<-protein_data_clean[grep(protein,colnames(protein_data_clean))]
# extract simoa protein + remain expr data
exprs_only<-protein_data_clean[10:ncol(protein_data_clean)]
# empty dataframe for results
cor_severity<-as.data.frame(matrix(nrow=nrow(assay_data_clean), ncol=2))
# colnames
colnames(cor_severity)<-c("cor", "p-value")
# loop
for (x in 1:ncol(exprs_only)) {
# add protein name
rownames(cor_severity)[x]<-colnames(exprs_only)[x]
# perform test
cor_results_temp<-rcorr(log2(sim_prot[,1]), exprs_only[,x], type="pearson")
#extract cor
cor_severity[x,1]<-cor_results_temp$r[2]
#extract p -value
cor_severity[x,2]<-cor_results_temp$P[2]
}
# sort by p-value
cor_severity<-cor_severity[order(cor_severity$`p-value`, -abs(-cor_severity$cor)),]
# merge with protein ID
cor_severity<-merge(protein_mapping, cor_severity, by="row.names")
# move rownames
rownames(cor_severity)<-cor_severity$Row.names
cor_severity$Row.names<-NULL
# sort by pvalue
cor_severity<-cor_severity[order(cor_severity$`p-value`, -abs(cor_severity$cor)),]
return(cor_severity)
}differenital expression function
Apply function to Tau
## [1] 97
## Assay Gene.ID Uniprot.ID cor p-value
## OID00380 Cardiovascular ANGPT1 Q15389 -0.5989159 8.915899e-10
## OID00401 Cardiovascular PDGF subunit B P01127 -0.4609265 7.033602e-06
## OID00382 Cardiovascular CD40-L P29965 -0.4245190 4.167934e-05
## OID00449 Cardiovascular HB-EGF Q99075 -0.4218897 4.703018e-05
## OID00439 Cardiovascular CCL17 Q92583 -0.4210625 4.884150e-05
## OID00300 Neurology SCARB2 Q14108 0.3940164 1.743611e-04
Plot most correlated
# plot all groups most correlated
ggplot(data=protein_data_clean[c(1,(grep("OID00380", colnames(protein_data_clean))))],
aes(x=as.character(Group), y=OID00380 , fill=as.character(Group))) +
geom_boxplot()Apply function to NfL
## [1] 233
## Assay Gene.ID Uniprot.ID cor p-value
## OID00370 Neurology EDA2R Q9HAV5 0.6617428 4.014344e-12
## OID00394 Cardiovascular TNFRSF11A Q9Y6Q6 0.6534596 6.847634e-12
## OID00396 Cardiovascular TRAIL-R2 O14763 0.6529903 7.170931e-12
## OID00346 Neurology SKR3 P37023 0.6454287 1.963896e-11
## OID00479 Inflammation OPG O00300 0.6402975 2.423617e-11
## OID00326 Neurology LAYN Q6UX15 0.6422911 2.636380e-11
Plot most correlated
# plot all groups most correlated
ggplot(data=protein_data_clean[c(1,(grep("OID00370", colnames(protein_data_clean))))],
aes(x=as.character(Group), y=OID00370 , fill=as.character(Group))) +
geom_boxplot()## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
Apply function to GFAP
#GFAP
GFAP_cor<-run_simoa_cor("GFAP")
# how many sig
nrow(subset(GFAP_cor, GFAP_cor$`p-value`<=0.05))## [1] 165
## Assay Gene.ID Uniprot.ID cor p-value
## OID00455 Cardiovascular BNP P16860 0.5085301 4.937307e-07
## OID00394 Cardiovascular TNFRSF11A Q9Y6Q6 0.4748341 3.371737e-06
## OID00552 Inflammation CX3CL1 P78423 0.4748163 3.374980e-06
## OID01018 Immune DDX58 O95786 0.4678390 5.574783e-06
## OID00300 Neurology SCARB2 Q14108 0.4649629 6.474506e-06
## OID00399 Cardiovascular TF P13726 0.4623677 6.527273e-06
Plot most correlated
# plot all groups most correlated
ggplot(data=protein_data_clean[c(1,(grep("OID00380", colnames(protein_data_clean))))],
aes(x=as.character(Group), y=OID00380 , fill=as.character(Group))) +
geom_boxplot()OUTPUT
save data
Clean expression data
Clean protein info
0 vs 1
0 vs 2
0 vs 3
1 vs 2
1 vs 3
2 vs 3
control vs case
Longitudinal group 1
saveRDS(group1_longitudinal_DE_results, file=paste(output_dir, "Result/group1_longitudinal_DE_results.RDS", sep="/"))
write.table(subset(group1_longitudinal_DE_results, group1_longitudinal_DE_results$adj.P.Val<=0.05)[3], file="Result/group1_longitudinal_DE_results.txt", col.names = F, row.names = F, quote=F)Longitudinal group 3
saveRDS(group3_longitudinal_DE_results, file=paste(output_dir, "Result/group3_longitudinal_DE_results.RDS", sep="/"))
write.table(subset(group3_longitudinal_DE_results, group3_longitudinal_DE_results$adj.P.Val<=0.05)[3], file="Result/group3_longitudinal_DE_results.txt", col.names = F, row.names = F, quote=F)Infection severity correlation
SIMOA protein cor
write.table(GFAP_cor, file=paste(output_dir, "Result/GFAP_correlation.txt", sep="/"), quote=F, sep="\t")
write.table(Tau_cor, file=paste(output_dir, "Result/Tau_correlation.txt", sep="/"), quote=F, sep="\t")
write.table(NfL_cor, file=paste(output_dir, "Result/NfL_correlation.txt", sep="/"), quote=F, sep="\t")Merge results for Shiny app
## [1] "Assay" "Gene.ID" "Uniprot.ID" "logFC" "CI.L"
## [6] "CI.R" "AveExpr" "t" "P.Value" "adj.P.Val"
## [11] "B"
## [1] "Assay" "Gene.ID" "Uniprot.ID" "logFC" "CI.L"
## [6] "CI.R" "AveExpr" "t" "P.Value" "adj.P.Val"
## [11] "B"
## [1] "Assay" "Gene.ID" "Uniprot.ID" "logFC" "CI.L"
## [6] "CI.R" "AveExpr" "t" "P.Value" "adj.P.Val"
## [11] "B"
## [1] "Assay" "Gene.ID" "Uniprot.ID" "logFC" "CI.L"
## [6] "CI.R" "AveExpr" "t" "P.Value" "adj.P.Val"
## [11] "B"
## [1] "Assay" "Gene.ID" "Uniprot.ID" "logFC" "CI.L"
## [6] "CI.R" "AveExpr" "t" "P.Value" "adj.P.Val"
## [11] "B"
## [1] "Assay" "Gene.ID" "Uniprot.ID" "logFC" "CI.L"
## [6] "CI.R" "AveExpr" "t" "P.Value" "adj.P.Val"
## [11] "B"
## [1] "Assay" "Gene.ID" "Uniprot.ID" "logFC" "CI.L"
## [6] "CI.R" "AveExpr" "t" "P.Value" "adj.P.Val"
## [11] "B"
## [1] "Assay" "Gene.ID" "Uniprot.ID" "logFC" "CI.L"
## [6] "CI.R" "AveExpr" "t" "P.Value" "adj.P.Val"
## [11] "B"
## [1] "Assay" "Gene.ID" "Uniprot.ID" "logFC" "CI.L"
## [6] "CI.R" "AveExpr" "t" "P.Value" "adj.P.Val"
## [11] "B"
# add comparison
group_0_vs_1$Comparison<-"Control_v_Mild"
group_0_vs_2$Comparison<-"Control_v_Severe"
group_0_vs_3$Comparison<-"Control_v_Critical"
group_1_vs_2$Comparison<-"Mild_v_Severe"
group_1_vs_3$Comparison<-"Mild_v_Critical"
group_2_vs_3$Comparison<-"Severe_v_Critical"
control_vs_case_DE_results$Comparison<-"Control_v_Case"
group1_longitudinal_DE_results$Comparison<-"Longitudinal_in_Mild"
group3_longitudinal_DE_results$Comparison<-"Longitudinal_in_Critical"
# move rownames to column
# add comparison
group_0_vs_1$OLINK.ID<-rownames(group_0_vs_1)
group_0_vs_2$OLINK.ID<-rownames(group_0_vs_2)
group_0_vs_3$OLINK.ID<-rownames(group_0_vs_3)
group_1_vs_2$OLINK.ID<-rownames(group_1_vs_2)
group_1_vs_3$OLINK.ID<-rownames(group_1_vs_3)
group_2_vs_3$OLINK.ID<-rownames(group_2_vs_3)
control_vs_case_DE_results$OLINK.ID<-rownames(control_vs_case_DE_results)
group1_longitudinal_DE_results$OLINK.ID<-rownames(group1_longitudinal_DE_results)
group3_longitudinal_DE_results$OLINK.ID<-rownames(group3_longitudinal_DE_results)
# merge
Full_results<-rbind(group_0_vs_1,
group_0_vs_2,
group_0_vs_3,
group_1_vs_2,
group_1_vs_3,
group_2_vs_3,
control_vs_case_DE_results,
group1_longitudinal_DE_results,
group3_longitudinal_DE_results)
# Adding correlation within groups
# Full_results$cor.with.days<-ifelse(Full_results$OLINK.ID %in% grp_cor_merged$OLINK.ID, "Yes", "No")
# rearrange columns
Full_results<-Full_results[c(12, 1, 13, 2, 3, 4, 5, 6, 7, 9, 10)]
dim(Full_results)## [1] 3195 11
## Comparison Assay OLINK.ID Gene.ID Uniprot.ID
## OID00981 Control_v_Mild Immune OID00981 NF2 P35240
## OID00374 Control_v_Mild Neurology OID00374 MANF P55145
## OID00979 Control_v_Mild Immune OID00979 BIRC2 Q13490
## OID00401 Control_v_Mild Cardiovascular OID00401 PDGF subunit B P01127
## OID01011 Control_v_Mild Immune OID01011 DAPP1 Q9UN19
## OID00936 Control_v_Mild Immune OID00936 PPP1R9B Q96SB3
## logFC CI.L CI.R AveExpr P.Value adj.P.Val
## OID00981 -4.621684 -4.802838 -4.440531 1.768762 7.250209e-45 2.573824e-42
## OID00374 -2.854449 -3.076361 -2.632537 7.973475 1.427848e-30 2.534430e-28
## OID00979 -1.735473 -1.901240 -1.569707 1.334477 1.786336e-26 2.113831e-24
## OID00401 -3.228965 -3.559097 -2.898833 10.312451 2.001978e-25 1.776755e-23
## OID01011 -6.619309 -7.314962 -5.923657 5.966673 1.260887e-24 8.952300e-23
## OID00936 -5.279689 -5.864442 -4.694936 4.173545 1.254514e-23 6.528425e-22
Enrichment test
output files for enrichment test. Using uniprot ID
# list of background
write.table(assay_data_clean$Uniprot.ID, file=paste(output_dir, "/Result/For_enrichment_test/background_list.txt", sep="/"), row.names = F, col.names = F, quote=F)
# group_0_vs_1
write.table(subset(group_0_vs_1, group_0_vs_1$adj.P.Val<=0.05)[3],
file=paste(output_dir, "/Result/For_enrichment_test/group_0_vs_1.txt", sep="/"),
row.names = F,
col.names = F,
quote=F)
# group_0_vs_2
write.table(subset(group_0_vs_2, group_0_vs_2$adj.P.Val<=0.05)[3],
file=paste(output_dir, "/Result/For_enrichment_test/group_0_vs_2.txt", sep="/"),
row.names = F,
col.names = F,
quote=F)
# group_0_vs_3
write.table(subset(group_0_vs_3, group_0_vs_3$adj.P.Val<=0.05)[3],
file=paste(output_dir, "/Result/For_enrichment_test/group_0_vs_3.txt", sep="/"),
row.names = F,
col.names = F,
quote=F)
# group_1_vs_2
write.table(subset(group_1_vs_2, group_1_vs_2$adj.P.Val<=0.05)[3],
file=paste(output_dir, "/Result/For_enrichment_test/group_1_vs_2.txt", sep="/"),
row.names = F,
col.names = F,
quote=F)
# group_1_vs_3
write.table(subset(group_1_vs_3, group_1_vs_3$adj.P.Val<=0.05)[3],
file=paste(output_dir, "/Result/For_enrichment_test/group_1_vs_3.txt", sep="/"),
row.names = F,
col.names = F,
quote=F)
# group_2_vs_3
write.table(subset(group_2_vs_3, group_2_vs_3$adj.P.Val<=0.05)[3],
file=paste(output_dir, "/Result/For_enrichment_test/group_2_vs_3.txt", sep="/"),
row.names = F,
col.names = F,
quote=F)
# control_vs_case_DE_results
write.table(subset(control_vs_case_DE_results, control_vs_case_DE_results$adj.P.Val<=0.05)[3],
file=paste(output_dir, "/Result/For_enrichment_test/control_vs_case_DE_results.txt", sep="/"),
row.names = F,
col.names = F,
quote=F)
# group1_longitudinal_DE_results
write.table(subset(group1_longitudinal_DE_results, group1_longitudinal_DE_results$adj.P.Val<=0.05)[3],
file=paste(output_dir, "/Result/For_enrichment_test/group1_longitudinal_DE_results.txt", sep="/"),
row.names = F,
col.names = F,
quote=F)
# group3_longitudinal_DE_results
write.table(subset(group3_longitudinal_DE_results, group3_longitudinal_DE_results$adj.P.Val<=0.05)[3],
file=paste(output_dir, "/Result/For_enrichment_test/group3_longitudinal_DE_results.txt", sep="/"),
row.names = F,
col.names = F,
quote=F)INTERPRETATION
number of DE across analyses
# function to calculate number of sig proteins
summarise_DE_by_p<-function(ED_results, analysis_name){
# emtpy data frame - 1 column, 3 rows
results<-as.data.frame(matrix(ncol=1, nrow=3))
#add colname
colnames(results)<-analysis_name
# add rownames
rownames(results)<-c("Down", "NotSig", "Up")
# number of up regulated proteins
results[1,1]<-nrow(subset(ED_results, ED_results$adj.P.Val<0.05 & logFC>0))
# number of non sig
results[2,1]<-nrow(subset(ED_results, ED_results$adj.P.Val>=0.05))
# number of down sig regulted proteins
results[3,1]<-nrow(subset(ED_results, ED_results$adj.P.Val<0.05 & logFC<0))
return(results)
}
# apply function
group_0_vs_1_DE_count <- summarise_DE_by_p(group_0_vs_1, "Control_v_Mild")
group_0_vs_2_DE_count <- summarise_DE_by_p(group_0_vs_2, "Control_v_Moderate")
group_0_vs_3_DE_count <- summarise_DE_by_p(group_0_vs_3 , "Control_v_Severe")
group_1_vs_2_DE_count <- summarise_DE_by_p(group_1_vs_2 , "Mild_v_Moderate")
group_1_vs_3_DE_count <- summarise_DE_by_p(group_1_vs_3 , "Mild_v_Severe")
group_2_vs_3_DE_count <- summarise_DE_by_p(group_2_vs_3 , "Moderate_v_Severe")
control_vs_case_DE_count <- summarise_DE_by_p(control_vs_case_DE_results , "Control_v_Case")
group1_longitudinal_DE_count <- summarise_DE_by_p(group1_longitudinal_DE_results , "Longitudinal_in_Mild")
group3_longitudinal_DE_count <- summarise_DE_by_p(group3_longitudinal_DE_results , "Longitudinal_in_Severe")
# merg results
DE_result_count<-cbind(group_0_vs_1_DE_count,
group_0_vs_2_DE_count,
group_0_vs_3_DE_count,
group_1_vs_2_DE_count,
group_1_vs_3_DE_count,
group_2_vs_3_DE_count,
control_vs_case_DE_count,
group1_longitudinal_DE_count,
group3_longitudinal_DE_count)
# check
DE_result_count## Control_v_Mild Control_v_Moderate Control_v_Severe Mild_v_Moderate
## Down 42 8 82 97
## NotSig 208 290 168 209
## Up 105 57 105 49
## Mild_v_Severe Moderate_v_Severe Control_v_Case Longitudinal_in_Mild
## Down 158 63 120 10
## NotSig 158 291 86 342
## Up 39 1 149 3
## Longitudinal_in_Severe
## Down 3
## NotSig 349
## Up 3
Proteins sig across control->mild->moderate->severe
These proteins are sig between:
- control -> mild
- control -> moderate
- control -> severe
# extract proteins sig diff in group 0->1->2->3
proteins_sig_across_all<-subset(Full_results, Full_results$Comparison=="Control_v_Mild" & Full_results$adj.P.Val<=0.05 |
Full_results$Comparison=="Mild_v_Critical" & Full_results$adj.P.Val<=0.05 |
Full_results$Comparison=="Severe_v_Critical" & Full_results$adj.P.Val<=0.05)
# keep only those that are repeated 3 times
proteins_sig_across_all<-subset(proteins_sig_across_all, proteins_sig_across_all$OLINK.ID %in% subset(as.data.frame(table(proteins_sig_across_all$OLINK.ID)), Freq==3)$Var1)
# sort by OLINK
proteins_sig_across_all<-proteins_sig_across_all[order(proteins_sig_across_all$OLINK.ID),]
# check
head(proteins_sig_across_all)## Comparison Assay OLINK.ID Gene.ID Uniprot.ID logFC
## OID00300 Control_v_Mild Neurology OID00300 SCARB2 Q14108 0.5942784
## OID003004 Mild_v_Critical Neurology OID00300 SCARB2 Q14108 0.6278560
## OID003005 Severe_v_Critical Neurology OID00300 SCARB2 Q14108 0.7269831
## OID00315 Control_v_Mild Neurology OID00315 SIGLEC1 Q9BZZ2 1.7679780
## OID003154 Mild_v_Critical Neurology OID00315 SIGLEC1 Q9BZZ2 0.4064055
## OID003155 Severe_v_Critical Neurology OID00315 SIGLEC1 Q9BZZ2 0.4096786
## CI.L CI.R AveExpr P.Value adj.P.Val
## OID00300 0.23940756 0.9491492 4.593904 1.483496e-03 4.702154e-03
## OID003004 0.18779937 1.0679127 5.259929 6.167083e-03 1.371865e-02
## OID003005 0.33069805 1.1232681 5.682310 7.712754e-04 1.557798e-02
## OID00315 1.22773304 2.3082230 6.017946 2.799808e-08 1.774878e-07
## OID003154 0.06085954 0.7519514 6.787408 2.219098e-02 4.124502e-02
## OID003155 0.13311006 0.6862472 7.220382 5.081584e-03 3.340671e-02
## [1] 78
# empty list to remove
proteins_sig_across_all_to_remove<-vector()
# keep proteins where logFC is same direction in all analysis
for (x in seq(1, nrow(proteins_sig_across_all),3)) {
# extract fold change by 3
numbers<-proteins_sig_across_all$logFC[x:(x+2)]
if(all(numbers>0)==F & all(numbers<0)==F) {
proteins_sig_across_all_to_remove<-c(proteins_sig_across_all_to_remove, proteins_sig_across_all$OLINK.ID[x])
}
}
# remove
proteins_sig_across_all<-subset(proteins_sig_across_all, !(proteins_sig_across_all$OLINK.ID %in% proteins_sig_across_all_to_remove))
nrow(proteins_sig_across_all) ## [1] 57
# sort b =p
proteins_sig_across_all<-proteins_sig_across_all[order(proteins_sig_across_all$adj.P.Val),]
proteins_sig_across_all## Comparison Assay OLINK.ID Gene.ID Uniprot.ID
## OID009474 Mild_v_Critical Immune OID00947 IL6 P05231
## OID004824 Mild_v_Critical Inflammation OID00482 IL6 P05231
## OID009854 Mild_v_Critical Immune OID00985 CKAP4 Q07065
## OID003904 Mild_v_Critical Cardiovascular OID00390 IL6 P05231
## OID004064 Mild_v_Critical Cardiovascular OID00406 Gal-9 O00182
## OID003894 Mild_v_Critical Cardiovascular OID00389 IL-1ra P18510
## OID009654 Mild_v_Critical Immune OID00965 LILRB4 Q8NHJ6
## OID00315 Control_v_Mild Neurology OID00315 SIGLEC1 Q9BZZ2
## OID004644 Mild_v_Critical Cardiovascular OID00464 CA5A P35218
## OID005184 Mild_v_Critical Inflammation OID00518 PD-L1 Q9NZQ7
## OID005174 Mild_v_Critical Inflammation OID00517 IL-18R1 Q13478
## OID00441 Control_v_Mild Cardiovascular OID00441 MMP7 P09237
## OID005354 Mild_v_Critical Inflammation OID00535 CXCL10 P02778
## OID01015 Control_v_Mild Immune OID01015 LAMP3 Q9UQV4
## OID00406 Control_v_Mild Cardiovascular OID00406 Gal-9 O00182
## OID00535 Control_v_Mild Inflammation OID00535 CXCL10 P02778
## OID00464 Control_v_Mild Cardiovascular OID00464 CA5A P35218
## OID00389 Control_v_Mild Cardiovascular OID00389 IL-1ra P18510
## OID004844 Mild_v_Critical Inflammation OID00484 MCP-1 P13500
## OID003834 Mild_v_Critical Cardiovascular OID00383 SLAMF7 Q9NQ25
## OID005175 Severe_v_Critical Inflammation OID00517 IL-18R1 Q13478
## OID003895 Severe_v_Critical Cardiovascular OID00389 IL-1ra P18510
## OID00484 Control_v_Mild Inflammation OID00484 MCP-1 P13500
## OID00552 Control_v_Mild Inflammation OID00552 CX3CL1 P78423
## OID005524 Mild_v_Critical Inflammation OID00552 CX3CL1 P78423
## OID00300 Control_v_Mild Neurology OID00300 SCARB2 Q14108
## OID00517 Control_v_Mild Inflammation OID00517 IL-18R1 Q13478
## OID00985 Control_v_Mild Immune OID00985 CKAP4 Q07065
## OID004415 Severe_v_Critical Cardiovascular OID00441 MMP7 P09237
## OID010154 Mild_v_Critical Immune OID01015 LAMP3 Q9UQV4
## OID00482 Control_v_Mild Inflammation OID00482 IL6 P05231
## OID00965 Control_v_Mild Immune OID00965 LILRB4 Q8NHJ6
## OID005185 Severe_v_Critical Inflammation OID00518 PD-L1 Q9NZQ7
## OID00390 Control_v_Mild Cardiovascular OID00390 IL6 P05231
## OID004254 Mild_v_Critical Cardiovascular OID00425 MERTK Q12866
## OID005525 Severe_v_Critical Inflammation OID00552 CX3CL1 P78423
## OID003004 Mild_v_Critical Neurology OID00300 SCARB2 Q14108
## OID003005 Severe_v_Critical Neurology OID00300 SCARB2 Q14108
## OID004645 Severe_v_Critical Cardiovascular OID00464 CA5A P35218
## OID005355 Severe_v_Critical Inflammation OID00535 CXCL10 P02778
## OID003905 Severe_v_Critical Cardiovascular OID00390 IL6 P05231
## OID010155 Severe_v_Critical Immune OID01015 LAMP3 Q9UQV4
## OID009655 Severe_v_Critical Immune OID00965 LILRB4 Q8NHJ6
## OID009855 Severe_v_Critical Immune OID00985 CKAP4 Q07065
## OID004065 Severe_v_Critical Cardiovascular OID00406 Gal-9 O00182
## OID004825 Severe_v_Critical Inflammation OID00482 IL6 P05231
## OID00518 Control_v_Mild Inflammation OID00518 PD-L1 Q9NZQ7
## OID003835 Severe_v_Critical Cardiovascular OID00383 SLAMF7 Q9NQ25
## OID003155 Severe_v_Critical Neurology OID00315 SIGLEC1 Q9BZZ2
## OID00383 Control_v_Mild Cardiovascular OID00383 SLAMF7 Q9NQ25
## OID00947 Control_v_Mild Immune OID00947 IL6 P05231
## OID00425 Control_v_Mild Cardiovascular OID00425 MERTK Q12866
## OID003154 Mild_v_Critical Neurology OID00315 SIGLEC1 Q9BZZ2
## OID009475 Severe_v_Critical Immune OID00947 IL6 P05231
## OID004414 Mild_v_Critical Cardiovascular OID00441 MMP7 P09237
## OID004255 Severe_v_Critical Cardiovascular OID00425 MERTK Q12866
## OID004845 Severe_v_Critical Inflammation OID00484 MCP-1 P13500
## logFC CI.L CI.R AveExpr P.Value adj.P.Val
## OID009474 4.4228712 3.62937566 5.2163667 4.599221 1.140249e-14 1.092434e-12
## OID004824 4.4630021 3.64214137 5.2838627 4.752515 1.995853e-14 1.417056e-12
## OID009854 1.5962310 1.27018936 1.9222726 5.770290 7.610724e-13 4.297903e-11
## OID003904 4.4796139 3.50733004 5.4518977 5.603931 4.026385e-12 1.786708e-10
## OID004064 0.8021345 0.60998283 0.9942861 8.703112 7.297999e-11 2.294098e-09
## OID003894 1.8647427 1.41170028 2.3177852 6.414466 1.085524e-10 2.964316e-09
## OID009654 1.7068637 1.21500199 2.1987254 4.540222 1.030364e-08 1.524080e-07
## OID00315 1.7679780 1.22773304 2.3082230 6.017946 2.799808e-08 1.774878e-07
## OID004644 1.7929475 1.24844215 2.3374529 3.698926 3.228637e-08 3.697310e-07
## OID005184 1.0393479 0.71253313 1.3661627 6.175398 7.088193e-08 6.989746e-07
## OID005174 0.8407626 0.57270612 1.1088191 9.041958 9.619410e-08 9.201371e-07
## OID00441 1.5051666 0.96348315 2.0468501 8.902822 9.320688e-07 4.726921e-06
## OID005354 1.8266501 1.18902831 2.4642720 11.598658 6.382902e-07 5.035400e-06
## OID01015 1.7709154 1.08382405 2.4580068 4.448952 4.028238e-06 1.958938e-05
## OID00406 0.9575807 0.53472696 1.3804345 8.066001 3.388933e-05 1.503839e-04
## OID00535 2.4225698 1.33514026 3.5099994 10.083450 4.344471e-05 1.904058e-04
## OID00464 1.5867562 0.87292894 2.3005834 2.506874 4.490463e-05 1.944042e-04
## OID00389 1.4497846 0.77962760 2.1199416 5.217257 6.690251e-05 2.729930e-04
## OID004844 1.4299050 0.68662672 2.1731832 12.279563 3.377112e-04 1.120444e-03
## OID003834 1.0142716 0.48604464 1.5424986 3.951016 3.454442e-04 1.135488e-03
## OID005175 0.5789307 0.35997899 0.7978824 9.615739 7.829294e-06 1.389700e-03
## OID003895 1.0461594 0.62673486 1.4655839 7.317076 1.846293e-05 2.184780e-03
## OID00484 1.0033769 0.44533591 1.5614179 11.308540 6.988529e-04 2.444943e-03
## OID00552 0.6248476 0.26378658 0.9859087 4.096594 1.054237e-03 3.436315e-03
## OID005524 0.4964143 0.19675514 0.7960735 4.487630 1.692105e-03 4.585476e-03
## OID00300 0.5942784 0.23940756 0.9491492 4.593904 1.483496e-03 4.702154e-03
## OID00517 0.7590735 0.30267341 1.2154735 8.385347 1.577485e-03 4.955816e-03
## OID00985 0.5208326 0.20643949 0.8352257 4.652782 1.649907e-03 5.137868e-03
## OID004415 0.7183900 0.39560657 1.0411734 9.855883 8.578543e-05 6.090766e-03
## OID010154 0.8559165 0.32147663 1.3903563 5.362570 2.340196e-03 6.108601e-03
## OID00482 1.0826512 0.41449025 1.7508122 2.374035 2.029106e-03 6.236336e-03
## OID00965 0.8842353 0.33860331 1.4298673 3.297511 2.037789e-03 6.236336e-03
## OID005185 0.4927042 0.26255432 0.7228542 6.818685 1.389122e-04 7.044832e-03
## OID00390 1.1779535 0.44011449 1.9157924 3.145305 2.331035e-03 7.072797e-03
## OID004254 0.3441327 0.10710131 0.5811640 6.805271 5.364707e-03 1.244752e-02
## OID005525 0.6942100 0.34031318 1.0481069 4.788656 3.810444e-04 1.352708e-02
## OID003004 0.6278560 0.18779937 1.0679127 5.259929 6.167083e-03 1.371865e-02
## OID003005 0.7269831 0.33069805 1.1232681 5.682310 7.712754e-04 1.557798e-02
## OID004645 1.1593538 0.53270509 1.7860025 4.600708 7.078011e-04 1.557798e-02
## OID005355 0.6233459 0.26836309 0.9783287 13.038169 1.185146e-03 1.618180e-02
## OID003905 1.7990829 0.76797313 2.8301927 7.816368 1.259427e-03 1.655913e-02
## OID010155 0.6611489 0.27022973 1.0520681 5.922254 1.682039e-03 1.926206e-02
## OID009655 0.7328342 0.29089556 1.1747728 5.391841 2.008154e-03 2.160287e-02
## OID009855 0.8178668 0.30931307 1.3264206 6.649743 2.620216e-03 2.383006e-02
## OID004065 0.2754089 0.09731255 0.4535052 9.245245 3.626564e-03 3.065310e-02
## OID004825 1.7094165 0.59904232 2.8197907 6.926543 3.759657e-03 3.103903e-02
## OID00518 0.5913870 0.13486657 1.0479075 5.495612 1.214420e-02 3.124052e-02
## OID003835 0.8217724 0.27455457 1.3689902 4.415489 4.573717e-03 3.334589e-02
## OID003155 0.4096786 0.13311006 0.6862472 7.220382 5.081584e-03 3.340671e-02
## OID00383 0.8836114 0.18979563 1.5774272 3.188798 1.358490e-02 3.444742e-02
## OID00947 0.7428272 0.15726051 1.3283939 2.251074 1.395526e-02 3.488815e-02
## OID00425 0.4396680 0.08712706 0.7922090 6.391581 1.553126e-02 3.828886e-02
## OID003154 0.4064055 0.06085954 0.7519514 6.787408 2.219098e-02 4.124502e-02
## OID009475 1.7433857 0.51963356 2.9671379 6.629754 6.784581e-03 4.379139e-02
## OID004414 0.4945356 0.06584914 0.9232220 9.762914 2.470363e-02 4.513650e-02
## OID004255 0.3581592 0.09879963 0.6175187 7.065892 8.451639e-03 4.839245e-02
## OID004845 1.1567006 0.32369174 1.9897094 12.988422 8.134528e-03 4.839245e-02
# multiple gene names
#proteins_sig_across_all$Gene.ID<-droplevels(proteins_sig_across_all$Gene.ID)
table(proteins_sig_across_all$Gene.ID)##
## 4E-BP1 ACE2 ADA
## 0 0 0
## ADAM-TS13 ADAM 22 ADAM 23
## 0 0 0
## ADM AGRP Alpha-2-MRAP
## 0 0 0
## AMBP ANGPT1 AREG
## 0 0 0
## ARNT ARTN AXIN1
## 0 0 0
## BACH1 BCAN Beta-NGF
## 0 0 0
## BIRC2 BMP-4 BMP-6
## 0 0 0
## BNP BOC BTN3A2
## 0 0 0
## CA5A CADM3 CASP-8
## 3 0 0
## CCL11 CCL17 CCL19
## 0 0 0
## CCL20 CCL23 CCL25
## 0 0 0
## CCL28 CCL3 CCL4
## 0 0 0
## CD200 CD200R1 CD244
## 0 0 0
## CD28 CD38 CD4
## 0 0 0
## CD40 CD40-L CD5
## 0 0 0
## CD6 CD83 CD84
## 0 0 0
## CD8A CDCP1 CDH3
## 0 0 0
## CDH6 CDSN CEACAM8
## 0 0 0
## CKAP4 CLEC10A CLEC1B
## 3 0 0
## CLEC4A CLEC4C CLEC4D
## 0 0 0
## CLEC4G CLEC6A CLEC7A
## 0 0 0
## CLM-1 CLM-6 CNTN5
## 0 0 0
## CNTNAP2 CPA2 CPM
## 0 0 0
## CRTAM CSF-1 CST5
## 0 0 0
## CTRC CTSC CTSL1
## 0 0 0
## CTSS CX3CL1 CXADR
## 0 3 0
## CXCL1 CXCL10 CXCL11
## 0 3 0
## CXCL12 CXCL5 CXCL6
## 0 0 0
## CXCL9 DAPP1 DCBLD2
## 0 0 0
## DCN DCTN1 DDR1
## 0 0 0
## DDX58 DECR1 DFFA
## 0 0 0
## DGKZ Dkk-1 Dkk-4
## 0 0 0
## DNER DPP10 DRAXIN
## 0 0 0
## EDA2R EDAR EFNA4
## 0 0 0
## EGLN1 EIF4G1 EIF5A
## 0 0 0
## EN-RAGE EPHB6 EZR
## 0 0 0
## FABP2 FAM3B FcRL2
## 0 0 0
## FCRL3 FCRL6 FGF-19
## 0 0 0
## FGF-21 FGF-23 FGF-5
## 0 0 0
## FGF2 FLRT2 Flt3L
## 0 0 0
## FS FXYD5 G-CSF
## 0 0 0
## gal-8 Gal-9 GALNT3
## 0 3 0
## GCP5 GDF-2 GDF-8
## 0 0 0
## GDNF GDNFR-alpha-3 GFR-alpha-1
## 0 0 0
## GH GIF GLB1
## 0 0 0
## GLO1 GM-CSF-R-alpha GT
## 0 0 0
## GZMA HAGH HAOX1
## 0 0 0
## HB-EGF HCLS1 HEXIM1
## 0 0 0
## HGF HNMT HO-1
## 0 0 0
## hOSCAR HSD11B1 HSP 27
## 0 0 0
## ICA1 IDUA IFN-gamma
## 0 0 0
## IFNLR1 IgG Fc receptor II-b IL-1 alpha
## 0 0 0
## IL-10RA IL-10RB IL-12B
## 0 0 0
## IL-15RA IL-17A IL-17C
## 0 0 0
## IL-17D IL-18R1 IL-1ra
## 0 3 3
## IL-20 IL-20RA IL-22 RA1
## 0 0 0
## IL-24 IL-27 IL-2RB
## 0 0 0
## IL-4RA IL-5R-alpha IL10
## 0 0 0
## IL12 IL12RB1 IL13
## 0 0 0
## IL16 IL18 IL1RL2
## 0 0 0
## IL2 IL33 IL4
## 0 0 0
## IL5 IL6 IL7
## 0 9 0
## IL8 IRAK1 IRAK4
## 0 0 0
## IRF9 ITGA11 ITGA6
## 0 0 0
## ITGB1BP2 ITGB6 ITM2A
## 0 0 0
## JAM-B JUN KIM1
## 0 0 0
## KLRD1 KPNA1 KRT19
## 0 0 0
## KYNU LAG3 LAIR-2
## 0 0 0
## LAMP3 LAP TGF-beta-1 LAT
## 3 0 0
## LAYN LEP LIF
## 0 0 0
## LIF-R LILRB4 LOX-1
## 0 3 0
## LPL LXN LY75
## 0 0 0
## MANF MAPT MARCO
## 0 0 0
## MASP1 MATN3 MCP-1
## 0 0 3
## MCP-2 MCP-3 MCP-4
## 0 0 0
## MDGA1 MERTK MGMT
## 0 3 0
## MILR1 MMP-1 MMP-10
## 0 0 0
## MMP12 MMP7 MSR1
## 0 3 0
## N-CDase N2DL-2 NAAA
## 0 0 0
## NBL1 NCAN NCR1
## 0 0 0
## NEMO NEP NF2
## 0 0 0
## NFATC3 NMNAT1 Nr-CAM
## 0 0 0
## NRP2 NRTN NT-3
## 0 0 0
## NTF4 NTRK2 NTRK3
## 0 0 0
## OPG OSM PADI2
## 0 0 0
## PAPPA PAR-1 PARP-1
## 0 0 0
## PD-L1 PD-L2 PDGF-R-alpha
## 3 0 0
## PDGF subunit B PGF PIgR
## 0 0 0
## PIK3AP1 PLXNA4 PLXNB1
## 0 0 0
## PLXNB3 PPP1R9B PRDX1
## 0 0 0
## PRDX3 PRDX5 PRELP
## 0 0 0
## PRKCQ PRSS27 PRSS8
## 0 0 0
## PRTG PSGL-1 PSIP1
## 0 0 0
## PTH1R PTX3 PVR
## 0 0 0
## RAGE REN RGMA
## 0 0 0
## RGMB ROBO2 RSPO1
## 0 0 0
## SCARA5 SCARB2 SCARF2
## 0 3 0
## SCF SERPINA12 sFRP-3
## 0 0 0
## SH2B3 SH2D1A Siglec-9
## 0 0 0
## SIGLEC1 SIRT2 SIT1
## 3 0 0
## SKR3 SLAMF1 SLAMF7
## 0 0 3
## SMOC2 SMPD1 SOD2
## 0 0 0
## SORT1 SPOCK1 SPON2
## 0 0 0
## SPRY2 SRC SRPK2
## 0 0 0
## ST1A1 STAMBP STC1
## 0 0 0
## STK4 TANK TF
## 0 0 0
## TGF-alpha TGM2 THBS2
## 0 0 0
## THPO THY 1 TIE2
## 0 0 0
## TM TMPRSS5 TN-R
## 0 0 0
## TNF TNFB TNFRSF10A
## 0 0 0
## TNFRSF11A TNFRSF12A TNFRSF13B
## 0 0 0
## TNFRSF21 TNFRSF9 TNFSF14
## 0 0 0
## TPSAB1 TRAF2 TRAIL
## 0 0 0
## TRAIL-R2 TRANCE TREM1
## 0 0 0
## TRIM21 TRIM5 TSLP
## 0 0 0
## TWEAK UNC5C uPA
## 0 0 0
## VEGFA VEGFD VSIG2
## 0 0 0
## VWC2 WFIKKN1 XCL1
## 0 0 0
## ZBTB16
## 0
plot
# plot most anti-correlated
ggplot(data=protein_data_clean[c(1,(grep("OID00390", colnames(protein_data_clean))))],
aes(x=as.character(Group), y=OID00390, fill=as.character(Group))) +
geom_boxplot()ggplot(data=protein_data_clean[c(1,(grep("OID00985", colnames(protein_data_clean))))],
aes(x=as.character(Group), y=OID00985, fill=as.character(Group))) +
geom_boxplot()## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
ggplot(data=protein_data_clean[c(1,(grep("OID00406", colnames(protein_data_clean))))],
aes(x=as.character(Group), y=OID00406, fill=as.character(Group))) +
geom_boxplot()ggplot(data=protein_data_clean[c(1,(grep("OID00389", colnames(protein_data_clean))))],
aes(x=as.character(Group), y=OID00389, fill=as.character(Group))) +
geom_boxplot()ggplot(data=protein_data_clean[c(1,(grep("OID00965", colnames(protein_data_clean))))],
aes(x=as.character(Group), y=OID00965, fill=as.character(Group))) +
geom_boxplot()## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
ggplot(data=protein_data_clean[c(1,(grep("OID00518", colnames(protein_data_clean))))],
aes(x=as.character(Group), y=OID00518, fill=as.character(Group))) +
geom_boxplot()Summary:
- 9 proteins sig dif
- 6 unique:
- CKAP4
- Gal-9
- IL-1ra
- IL6
- LILRB4
- PD-L1
Manuscript plots
plots
tiff(file=paste(paste(output_dir, "plots/NF2.tiff", sep="/")))
ggplot(data=protein_data_clean[c(1,(grep("OID00981", colnames(protein_data_clean))))],
aes(x=as.character(Group), y=OID00981, fill=as.character(Group))) +
geom_boxplot() +
labs(x = "Disease Group",
y = "Expression (NPX)") +
scale_fill_discrete(name = "Disease Group", labels = c("Control", "Mild", "Severe", "Critical")) +
scale_x_discrete(labels=c("0" = "Control", "1" = "Mild", "2" = "Severe", "3" = "Critical")) +
ggtitle("NF2 Expression")+
theme(plot.title = element_text(hjust = 0.5))
dev.off()
NF2 <- protein_data_clean[c(1,2,3,4,5,6, (grep("OID00981", colnames(protein_data_clean))))]
# order
NF2<-NF2[order(NF2$Group),]
# add level
NF2$sample<-1:nrow(NF2)
tiff(file=paste(paste(output_dir, "plots/NF2_individual.tiff", sep="/")), width = 10, height = 10, units = 'in', res = 300)
ggplot(data= NF2, aes(x=sample, y=OID00981, fill=as.character(Group))) +
geom_bar(stat="identity") +
labs(x = "Samples",
y = "Expression (NPX)") +
scale_fill_discrete(name = "Disease Group", labels = c("Control", "Mild", "Severe", "Critical")) +
scale_x_discrete(labels=c("0" = "Control", "1" = "Mild", "2" = "Severe", "3" = "Critical")) +
ggtitle("NF2 Expression")+
theme(plot.title = element_text(hjust = 0.5))
dev.off()
tiff(file=paste(paste(output_dir, "plots/CKAP4.tiff", sep="/")))
ggplot(data=protein_data_clean[c(1,(grep("OID00985", colnames(protein_data_clean))))],
aes(x=as.character(Group), y=OID00985, fill=as.character(Group))) +
geom_boxplot() +
labs(x = "Disease Group",
y = "Expression (NPX)") +
scale_fill_discrete(name = "Disease Group", labels = c("Control", "Mild", "Severe", "Critical")) +
scale_x_discrete(labels=c("0" = "Control", "1" = "Mild", "2" = "Severe", "3" = "Critical")) +
ggtitle("CKAP4 Expression")+
theme(plot.title = element_text(hjust = 0.5))
dev.off()
tiff(file=paste(paste(output_dir, "plots/Gal-9.tiff", sep="/")))
ggplot(data=protein_data_clean[c(1,(grep("OID00406", colnames(protein_data_clean))))],
aes(x=as.character(Group), y=OID00406, fill=as.character(Group))) +
geom_boxplot() +
labs(x = "Disease Group",
y = "Expression (NPX)") +
scale_fill_discrete(name = "Disease Group", labels = c("Control", "Mild", "Severe", "Critical")) +
scale_x_discrete(labels=c("0" = "Control", "1" = "Mild", "2" = "Severe", "3" = "Critical")) +
ggtitle("Gal-9 Expression")+
theme(plot.title = element_text(hjust = 0.5))
dev.off()
tiff(file=paste(paste(output_dir, "plots/IL-1ra.tiff", sep="/")))
ggplot(data=protein_data_clean[c(1,(grep("OID00389", colnames(protein_data_clean))))],
aes(x=as.character(Group), y=OID00389, fill=as.character(Group))) +
geom_boxplot() +
labs(x = "Disease Group",
y = "Expression (NPX)") +
scale_fill_discrete(name = "Disease Group", labels = c("Control", "Mild", "Severe", "Critical")) +
scale_x_discrete(labels=c("0" = "Control", "1" = "Mild", "2" = "Severe", "3" = "Critical")) +
ggtitle("IL-1ra Expression")+
theme(plot.title = element_text(hjust = 0.5))
dev.off()
tiff(file=paste(paste(output_dir, "plots/PD-L1.tiff", sep="/")))
ggplot(data=protein_data_clean[c(1,(grep("OID00518", colnames(protein_data_clean))))],
aes(x=as.character(Group), y=OID00518, fill=as.character(Group))) +
geom_boxplot() +
labs(x = "Disease Group",
y = "Expression (NPX)") +
scale_fill_discrete(name = "Disease Group", labels = c("Control", "Mild", "Severe", "Critical")) +
scale_x_discrete(labels=c("0" = "Control", "1" = "Mild", "2" = "Severe", "3" = "Critical")) +
ggtitle("PD-L1 Expression")+
theme(plot.title = element_text(hjust = 0.5))
dev.off()
tiff(file=paste(paste(output_dir, "plots/LILRB4.tiff", sep="/")))
ggplot(data=protein_data_clean[c(1,(grep("OID00965", colnames(protein_data_clean))))],
aes(x=as.character(Group), y=OID00965, fill=as.character(Group))) +
geom_boxplot() +
labs(x = "Disease Group",
y = "Expression (NPX)") +
scale_fill_discrete(name = "Disease Group", labels = c("Control", "Mild", "Severe", "Critical")) +
scale_x_discrete(labels=c("0" = "Control", "1" = "Mild", "2" = "Severe", "3" = "Critical")) +
ggtitle("LILRB4 Expression")+
theme(plot.title = element_text(hjust = 0.5))
dev.off()
tiff(file=paste(paste(output_dir, "plots/EDA2R.tiff", sep="/")))
ggplot(data=protein_data_clean[c(1,(grep("OID00370", colnames(protein_data_clean))))],
aes(x=as.character(Group), y=OID00370, fill=as.character(Group))) +
geom_boxplot() +
labs(x = "Disease Group",
y = "Expression (NPX)") +
scale_fill_discrete(name = "Disease Group", labels = c("Control", "Mild", "Severe", "Critical")) +
scale_x_discrete(labels=c("0" = "Control", "1" = "Mild", "2" = "Severe", "3" = "Critical")) +
ggtitle("EDA2R Expression")+
theme(plot.title = element_text(hjust = 0.5))
dev.off()
tiff(file=paste(paste(output_dir, "plots/SCARB2.tiff", sep="/")))
ggplot(data=protein_data_clean[c(1,(grep("OID00300", colnames(protein_data_clean))))],
aes(x=as.character(Group), y=OID00300, fill=as.character(Group))) +
geom_boxplot() +
labs(x = "Disease Group",
y = "Expression (NPX)") +
scale_fill_discrete(name = "Disease Group", labels = c("Control", "Mild", "Severe", "Critical")) +
scale_x_discrete(labels=c("0" = "Control", "1" = "Mild", "2" = "Severe", "3" = "Critical")) +
ggtitle("SCARB2 Expression")+
theme(plot.title = element_text(hjust = 0.5))
dev.off()
tiff(file=paste(paste(output_dir, "plots/MANF.tiff", sep="/")))
ggplot(data=protein_data_clean[c(1,(grep("OID00374", colnames(protein_data_clean))))],
aes(x=as.character(Group), y=OID00374, fill=as.character(Group))) +
geom_boxplot() +
labs(x = "Disease Group",
y = "Expression (NPX)") +
scale_fill_discrete(name = "Disease Group", labels = c("Control", "Mild", "Severe", "Critical")) +
scale_x_discrete(labels=c("0" = "Control", "1" = "Mild", "2" = "Severe", "3" = "Critical")) +
ggtitle("MANF Expression")+
theme(plot.title = element_text(hjust = 0.5))
dev.off()
tiff(file=paste(paste(output_dir, "plots/LAT.tiff", sep="/")))
ggplot(data=protein_data_clean[c(1,(grep("OID00371", colnames(protein_data_clean))))],
aes(x=as.character(Group), y=OID00371, fill=as.character(Group))) +
geom_boxplot() +
labs(x = "Disease Group",
y = "Expression (NPX)") +
scale_fill_discrete(name = "Disease Group", labels = c("Control", "Mild", "Severe", "Critical")) +
scale_x_discrete(labels=c("0" = "Control", "1" = "Mild", "2" = "Severe", "3" = "Critical")) +
ggtitle("LAT Expression")+
theme(plot.title = element_text(hjust = 0.5))
dev.off()
tiff(file=paste(paste(output_dir, "plots/SIGLEC1.tiff", sep="/")))
ggplot(data=protein_data_clean[c(1,(grep("OID00315", colnames(protein_data_clean))))],
aes(x=as.character(Group), y=OID00315 , fill=as.character(Group))) +
geom_boxplot() +
labs(x = "Disease Group",
y = "Expression (NPX)") +
scale_fill_discrete(name = "Disease Group", labels = c("Control", "Mild", "Severe", "Critical")) +
scale_x_discrete(labels=c("0" = "Control", "1" = "Mild", "2" = "Severe", "3" = "Critical")) +
ggtitle("SIGLEC1 Expression")+
theme(plot.title = element_text(hjust = 0.5))
dev.off()