Environment details

# Working directory structure
dir.create(path = "./RIME/objects", recursive = T, showWarnings = F)
dir.create(path = "./RIME/tables", recursive = T, showWarnings = F)
dir.create(path = "./RIME/plots", recursive = T, showWarnings = F)


# Required libraries
require(DEprot)
require(dplyr)
require(ggplot2)

# Optional attachment
require(data.table)
require(patchwork)
require(ggpubr)
require(purrr)

# Not included in the DEprot dependencies
## To install:
## BiocManager::install("AnnotationDbi")
require(AnnotationDbi)

1 Data description

Data have been obtained from N.Eickhoff et al. (Communications Biology, 2025) and data are accessible at the ProteomeXchange Consortium through the PRIDE partner repository with the identifier PXD058914.

LFQ intensities, and corresponding metadata, for a selection of samples is available in the ./source folder. This dataset involves experiments of Rapid immunoprecipitation mass spectrometry of endogenous proteins (RIME) for the Androgen Receptor (AR) in prostate cancer cells. Notably, these data have been generated in 3 different batches: each batch has been analyzed independently using MaxQuant, later the 3 unimputed intensities tables have been merged using accession number as merging-key.

The data involve LNCaP and LNCaP-abl (resistant to hormone deprivation but still expressing AR) prostate cancer cells. LNCaP are either in the wild-type (WT) version, or monoclonal knock-out for the TRIM33 protein (TRIM33-5#MC-C2) - and respective non-targeting control (NT-3) -. Cells have been cultured using medium supplied with regular Fetal-Bovine Serum (FBS) or hormone deprived for 3 days before being stimulated with 5nM R1881 (synthetic androgen) for 4 hours. For each condition immunoprecipitation with AR and IgG (negative control) was performed in 4 biological replicates. Pulled-down proteins were detected using Label-Free Quantitation (LFQ) mass spectrometry. Peptide mixtures were analyzed by nanoLC-MS/MS on an Orbitrap Exploris 480 Mass Spectrometer equipped with an EASY-NLC 1200 system (Thermo Scientific).

1.1 Data import

## Import metadata
metadata <- data.table::fread("./sources/RIME_metadata_combined.txt", data.table = F)
metadata

## Import combined LFQ-intensities
lfq <- data.table::fread("./sources/RIME_LFQ_intensities_unimputed_combined.txt", data.table = F)
rownames(lfq) <- lfq$gene.id
head(lfq)

NOTE: LFQ-intensities contain NAs because they belong to unimputed data. Further, gene.id does not contain unique values, therfore a the make.unique function was applied: duplicated gene IDs contain a .X index and can be used as rownames/unique identifiers.

1.2 Load `DEprot` object (dpo)

As above mentioned, the data belong to 3 different batches, therefore a batch correction will be required. For this reason, the data must be loaded as “raw” data.

dpo <- load.counts(counts = lfq[,-c(1:2)], # removing IDs
                   metadata = metadata,
                   log.base = 2,
                   imputation = NA,
                   normalization.method = NA,
                   column.id = "column.id")

dpo

## DEprot object:
##            Samples:  32
##           Proteins:  3285
##   Counts available:  raw
## Log transformation:  log2
##   Metadata columns:  column.id, batch, cell, condition, treatment, target, group, replicate

2 Batch correction/normalization

Proteomics is highly susceptible to batch effects. Hereafter we use the HarmonizR package, developed by H.Voß & S.Schlumbohm (Nat.Commun., 2022). This tool besides handling multiple experiments, allows for the batch correction of data derived by the combination of both DIA (Data-Independent Acquisition) and DDA (Data-Dependent Acquisition).

To use the the harmonize.batches function, it is sufficient to provide a DEprot object containing a combined table of multiple experiments and indicate the identifier column in the metadata table which corresponds to the batch annotation (in our example: batch).

HarmonizR is not a required dependency, therefore if this function is used and HarmonizR is not already installed, a warning will indicate the requred installation. The package can be installed via: BiocManager::install("HarmonizR"), or alternatively devtools::install_github("https://github.com/SimonSchlumbohm/HarmonizR/", subdir="HarmonizR").

dpo <- harmonize.batches(DEprot.object = dpo,
                         batch.column = "batch",
                         cores = 5)

dpo

## DEprot object:
##            Samples:  32
##           Proteins:  3153
##   Counts available:  raw, normalized
## Log transformation:  log2
##   Metadata columns:  column.id, batch, cell, condition, treatment, target, group, replicate

The batch-corrected data will be stored in the “norm.counts” slot of the dpo object (notice the “Counts available: raw, normalized”).

It is possible to compare the distribution of the data before and after normalization using the boxplots already created in the dpo object or using the function plot.counts:

patchwork::wrap_plots(dpo@boxplot.raw + xlab(NULL),
                      dpo@boxplot.norm + xlab(NULL),
                      ncol = 1)

3 Imputation

Often many NA/NaN values are present in the LFQ tables due to the technical limitations of the protein detection by Mass Spectrometry (MS) experiments, or due to biological reasons (e.g., IgG negative control in RIMEs).

protein.counts.bySample <- 
  protein.summary(DEprot.object = dpo,
                  group.column = "column.id",
                  n.labels = "counts",
                  show.frequency = T,
                  x.label.angle = 0,
                  title = "**# Proteins identified per _Sample_**",
                  colors = c("#9c9ab5", "navy")) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

protein.counts.bySample

Notably, visualization of the amount of proteins detected per group shows that IgG samples have - as expected - less reproducible protein detection:

protein.counts.byGroup <- 
  protein.summary(DEprot.object = dpo,
                  group.column = "group",
                  n.labels = "percentage",
                  show.frequency = T,
                  x.label.angle = 0,
                  title = "**# Proteins identified per _Group_**") +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

protein.counts.byGroup

DEprot allows for the possibility to use 9 different methods for the imputation of the data:

missForest: developed by DJ.Stekhoven & P.Buehlmann (Bioinformatics, 2012), this tool will impute the NaN and assign and estimated value. It also yields an out-of-bag (OOB) imputation error estimate (general, or per each sample). Moreover, it can be run parallel to save computation time (both examples reported here after).
k-Nearest Neighbors (kNN) algorithm
correlation-Nearest Neighbors (corkNN) algorithm
truncated-Nearest Neighbors (tkNN) algorithm
Local Least Squares (LLS) imputation, using the pearson correlation coefficient
Singular Value Decomposition (SVD)
Glmnet ridge regression (RegImpute, from DreamAI)
Probabilistic Principal Component Analysis (PPCA)
Bayesian Principal Component Analysis (BPCA)

Of note, there are cases in which the missing data are not due to random missingness but rather consequential to real biological effects (i.e. knock-down, knock-out, degron systems and PROTACs, etc.). This may lead to the issue that a protein completely missing in all the replicates of an individual condition will be imputed with relatively high values This would ultimately lead to the incorrect definition of the differential status of this protein.

In the specific case of our RIME dataset these “non-random” missing values can be found in two cases: * IgG control: the IgG pull-down “background” proteins that might be detected in a non reproducible manner (e.g. 1 out of 4 replicates); * TRIM33 knock-out: TRIM33 protein is “biologically” not present in the LNCaP_TRIM33-5#MC-C2 cell line.

3.1 Imputation non-random missing values

To overcome this hurdle, it is possible to use the function randomize.missing.values to assign random values from the bottom distribution of the full dataset. The user can define a percentage threshold of minimal missing values within the group; for instance if the threshold is 75% and in each group there are 4 samples (as in our dataset), it means that only proteins with at least 3 out of 4 missing values will be imputed with this method. It will be sufficient to indicate the ID of a column from the metadata to assign the groups/conditions alongside the percentage of bottom distribution from which to select random numbers (“tail.percentage” parameter).

The imputation will overwrite the normalized data adding the imputed values. For this reason we will store the unimputed results in another variable in order to be able to compare the data before and after random-imputation.

## Store old data
dpo_unimputed <- dpo

## Imputation non-random missing values
dpo <- randomize.missing.values(DEprot.object = dpo,
                                group.column = "group",
                                percentage.missing = 75, # 3~4 out of 4 missing
                                tail.percentage = 3,
                                seed = 168,
                                verbose = F)

3.1.1 Comparison unimputed vs random-imputed

Hereafter we will compare the differences upon non-random imputation for the two cases mentioned before.

3.1.1.1 AR vs IgG RIME

In this comparison we will show the differences between an AR-RIME and the IgG for LNCaP cells (batch A). 25 random proteins will be displayed.

## Define random proteins to show
set.seed(168)
proteins.to.show <- sample(rownames(dpo@norm.counts), size = 25)

rime_unimputed <-
  heatmap.counts(DEprot.object = dpo_unimputed,
                 which.data = "normalized",
                 sample.subset = metadata[metadata$batch == "A", "column.id"],
                 protein.subset = proteins.to.show,
                 clust.columns = F,
                 clust.rows = F,
                 show.protein.names = T,
                 na.color = "firebrick",
                 color.limits = c(19,31),
                 title = "Unimputed")


rime_random <-
  heatmap.counts(DEprot.object = dpo,
                 which.data = "normalized",
                 sample.subset = metadata[metadata$batch == "A", "column.id"],
                 protein.subset = proteins.to.show,
                 clust.columns = F,
                 clust.rows = F,
                 show.protein.names = F,
                 na.color = "firebrick",
                 color.limits = c(19,31),
                 title = "Random")


patchwork::wrap_plots(rime_unimputed@heatmap, rime_random@heatmap)

3.1.1.2 AR-RIME NT vs TRIM33-KO

In this comparison we will show the differences of TRIM33 intensities in the AR-RIME for LNCaP_NT-3 (control) vs LNCaP_TRIM33-5#MC-C2 (batch C, AR target).

rime_unimputed_trim33 <-
  heatmap.counts(DEprot.object = dpo_unimputed,
                 which.data = "normalized",
                 sample.subset = metadata[metadata$batch == "C" & metadata$target == "AR", "column.id"],
                 protein.subset = "TRIM33",
                 clust.columns = F,
                 clust.rows = F,
                 show.protein.names = T,
                 na.color = "firebrick",
                 color.limits = c(22,28),
                 title = "Unimputed")


rime_random_trim33 <-
  heatmap.counts(DEprot.object = dpo,
                 which.data = "normalized",
                 sample.subset = metadata[metadata$batch == "C" & metadata$target == "AR", "column.id"],
                 protein.subset = "TRIM33",
                 clust.columns = F,
                 clust.rows = F,
                 show.protein.names = T,
                 na.color = "firebrick",
                 color.limits = c(22,28),
                 title = "Random")


patchwork::wrap_plots(rime_unimputed_trim33@heatmap,
                      rime_random_trim33@heatmap +
                        theme(axis.text.x = element_blank(),
                              legend.position = "none"),
                      ncol = 1)

3.2 Imputation of random missing values

3.2.1 Choice of the imputation method

DEprot provides 9 methods to imputed data: missForest, kNN, corkNN, tkNN, LLS, SVD, RegImpute, bPCA, pPCA.
We implemented the function compare.imp.methods to compare the different imputation methods and choose the best one for your data type. Indeed, your dataset is subset and NAs are introduce simulating distribution and proportion of the original data.

rime_imp_comparison <- compare.imp.methods(DEprot.object = dpo,
                                           percentage.test = 100,
                                           correlation.method = "pearson",
                                           sample.group.column = "group",
                                           seed = 1234,
                                           pcaMethods.nPCs.to.test = 3)

saveRDS(object = rime_imp_comparison,
        file = "./RIME/objects/rime_imputation_comparison.Rds")

rime_imp_comparison

pdf("./RIME/plots/rime_imputation_comparison_radar.plot.pdf",
    width = 10, height = 6)
replayPlot(radar_plot_imp)
invisible(dev.off())

Considering the RMSE and correlation coefficients the best method for these RIME data is RegImpute.

3.2.1.1 Run `RegImpute` imputation

In order to be able to compare the impact of randomize.missing.values function on knock-out proteins, we will impute the data on both the dpo and dpo_unimputed objects.

## Before randomize.missing.values
dpo_unimputed <- impute.counts(DEprot.object = dpo_unimputed,
                               method = "RegImpute",
                               use.normalized.data = TRUE,
                               verbose = TRUE)


## After randomize.missing.values
dpo <- impute.counts(DEprot.object = dpo,
                     method = "RegImpute",
                     use.normalized.data = TRUE,
                     verbose = TRUE)

dpo


### Export objects
saveRDS(object = dpo_unimputed, "./RIME/objects/dpo_imputedRegImpute.wo.random.Rds")
saveRDS(object = dpo, "./RIME/objects/dpo_random.and.imputedRegImpute.Rds")

3.2.2 Data distribution after imputation

Similarly to the post-normalization data, it is now possible to compare the distribution of the data post-imputation using boxplots:

patchwork::wrap_plots(dpo@boxplot.raw + xlab(NULL),
                      dpo@boxplot.norm + xlab(NULL),
                      dpo@boxplot.imputed + xlab(NULL),
                      ncol = 1)

It is appreciable how the IgG negative controls - as expected - display a lower distribution compared to AR-RIME.

Remarkably, comparing the 2-steps imputation (random + RegImpute) and the RegImpute alone, is possible to observe that the double imputation method shows a strikingly different distribution for the bottom tale compare to the RegImpute-alone one. This implies that RegImpute alone is not able to properly impute non-random missing values and tends to “homogenize” the intensities.

comp_density_imputation =
  ggpubr::ggdensity(reshape2::melt(data = dpo@raw.counts)[,3],
                    fill = "gray",
                    title = "Raw data") +
  xlab("LFQ") +
  theme(aspect.ratio = 1) +
  ggpubr::ggdensity(reshape2::melt(data = dpo_unimputed@imputed.counts)[,3],
                    fill = "gray",
                    title = "Imputed RegImpute") +
  xlab("LFQ") +
  theme(aspect.ratio = 1) +
  ggpubr::ggdensity(reshape2::melt(data = dpo@imputed.counts)[,3],
                    fill = "gray",
                    title = "Imputed Random + RegImpute") +
  theme(aspect.ratio = 1) +
  xlab("LFQ")

comp_density_imputation

3.2.3 Examples of post-imputation values

Now it is possible to add a third heatmap (imputed counts) to the previous comparisons.

3.2.3.1 AR vs IgG RIME, RegImpute

rime_random_reg <-
  heatmap.counts(DEprot.object = dpo,
                 which.data = "imputed",
                 sample.subset = metadata[metadata$batch == "A", "column.id"],
                 protein.subset = proteins.to.show,
                 clust.columns = FALSE,
                 clust.rows = FALSE,
                 show.protein.names = FALSE,
                 na.color = "firebrick",
                 color.limits = c(19,31),
                 title = "Random + RegImpute")


patchwork::wrap_plots(rime_unimputed@heatmap, rime_random@heatmap, rime_random_reg@heatmap)

3.2.3.2 AR-RIME NT vs TRIM33-KO, RegImpute

Here we can compare the effects of omitting the randomize.missing.values step for proteins that are missing not at random.

rime_RegImpute.only_trim33 <-
  heatmap.counts(DEprot.object = dpo_unimputed,
                 which.data = "imputed",
                 sample.subset = metadata[metadata$batch == "C" & metadata$target == "AR", "column.id"],
                 protein.subset = "TRIM33",
                 clust.columns = F,
                 clust.rows = F,
                 show.protein.names = T,
                 na.color = "firebrick",
                 color.limits = c(22,28),
                 title = "RegImpute only")

patchwork::wrap_plots(rime_random_trim33@heatmap,
                      rime_RegImpute.only_trim33@heatmap +
                        theme(axis.text.x = element_blank(),
                              legend.position = "none"),
                      ncol = 1)

4 Quality controls

Once the data are normalized and imputed, it is good practice to validate the quality and reproducibility of the data.
DEprot can generate correlations and perform principal component analyses (PCA) between samples.

4.1 Correlations

The correlations can be performed globally as well as within sample subgroups. The clustering distance is equivalent to the \(1 - correlation\).

correlation_all <-
  plot.correlation.heatmap(DEprot.object = dpo,
                           correlation.method = "spearman",
                           which.data = "imputed",
                           display.values = FALSE,
                           plot.subtitle = "All samples")


correlation_all

correlation_AR <- 
   plot.correlation.heatmap(DEprot.object = dpo,
                           correlation.method = "spearman",
                           which.data = "imputed",
                           sample.subset = metadata[metadata$target == "AR", "column.id"],
                           display.values = TRUE,
                           plot.subtitle = "AR RIME")

correlation_IgG <- 
   plot.correlation.heatmap(DEprot.object = dpo,
                           correlation.method = "spearman",
                           which.data = "imputed",
                           sample.subset = metadata[metadata$target == "IgG", "column.id"],
                           display.values = TRUE,
                           plot.subtitle = "IgG RIME")

patchwork::wrap_plots(correlation_AR@heatmap, correlation_IgG@heatmap)

4.2 Principal Component Analyses

Similarly to the correlation, it is possible to asses the quality of the the data analyzing the segregation of the data in principal-component space.

## Compute PCs for all samples
PCA <- perform.PCA(DEprot.object = dpo, which.data = "imputed")

## Plotting PC1-vs-PC2 and PC2-vs-PC3 (all samples)
plot.PC.scatter.123(DEprot.PCA.object = PCA,
                    color.column = "group",
                    shape.column = "batch",
                    label.column = "replicate",
                    plot.zero.line.x.12 = FALSE,
                    dot.colors = rev(RColorBrewer::brewer.pal(n = 8, name = "Paired")))

AR-RIMEs only

## Compute PCs for AR-RIMES
PCA_AR <- perform.PCA(DEprot.object = dpo,
                      which.data = "imputed",
                      sample.subset = metadata[metadata$target == "AR", "column.id"])

## Plotting PC1-vs-PC2 and PC2-vs-PC3 (all samples)
plot.PC.scatter.123(DEprot.PCA.object = PCA_AR,
                    color.column = "group",
                    shape.column = "batch",
                    label.column = "replicate",
                    plot.zero.line.x.12 = FALSE,
                    dot.colors = rev(RColorBrewer::brewer.pal(n = 8, name = "Paired"))[c(1,3,5,7)])

5 Differential expression analyses

Two possible type of differential analysis are available in DEprot:

multiple t-test: per each protein a t-test is performed to compare the groups means
limma: limma is a software originally developed for the analyses of microarray expression data; it fits the data into a negative binomial model in order to define differences between groups.

5.1 Test normality

In the next examples we made use of limma to establish differential protein enrichment. However, limma assumes that the distribution of the intensity is normal. To validate this assumption, the function check.normality can be employed to perform an Anderson-Darling normality test per each sample, and plot Q-Q plots and data distributions.

## Test normality
norm.test <- check.normality(DEprot.object = dpo, p.threshold = 0.5, which.data = "imputed")

## All samples display a normal distribution.

patchwork::wrap_plots(norm.test@qqplots)

patchwork::wrap_plots(norm.test@densities)

5.2 Differential analyses: `limma`

Hereafter, we will compare each AR-RIME with the corresponding IgG control, study the effects of TRIM33-KO on AR-interactome composition comparing LNCaP_NT-3_FBS_AR vs LNCaP_TRIM33-5#MC-C2_FBS_AR, and finally evaluate the AR-interactome when the cells are hormone-deprived LNCaP_WT_R1881_AR vs LNCaP-abl_WT_R1881_AR.

dpo_analyses <-
  diff.analyses.limma(DEprot.object = dpo,
                      contrast.list = list(c("group", "LNCaP_NT-3_FBS_AR", "LNCaP_NT-3_FBS_IgG"),
                                           c("group", "LNCaP_TRIM33-5#MC-C2_FBS_AR", "LNCaP_TRIM33-5#MC-C2_FBS_IgG"),
                                           c("group", "LNCaP_WT_R1881_AR", "LNCaP_WT_R1881_IgG"),
                                           c("group", "LNCaP-abl_WT_R1881_AR", "LNCaP-abl_WT_R1881_IgG"),
                                           c("group", "LNCaP_TRIM33-5#MC-C2_FBS_AR", "LNCaP_NT-3_FBS_AR"),
                                           c("group", "LNCaP_WT_R1881_AR", "LNCaP-abl_WT_R1881_AR")),
                      linear.FC.th = 2,
                      padj.th = 0.05,
                      padj.method = "BH", #Benjamini-Hochberg
                      fitting.method = "ls",
                      include.rep.model = TRUE,
                      replicate.column = "replicate",
                      which.data = "imputed")

saveRDS(dpo_analyses, "./RIME/objects/limma_differential_analyses.Rds")

dpo_analyses

## DEprot.analyses object:
##           Counts used:  imputed
## Fold Change threshold:  2 (linear)
## FC unresponsive range:  [0.9090909,1.1] (linear)
##        padj threshold:  0.05 (linear)
##           padj method:  BH
## 
## 
## Differential results summary:
##                                                           contrast.id
## 1                      group: LNCaP_NT-3_FBS_AR vs LNCaP_NT-3_FBS_IgG
## 2                      group: LNCaP_NT-3_FBS_AR vs LNCaP_NT-3_FBS_IgG
## 3                      group: LNCaP_NT-3_FBS_AR vs LNCaP_NT-3_FBS_IgG
## 4                      group: LNCaP_NT-3_FBS_AR vs LNCaP_NT-3_FBS_IgG
## 5  group: LNCaP_TRIM33-5#MC-C2_FBS_AR vs LNCaP_TRIM33-5#MC-C2_FBS_IgG
## 6  group: LNCaP_TRIM33-5#MC-C2_FBS_AR vs LNCaP_TRIM33-5#MC-C2_FBS_IgG
## 7  group: LNCaP_TRIM33-5#MC-C2_FBS_AR vs LNCaP_TRIM33-5#MC-C2_FBS_IgG
## 8  group: LNCaP_TRIM33-5#MC-C2_FBS_AR vs LNCaP_TRIM33-5#MC-C2_FBS_IgG
## 9                      group: LNCaP_WT_R1881_AR vs LNCaP_WT_R1881_IgG
## 10                     group: LNCaP_WT_R1881_AR vs LNCaP_WT_R1881_IgG
## 11                     group: LNCaP_WT_R1881_AR vs LNCaP_WT_R1881_IgG
## 12                     group: LNCaP_WT_R1881_AR vs LNCaP_WT_R1881_IgG
## 13             group: LNCaP-abl_WT_R1881_AR vs LNCaP-abl_WT_R1881_IgG
## 14             group: LNCaP-abl_WT_R1881_AR vs LNCaP-abl_WT_R1881_IgG
## 15             group: LNCaP-abl_WT_R1881_AR vs LNCaP-abl_WT_R1881_IgG
## 16             group: LNCaP-abl_WT_R1881_AR vs LNCaP-abl_WT_R1881_IgG
## 17            group: LNCaP_TRIM33-5#MC-C2_FBS_AR vs LNCaP_NT-3_FBS_AR
## 18            group: LNCaP_TRIM33-5#MC-C2_FBS_AR vs LNCaP_NT-3_FBS_AR
## 19            group: LNCaP_TRIM33-5#MC-C2_FBS_AR vs LNCaP_NT-3_FBS_AR
## 20            group: LNCaP_TRIM33-5#MC-C2_FBS_AR vs LNCaP_NT-3_FBS_AR
## 21                  group: LNCaP_WT_R1881_AR vs LNCaP-abl_WT_R1881_AR
## 22                  group: LNCaP_WT_R1881_AR vs LNCaP-abl_WT_R1881_AR
## 23                  group: LNCaP_WT_R1881_AR vs LNCaP-abl_WT_R1881_AR
## 24                  group: LNCaP_WT_R1881_AR vs LNCaP-abl_WT_R1881_AR
##    group.factor                      group1                       group2
## 1         group           LNCaP_NT-3_FBS_AR           LNCaP_NT-3_FBS_IgG
## 2         group           LNCaP_NT-3_FBS_AR           LNCaP_NT-3_FBS_IgG
## 3         group           LNCaP_NT-3_FBS_AR           LNCaP_NT-3_FBS_IgG
## 4         group           LNCaP_NT-3_FBS_AR           LNCaP_NT-3_FBS_IgG
## 5         group LNCaP_TRIM33-5#MC-C2_FBS_AR LNCaP_TRIM33-5#MC-C2_FBS_IgG
## 6         group LNCaP_TRIM33-5#MC-C2_FBS_AR LNCaP_TRIM33-5#MC-C2_FBS_IgG
## 7         group LNCaP_TRIM33-5#MC-C2_FBS_AR LNCaP_TRIM33-5#MC-C2_FBS_IgG
## 8         group LNCaP_TRIM33-5#MC-C2_FBS_AR LNCaP_TRIM33-5#MC-C2_FBS_IgG
## 9         group           LNCaP_WT_R1881_AR           LNCaP_WT_R1881_IgG
## 10        group           LNCaP_WT_R1881_AR           LNCaP_WT_R1881_IgG
## 11        group           LNCaP_WT_R1881_AR           LNCaP_WT_R1881_IgG
## 12        group           LNCaP_WT_R1881_AR           LNCaP_WT_R1881_IgG
## 13        group       LNCaP-abl_WT_R1881_AR       LNCaP-abl_WT_R1881_IgG
## 14        group       LNCaP-abl_WT_R1881_AR       LNCaP-abl_WT_R1881_IgG
## 15        group       LNCaP-abl_WT_R1881_AR       LNCaP-abl_WT_R1881_IgG
## 16        group       LNCaP-abl_WT_R1881_AR       LNCaP-abl_WT_R1881_IgG
## 17        group LNCaP_TRIM33-5#MC-C2_FBS_AR            LNCaP_NT-3_FBS_AR
## 18        group LNCaP_TRIM33-5#MC-C2_FBS_AR            LNCaP_NT-3_FBS_AR
## 19        group LNCaP_TRIM33-5#MC-C2_FBS_AR            LNCaP_NT-3_FBS_AR
## 20        group LNCaP_TRIM33-5#MC-C2_FBS_AR            LNCaP_NT-3_FBS_AR
## 21        group           LNCaP_WT_R1881_AR        LNCaP-abl_WT_R1881_AR
## 22        group           LNCaP_WT_R1881_AR        LNCaP-abl_WT_R1881_AR
## 23        group           LNCaP_WT_R1881_AR        LNCaP-abl_WT_R1881_AR
## 24        group           LNCaP_WT_R1881_AR        LNCaP-abl_WT_R1881_AR
##    paired.test                  diff.status    n median.FoldChange
## 1         TRUE           LNCaP_NT-3_FBS_IgG  131      -1.905494859
## 2         TRUE            LNCaP_NT-3_FBS_AR  868       2.840001726
## 3         TRUE                 unresponsive  362       0.001355977
## 4         TRUE                         null 1792      -0.179844475
## 5         TRUE LNCaP_TRIM33-5#MC-C2_FBS_IgG  130      -1.827148750
## 6         TRUE  LNCaP_TRIM33-5#MC-C2_FBS_AR  922       2.936158396
## 7         TRUE                 unresponsive  370      -0.002209037
## 8         TRUE                         null 1731      -0.167189320
## 9         TRUE           LNCaP_WT_R1881_IgG  185      -1.923493012
## 10        TRUE            LNCaP_WT_R1881_AR 1011       2.267629438
## 11        TRUE                 unresponsive  302       0.003921392
## 12        TRUE                         null 1655       0.228013526
## 13        TRUE       LNCaP-abl_WT_R1881_IgG  133      -1.636583027
## 14        TRUE        LNCaP-abl_WT_R1881_AR  992       2.676888784
## 15        TRUE                 unresponsive  354      -0.012761041
## 16        TRUE                         null 1674      -0.143302902
## 17        TRUE            LNCaP_NT-3_FBS_AR   46      -2.862151277
## 18        TRUE  LNCaP_TRIM33-5#MC-C2_FBS_AR   30       2.655605197
## 19        TRUE                 unresponsive 1122       0.002161318
## 20        TRUE                         null 1955      -0.162504964
## 21        TRUE        LNCaP-abl_WT_R1881_AR   43      -2.226926883
## 22        TRUE            LNCaP_WT_R1881_AR 1056       3.027185641
## 23        TRUE                 unresponsive  774       0.011976388
## 24        TRUE                         null 1280       0.233719743

5.2.1 Filtering for nuclear proteins

The RIME is a technique that involves nuclear isolation, this implies that the presence of cytoplasmic proteins is most likely to be attributed to “cytoplasmic contamination”. For this reason we will select only nuclear proteins, and remove the ribosomal and mitochondrial ones in the dpo_analyses object.

nucleus <- AnnotationDbi::select(org.Hs.eg.db::org.Hs.eg.db,
                                 keytype = "GOALL",
                                 keys = "GO:0005634", #nucleus
                                 columns = c("SYMBOL", "UNIPROT"))

ribo_mito <- AnnotationDbi::select(org.Hs.eg.db::org.Hs.eg.db,
                                  keytype = "GOALL",
                                  keys = c("GO:0005840", "GO:0005739"), #ribosome, mitochondria
                                  columns = c("SYMBOL", "UNIPROT"))

## Certain proteins have the .X suffix, so we collect the full names
nucleus_full <- rownames(dpo@raw.counts)[gsub("[.].*", "", rownames(dpo@raw.counts)) %in% nucleus$SYMBOL]
ribo_mito_full <- rownames(dpo@raw.counts)[gsub("[.].*", "", rownames(dpo@raw.counts)) %in% ribo_mito$SYMBOL]


## Filter dpo_analyses project
dpo_analyses_nuclear <- filter.proteins(DEprot.object = dpo_analyses,
                                        proteins = nucleus_full,
                                        mode = "keep")

dpo_analyses_nuclear <- filter.proteins(DEprot.object = dpo_analyses_nuclear,
                                        proteins = ribo_mito_full,
                                        mode = "remove")

Now the analyses can be exported using the automated function:

export.analyses(DEprot.analyses.object = dpo_analyses_nuclear, output.folder = "./RIME/export")

5.3 Visualization

For the comparison of AR-RIMEs vs the respective IgG (contrasts 1-4), we first expect that the AR-RIME displays more proteins than the IgG negative control, and secondly that AR is among the top-enriched proteins in the AR-RIME itself.
To verify this we will generate volcano plots of the log₂(Fold Change) AR-vs-IgG for each group and highlight AR.

volcano_IgG <-
  purrr::pmap(.l = list(contrast = 1:4,
                        up.color = rev(RColorBrewer::brewer.pal(n = 8, name = "Paired"))[c(1,3,5,7)]),
              .f = function(contrast, up.color) {
                plot.volcano(DEprot.analyses.object = dpo_analyses_nuclear,
                             contrast = contrast,
                             up.color = up.color,
                             down.color = "black",
                             dot.labels = "AR",
                             label.font.size = 5)
              })

patchwork::wrap_plots(volcano_IgG)

On the other hand, for the TRIM33-KO we expect that the interaction of TRIM33 with AR is lost when TRIM33 is knocked-out.

plot.volcano(DEprot.analyses.object = dpo_analyses_nuclear,
             contrast = 5,
             up.color = "#FF7F00",
             down.color = "#E31A1C",
             dot.labels = c("AR", "TRIM33"),
             label.font.size = 5)

As internal control, being the two conditions biologically similar, we do expect comparable results for the AR-RIME in LNCaP_WT_R1881_AR and LNCaP_NT-3_FBS_AR. For this we will compare the protein fold change for AR-vs-IgG for the two conditions and the overlap of differential proteins.

contrast.scatter(DEprot.analyses.object = dpo_analyses_nuclear,
                 contrast.x = 1,
                 contrast.y = 3,
                 correlation.method = "pearson")

plot.upset(DEprot.analyses.object = dpo_analyses_nuclear,
           contrast.subset = c(1,3),
           title = "Overalp AR-interactome",
           subtitle = "LNCaP_WT_R1881 vs LNCaP_NT-3_FBS")

5.4 Over-Representation Analysis (ORA)

We are going to perform ORA analyses on the differential proteins between LNCaP-WT and the LNCaP-abl in order to understand how long-term hormone deprivation impacts the AR-interactome composition.

For these analyses a geneSet table is required as a reference. The latter must be a data.frame with two columns: “gs_name” (geneSet name) and “gene_symbol” (corresponding to the protein id).
In DEprot we provide a ready-to-use list of protein complexes obtained by the CORUM database.

Notice that the “gene_symbol” values must match the “prot.id” in the result tables. To this goal, the geneset.enrichment function contains the parameter gsub.pattern.prot.id that will be used to remove a pattern from the “prot.id”s (e.g., the duplicates suffix).

## Make CORUM geneSet
data("corum_v5.0", package = "DEprot")

corum_geneSet =
  corum_v5.0 %>%
  dplyr::filter(organism == "Human") %>%
  dplyr::rename(gs_name = complex.name,
                gene_symbol = protein.members) %>%
  dplyr::select(gs_name, gene_symbol)

head(corum_geneSet)

## Run ORA
ORA.results =
  geneset.enrichment(DEprot.analyses.object = dpo_analyses_nuclear,
                     contrast = 6,
                     TERM2GENE = corum_geneSet,
                     enrichment.type = "ORA",
                     gsub.pattern.prot.id = "[.].*",
                     pvalueCutoff = 0.05,
                     qvalueCutoff = 0.05,
                     pAdjustMethod = "BH",
                     diff.status.category = "LNCaP_WT_R1881_AR",
                     dotplot.n = 10)

The ORA object contains several plots that can help the interpretation of the results. These include networks of proteins and complexes, and dot plots of GeneRatio and FoldEnrichment:

ORA.results@protein.network

ORA.results@pathway.network$plot

patchwork::wrap_plots(ORA.results@dotplot_gene.ratio, ORA.results@dotplot_fold.enrichment)

6 Session info

## $`R version:`
## [1] R version 4.4.3 (2025-02-28)
## 
## $`Base packages:`
## [1] base      datasets  graphics  grDevices methods   stats     stats4   
## [8] utils    
## 
## $`Other attached packages:`
##  [1] AnnotationDbi_1.66.0 Biobase_2.64.0       BiocGenerics_0.50.0 
##  [4] data.table_1.17.8    DEprot_0.1.0         dplyr_1.1.4         
##  [7] ggplot2_4.0.0        ggpubr_0.6.2         IRanges_2.38.1      
## [10] patchwork_1.3.2      purrr_1.1.0          S4Vectors_0.42.1    
## 
## $`Loaded via a namespace (and not attached):`
##   [1] abind_1.4-8             apcluster_1.4.14        ape_5.8-1              
##   [4] aPEAR_1.0.0             aplot_0.2.9             arules_1.7-9           
##   [7] backports_1.5.0         bayesbio_1.0.0          BiocManager_1.30.26    
##  [10] BiocParallel_1.38.0     Biostrings_2.72.1       bit_4.6.0              
##  [13] bit64_4.6.0-1           bitops_1.0-9            blob_1.2.4             
##  [16] boot_1.3-31             broom_1.0.10            bslib_0.9.0            
##  [19] cachem_1.1.0            car_3.1-3               carData_3.0-5          
##  [22] class_7.3-23            cli_3.6.5               clusterProfiler_4.12.6 
##  [25] codetools_0.2-20        colorspace_2.1-2        commonmark_2.0.0       
##  [28] compiler_4.4.3          ComplexUpset_1.3.3      conflicted_1.2.0       
##  [31] cowplot_1.2.0           crayon_1.5.3            curl_7.0.0             
##  [34] DBI_1.2.3               DEoptimR_1.1-4          dichromat_2.0-0.1      
##  [37] digest_0.6.37           doParallel_1.0.17       doRNG_1.8.6.2          
##  [40] DOSE_3.30.5             e1071_1.7-16            enrichplot_1.24.4      
##  [43] evaluate_1.0.5          expm_1.0-0              farver_2.1.2           
##  [46] fastmap_1.2.0           fastmatch_1.1-6         fgsea_1.30.0           
##  [49] fmsb_0.7.6              fontBitstreamVera_0.1.1 fontLiberation_0.1.0   
##  [52] fontquiver_0.2.1        forcats_1.0.1           foreach_1.5.2          
##  [55] Formula_1.2-5           fs_1.6.6                gdtools_0.4.4          
##  [58] generics_0.1.4          GenomeInfoDb_1.40.1     GenomeInfoDbData_1.2.12
##  [61] ggdendro_0.2.0          ggforce_0.5.0           ggfun_0.2.0            
##  [64] ggiraph_0.9.2           ggplotify_0.1.3         ggraph_2.2.2           
##  [67] ggrepel_0.9.6           ggridges_0.5.7          ggsignif_0.6.4         
##  [70] ggtext_0.1.2            ggtree_3.12.0           glmnet_4.1-10          
##  [73] glue_1.8.0              GO.db_3.19.1            GOSemSim_2.30.2        
##  [76] graphlayouts_1.2.2      grid_4.4.3              gridExtra_2.3          
##  [79] gridGraphics_0.5-1      gridtext_0.1.5          gson_0.1.0             
##  [82] gtable_0.3.6            HarmonizR_1.2.0         hms_1.1.3              
##  [85] htmltools_0.5.8.1       htmlwidgets_1.6.4       httr_1.4.7             
##  [88] httr2_1.2.1             igraph_2.1.4            iterators_1.0.14       
##  [91] itertools_0.1-3         jquerylib_0.1.4         jsonlite_2.0.0         
##  [94] KEGGREST_1.44.1         knitr_1.50              labeling_0.4.3         
##  [97] laeken_0.5.3            lattice_0.22-6          lazyeval_0.2.2         
## [100] legendry_0.2.4          lifecycle_1.0.4         limma_3.60.6           
## [103] litedown_0.7            lmtest_0.9-40           lsa_0.73.3             
## [106] lubridate_1.9.4         magrittr_2.0.4          markdown_2.0           
## [109] MASS_7.3-64             Matrix_1.7-2            MBQN_2.16.0            
## [112] MCL_1.0                 memoise_2.0.1           mgcv_1.9-1             
## [115] missForest_1.5          nlme_3.1-167            nnet_7.3-20            
## [118] nortest_1.0-4           org.Hs.eg.db_3.19.1     parallel_4.4.3         
## [121] pcaMethods_1.96.0       pheatmap_1.0.13         pillar_1.11.1          
## [124] pkgconfig_2.0.3         plotly_4.11.0           plyr_1.8.9             
## [127] png_0.1-8               polyclip_1.10-7         prettyunits_1.2.0      
## [130] progress_1.2.3          prolfqua_1.3.9          proxy_0.4-27           
## [133] qvalue_2.36.0           R.methodsS3_1.8.2       R.oo_1.27.1            
## [136] R.utils_2.13.0          R6_2.6.1                randomForest_4.7-1.2   
## [139] ranger_0.17.0           rappdirs_0.3.3          RColorBrewer_1.1-3     
## [142] Rcpp_1.1.0              RCurl_1.98-1.17         reshape2_1.4.4         
## [145] rlang_1.1.6             rmarkdown_2.30          rngtools_1.5.2         
## [148] robustbase_0.99-6       RSQLite_2.4.3           rstatix_0.7.3          
## [151] rstudioapi_0.17.1       rvcheck_0.2.1           S7_0.2.0               
## [154] sass_0.4.10             scales_1.4.0            scatterpie_0.2.6       
## [157] shadowtext_0.1.6        shape_1.4.6.1           SnowballC_0.7.1        
## [160] sp_2.2-0                splines_4.4.3           statmod_1.5.1          
## [163] stringi_1.8.7           stringr_1.5.2           survival_3.8-3         
## [166] systemfonts_1.3.1       tibble_3.3.0            tidygraph_1.3.1        
## [169] tidyr_1.3.1             tidyselect_1.2.1        tidytree_0.4.6         
## [172] timechange_0.3.0        tools_4.4.3             treeio_1.28.0          
## [175] tweenr_2.0.3            UCSC.utils_1.0.0        vcd_1.4-13             
## [178] vctrs_0.6.5             VIM_6.2.6               viridis_0.6.5          
## [181] viridisLite_0.4.2       withr_3.0.2             xfun_0.53              
## [184] xml2_1.4.1              XVector_0.44.0          yaml_2.3.10            
## [187] yulab.utils_0.2.1       zlibbioc_1.50.0         zoo_1.8-14

`DEprot` analyses examples - RIME

Sebastian Gregoricchio

30 October, 2025

1 Data description

1.1 Data import

1.2 Load `DEprot` object (dpo)

2 Batch correction/normalization

3 Imputation

3.1 Imputation non-random missing values

3.1.1 Comparison unimputed vs random-imputed

3.1.1.1 AR vs IgG RIME

3.1.1.2 AR-RIME NT vs TRIM33-KO

3.2 Imputation of random missing values

3.2.1 Choice of the imputation method

3.2.1.1 Run `RegImpute` imputation

3.2.2 Data distribution after imputation

3.2.3 Examples of post-imputation values

3.2.3.1 AR vs IgG RIME, RegImpute

3.2.3.2 AR-RIME NT vs TRIM33-KO, RegImpute

4 Quality controls

4.1 Correlations

4.2 Principal Component Analyses

5 Differential expression analyses

5.1 Test normality

5.2 Differential analyses: `limma`

5.2.1 Filtering for nuclear proteins

5.3 Visualization

5.4 Over-Representation Analysis (ORA)

6 Session info

DEprot analyses examples - RIME

Sebastian Gregoricchio

30 October, 2025

1 Data description

1.1 Data import

1.2 Load DEprot object (dpo)

2 Batch correction/normalization

3 Imputation

3.1 Imputation non-random missing values

3.1.1 Comparison unimputed vs random-imputed

3.1.1.1 AR vs IgG RIME

3.1.1.2 AR-RIME NT vs TRIM33-KO

3.2 Imputation of random missing values

3.2.1 Choice of the imputation method

3.2.1.1 Run RegImpute imputation

3.2.2 Data distribution after imputation

3.2.3 Examples of post-imputation values

3.2.3.1 AR vs IgG RIME, RegImpute

3.2.3.2 AR-RIME NT vs TRIM33-KO, RegImpute

4 Quality controls

4.1 Correlations

4.2 Principal Component Analyses

5 Differential expression analyses

5.1 Test normality

5.2 Differential analyses: limma

5.2.1 Filtering for nuclear proteins

5.3 Visualization

5.4 Over-Representation Analysis (ORA)

6 Session info

`DEprot` analyses examples - RIME

1.2 Load `DEprot` object (dpo)

3.2.1.1 Run `RegImpute` imputation

5.2 Differential analyses: `limma`