Introduction

For the prediction of risk scores for the Prostate Cancer DREAM Challenge we used Cox’s proportional hazards model in combination with variable selection and/or penalization. The data analysis and model building was carried out in R. This document includes a brief description of our work, and the computations required for fitting the final model are embedded as R code in the Results section.

It quickly became clear from the initial data analysis that two of the main challenges were how to deal with the variable selection and how to handle missing values. The logitudinal data provided in addition to the core table was also explored, but it was not obvious how to use the information in these tables besides some investigations of data quality and consistency.

The model we ended up with was a fairly standard model that did not differ much from the model presented by Halabi et al. except that it was a generalized additive model (a gam) that allows for nonlinear relations between continuous predictors and the log-hazard. This model used only nine predictor variables.

We experimented with one way of extracting additional predictive information from the many variables not included in the gam. This is described further in the two sections Methods and Results below, but whether it produced any actual increase of performance in terms of iAUC was uncertain.

Methods

The following presentation of the methods refers only to the use of the core table. We did not use the longitudinal data tables for the actual modeling. Moreover, this section does not describe any of the training-test data splits we did on the training data to test out how well our methods could be expected to work. The section only documents how the final models were fitted to the full data set and how the predictions were computed on the validation data.

Initial data cleaning

The variables HGTBLCAT, WGTBLCAT, HEAD_AND_NECK, PANCREAS, THYROID, CREACLCA, CREACL, GLEAS_DX and STOMACH were removed from the core table as these variables have no or limited variation in the training data.

For the training data a single PSA value (Patient ID VEN-756003601) was changed from 0 to 0.01, and a single ECOG value (Patient ID CELG-00121) was changed from 3 to 2.

For both the training and the validation data sets the ECOG variable was converted to a factor.

Missing values and imputation

Some of the variables, lab values in particular, have many missing values. We investigated several imputation methods, but our conclusion was that the more complicated imputation methods resulted in worse predictive performance. Thus we ended up with a very simple imputation method, which in principle relies on a missing completely at random assumption.

For each variable we imputed missing values by sampling with replacement from the observed values of that variable in the training data.

Variable selection

We used lasso penalization with Cox’s partial log-likelihood as implemented in the R package glmnet in combination with a form of stability selection for the variable selection.

This was carried out by fitting the lasso path to 100 subsamples of the training data of half the size of the initial data set. For each subsample the optimal choice of penalty parameter was selected by cross-validation (as implemented in the cv.glmnet function in the glmnet package), and the variables corresponding to the non-zero parameters were selected. The proportion of times each variable was selected was then computed across the 100 subsamples. The variables most frequently selected by the lasso procedure were then selected for further modeling.

Generalized additive model

Based on the selected variables by the lasso procedure and some additional considerations about the final model, a generalized additive proportional hazards model was fitted using the gam function in the mgcv package with the cox.ph family.

The fitting of a gam uses a basis expansion of all continuous predictors in combination with penalized likelihood estimation. The mgcv package supports an automatic selection of penalty parameters, which was used.

Improvements based on ridge regression

To include potential predictive information in the variables not selected we fitted a model using a ridge regression penalty as implemented in the R package glmnet including all predictors considered for the variable selection step. The final risk prediction was obtained as a weighted linear combination of the risk prediction from the gam and the ridge model.

Prediction of time-to-event

For the prediction of the actual survival time (time-to-event) we used median survival time as predicted from the proportional hazards model. We did not attempt to optimize this prediction and used the simplest method we could come up with. It used the same variables as the gam discussed above but was fitted without basis expansion and penalization. It was fitted in R using the surv function from the survival package and the subsequent median survival time predictions were computed using the survfit function.

Results

The predictions were obtained by running the code below. The raw code is found in the file Write-up.R.

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.3
library(survival)
library(glmnet)
library(mgcv)

Data cleaning

The data in the two core tables were read from the csv files. To run the code it below the files must be in the working directory of the R session.

In one of the data cleaning stesp below, all values that were equal to the empty string were converted to "No".

load("../Data/Prostate_all.RData")
training <- CoreTable_training
validation <- CoreTable_validation

discard <-  c("HGTBLCAT", "WGTBLCAT", "HEAD_AND_NECK", 
              "PANCREAS", "THYROID", "CREACLCA", "CREACL", 
              "GLEAS_DX", "STOMACH")
training <- subset(training, select = -which(colnames(training) %in% discard))
validation <- subset(validation, select = -which(colnames(validation) %in% discard))

training <- transform(training, 
                      PSA = ifelse(PSA == 0, 0.01, PSA),
                      ECOG_C = factor((ECOG_C >= 1) + (ECOG_C >= 2)))
validation <- transform(validation, 
                        ECOG_C = factor((ECOG_C >= 1) + (ECOG_C >= 2)))

for (i in seq_len(ncol(training))) {
  if (is.factor(training[, i])) {
    tmp <- as.character(training[, i])
    tmp[tmp == ""] <- "No"
    training[, i] <- factor(tmp)
  }
}

for (i in seq_len(ncol(validation))) {
  if (is.factor(validation[, i])) {
    tmp <- as.character(validation[, i])
    tmp[tmp == ""] <- "No"
    validation[, i] <- factor(tmp)
    name <- colnames(validation)[i]
    if (name %in% colnames(training) && 
          !identical(levels(validation[, i]), levels(training[, name])))
      validation[, i] <- factor(tmp, levels(training[, name]))
  }
}

Imputation

The implemented imputation scheme consisted of sampling from the marginal empirical distributions from the training data for each variable.

set.seed(1234)
for (i in seq(1, ncol(training))) {
  x0 <- training[, i]
  nas <- is.na(x0)
  if (!all(nas)) {
    if (any(nas)) 
      training[nas, i] <- sample(x0[!nas], sum(nas), replace = TRUE)  
    name <- names(training)[i]
    if(name %in% names(validation)) {
      nasLeader <- is.na(validation[, name])
      if (any(nasLeader))
        validation[nasLeader, name] <- sample(x0[!nas], sum(nasLeader), replace = TRUE)
    }
  }
}

Variable selection

variables <-  c(23, 27:ncol(training))
train <- training[, c(4, 5, variables)]
XX0 <- XX <- model.matrix(~ . - 1, train[, -c(1, 2)])
YY <- Surv(train$LKADT_P, train$DEATH == "YES")

n <- nrow(XX)
p <- ncol(XX)
B <- 100
select <- matrix(0, p, B)
rownames(select) <- colnames(XX)

for (b in seq_len(B)) {
  ii <- sample(n, n / 2 )
  survNet <- cv.glmnet(XX[ii, ], YY[ii, ], family = "cox")
  betahat <- coef(survNet, s = "lambda.min") 
  select[, b] <- as.numeric(betahat != 0)
}

selectFreq <- rowSums(select) / B
selectFreq <- sort(selectFreq, decreasing = TRUE)
varSort <- factor(names(selectFreq), levels = names(selectFreq))  
qplot(varSort[1:20], selectFreq[1:20]) + 
  theme(axis.text.x = element_text(angle = -90)) + 
  scale_y_continuous("Selection proportion") +
  scale_x_discrete("Variable")

The variable selection procedure shows that the eight variables ALP, AST, HB, ECOG, LIVER, ADRENAL, ALB and ANALGESICS are the most stably selected variables, all selected in more than 60% of the models.

In addition to these eight variables we included PSA and LYMPH_NODES, which were suspected to be predictive based on the study by Halabi et al., and found to be so in our studies as well.

Gam

form <- LKADT_P ~ s(log(ALP)) + s(HB) + s(log(AST)) + s(log(PSA)) + s(ALB) +
  ECOG_C + LIVER + ADRENAL + LYMPH_NODES + ANALGESICS
survGam <- gam(form, data = training, family = cox.ph(), weight = DEATH == "YES")
summary(survGam)
## 
## Family: Cox PH 
## Link function: identity 
## 
## Formula:
## LKADT_P ~ s(log(ALP)) + s(HB) + s(log(AST)) + s(log(PSA)) + s(ALB) + 
##     ECOG_C + LIVER + ADRENAL + LYMPH_NODES + ANALGESICS
## 
## Parametric coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## ECOG_C1        0.19669    0.08475   2.321  0.02030 * 
## ECOG_C2        0.55074    0.17385   3.168  0.00154 **
## LIVERY         0.39505    0.12764   3.095  0.00197 **
## ADRENALY       0.73817    0.22477   3.284  0.00102 **
## LYMPH_NODESY   0.21123    0.07937   2.661  0.00778 **
## ANALGESICSYES  0.07936    0.08694   0.913  0.36131   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##               edf Ref.df Chi.sq  p-value    
## s(log(ALP)) 2.535  3.213  43.38 4.29e-09 ***
## s(HB)       1.002  1.004  19.05 1.30e-05 ***
## s(log(AST)) 2.005  2.605  15.83  0.00085 ***
## s(log(PSA)) 3.641  4.586  11.85  0.02986 *  
## s(ALB)      1.003  1.005   5.73  0.01688 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Deviance explained = 10.2%
## -REML = 4144.1  Scale est. = 1         n = 1600

The effect of ANALGESICS was not significant in this model, and it was removed from the model.

survGam <- update(survGam, . ~ . - ANALGESICS)

Risk predictions

As mentioned in Methods, the final risk predictions were given as a linear combination of the risk predictions from the gam and the risk predictions from the ridge regression model. The ridge model was fitted using the cv.glmnet function with alpha = 0. The penalty parameter was selected by cross-validation choosing the model with the minimal cross-validated negative partial log-likelihood.

survNet <- cv.glmnet(XX, YY, family = "cox", alpha = 0)
XXvalidation <- model.matrix(~ . - 1, validation[, variables])
 
riskhatGam <- predict(survGam, newdata = validation)
riskhatNet <- predict(survNet, newx = XXvalidation, s = "lambda.min")[, 1]
qplot(riskhatGam, riskhatNet) + geom_smooth()

The figure shows the scatter plot of the risk predictions on the validation data from the gam and the ridge regression model. The two predictions are clearly strongly positively correlated.

It should be mentioned that the predictions from the ridge models were generally worse than the predictions from the gam (results not shown), and they were given a weight of 0.25 in the final risk prediction (the gam predictions were given weight 1). It should be noted that the choice of the weight has not been optimized in a systematic way.

The following table shows the predictions made from the model, and it is identical to the risk predictions found in the file nrhFinal.csv.

riskhat <- riskhatGam + riskhatNet / 4
riskhatValidation <- 
  cbind(validation[, "RPT", drop = FALSE], data.frame(riskScoreGlobal = riskhat))
riskhatValidation
##          RPT riskScoreGlobal
## 1   AZ-00001    -0.227741878
## 2   AZ-00003    -1.103190111
## 3   AZ-00005    -1.198691034
## 4   AZ-00006     0.339399822
## 5   AZ-00007    -0.654883117
## 6   AZ-00008    -0.540225551
## 7   AZ-00010    -0.848069874
## 8   AZ-00011    -0.603070266
## 9   AZ-00012     0.653206756
## 10  AZ-00013    -0.771761179
## 11  AZ-00014    -1.228289107
## 12  AZ-00016    -0.618315877
## 13  AZ-00019     1.527487898
## 14  AZ-00020     0.943411410
## 15  AZ-00024    -0.827449711
## 16  AZ-00025    -0.088499276
## 17  AZ-00027    -0.812861251
## 18  AZ-00029     0.852866559
## 19  AZ-00030     0.609698900
## 20  AZ-00031    -0.202778004
## 21  AZ-00033    -0.077520956
## 22  AZ-00034    -0.314154140
## 23  AZ-00035    -0.747674699
## 24  AZ-00036     0.986451545
## 25  AZ-00037    -0.776053353
## 26  AZ-00038    -0.268856655
## 27  AZ-00040     0.762404322
## 28  AZ-00043     0.959094962
## 29  AZ-00044    -0.414249961
## 30  AZ-00048    -1.176070867
## 31  AZ-00049    -0.045483286
## 32  AZ-00050    -0.428833845
## 33  AZ-00054     0.076162571
## 34  AZ-00055    -0.129954797
## 35  AZ-00057    -0.761031251
## 36  AZ-00059     0.674328628
## 37  AZ-00060    -0.168608525
## 38  AZ-00061    -0.149586988
## 39  AZ-00062    -0.399143773
## 40  AZ-00063     0.399417252
## 41  AZ-00064    -0.582855492
## 42  AZ-00066    -0.753868709
## 43  AZ-00070     0.886156500
## 44  AZ-00071    -0.535554543
## 45  AZ-00072    -0.784659399
## 46  AZ-00073     0.416707657
## 47  AZ-00074    -0.609439250
## 48  AZ-00076    -1.119211602
## 49  AZ-00077    -0.839306303
## 50  AZ-00079     1.017072904
## 51  AZ-00080    -1.156928404
## 52  AZ-00081     1.059658327
## 53  AZ-00082     0.570620777
## 54  AZ-00085    -0.334923927
## 55  AZ-00086    -0.165750890
## 56  AZ-00087    -0.847282758
## 57  AZ-00088    -1.381346985
## 58  AZ-00089     0.025413845
## 59  AZ-00090    -0.318571916
## 60  AZ-00092    -1.265760484
## 61  AZ-00093     1.582272723
## 62  AZ-00094     0.536870323
## 63  AZ-00095     1.246872247
## 64  AZ-00096    -0.156402862
## 65  AZ-00101     0.370454168
## 66  AZ-00102     1.216063177
## 67  AZ-00103     0.258934587
## 68  AZ-00106    -0.694452026
## 69  AZ-00107    -1.589188080
## 70  AZ-00108    -0.600871424
## 71  AZ-00110     1.103729657
## 72  AZ-00111    -1.645813663
## 73  AZ-00113     0.617851628
## 74  AZ-00114    -1.024011670
## 75  AZ-00115    -0.976739032
## 76  AZ-00116    -0.983165744
## 77  AZ-00119    -1.218238985
## 78  AZ-00124    -0.620257818
## 79  AZ-00126     0.096439978
## 80  AZ-00127    -0.638576515
## 81  AZ-00128    -0.898806304
## 82  AZ-00129    -0.250992331
## 83  AZ-00131    -0.543888468
## 84  AZ-00135    -0.131801749
## 85  AZ-00137     1.503935192
## 86  AZ-00139     0.377326535
## 87  AZ-00140    -1.632302729
## 88  AZ-00141    -0.895207701
## 89  AZ-00143    -1.224819260
## 90  AZ-00144    -1.436653684
## 91  AZ-00145     0.222522046
## 92  AZ-00146    -1.395299366
## 93  AZ-00149    -0.502327906
## 94  AZ-00150    -0.868747405
## 95  AZ-00152     0.896252474
## 96  AZ-00153    -1.336158454
## 97  AZ-00154     0.298315780
## 98  AZ-00156     0.476482773
## 99  AZ-00158    -1.498150624
## 100 AZ-00160    -0.190270725
## 101 AZ-00161    -0.647622548
## 102 AZ-00163    -0.826800525
## 103 AZ-00164     0.024076814
## 104 AZ-00165    -0.087438092
## 105 AZ-00166     0.540547308
## 106 AZ-00167    -0.228885630
## 107 AZ-00168    -1.329570489
## 108 AZ-00169    -1.219519454
## 109 AZ-00173     0.027971461
## 110 AZ-00174    -0.158560135
## 111 AZ-00176     0.362513956
## 112 AZ-00177    -0.639935835
## 113 AZ-00178    -0.944236936
## 114 AZ-00179    -0.864908697
## 115 AZ-00181     0.144939585
## 116 AZ-00182    -1.306400094
## 117 AZ-00186    -1.094780825
## 118 AZ-00188    -1.281535150
## 119 AZ-00190    -0.188278755
## 120 AZ-00191    -1.177342156
## 121 AZ-00192    -0.359828966
## 122 AZ-00193     2.599372248
## 123 AZ-00194    -1.416672507
## 124 AZ-00197    -1.275344362
## 125 AZ-00198     0.561784838
## 126 AZ-00201    -0.301727551
## 127 AZ-00202     1.489453852
## 128 AZ-00203     0.446769894
## 129 AZ-00205    -1.065432364
## 130 AZ-00206    -1.014719884
## 131 AZ-00207    -0.618702927
## 132 AZ-00208    -0.573448119
## 133 AZ-00209    -0.006235401
## 134 AZ-00210    -0.933119722
## 135 AZ-00211    -0.153863846
## 136 AZ-00212    -0.501587207
## 137 AZ-00213    -0.028753638
## 138 AZ-00215     0.325360143
## 139 AZ-00217    -1.076018077
## 140 AZ-00218     0.456805296
## 141 AZ-00219    -2.012243396
## 142 AZ-00220     0.171416906
## 143 AZ-00221    -0.443847945
## 144 AZ-00223    -0.290542228
## 145 AZ-00224    -0.075261169
## 146 AZ-00225     0.044352746
## 147 AZ-00226    -0.636497385
## 148 AZ-00228     1.043850899
## 149 AZ-00229    -1.452929437
## 150 AZ-00230     2.176660077
## 151 AZ-00231    -1.113152152
## 152 AZ-00233    -1.026110734
## 153 AZ-00234    -0.386976431
## 154 AZ-00235     0.844471972
## 155 AZ-00236    -0.546163599
## 156 AZ-00237    -0.430129810
## 157 AZ-00238    -1.466740612
## 158 AZ-00240     0.709374300
## 159 AZ-00241    -1.135233449
## 160 AZ-00242    -0.496685188
## 161 AZ-00246     0.188687734
## 162 AZ-00249    -0.292092300
## 163 AZ-00250    -0.032683900
## 164 AZ-00251    -0.701847680
## 165 AZ-00252     0.240167784
## 166 AZ-00256     1.098760170
## 167 AZ-00257    -1.311755031
## 168 AZ-00260    -1.444792251
## 169 AZ-00261    -0.645018373
## 170 AZ-00262    -1.052420920
## 171 AZ-00265    -0.741342409
## 172 AZ-00266     0.978380598
## 173 AZ-00267    -0.689786323
## 174 AZ-00268    -1.166844402
## 175 AZ-00269    -0.969290545
## 176 AZ-00270     0.888661754
## 177 AZ-00272    -1.146435624
## 178 AZ-00274     0.581785646
## 179 AZ-00275    -0.143589052
## 180 AZ-00278    -1.039484283
## 181 AZ-00279    -0.021280286
## 182 AZ-00280     0.134776036
## 183 AZ-00282    -1.104620962
## 184 AZ-00283     1.061241940
## 185 AZ-00286     1.999464480
## 186 AZ-00287     0.062523855
## 187 AZ-00288    -0.993110530
## 188 AZ-00289    -0.737991788
## 189 AZ-00290    -0.881654365
## 190 AZ-00293    -1.100702450
## 191 AZ-00294     0.465987097
## 192 AZ-00296    -1.531913898
## 193 AZ-00297     0.390280295
## 194 AZ-00299    -0.492967014
## 195 AZ-00300     0.044611190
## 196 AZ-00301     0.630914535
## 197 AZ-00302     0.224174610
## 198 AZ-00303    -0.082199317
## 199 AZ-00305     1.664021045
## 200 AZ-00306     0.762874873
## 201 AZ-00311    -1.364303171
## 202 AZ-00312     0.236648516
## 203 AZ-00313     0.789843987
## 204 AZ-00314    -0.942994078
## 205 AZ-00316    -0.124564576
## 206 AZ-00317    -1.755595332
## 207 AZ-00318     0.435633794
## 208 AZ-00319     0.696344372
## 209 AZ-00320     1.158072574
## 210 AZ-00321     0.943169269
## 211 AZ-00322     0.620707211
## 212 AZ-00323    -0.451062117
## 213 AZ-00324    -0.891937300
## 214 AZ-00326    -0.262775316
## 215 AZ-00327    -0.108457545
## 216 AZ-00328    -0.711103120
## 217 AZ-00329    -0.126761458
## 218 AZ-00330     0.577753662
## 219 AZ-00332    -1.056013252
## 220 AZ-00337    -0.250217547
## 221 AZ-00339     0.563249089
## 222 AZ-00341    -0.127665248
## 223 AZ-00343    -0.291072854
## 224 AZ-00345    -1.379551441
## 225 AZ-00346     1.211359266
## 226 AZ-00347     1.674234603
## 227 AZ-00349     0.633147573
## 228 AZ-00352    -0.945289410
## 229 AZ-00354     0.009453291
## 230 AZ-00355     0.247078681
## 231 AZ-00356     0.160044940
## 232 AZ-00357    -0.361661135
## 233 AZ-00360    -0.284030093
## 234 AZ-00361     0.029903646
## 235 AZ-00362    -0.154294066
## 236 AZ-00364     0.283322349
## 237 AZ-00365     0.024031724
## 238 AZ-00366    -0.652925875
## 239 AZ-00367     0.945961885
## 240 AZ-00368     1.033960902
## 241 AZ-00369    -0.401722659
## 242 AZ-00370     0.016200885
## 243 AZ-00371     1.270238562
## 244 AZ-00373    -0.606323980
## 245 AZ-00374    -0.533813067
## 246 AZ-00376    -0.816915796
## 247 AZ-00377     0.670465443
## 248 AZ-00378     0.481537562
## 249 AZ-00379     1.016196409
## 250 AZ-00380    -0.730757869
## 251 AZ-00383    -1.192099016
## 252 AZ-00384    -0.474004280
## 253 AZ-00385    -1.142212626
## 254 AZ-00386    -0.598290153
## 255 AZ-00387    -1.001168382
## 256 AZ-00389     0.332010899
## 257 AZ-00390    -0.789321323
## 258 AZ-00391    -0.276589748
## 259 AZ-00392    -1.270339509
## 260 AZ-00393     1.844049455
## 261 AZ-00394    -0.509490187
## 262 AZ-00395    -0.412547723
## 263 AZ-00399    -0.083195611
## 264 AZ-00400    -0.436847520
## 265 AZ-00401    -1.475197870
## 266 AZ-00402    -0.873061665
## 267 AZ-00403     0.662446419
## 268 AZ-00404    -0.893930094
## 269 AZ-00405     1.714686648
## 270 AZ-00406    -0.309884937
## 271 AZ-00407    -0.637613975
## 272 AZ-00408     1.298395831
## 273 AZ-00410     0.452765933
## 274 AZ-00412    -0.244992228
## 275 AZ-00413    -0.844013863
## 276 AZ-00415    -1.184148960
## 277 AZ-00417    -0.880781886
## 278 AZ-00418     1.006033118
## 279 AZ-00421    -1.288822177
## 280 AZ-00422    -0.813086890
## 281 AZ-00423    -0.601081434
## 282 AZ-00424    -1.228882949
## 283 AZ-00427     0.496236643
## 284 AZ-00430    -1.081747341
## 285 AZ-00431    -0.564114691
## 286 AZ-00432     1.045248752
## 287 AZ-00433     0.081897217
## 288 AZ-00434    -0.653708261
## 289 AZ-00435    -0.274573376
## 290 AZ-00436    -0.543974182
## 291 AZ-00438     0.676335519
## 292 AZ-00439    -1.685919155
## 293 AZ-00440    -0.407125924
## 294 AZ-00441     0.432661990
## 295 AZ-00442     0.094200021
## 296 AZ-00443     1.977386641
## 297 AZ-00445    -1.499462569
## 298 AZ-00446    -0.047623752
## 299 AZ-00449    -0.876679756
## 300 AZ-00450    -0.217854149
## 301 AZ-00451    -0.566836522
## 302 AZ-00452     0.025425864
## 303 AZ-00454     0.223618450
## 304 AZ-00458    -0.508400353
## 305 AZ-00459    -0.442619954
## 306 AZ-00461     0.340567633
## 307 AZ-00462     0.410404461
## 308 AZ-00464     0.298759131
## 309 AZ-00465    -0.551610201
## 310 AZ-00467    -0.800053902
## 311 AZ-00468     0.683295322
## 312 AZ-00469    -1.272993803
## 313 AZ-00470    -0.450057848

Prediction of time-to-event

form <- Surv(LKADT_P,DEATH=="YES") ~ log(ALP) + HB + log(AST) + log(PSA) + ALB +
  ECOG_C + LIVER + ADRENAL + LYMPH_NODES 
survReg <- coxph(form, data= training)
riskhatReg <- predict(survReg, newdata = validation)
qplot(riskhatGam, riskhatReg) + geom_smooth()

The figure shows the scatter plot of the risk predictions on the validation data from the gam and the regression model.

timehat <- summary(survfit(survReg, newdata = validation))$table[, "median"]
qplot(riskhatReg, timehat)

The following table shows the predictions of survival times made from the model, and it is identical to the time-to-event predictions found in the file nrhFinaltimetoevent.csv.

timehatValidation <- 
  cbind(validation[, "RPT", drop = FALSE], data.frame(TIMETOEVENT = timehat))
timehatValidation
##          RPT TIMETOEVENT
## 1   AZ-00001         644
## 2   AZ-00003         976
## 3   AZ-00005        1027
## 4   AZ-00006         520
## 5   AZ-00007         829
## 6   AZ-00008         805
## 7   AZ-00010         861
## 8   AZ-00011         798
## 9   AZ-00012         458
## 10  AZ-00013         852
## 11  AZ-00014        1083
## 12  AZ-00016         790
## 13  AZ-00019         285
## 14  AZ-00020         401
## 15  AZ-00024         857
## 16  AZ-00025         593
## 17  AZ-00027         861
## 18  AZ-00029         388
## 19  AZ-00030         402
## 20  AZ-00031         654
## 21  AZ-00033         602
## 22  AZ-00034         679
## 23  AZ-00035         837
## 24  AZ-00036         401
## 25  AZ-00037         837
## 26  AZ-00038         689
## 27  AZ-00040         413
## 28  AZ-00043         347
## 29  AZ-00044         724
## 30  AZ-00048        1027
## 31  AZ-00049         600
## 32  AZ-00050         745
## 33  AZ-00054         593
## 34  AZ-00055         629
## 35  AZ-00057         879
## 36  AZ-00059         432
## 37  AZ-00060         658
## 38  AZ-00061         602
## 39  AZ-00062         741
## 40  AZ-00063         512
## 41  AZ-00064         745
## 42  AZ-00066         870
## 43  AZ-00070         408
## 44  AZ-00071         743
## 45  AZ-00072         953
## 46  AZ-00073         526
## 47  AZ-00074         830
## 48  AZ-00076         945
## 49  AZ-00077         893
## 50  AZ-00079         362
## 51  AZ-00080        1001
## 52  AZ-00081         376
## 53  AZ-00082         495
## 54  AZ-00085         675
## 55  AZ-00086         644
## 56  AZ-00087         945
## 57  AZ-00088        1059
## 58  AZ-00089         585
## 59  AZ-00090         691
## 60  AZ-00092        1149
## 61  AZ-00093         282
## 62  AZ-00094         421
## 63  AZ-00095         354
## 64  AZ-00096         618
## 65  AZ-00101         487
## 66  AZ-00102         376
## 67  AZ-00103         528
## 68  AZ-00106         829
## 69  AZ-00107        1295
## 70  AZ-00108         742
## 71  AZ-00110         348
## 72  AZ-00111          NA
## 73  AZ-00113         456
## 74  AZ-00114         912
## 75  AZ-00115         870
## 76  AZ-00116         861
## 77  AZ-00119        1031
## 78  AZ-00124         837
## 79  AZ-00126         546
## 80  AZ-00127         845
## 81  AZ-00128         861
## 82  AZ-00129         658
## 83  AZ-00131         724
## 84  AZ-00135         615
## 85  AZ-00137         318
## 86  AZ-00139         520
## 87  AZ-00140        1160
## 88  AZ-00141         903
## 89  AZ-00143        1276
## 90  AZ-00144        1295
## 91  AZ-00145         446
## 92  AZ-00146        1091
## 93  AZ-00149         779
## 94  AZ-00150        1002
## 95  AZ-00152         345
## 96  AZ-00153        1187
## 97  AZ-00154         511
## 98  AZ-00156         469
## 99  AZ-00158        1083
## 100 AZ-00160         577
## 101 AZ-00161         798
## 102 AZ-00163         911
## 103 AZ-00164         559
## 104 AZ-00165         600
## 105 AZ-00166         399
## 106 AZ-00167         585
## 107 AZ-00168        1023
## 108 AZ-00169        1031
## 109 AZ-00173         600
## 110 AZ-00174         649
## 111 AZ-00176         526
## 112 AZ-00177         800
## 113 AZ-00178         858
## 114 AZ-00179         817
## 115 AZ-00181         519
## 116 AZ-00182        1076
## 117 AZ-00186         948
## 118 AZ-00188        1027
## 119 AZ-00190         615
## 120 AZ-00191        1102
## 121 AZ-00192         715
## 122 AZ-00193         268
## 123 AZ-00194        1178
## 124 AZ-00197        1076
## 125 AZ-00198         417
## 126 AZ-00201         743
## 127 AZ-00202         308
## 128 AZ-00203         519
## 129 AZ-00205        1076
## 130 AZ-00206         893
## 131 AZ-00207         834
## 132 AZ-00208         798
## 133 AZ-00209         602
## 134 AZ-00210         846
## 135 AZ-00211         685
## 136 AZ-00212         829
## 137 AZ-00213         549
## 138 AZ-00215         541
## 139 AZ-00217         893
## 140 AZ-00218         502
## 141 AZ-00219          NA
## 142 AZ-00220         542
## 143 AZ-00221         762
## 144 AZ-00223         723
## 145 AZ-00224         607
## 146 AZ-00225         593
## 147 AZ-00226         817
## 148 AZ-00228         405
## 149 AZ-00229        1160
## 150 AZ-00230         231
## 151 AZ-00231         949
## 152 AZ-00233         950
## 153 AZ-00234         731
## 154 AZ-00235         430
## 155 AZ-00236         797
## 156 AZ-00237         747
## 157 AZ-00238        1102
## 158 AZ-00240         449
## 159 AZ-00241        1076
## 160 AZ-00242         837
## 161 AZ-00246         534
## 162 AZ-00249         663
## 163 AZ-00250         578
## 164 AZ-00251         837
## 165 AZ-00252         536
## 166 AZ-00256         352
## 167 AZ-00257        1011
## 168 AZ-00260        1116
## 169 AZ-00261         858
## 170 AZ-00262         953
## 171 AZ-00265         889
## 172 AZ-00266         389
## 173 AZ-00267         846
## 174 AZ-00268        1011
## 175 AZ-00269         943
## 176 AZ-00270         407
## 177 AZ-00272        1011
## 178 AZ-00274         405
## 179 AZ-00275         600
## 180 AZ-00278         955
## 181 AZ-00279         560
## 182 AZ-00280         566
## 183 AZ-00282         986
## 184 AZ-00283         322
## 185 AZ-00286         390
## 186 AZ-00287         598
## 187 AZ-00288         909
## 188 AZ-00289         837
## 189 AZ-00290         897
## 190 AZ-00293         930
## 191 AZ-00294         506
## 192 AZ-00296        1187
## 193 AZ-00297         521
## 194 AZ-00299         750
## 195 AZ-00300         583
## 196 AZ-00301         425
## 197 AZ-00302         543
## 198 AZ-00303         621
## 199 AZ-00305         278
## 200 AZ-00306         400
## 201 AZ-00311        1001
## 202 AZ-00312         523
## 203 AZ-00313         368
## 204 AZ-00314         893
## 205 AZ-00316         607
## 206 AZ-00317          NA
## 207 AZ-00318         442
## 208 AZ-00319         445
## 209 AZ-00320         346
## 210 AZ-00321         437
## 211 AZ-00322         474
## 212 AZ-00323         747
## 213 AZ-00324         916
## 214 AZ-00326         691
## 215 AZ-00327         526
## 216 AZ-00328         858
## 217 AZ-00329         607
## 218 AZ-00330         434
## 219 AZ-00332         912
## 220 AZ-00337         798
## 221 AZ-00339         451
## 222 AZ-00341         677
## 223 AZ-00343         629
## 224 AZ-00345        1091
## 225 AZ-00346         326
## 226 AZ-00347         287
## 227 AZ-00349         454
## 228 AZ-00352         945
## 229 AZ-00354         551
## 230 AZ-00355         528
## 231 AZ-00356         525
## 232 AZ-00357         773
## 233 AZ-00360         582
## 234 AZ-00361         572
## 235 AZ-00362         654
## 236 AZ-00364         553
## 237 AZ-00365         585
## 238 AZ-00366         829
## 239 AZ-00367         327
## 240 AZ-00368         368
## 241 AZ-00369         723
## 242 AZ-00370         565
## 243 AZ-00371         336
## 244 AZ-00373         798
## 245 AZ-00374         798
## 246 AZ-00376         830
## 247 AZ-00377         469
## 248 AZ-00378         430
## 249 AZ-00379         405
## 250 AZ-00380         774
## 251 AZ-00383        1006
## 252 AZ-00384         760
## 253 AZ-00385         943
## 254 AZ-00386         829
## 255 AZ-00387         912
## 256 AZ-00389         523
## 257 AZ-00390         837
## 258 AZ-00391         624
## 259 AZ-00392        1083
## 260 AZ-00393         392
## 261 AZ-00394         760
## 262 AZ-00395         779
## 263 AZ-00399         607
## 264 AZ-00400         645
## 265 AZ-00401        1149
## 266 AZ-00402         870
## 267 AZ-00403         467
## 268 AZ-00404         893
## 269 AZ-00405         283
## 270 AZ-00406         739
## 271 AZ-00407         791
## 272 AZ-00408         347
## 273 AZ-00410         437
## 274 AZ-00412         624
## 275 AZ-00413         858
## 276 AZ-00415         986
## 277 AZ-00417         955
## 278 AZ-00418         298
## 279 AZ-00421        1091
## 280 AZ-00422         903
## 281 AZ-00423         744
## 282 AZ-00424         976
## 283 AZ-00427         461
## 284 AZ-00430         945
## 285 AZ-00431         773
## 286 AZ-00432         327
## 287 AZ-00433         571
## 288 AZ-00434         741
## 289 AZ-00435         649
## 290 AZ-00436         742
## 291 AZ-00438         456
## 292 AZ-00439          NA
## 293 AZ-00440         702
## 294 AZ-00441         503
## 295 AZ-00442         474
## 296 AZ-00443         233
## 297 AZ-00445        1149
## 298 AZ-00446         542
## 299 AZ-00449         949
## 300 AZ-00450         631
## 301 AZ-00451         829
## 302 AZ-00452         591
## 303 AZ-00454         520
## 304 AZ-00458         723
## 305 AZ-00459         774
## 306 AZ-00461         528
## 307 AZ-00462         516
## 308 AZ-00464         542
## 309 AZ-00465         798
## 310 AZ-00467         893
## 311 AZ-00468         453
## 312 AZ-00469        1027
## 313 AZ-00470         741

Discussion

Based on our own validation studies and earlier submissions to the leaderboard, we expect a performance in terms of iAUC between 0.75 and 0.80. The addition of the ridge model to the gam is expected to give at most and increase of iAUC by 0.02. It could be interesting to investigate if boosting/bagging techniques could improve on the ridge component.

We did not try to include interactions between variables selected for the gam in any systematic way.

It was a little surprising that none of our attempts to come up with a clever handling of missing values had any positive impact on predictive performance. On the contrary.

There is one issue with the variable selection that is worth investigating further. The glmnet function standardizes all variables by default, which is sensible if they are all continuous, but less so if some are dummy variables encoding factor levels. We did some experiments with standardizing only the continuous predictors, but that did not appear to improve predictive performance in terms of iAUC.

Author contributions

NRH initiated the participation in the challenge. ASB, AHP and JZ worked specifically on this subchallenge. LT and MBND worked specifically on the other subchallenge. All authors worked on the exploratory data analysis, including the longitudinal data, and on the question about imputation. This manuscript was written by NRH based on reports worked out by each of the other five authors.