For the prediction of risk scores for the Prostate Cancer DREAM Challenge we used Cox’s proportional hazards model in combination with variable selection and/or penalization. The data analysis and model building was carried out in R. This document includes a brief description of our work, and the computations required for fitting the final model are embedded as R code in the Results section.
It quickly became clear from the initial data analysis that two of the main challenges were how to deal with the variable selection and how to handle missing values. The logitudinal data provided in addition to the core table was also explored, but it was not obvious how to use the information in these tables besides some investigations of data quality and consistency.
The model we ended up with was a fairly standard model that did not differ much from the model presented by Halabi et al. except that it was a generalized additive model (a gam) that allows for nonlinear relations between continuous predictors and the log-hazard. This model used only nine predictor variables.
We experimented with one way of extracting additional predictive information from the many variables not included in the gam. This is described further in the two sections Methods and Results below, but whether it produced any actual increase of performance in terms of iAUC was uncertain.
The following presentation of the methods refers only to the use of the core table. We did not use the longitudinal data tables for the actual modeling. Moreover, this section does not describe any of the training-test data splits we did on the training data to test out how well our methods could be expected to work. The section only documents how the final models were fitted to the full data set and how the predictions were computed on the validation data.
The variables HGTBLCAT, WGTBLCAT, HEAD_AND_NECK, PANCREAS, THYROID, CREACLCA, CREACL, GLEAS_DX and STOMACH were removed from the core table as these variables have no or limited variation in the training data.
For the training data a single PSA value (Patient ID VEN-756003601) was changed from 0 to 0.01, and a single ECOG value (Patient ID CELG-00121) was changed from 3 to 2.
For both the training and the validation data sets the ECOG variable was converted to a factor.
Some of the variables, lab values in particular, have many missing values. We investigated several imputation methods, but our conclusion was that the more complicated imputation methods resulted in worse predictive performance. Thus we ended up with a very simple imputation method, which in principle relies on a missing completely at random assumption.
For each variable we imputed missing values by sampling with replacement from the observed values of that variable in the training data.
We used lasso penalization with Cox’s partial log-likelihood as implemented in the R package glmnet in combination with a form of stability selection for the variable selection.
This was carried out by fitting the lasso path to 100 subsamples of the training data of half the size of the initial data set. For each subsample the optimal choice of penalty parameter was selected by cross-validation (as implemented in the cv.glmnet
function in the glmnet package), and the variables corresponding to the non-zero parameters were selected. The proportion of times each variable was selected was then computed across the 100 subsamples. The variables most frequently selected by the lasso procedure were then selected for further modeling.
Based on the selected variables by the lasso procedure and some additional considerations about the final model, a generalized additive proportional hazards model was fitted using the gam
function in the mgcv package with the cox.ph
family.
The fitting of a gam uses a basis expansion of all continuous predictors in combination with penalized likelihood estimation. The mgcv package supports an automatic selection of penalty parameters, which was used.
To include potential predictive information in the variables not selected we fitted a model using a ridge regression penalty as implemented in the R package glmnet including all predictors considered for the variable selection step. The final risk prediction was obtained as a weighted linear combination of the risk prediction from the gam and the ridge model.
For the prediction of the actual survival time (time-to-event) we used median survival time as predicted from the proportional hazards model. We did not attempt to optimize this prediction and used the simplest method we could come up with. It used the same variables as the gam discussed above but was fitted without basis expansion and penalization. It was fitted in R using the surv
function from the survival package and the subsequent median survival time predictions were computed using the survfit
function.
The predictions were obtained by running the code below. The raw code is found in the file Write-up.R.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.3
library(survival)
library(glmnet)
library(mgcv)
The data in the two core tables were read from the csv files. To run the code it below the files must be in the working directory of the R session.
In one of the data cleaning stesp below, all values that were equal to the empty string were converted to "No"
.
load("../Data/Prostate_all.RData")
training <- CoreTable_training
validation <- CoreTable_validation
discard <- c("HGTBLCAT", "WGTBLCAT", "HEAD_AND_NECK",
"PANCREAS", "THYROID", "CREACLCA", "CREACL",
"GLEAS_DX", "STOMACH")
training <- subset(training, select = -which(colnames(training) %in% discard))
validation <- subset(validation, select = -which(colnames(validation) %in% discard))
training <- transform(training,
PSA = ifelse(PSA == 0, 0.01, PSA),
ECOG_C = factor((ECOG_C >= 1) + (ECOG_C >= 2)))
validation <- transform(validation,
ECOG_C = factor((ECOG_C >= 1) + (ECOG_C >= 2)))
for (i in seq_len(ncol(training))) {
if (is.factor(training[, i])) {
tmp <- as.character(training[, i])
tmp[tmp == ""] <- "No"
training[, i] <- factor(tmp)
}
}
for (i in seq_len(ncol(validation))) {
if (is.factor(validation[, i])) {
tmp <- as.character(validation[, i])
tmp[tmp == ""] <- "No"
validation[, i] <- factor(tmp)
name <- colnames(validation)[i]
if (name %in% colnames(training) &&
!identical(levels(validation[, i]), levels(training[, name])))
validation[, i] <- factor(tmp, levels(training[, name]))
}
}
The implemented imputation scheme consisted of sampling from the marginal empirical distributions from the training data for each variable.
set.seed(1234)
for (i in seq(1, ncol(training))) {
x0 <- training[, i]
nas <- is.na(x0)
if (!all(nas)) {
if (any(nas))
training[nas, i] <- sample(x0[!nas], sum(nas), replace = TRUE)
name <- names(training)[i]
if(name %in% names(validation)) {
nasLeader <- is.na(validation[, name])
if (any(nasLeader))
validation[nasLeader, name] <- sample(x0[!nas], sum(nasLeader), replace = TRUE)
}
}
}
variables <- c(23, 27:ncol(training))
train <- training[, c(4, 5, variables)]
XX0 <- XX <- model.matrix(~ . - 1, train[, -c(1, 2)])
YY <- Surv(train$LKADT_P, train$DEATH == "YES")
n <- nrow(XX)
p <- ncol(XX)
B <- 100
select <- matrix(0, p, B)
rownames(select) <- colnames(XX)
for (b in seq_len(B)) {
ii <- sample(n, n / 2 )
survNet <- cv.glmnet(XX[ii, ], YY[ii, ], family = "cox")
betahat <- coef(survNet, s = "lambda.min")
select[, b] <- as.numeric(betahat != 0)
}
selectFreq <- rowSums(select) / B
selectFreq <- sort(selectFreq, decreasing = TRUE)
varSort <- factor(names(selectFreq), levels = names(selectFreq))
qplot(varSort[1:20], selectFreq[1:20]) +
theme(axis.text.x = element_text(angle = -90)) +
scale_y_continuous("Selection proportion") +
scale_x_discrete("Variable")
The variable selection procedure shows that the eight variables ALP, AST, HB, ECOG, LIVER, ADRENAL, ALB and ANALGESICS are the most stably selected variables, all selected in more than 60% of the models.
In addition to these eight variables we included PSA and LYMPH_NODES, which were suspected to be predictive based on the study by Halabi et al., and found to be so in our studies as well.
form <- LKADT_P ~ s(log(ALP)) + s(HB) + s(log(AST)) + s(log(PSA)) + s(ALB) +
ECOG_C + LIVER + ADRENAL + LYMPH_NODES + ANALGESICS
survGam <- gam(form, data = training, family = cox.ph(), weight = DEATH == "YES")
summary(survGam)
##
## Family: Cox PH
## Link function: identity
##
## Formula:
## LKADT_P ~ s(log(ALP)) + s(HB) + s(log(AST)) + s(log(PSA)) + s(ALB) +
## ECOG_C + LIVER + ADRENAL + LYMPH_NODES + ANALGESICS
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## ECOG_C1 0.19669 0.08475 2.321 0.02030 *
## ECOG_C2 0.55074 0.17385 3.168 0.00154 **
## LIVERY 0.39505 0.12764 3.095 0.00197 **
## ADRENALY 0.73817 0.22477 3.284 0.00102 **
## LYMPH_NODESY 0.21123 0.07937 2.661 0.00778 **
## ANALGESICSYES 0.07936 0.08694 0.913 0.36131
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(log(ALP)) 2.535 3.213 43.38 4.29e-09 ***
## s(HB) 1.002 1.004 19.05 1.30e-05 ***
## s(log(AST)) 2.005 2.605 15.83 0.00085 ***
## s(log(PSA)) 3.641 4.586 11.85 0.02986 *
## s(ALB) 1.003 1.005 5.73 0.01688 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Deviance explained = 10.2%
## -REML = 4144.1 Scale est. = 1 n = 1600
The effect of ANALGESICS was not significant in this model, and it was removed from the model.
survGam <- update(survGam, . ~ . - ANALGESICS)
As mentioned in Methods, the final risk predictions were given as a linear combination of the risk predictions from the gam and the risk predictions from the ridge regression model. The ridge model was fitted using the cv.glmnet
function with alpha = 0
. The penalty parameter was selected by cross-validation choosing the model with the minimal cross-validated negative partial log-likelihood.
survNet <- cv.glmnet(XX, YY, family = "cox", alpha = 0)
XXvalidation <- model.matrix(~ . - 1, validation[, variables])
riskhatGam <- predict(survGam, newdata = validation)
riskhatNet <- predict(survNet, newx = XXvalidation, s = "lambda.min")[, 1]
qplot(riskhatGam, riskhatNet) + geom_smooth()
The figure shows the scatter plot of the risk predictions on the validation data from the gam and the ridge regression model. The two predictions are clearly strongly positively correlated.
It should be mentioned that the predictions from the ridge models were generally worse than the predictions from the gam (results not shown), and they were given a weight of 0.25 in the final risk prediction (the gam predictions were given weight 1). It should be noted that the choice of the weight has not been optimized in a systematic way.
The following table shows the predictions made from the model, and it is identical to the risk predictions found in the file nrhFinal.csv.
riskhat <- riskhatGam + riskhatNet / 4
riskhatValidation <-
cbind(validation[, "RPT", drop = FALSE], data.frame(riskScoreGlobal = riskhat))
riskhatValidation
## RPT riskScoreGlobal
## 1 AZ-00001 -0.227741878
## 2 AZ-00003 -1.103190111
## 3 AZ-00005 -1.198691034
## 4 AZ-00006 0.339399822
## 5 AZ-00007 -0.654883117
## 6 AZ-00008 -0.540225551
## 7 AZ-00010 -0.848069874
## 8 AZ-00011 -0.603070266
## 9 AZ-00012 0.653206756
## 10 AZ-00013 -0.771761179
## 11 AZ-00014 -1.228289107
## 12 AZ-00016 -0.618315877
## 13 AZ-00019 1.527487898
## 14 AZ-00020 0.943411410
## 15 AZ-00024 -0.827449711
## 16 AZ-00025 -0.088499276
## 17 AZ-00027 -0.812861251
## 18 AZ-00029 0.852866559
## 19 AZ-00030 0.609698900
## 20 AZ-00031 -0.202778004
## 21 AZ-00033 -0.077520956
## 22 AZ-00034 -0.314154140
## 23 AZ-00035 -0.747674699
## 24 AZ-00036 0.986451545
## 25 AZ-00037 -0.776053353
## 26 AZ-00038 -0.268856655
## 27 AZ-00040 0.762404322
## 28 AZ-00043 0.959094962
## 29 AZ-00044 -0.414249961
## 30 AZ-00048 -1.176070867
## 31 AZ-00049 -0.045483286
## 32 AZ-00050 -0.428833845
## 33 AZ-00054 0.076162571
## 34 AZ-00055 -0.129954797
## 35 AZ-00057 -0.761031251
## 36 AZ-00059 0.674328628
## 37 AZ-00060 -0.168608525
## 38 AZ-00061 -0.149586988
## 39 AZ-00062 -0.399143773
## 40 AZ-00063 0.399417252
## 41 AZ-00064 -0.582855492
## 42 AZ-00066 -0.753868709
## 43 AZ-00070 0.886156500
## 44 AZ-00071 -0.535554543
## 45 AZ-00072 -0.784659399
## 46 AZ-00073 0.416707657
## 47 AZ-00074 -0.609439250
## 48 AZ-00076 -1.119211602
## 49 AZ-00077 -0.839306303
## 50 AZ-00079 1.017072904
## 51 AZ-00080 -1.156928404
## 52 AZ-00081 1.059658327
## 53 AZ-00082 0.570620777
## 54 AZ-00085 -0.334923927
## 55 AZ-00086 -0.165750890
## 56 AZ-00087 -0.847282758
## 57 AZ-00088 -1.381346985
## 58 AZ-00089 0.025413845
## 59 AZ-00090 -0.318571916
## 60 AZ-00092 -1.265760484
## 61 AZ-00093 1.582272723
## 62 AZ-00094 0.536870323
## 63 AZ-00095 1.246872247
## 64 AZ-00096 -0.156402862
## 65 AZ-00101 0.370454168
## 66 AZ-00102 1.216063177
## 67 AZ-00103 0.258934587
## 68 AZ-00106 -0.694452026
## 69 AZ-00107 -1.589188080
## 70 AZ-00108 -0.600871424
## 71 AZ-00110 1.103729657
## 72 AZ-00111 -1.645813663
## 73 AZ-00113 0.617851628
## 74 AZ-00114 -1.024011670
## 75 AZ-00115 -0.976739032
## 76 AZ-00116 -0.983165744
## 77 AZ-00119 -1.218238985
## 78 AZ-00124 -0.620257818
## 79 AZ-00126 0.096439978
## 80 AZ-00127 -0.638576515
## 81 AZ-00128 -0.898806304
## 82 AZ-00129 -0.250992331
## 83 AZ-00131 -0.543888468
## 84 AZ-00135 -0.131801749
## 85 AZ-00137 1.503935192
## 86 AZ-00139 0.377326535
## 87 AZ-00140 -1.632302729
## 88 AZ-00141 -0.895207701
## 89 AZ-00143 -1.224819260
## 90 AZ-00144 -1.436653684
## 91 AZ-00145 0.222522046
## 92 AZ-00146 -1.395299366
## 93 AZ-00149 -0.502327906
## 94 AZ-00150 -0.868747405
## 95 AZ-00152 0.896252474
## 96 AZ-00153 -1.336158454
## 97 AZ-00154 0.298315780
## 98 AZ-00156 0.476482773
## 99 AZ-00158 -1.498150624
## 100 AZ-00160 -0.190270725
## 101 AZ-00161 -0.647622548
## 102 AZ-00163 -0.826800525
## 103 AZ-00164 0.024076814
## 104 AZ-00165 -0.087438092
## 105 AZ-00166 0.540547308
## 106 AZ-00167 -0.228885630
## 107 AZ-00168 -1.329570489
## 108 AZ-00169 -1.219519454
## 109 AZ-00173 0.027971461
## 110 AZ-00174 -0.158560135
## 111 AZ-00176 0.362513956
## 112 AZ-00177 -0.639935835
## 113 AZ-00178 -0.944236936
## 114 AZ-00179 -0.864908697
## 115 AZ-00181 0.144939585
## 116 AZ-00182 -1.306400094
## 117 AZ-00186 -1.094780825
## 118 AZ-00188 -1.281535150
## 119 AZ-00190 -0.188278755
## 120 AZ-00191 -1.177342156
## 121 AZ-00192 -0.359828966
## 122 AZ-00193 2.599372248
## 123 AZ-00194 -1.416672507
## 124 AZ-00197 -1.275344362
## 125 AZ-00198 0.561784838
## 126 AZ-00201 -0.301727551
## 127 AZ-00202 1.489453852
## 128 AZ-00203 0.446769894
## 129 AZ-00205 -1.065432364
## 130 AZ-00206 -1.014719884
## 131 AZ-00207 -0.618702927
## 132 AZ-00208 -0.573448119
## 133 AZ-00209 -0.006235401
## 134 AZ-00210 -0.933119722
## 135 AZ-00211 -0.153863846
## 136 AZ-00212 -0.501587207
## 137 AZ-00213 -0.028753638
## 138 AZ-00215 0.325360143
## 139 AZ-00217 -1.076018077
## 140 AZ-00218 0.456805296
## 141 AZ-00219 -2.012243396
## 142 AZ-00220 0.171416906
## 143 AZ-00221 -0.443847945
## 144 AZ-00223 -0.290542228
## 145 AZ-00224 -0.075261169
## 146 AZ-00225 0.044352746
## 147 AZ-00226 -0.636497385
## 148 AZ-00228 1.043850899
## 149 AZ-00229 -1.452929437
## 150 AZ-00230 2.176660077
## 151 AZ-00231 -1.113152152
## 152 AZ-00233 -1.026110734
## 153 AZ-00234 -0.386976431
## 154 AZ-00235 0.844471972
## 155 AZ-00236 -0.546163599
## 156 AZ-00237 -0.430129810
## 157 AZ-00238 -1.466740612
## 158 AZ-00240 0.709374300
## 159 AZ-00241 -1.135233449
## 160 AZ-00242 -0.496685188
## 161 AZ-00246 0.188687734
## 162 AZ-00249 -0.292092300
## 163 AZ-00250 -0.032683900
## 164 AZ-00251 -0.701847680
## 165 AZ-00252 0.240167784
## 166 AZ-00256 1.098760170
## 167 AZ-00257 -1.311755031
## 168 AZ-00260 -1.444792251
## 169 AZ-00261 -0.645018373
## 170 AZ-00262 -1.052420920
## 171 AZ-00265 -0.741342409
## 172 AZ-00266 0.978380598
## 173 AZ-00267 -0.689786323
## 174 AZ-00268 -1.166844402
## 175 AZ-00269 -0.969290545
## 176 AZ-00270 0.888661754
## 177 AZ-00272 -1.146435624
## 178 AZ-00274 0.581785646
## 179 AZ-00275 -0.143589052
## 180 AZ-00278 -1.039484283
## 181 AZ-00279 -0.021280286
## 182 AZ-00280 0.134776036
## 183 AZ-00282 -1.104620962
## 184 AZ-00283 1.061241940
## 185 AZ-00286 1.999464480
## 186 AZ-00287 0.062523855
## 187 AZ-00288 -0.993110530
## 188 AZ-00289 -0.737991788
## 189 AZ-00290 -0.881654365
## 190 AZ-00293 -1.100702450
## 191 AZ-00294 0.465987097
## 192 AZ-00296 -1.531913898
## 193 AZ-00297 0.390280295
## 194 AZ-00299 -0.492967014
## 195 AZ-00300 0.044611190
## 196 AZ-00301 0.630914535
## 197 AZ-00302 0.224174610
## 198 AZ-00303 -0.082199317
## 199 AZ-00305 1.664021045
## 200 AZ-00306 0.762874873
## 201 AZ-00311 -1.364303171
## 202 AZ-00312 0.236648516
## 203 AZ-00313 0.789843987
## 204 AZ-00314 -0.942994078
## 205 AZ-00316 -0.124564576
## 206 AZ-00317 -1.755595332
## 207 AZ-00318 0.435633794
## 208 AZ-00319 0.696344372
## 209 AZ-00320 1.158072574
## 210 AZ-00321 0.943169269
## 211 AZ-00322 0.620707211
## 212 AZ-00323 -0.451062117
## 213 AZ-00324 -0.891937300
## 214 AZ-00326 -0.262775316
## 215 AZ-00327 -0.108457545
## 216 AZ-00328 -0.711103120
## 217 AZ-00329 -0.126761458
## 218 AZ-00330 0.577753662
## 219 AZ-00332 -1.056013252
## 220 AZ-00337 -0.250217547
## 221 AZ-00339 0.563249089
## 222 AZ-00341 -0.127665248
## 223 AZ-00343 -0.291072854
## 224 AZ-00345 -1.379551441
## 225 AZ-00346 1.211359266
## 226 AZ-00347 1.674234603
## 227 AZ-00349 0.633147573
## 228 AZ-00352 -0.945289410
## 229 AZ-00354 0.009453291
## 230 AZ-00355 0.247078681
## 231 AZ-00356 0.160044940
## 232 AZ-00357 -0.361661135
## 233 AZ-00360 -0.284030093
## 234 AZ-00361 0.029903646
## 235 AZ-00362 -0.154294066
## 236 AZ-00364 0.283322349
## 237 AZ-00365 0.024031724
## 238 AZ-00366 -0.652925875
## 239 AZ-00367 0.945961885
## 240 AZ-00368 1.033960902
## 241 AZ-00369 -0.401722659
## 242 AZ-00370 0.016200885
## 243 AZ-00371 1.270238562
## 244 AZ-00373 -0.606323980
## 245 AZ-00374 -0.533813067
## 246 AZ-00376 -0.816915796
## 247 AZ-00377 0.670465443
## 248 AZ-00378 0.481537562
## 249 AZ-00379 1.016196409
## 250 AZ-00380 -0.730757869
## 251 AZ-00383 -1.192099016
## 252 AZ-00384 -0.474004280
## 253 AZ-00385 -1.142212626
## 254 AZ-00386 -0.598290153
## 255 AZ-00387 -1.001168382
## 256 AZ-00389 0.332010899
## 257 AZ-00390 -0.789321323
## 258 AZ-00391 -0.276589748
## 259 AZ-00392 -1.270339509
## 260 AZ-00393 1.844049455
## 261 AZ-00394 -0.509490187
## 262 AZ-00395 -0.412547723
## 263 AZ-00399 -0.083195611
## 264 AZ-00400 -0.436847520
## 265 AZ-00401 -1.475197870
## 266 AZ-00402 -0.873061665
## 267 AZ-00403 0.662446419
## 268 AZ-00404 -0.893930094
## 269 AZ-00405 1.714686648
## 270 AZ-00406 -0.309884937
## 271 AZ-00407 -0.637613975
## 272 AZ-00408 1.298395831
## 273 AZ-00410 0.452765933
## 274 AZ-00412 -0.244992228
## 275 AZ-00413 -0.844013863
## 276 AZ-00415 -1.184148960
## 277 AZ-00417 -0.880781886
## 278 AZ-00418 1.006033118
## 279 AZ-00421 -1.288822177
## 280 AZ-00422 -0.813086890
## 281 AZ-00423 -0.601081434
## 282 AZ-00424 -1.228882949
## 283 AZ-00427 0.496236643
## 284 AZ-00430 -1.081747341
## 285 AZ-00431 -0.564114691
## 286 AZ-00432 1.045248752
## 287 AZ-00433 0.081897217
## 288 AZ-00434 -0.653708261
## 289 AZ-00435 -0.274573376
## 290 AZ-00436 -0.543974182
## 291 AZ-00438 0.676335519
## 292 AZ-00439 -1.685919155
## 293 AZ-00440 -0.407125924
## 294 AZ-00441 0.432661990
## 295 AZ-00442 0.094200021
## 296 AZ-00443 1.977386641
## 297 AZ-00445 -1.499462569
## 298 AZ-00446 -0.047623752
## 299 AZ-00449 -0.876679756
## 300 AZ-00450 -0.217854149
## 301 AZ-00451 -0.566836522
## 302 AZ-00452 0.025425864
## 303 AZ-00454 0.223618450
## 304 AZ-00458 -0.508400353
## 305 AZ-00459 -0.442619954
## 306 AZ-00461 0.340567633
## 307 AZ-00462 0.410404461
## 308 AZ-00464 0.298759131
## 309 AZ-00465 -0.551610201
## 310 AZ-00467 -0.800053902
## 311 AZ-00468 0.683295322
## 312 AZ-00469 -1.272993803
## 313 AZ-00470 -0.450057848
form <- Surv(LKADT_P,DEATH=="YES") ~ log(ALP) + HB + log(AST) + log(PSA) + ALB +
ECOG_C + LIVER + ADRENAL + LYMPH_NODES
survReg <- coxph(form, data= training)
riskhatReg <- predict(survReg, newdata = validation)
qplot(riskhatGam, riskhatReg) + geom_smooth()
The figure shows the scatter plot of the risk predictions on the validation data from the gam and the regression model.
timehat <- summary(survfit(survReg, newdata = validation))$table[, "median"]
qplot(riskhatReg, timehat)
The following table shows the predictions of survival times made from the model, and it is identical to the time-to-event predictions found in the file nrhFinaltimetoevent.csv.
timehatValidation <-
cbind(validation[, "RPT", drop = FALSE], data.frame(TIMETOEVENT = timehat))
timehatValidation
## RPT TIMETOEVENT
## 1 AZ-00001 644
## 2 AZ-00003 976
## 3 AZ-00005 1027
## 4 AZ-00006 520
## 5 AZ-00007 829
## 6 AZ-00008 805
## 7 AZ-00010 861
## 8 AZ-00011 798
## 9 AZ-00012 458
## 10 AZ-00013 852
## 11 AZ-00014 1083
## 12 AZ-00016 790
## 13 AZ-00019 285
## 14 AZ-00020 401
## 15 AZ-00024 857
## 16 AZ-00025 593
## 17 AZ-00027 861
## 18 AZ-00029 388
## 19 AZ-00030 402
## 20 AZ-00031 654
## 21 AZ-00033 602
## 22 AZ-00034 679
## 23 AZ-00035 837
## 24 AZ-00036 401
## 25 AZ-00037 837
## 26 AZ-00038 689
## 27 AZ-00040 413
## 28 AZ-00043 347
## 29 AZ-00044 724
## 30 AZ-00048 1027
## 31 AZ-00049 600
## 32 AZ-00050 745
## 33 AZ-00054 593
## 34 AZ-00055 629
## 35 AZ-00057 879
## 36 AZ-00059 432
## 37 AZ-00060 658
## 38 AZ-00061 602
## 39 AZ-00062 741
## 40 AZ-00063 512
## 41 AZ-00064 745
## 42 AZ-00066 870
## 43 AZ-00070 408
## 44 AZ-00071 743
## 45 AZ-00072 953
## 46 AZ-00073 526
## 47 AZ-00074 830
## 48 AZ-00076 945
## 49 AZ-00077 893
## 50 AZ-00079 362
## 51 AZ-00080 1001
## 52 AZ-00081 376
## 53 AZ-00082 495
## 54 AZ-00085 675
## 55 AZ-00086 644
## 56 AZ-00087 945
## 57 AZ-00088 1059
## 58 AZ-00089 585
## 59 AZ-00090 691
## 60 AZ-00092 1149
## 61 AZ-00093 282
## 62 AZ-00094 421
## 63 AZ-00095 354
## 64 AZ-00096 618
## 65 AZ-00101 487
## 66 AZ-00102 376
## 67 AZ-00103 528
## 68 AZ-00106 829
## 69 AZ-00107 1295
## 70 AZ-00108 742
## 71 AZ-00110 348
## 72 AZ-00111 NA
## 73 AZ-00113 456
## 74 AZ-00114 912
## 75 AZ-00115 870
## 76 AZ-00116 861
## 77 AZ-00119 1031
## 78 AZ-00124 837
## 79 AZ-00126 546
## 80 AZ-00127 845
## 81 AZ-00128 861
## 82 AZ-00129 658
## 83 AZ-00131 724
## 84 AZ-00135 615
## 85 AZ-00137 318
## 86 AZ-00139 520
## 87 AZ-00140 1160
## 88 AZ-00141 903
## 89 AZ-00143 1276
## 90 AZ-00144 1295
## 91 AZ-00145 446
## 92 AZ-00146 1091
## 93 AZ-00149 779
## 94 AZ-00150 1002
## 95 AZ-00152 345
## 96 AZ-00153 1187
## 97 AZ-00154 511
## 98 AZ-00156 469
## 99 AZ-00158 1083
## 100 AZ-00160 577
## 101 AZ-00161 798
## 102 AZ-00163 911
## 103 AZ-00164 559
## 104 AZ-00165 600
## 105 AZ-00166 399
## 106 AZ-00167 585
## 107 AZ-00168 1023
## 108 AZ-00169 1031
## 109 AZ-00173 600
## 110 AZ-00174 649
## 111 AZ-00176 526
## 112 AZ-00177 800
## 113 AZ-00178 858
## 114 AZ-00179 817
## 115 AZ-00181 519
## 116 AZ-00182 1076
## 117 AZ-00186 948
## 118 AZ-00188 1027
## 119 AZ-00190 615
## 120 AZ-00191 1102
## 121 AZ-00192 715
## 122 AZ-00193 268
## 123 AZ-00194 1178
## 124 AZ-00197 1076
## 125 AZ-00198 417
## 126 AZ-00201 743
## 127 AZ-00202 308
## 128 AZ-00203 519
## 129 AZ-00205 1076
## 130 AZ-00206 893
## 131 AZ-00207 834
## 132 AZ-00208 798
## 133 AZ-00209 602
## 134 AZ-00210 846
## 135 AZ-00211 685
## 136 AZ-00212 829
## 137 AZ-00213 549
## 138 AZ-00215 541
## 139 AZ-00217 893
## 140 AZ-00218 502
## 141 AZ-00219 NA
## 142 AZ-00220 542
## 143 AZ-00221 762
## 144 AZ-00223 723
## 145 AZ-00224 607
## 146 AZ-00225 593
## 147 AZ-00226 817
## 148 AZ-00228 405
## 149 AZ-00229 1160
## 150 AZ-00230 231
## 151 AZ-00231 949
## 152 AZ-00233 950
## 153 AZ-00234 731
## 154 AZ-00235 430
## 155 AZ-00236 797
## 156 AZ-00237 747
## 157 AZ-00238 1102
## 158 AZ-00240 449
## 159 AZ-00241 1076
## 160 AZ-00242 837
## 161 AZ-00246 534
## 162 AZ-00249 663
## 163 AZ-00250 578
## 164 AZ-00251 837
## 165 AZ-00252 536
## 166 AZ-00256 352
## 167 AZ-00257 1011
## 168 AZ-00260 1116
## 169 AZ-00261 858
## 170 AZ-00262 953
## 171 AZ-00265 889
## 172 AZ-00266 389
## 173 AZ-00267 846
## 174 AZ-00268 1011
## 175 AZ-00269 943
## 176 AZ-00270 407
## 177 AZ-00272 1011
## 178 AZ-00274 405
## 179 AZ-00275 600
## 180 AZ-00278 955
## 181 AZ-00279 560
## 182 AZ-00280 566
## 183 AZ-00282 986
## 184 AZ-00283 322
## 185 AZ-00286 390
## 186 AZ-00287 598
## 187 AZ-00288 909
## 188 AZ-00289 837
## 189 AZ-00290 897
## 190 AZ-00293 930
## 191 AZ-00294 506
## 192 AZ-00296 1187
## 193 AZ-00297 521
## 194 AZ-00299 750
## 195 AZ-00300 583
## 196 AZ-00301 425
## 197 AZ-00302 543
## 198 AZ-00303 621
## 199 AZ-00305 278
## 200 AZ-00306 400
## 201 AZ-00311 1001
## 202 AZ-00312 523
## 203 AZ-00313 368
## 204 AZ-00314 893
## 205 AZ-00316 607
## 206 AZ-00317 NA
## 207 AZ-00318 442
## 208 AZ-00319 445
## 209 AZ-00320 346
## 210 AZ-00321 437
## 211 AZ-00322 474
## 212 AZ-00323 747
## 213 AZ-00324 916
## 214 AZ-00326 691
## 215 AZ-00327 526
## 216 AZ-00328 858
## 217 AZ-00329 607
## 218 AZ-00330 434
## 219 AZ-00332 912
## 220 AZ-00337 798
## 221 AZ-00339 451
## 222 AZ-00341 677
## 223 AZ-00343 629
## 224 AZ-00345 1091
## 225 AZ-00346 326
## 226 AZ-00347 287
## 227 AZ-00349 454
## 228 AZ-00352 945
## 229 AZ-00354 551
## 230 AZ-00355 528
## 231 AZ-00356 525
## 232 AZ-00357 773
## 233 AZ-00360 582
## 234 AZ-00361 572
## 235 AZ-00362 654
## 236 AZ-00364 553
## 237 AZ-00365 585
## 238 AZ-00366 829
## 239 AZ-00367 327
## 240 AZ-00368 368
## 241 AZ-00369 723
## 242 AZ-00370 565
## 243 AZ-00371 336
## 244 AZ-00373 798
## 245 AZ-00374 798
## 246 AZ-00376 830
## 247 AZ-00377 469
## 248 AZ-00378 430
## 249 AZ-00379 405
## 250 AZ-00380 774
## 251 AZ-00383 1006
## 252 AZ-00384 760
## 253 AZ-00385 943
## 254 AZ-00386 829
## 255 AZ-00387 912
## 256 AZ-00389 523
## 257 AZ-00390 837
## 258 AZ-00391 624
## 259 AZ-00392 1083
## 260 AZ-00393 392
## 261 AZ-00394 760
## 262 AZ-00395 779
## 263 AZ-00399 607
## 264 AZ-00400 645
## 265 AZ-00401 1149
## 266 AZ-00402 870
## 267 AZ-00403 467
## 268 AZ-00404 893
## 269 AZ-00405 283
## 270 AZ-00406 739
## 271 AZ-00407 791
## 272 AZ-00408 347
## 273 AZ-00410 437
## 274 AZ-00412 624
## 275 AZ-00413 858
## 276 AZ-00415 986
## 277 AZ-00417 955
## 278 AZ-00418 298
## 279 AZ-00421 1091
## 280 AZ-00422 903
## 281 AZ-00423 744
## 282 AZ-00424 976
## 283 AZ-00427 461
## 284 AZ-00430 945
## 285 AZ-00431 773
## 286 AZ-00432 327
## 287 AZ-00433 571
## 288 AZ-00434 741
## 289 AZ-00435 649
## 290 AZ-00436 742
## 291 AZ-00438 456
## 292 AZ-00439 NA
## 293 AZ-00440 702
## 294 AZ-00441 503
## 295 AZ-00442 474
## 296 AZ-00443 233
## 297 AZ-00445 1149
## 298 AZ-00446 542
## 299 AZ-00449 949
## 300 AZ-00450 631
## 301 AZ-00451 829
## 302 AZ-00452 591
## 303 AZ-00454 520
## 304 AZ-00458 723
## 305 AZ-00459 774
## 306 AZ-00461 528
## 307 AZ-00462 516
## 308 AZ-00464 542
## 309 AZ-00465 798
## 310 AZ-00467 893
## 311 AZ-00468 453
## 312 AZ-00469 1027
## 313 AZ-00470 741
Based on our own validation studies and earlier submissions to the leaderboard, we expect a performance in terms of iAUC between 0.75 and 0.80. The addition of the ridge model to the gam is expected to give at most and increase of iAUC by 0.02. It could be interesting to investigate if boosting/bagging techniques could improve on the ridge component.
We did not try to include interactions between variables selected for the gam in any systematic way.
It was a little surprising that none of our attempts to come up with a clever handling of missing values had any positive impact on predictive performance. On the contrary.
There is one issue with the variable selection that is worth investigating further. The glmnet
function standardizes all variables by default, which is sensible if they are all continuous, but less so if some are dummy variables encoding factor levels. We did some experiments with standardizing only the continuous predictors, but that did not appear to improve predictive performance in terms of iAUC.