Model selection based on combined penalties for biomarker identification

ABSTRACT The growing role of targeted medicine has led to an increased focus on the development of actionable biomarkers. Current penalized selection methods that are used to identify biomarker panels for classification in high-dimensional data, however, often result in highly complex panels that need careful pruning for practical use. In the framework of regularization methods, a penalty that is a weighted sum of the L1 and L0 norm has been proposed to account for the complexity of the resulting model. In practice, the limitation of this penalty is that the objective function is non-convex, non-smooth, the optimization is computationally intensive and the application to high-dimensional settings is challenging. In this paper, we propose a stepwise forward variable selection method which combines the L0 with L1 or L2 norms. The penalized likelihood criterion that is used in the stepwise selection procedure results in more parsimonious models, keeping only the most relevant features. Simulation results and a real application show that our approach exhibits a comparable performance with common selection methods with respect to the prediction performance while minimizing the number of variables in the selected model resulting in a more parsimonious model as desired.


Introduction
The high costs and long duration of clinical development, paired with high levels of attrition, require the quantification of the risk when moving from early to late stage clinical development, and biomarkers may play an important role in this quantification. However, only rarely the number of variables (biomarkers) in the resulting panel plays an active role in selection procedures. Variable selection is an important aspect in the determination of such panels in the framework of highdimensional statistical modeling. In practice, a large number of candidate predictors are available for modeling. Keeping only the relevant variables in the model enhances the interpretation and may increase the predictability of the resulting model.
Particularly in the framework of regularization methods, various penalty functions are used to perform variable selection. Frank and Friedman (1993) proposed the bridge regression by introducing the penalty of the form L q ¼ P d j¼1 β j q ; q > 0, for the vector of regression coefficients β ¼ β 1 ; β 2 ; . . . ; β d À Á 2 R d .
When q 1 the penalty performs variable selection. The case where q ¼ 1 is the L 1 penalty and corresponds to the Least Absolute Shrinkage and Selection Operator (Lasso) (Tibshirani, 1995 ). It performs continuous shrinkage and variable selection at the same time, whereas for q ¼ 2 we get the ridge estimator (Hoerl and Kennard, 1970) that shrinks coefficients toward zero but it does not perform variable selection. The limit of the L q as q ! 0 gives the L 0 penalty, which penalizes the number of nonzero coefficients and thus is appealing for model selection, if sparse models are of advantage. However, due to its non-convexity and discontinuity at the origin, the corresponding optimization problem becomes difficult to implement in high dimensions. In genomic research, an L 1 penalty is routinely used due to its convexity and optimization simplicity. However, the result of the L 1 type regularization may not be sparse enough for a good interpretation. The development of methods to obtain sparser solutions than through L 1 penalization methods is becoming essential part in the classification and feature selection area. A variable selection method that combines the L 1 and L 0 penalties was proposed by Liu and Wu (Liu and Wu, 2007). They used a mixed integer programming algorithm for optimization of the objective function. The results showed that their method achieved sparser solutions than Lasso and more stable solutions that the L 0 regularization. However, the application was limited to moderate data sizes, due to computational inefficiency for large-scale problems. Other combinations of L q penalties have been proposed so far (Zou and Hastie, 2005) and recently (Huang et al., 2016) with each of these methods using a different optimization algorithm to approach the solution.
In this article, we propose a method for variable selection that penalizes the likelihood function with a linear combination of L 0 with L 1 or L 2 penalties (CL, CL2) in a stepwise forward variable selection procedure. The aim is to obtain a model that is sparser than the model with the L 1 penalty alone and at the same time achieve a good predictive performance. Moreover, a strong motivation for the proposed stepwise forward variable selection method is that state-of-the-art global optimization algorithms for nonsmooth and non-convex functions do not provide satisfactory results. In Section 2, we define the CL and CL2 penalties and present the algorithm for solving the penalized logistic regression problem with these combined penalties. In Section 3, we use simulated data to evaluate the performance of our method, and we compare it to Lasso and adaptive Lasso both in terms of correct variable selection (true covariates with β j Þ0) as well as predictive performance. Finally, we show an application of our method for classification and variable selection on a real dataset with protein measurements to identify the least number of predictors that can best classify responders and non-responders to a treatment.

Regularization
Suppose we have data (X, y), where y ¼ y 1 ; y 2 ; . . . ; y n ð Þis the vector of responses and X is an n Â d matrix of predictors. We will assume that the observations are independent and the predictors standardized. With linear predictor η ¼ X T β and link function g the generalized linear model is expressed as Under the regularization framework, the estimated coefficientsβ ¼β 1 ;β 2 ; . . . ;β d 2 R d are obtained by minimizing the objective function À logL þ λP β ð Þ, and are given by: where P β ð Þ is a regularization term. The parameter λ > 0 is a tuning parameter and À logL is the negative log-likelihood. One of the most popular and commonly used regularization method is the Setting λ ¼ 0 reverses the Lasso to maximum likelihood estimation. On the other hand, a very large λ will completely shrink β to zero thus leading to the empty or null model. In general, moderate values of λ will cause shrinkage of the solutions toward zero, and some coefficients may be exactly zero.
Other types of L 1 regularization include the adaptive Lasso, where adaptive weights are used for penalizing different coefficients in the L 1 penalty and which was shown to have the oracle property (Zou, 2006). A variable selection and estimation procedure is said to have the oracle property i) if it selects the true model with probability tending to 1 and ii) if the estimated penalized coefficients are asymptotically normal, with the same asymptotic empirical variance as the estimator based on the true model. However, the L 1 type regularization is consistent only under rather restrictive assumptions (Zhao and Yu, 2006) and the coefficient estimates are severely biased due to shrinkage (Meinshausen and Yu, 2009;Fan and Li, 2001). Although the L 0 norm, where P β ð Þ ¼ P d j¼1 1 β j Þ0 and 1 β j Þ0 is the indicator function of whether β j Þ0, tend to yield the sparsest solutions, its implementation in high-dimensional data becomes an NP hard optimization problem and is not computationally feasible. Classical information criteria like AIC (Akaike, 1974) or BIC (Schwarz, 1978) lie in the general class of the regularization λP β ð Þ ¼ λ P d j¼1 1 β j Þ0 for suitable choices of λ. In order to gain a more concise and sparse solution and while keeping a high predictive accuracy of the classification model, we propose a regularization term that combines the L 0 with L 1 or L 2 norms (Liu and Wu, 2007). Figure 1 plots Figure 1. Plot of the L 0 ; L 0 ε; L 2 ; L 1 norms with ε ¼ 0:1andd ¼ 1 the penalty functions L 1 and L 2 in the bottom panel and the L 0 penalty in the top panel. Unlike L 2 , the penalty terms L 1 and L 0 are singular at the origin and thus perform variable selection (Fan and Li, 2001).
The combined L 0 þ L 1 penalty Following Liu and Wu (Liu and Wu, 2007) the penalization term is defined as CL α ε β ð Þ ¼ 1 À a ð ÞL 0 ε þ aL 1 , where 0 a 1 is a weighting parameter between L 0 ε and L 1 penalties, with L 0 ε given by: Discontinuity at the origin of L 0 makes the optimization difficult, and therefore we consider the continuous approximation to L 0 defined by (2.2). The limit of L 0 ε β ð Þ when ε ! 0 is L 0 β ð Þ itself. When ε > 0 is small, L 0 ε β ð Þ is a good approximation to L 0 β ð Þ (Figure 1 top right). The estimated coefficients are obtained by minimizing the objective function and are given byβ The combined L 0 þ L 2 penalty We now consider another combination, the L 0 norm with L 2 . The motivation for combining the L 0 norm with L 2 , is to consider a penalty that will join the nice properties of the L 2 and those of the L 0 norm, which is to perform variable selection L 0 ð Þ and keep in the model groups of variables that are correlated ðL 2 Þ. In theory, a strictly convex function provides a sufficient condition for such grouping of variables and the L 2 penalty guarantees strict convexity. The grouping effect refers to the simultaneous inclusion (or exclusion) of correlated predictors in the model.

The stepwise forward procedure
In their paper, Liu and Wu (2007) proposed a global algorithm to solve the corresponding difficult non-convex problem (mixed integer linear programming). However, the applicability was restricted to moderate datasizes. As mentioned by Frommlet F. and Nuel G. (Frommlet and Nuel, 2016), when the number of predictors grows large (d > 20) it is not possible to apply algorithms which guarantee to find the optimal solution (Furnival and Wilson, 1974) and instead heuristic search strategies like stepwise procedures may be considered. By using heuristic techniques, we can approximate the solution of the non-smooth, non-convex, and NP-hard optimization problems like the one in equation (2.3), where exact algorithms are not applicable for such minimization problems. The optimization of the objective function (2.3) is rather challenging since CL α ε β ð Þ and CL2 α ε β ð Þ are non-convex and non-differentiable at some points of the parameters' space. We apply the Broyden (1970), Fletcher (1970, Goldfarb (1970), and Shanno (1970) (BFGS) variable metric (quasi-Newton) method, which is shown to work well for the optimization of non-smooth and non-convex functions (Lewis and Overton, 2009 ;Lewis and Overton, 2013;Curtis and Que, 2015 ). The BFGS method uses an approximation of the Hessian matrix to find the stationary points of the function to be minimized. Its ability to capture the curvature information of the considered function renders the method so effective.
We propose to use a stepwise forward variable selection using the previously introduced penalized likelihood criterion for feature selection that can be used effectively in high-dimensional data. In this stepwise forward selection framework, at each step we optimize the objective function À logL þ λ a L 1 using the BFGS algorithm and obtain The selected model is based on the criterion that minimizes the value of The suggested algorithm is described as follows: Figure 2. Two dimensional contour plot for the regularization terms CL 0:30:1 ; L 1 ; L 2 ; CL2 0:50:1 with ε ¼ 0:1andd ¼ 2: The black dashed curve represent the CL penalty (α ¼ 0:3; ε ¼ 0:1) and the green shaped curve is the contour of the CL2 (α ¼ 0:5; ε ¼ 0:1). The outer contour shows the shape of the ridge penalty, while the blue solid line is the Lasso penalty.
Step 1 • Given a set of d standardized predictors X 1 ; X 2 ; . . . ; X d and a response y i 2 0; 1 f g; i ¼ 1; . . . n we consider all possible univariate models (M 1 ; M 2; . . . ; M d ) gthat gives the smallest value of the function (2.4), e.g. keep variable X 2 Step 2 • With the model chosen in step 1 (e.g. M 2 ) and in an additive way we consider all the d À 1 models (M 0 ) by adding the remaining d À 1 variables one at a time to the model M 2 .
Step 3 • Continue adding single variables until the value of the function (2.4) in the current step is bigger than its value in the previous step.
The advantage of using the function (2.4) instead of (2.3) in the optimization is that we no longer need to consider the continuous approximations to the discontinuous L 0 function and therefore we eliminate the number of parameters by the continuity parameter ε. The reason why we can do so is that within each step the L 0 -penality term remains constant (since the dimension of the model is fixed) and hence play no role in the determination of the regression coefficients. The L 0 -penality term only plays a role for the stopping criterium.

Sparse logistic regression with combined penalties
As a particular example, we consider the binary linear regression model (2.1), where y 2 0; 1 f g, is a vector of n observed binary outcomes, , where p is the conditional event probability and is given by The coefficient estimates are obtained by minimizing (2.3) with the log-likelihood

Results
In this section, we examine via simulations the performance of logistic regression when models are selected and estimated with the above introduced combined penalties (CL, CL2) by either stepwise forward selection as introduced in Section 2 (stepCL and stepCL2) or by global minimization (CL,CL2). In the stepwise model selection scheme, we also examine the performance of the stepwise adaptive L 1 model with the λ 1 À a ð Þ L 0 selection criterion (stepAdaCL). For that model, the is the ridge regression estimator. The estimation is done with the stepwise algorithm described in Section 2.4. Although the proposed method is a stepwise variable selection procedure, we did not consider the comparison with other stepwise methods like the stepwise BIC or AIC, as they tend to perform poorly when the dimension is large relative to the sample size and are usually too liberal, that is, they tend to select a model with many spurious covariates (Chen and Chen, 2008 ). As mentioned by Zhang and Shen (Zhang and Shen, 2010) these criteria may be inadequate due to their nonadaptivity to the model space and infeasibility of exhaustive search.
We include the global minimization in spite of the disadvantages mentioned in Section 2 for a comparison. We also consider the results from Lasso (L 1 penalty) and the adaptive Lasso. We compare the different methods in terms of the fraction of correctly selected variables and the prediction classification accuracy. The real data come from a biomarker study of protein measurements with the objective to select biomarkers that potentially discriminate between responders and non-responders.

Simulation study
We simulate data for varying number of predictors. We consider two settings, one high-dimensional data where the number of predictors (d) exceeds the number of samples (n), and a setting where the sample size is smaller than the dimensionality of the data. We assume multivariate normal predictors X 1 ; . . . ; X d with pairwise correlation ρ (compound symmetry). Let ρ denote the correlation between variables X m ; X l where m; l 2 1; . . . ; d f g , mÞl. The true model that was used to generate the outcome has k informative covariates X k ; k 2 Z; 1 < k < d. We consider a classification problem with y a binary response and standard normally distributed predictors X , MVN 0; Σ ð Þ, where Σ is the covariance matrix. We consider the logistic model with logit link function, logit p ð Þ ¼ X T β, as described above with p the probability of y = 1 given X as defined in (2.5). In other words, each component of the response vector y is viewed as a realization of a Bernoulli random variable with probability of success p.

Tuning of parameters
All analyses were done in R version 3.2.3 (R Core Team, 2015). For the Lasso and adaptive Lasso, the glmnet library was used and all the functions that were used for the combined penalty approach can be found in the R-package "stepPenal", available on the CRAN. For the adaptive lasso, weights were estimated by ridge regression and then used for a weighted L 1 penality in estimation of β. The optimal regularization parameters for the methods stepCL, stepAdaCL, CL, CL2, stepCL2 were tuned by 10-fold cross-validation on the two-dimensional surface (a; λÞ using a grid of values. The choice of the optimal parameters was done in the following way. For each configuration of (a; λÞ in the grid, the AUC of the ROC curves on the validation set was computed in each of the 10 validation sets. The average of the 10 AUCs was reported together with its standard deviation. Selection of (a; λÞ was based on the interval A ¼ maxAUC À sdAUC; maxAUC ½ Þ where maxAUC is the maximum average AUC and sdAUC is the standard deviation of the AUCs corresponding to the (a; λÞ with maximum average AUC. The (a; λÞ that corresponds to the median of the AUCs in the interval A was chosen for the final model fitting. In case that more than one configurations yields the median of the AUCs, we select the configuration with the largest λ and smallest a, to obtain the sparsest model. The use of the interval A acknowledges the sample variability and the fact that we are aiming for a compromise between good classification performance and complexity of the model. In other words, a small decrease in the AUC of the ROC curve is acceptable in return to a less complex model. The Lasso and adaptive Lasso were also tuned by 10-fold cross-validation on the onedimensional space (λÞ, using the default settings in R in the function cv.glmnet and the measure type "auc".

Simulation results
The different classifiers were built by the estimated tuning parameters on the training set. Then, the obtained classifiers were applied to the testing set for classification and prediction. For the testing set, we simulated data from the same distribution as the training set for n = 1000 samples. We simulated 1000 datasets on which we applied all methods. For each method, we computed the mean classification performance of the models on the testing sets measured by the AUC of the ROC curve (test AUC). This is a measure for the discrimination ability of the model to correctly distinguish the two classes of the response. The complexity of the resulting model was measured by the ratio of correctly selected variables (true covariates with β j Þ0) to the total variables selected by the model. We will call this ratio RCV for the rest of the paper. This ratio takes values between zero and one. When the model selects none of the informative variables it is zero and it becomes one, when the selected model includes only the k informative covariates. The closer the RCV is to one, the sparser the model is and the more it selects the true variables. The results in Table 1 summarize the performance of the different methods in terms of model complexity. An ideal model selection method would only select the k true features and set the coefficients of the other d-k variables equal to zero.
In most of the scenarios, the stepCL and stepCL2 methods yield a higher RCV than the other methods and on average the stepCL and stepCL2 models are sparser than the other methods. In scenarios 2 and 3 the adaptive Lasso yields the higher RCV, but the models are not as sparse as the stepwise methods. Although the stepwise methods (stepCL, stepCL2, stepAdaCL) result in including the least variables in the model, its discriminative ability in terms of AUC, as shown in Table 2, is comparable with the other methods that tend to select a larger model with more variables. Table 1. Simulation results for all four scenarios based on 1000 replications. The table summarizes the complexity of the models in terms of correct non-zero estimates. The first column to the left shows the median value (1st and 3rd quantile) of the correct variables included in the model. The second column is the median (1st and 3rd quantile) of the total variables each model selects. The third column is the average RCV (standard deviation), the ratio of the correct variables over the total variables selected. The stepCL, stepAdaCL and stepCL2 are the stepwise methods and CL, CL2 are the global optimization methods.  The stepCL2 method also has remarkable performance both in terms of sparsity and predictive discrimination. Considering the trade-off between model complexity and performance, the proposed stepwise combined penalty approach achieves a good balance between parsimony, including less variables and maintaining a high predictive accuracy that is comparable with state-of-the-art methods. We should mention that in scenarios 2, 3, and 4 none of the methods select all of the k informative variables, however, for the stepwise method the AUC of the ROC curve on the testing set is greater than 90%, indicating a good discrimination accuracy by including the least variables in the model. In all scenarios, we found that the stepCL2 method has comparable performance to stepCL and is superior to adaptive Lasso and Lasso.
In Table 2, we present results regarding the predictive classification accuracy of the methods by the AUC of the ROC curves. We report the Brier score (Brier, 1950) as a measure of the accuracy of predictions, defined as It is given by the squared distance between the patients observed status y i and the predicted probability b p i . The decision space for the Brier score is the interval [0,1] and generally the lowest the Brier score, the better the classification rule. If the predicted probability is 0.5 for each individual, the Brier score of 0.25 would indicate that the classification rule is a random one.
Empirical results from our simulations show that even for settings where n > d, the stepwise method gives the sparsest solutions while maintaining classification performance measures as good as Lasso and adaptive Lasso. The CL2 method tends to select big models, due to the L 2 norm which shrinks coefficients toward zero without variable selection. Thus when a is close to 1, the model will behave similar to ridge regression and the resulting model will be complex in terms of the number of predictors. On the other hand, when a is closer to 0, the CL2 and stepCL2 methods will borrow more of the characteristics of the L 0 norm and will result in sparser models.
In our simulations, we also considered the case where there is no correlation among predictors (results not shown in the table as we do not consider it a realistic scenario). We repeated scenario 2 with the only alteration of setting ρ ¼ 0. Results were in the same direction as in scenario 2 shown in Table 1 and Table 2. That is, the stepwise methods perform better than all the other methods in terms of model complexity resulting in the sparsest models with a high classification performance.
Furthermore, we examined the situation where there are no predictors in the data associated with the outcome. In that case that the true model is the null model, none of the methods identified the true model. Again, running through the second scenario with d = 150 predictors with none being informative for the outcome, the stepCL method selected a median of five variables, whereas the other methods selected between 14 (AdaLasso, CL) and 19 (CL2). We observed the same pattern in the results for repeating the first scenario with d = 50 uninformative predictors, where none of the methods selected the true model but the stepwise methods produced the sparsest solutions.

Application-real data analysis
To illustrate the applicability of the proposed method, we applied the stepwise method on a real example involving protein measurements. The dataset contained n = 53 patients with baseline measurements of d = 187 proteins. To maintain confidentiality, the names of the proteins are not revealed. For the presentation of the results and keeping the study anonymized we renamed the proteins to X 1 ; X 2 ; . . . ; X 187 . The objective is to extract potential candidate markers discriminating responders from non-responders based on patients' protein levels. We apply our proposed stepwise combined penalty approach with the aim to select a small set of proteins that can sufficiently predict response to the treatment. We compare our approach with the commonly used Lasso and adaptive Lasso, but also with the global optimization penalized methods CL and CL2.
The regularization parameters were not tuned by cross-validation, due to the relatively small sample size (n = 53). The tuning was done using the bootstrap method in the following way; for a grid of values of (a; λÞ, we trained the models on B = 100 bootstrapped datasets (drawing samples with replacement from the original data) and evaluate their classification performance (in terms of AUC) on the original data. For each combination of the tuning parameters (a; λÞ, the models were trained on B bootstrapped sets and validated on the testing set (original data) and the average AUC (over the B bootstrapped samples) was reported together with its standard deviation. The configuration of (a; λÞ that corresponds to the median of the AUC in interval A, as described above in Section 3.2 'Tuning of parameters', was chosen. The results show that the stepwise methods yield the sparsest models by selecting eight variables (stepCL) and nine (stepCL2) accordingly, whereas the other methods select between 22 (CL2) and 26 (Lasso). It is noticeable that the classification performance of the stepwise method is as good as the other variable selection methods, even including the least predictors. In order to evaluate the performance of the models and in the absence of an external validation dataset we use bootstrapping. We applied all the methods on another B = 1000 bootstrapped datasets of the protein data, by sampling with replacement, and the frequencies of the top 10 selected variables by all methods are reported in Figure 3. This resulted in 16 unique proteins. The first column (stepCL) is ordered by decreasing frequencies of the top 10 variables selected by all methods. In dark colour are the higher frequencies (>30%) and the lighter colours depict the lower frequencies.
This figure shows that the proteins that were frequently selected by the stepCL and stepCL2 methods are also frequently selected by the Lasso and adaptive Lasso. Note that the stepwise methods have lower frequencies of the selected variables, because selection of larger models will automatically increase the number of selection for individual variables. Figure 4 shows boxplots of the total number of variables included in the model over the bootstrap evaluations. The stepwise method yields consistent model selection by selecting a median of eight variables for stepCL and stepCL2, whereas the Lasso and adaptive Lasso have a big variability on the complexity of the model selected. The AUC of the ROC curves that is used as a measure of classification performance of the methods on the bootstrapped datasets and their distribution is shown in Figure 5. The stepwise methods tend to always select the most sparse models more systematically, while maintaining a very good classification performance.

Conclusion
In this paper, we have proposed a stepwise forward approach for model selection in the framework of penalized regression using a penalty that combines the L 0 norm, which is based on the number of coefficients, with L 1 norm which is based on the size of coefficients or L 2 norm which take into account the grouping effect. The aim of the proposed method is to find a model that includes as few and relevant variables as possible on one hand, while maintain good predictive performance on the other hand. The combined penalization term CL α ε β ð Þ that was introduced by Liu and Wu (Liu & Wu, 2007) was limited to moderate datasets due to limitations of the optimization algorithm. Considering the heuristic stepwise forward approach, we can apply the penalization CL α β ð Þ and CL2 α β ð Þ to high-dimensional data by using the BFGS algorithm which is found to work well in practice for non-convex and non-smooth functions (Lewis and Overton, 2009 ;Lewis and Overton, 2013;Curtis and Que, 2015 ). As a result, the practical implementation of the stepwise penalization method is simpler and more efficient.
We find that the combined penalty does achieve the goal of sparser selection compared to the Lasso and adaptive Lasso while at the same time retaining the same or very similar predictive performance. While the stepwise method is only an approximation to the true optimal solutions it appears to approximate the true optimal solution as well or on occasion even better than the global optimization routine and reduces the computational time considerably. Tuning of the regularization parameters (a; λÞ can, however, take a few minutes. Tuning is an important aspect of penalization methods and will be further explored and improved in future work.
Simulation results and a real data application show that the proposed method yields sparser models, while maintaining a good classification performance. This is an essential consideration for classification and screening applications where the goal is to develop a test using as few features as possible to enhance the interpretability and, potentially, the reproducibility of the results, as well as to control the cost of the implementation of the test. Overall, we found that our method provides a sparser model while maintaining similar prediction properties as compared to other methods. We hope that this paper could be a first step to learn more about the theoretical properties of this method, which seems to be worth further investigation.
Furthermore, it would be of great interest to extend the forward stepwise method to the stepwise bidirectional approach, considering at each step of the algorithm which variables can be included and excluded (forward and backwards variable selection) in the model. As future work we also consider to apply our method to regression problems for variable selection with a continuous response as well as time-to-event data.

Declaration of interest
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.