Estimation procedure for HAL, the Highly Adaptive Lasso
fit_hal(
X,
Y,
X_unpenalized = NULL,
max_degree = 3,
fit_type = c("glmnet", "lassi"),
n_folds = 10,
foldid = NULL,
use_min = TRUE,
reduce_basis = NULL,
family = c("gaussian", "binomial", "cox"),
return_lasso = TRUE,
return_x_basis = FALSE,
basis_list = NULL,
lambda = NULL,
id = NULL,
offset = NULL,
cv_select = TRUE,
...,
yolo = TRUE
)
Arguments
| X |
An input matrix containing observations and covariates. |
| Y |
A numeric vector of obervations of the outcome variable. |
| X_unpenalized |
An input matrix with the same format as X, that
directly get appended into the design matrix (no basis expansion). No L-1
penalization is performed on these covariates. |
| max_degree |
The highest order of interaction terms for which the basis
functions ought to be generated. The default (NULL) corresponds to
generating basis functions for the full dimensionality of the input matrix. |
| fit_type |
The specific routine to be called when fitting the Lasso
regression in a cross-validated manner. Choosing the glmnet option
will result in a call to cv.glmnet while lassi
will produce a (faster) call to a custom Lasso routine. |
| n_folds |
Integer for the number of folds to be used when splitting the
data for V-fold cross-validation. This defaults to 10. |
| foldid |
An optional vector of values between 1 and n_folds
identifying what fold each observation is in. If supplied, n_folds
can be missing. When supplied, this is passed to
cv.glmnet. |
| use_min |
Determines which lambda is selected from
cv.glmnet. TRUE corresponds to
"lambda.min" and FALSE corresponds to "lambda.1se". |
| reduce_basis |
A numeric value bounded in the open interval
(0,1) indicating the minimum proportion of 1's in a basis function column
needed for the basis function to be included in the procedure to fit the
Lasso. Any basis functions with a lower proportion of 1's than the cutoff
will be removed. This argument defaults to NULL, in which case all
basis functions are used in the lasso-fitting stage of the HAL algorithm. |
| family |
A character corresponding to the error family for a
generalized linear model. Options are limited to "gaussian" for fitting a
standard linear model, "binomial" for penalized logistic regression,
"cox" for a penalized proportional hazards model. Note that in the case of
"binomial" and "cox" the argument fit_type is limited to "glmnet"; thus,
documentation of the glmnet package should be consulted for any errors
resulting from the Lasso fitting step in these cases. |
| return_lasso |
A logical indicating whether or not to return
the glmnet fit of the lasso model. |
| return_x_basis |
A logical indicating whether or not to return
the matrix of (possibly reduced) basis functions used in the HAL lasso fit. |
| basis_list |
The full set of basis functions generated from the input
data X (via a call to enumerate_basis). The dimensionality of this
structure is dim = (n * 2^(d - 1)), where n is the number of observations
and d is the number of columns in X. |
| lambda |
User-specified array of values of the lambda tuning parameter
of the Lasso L1 regression. If NULL, cv.glmnet
will be used to automatically select a CV-optimal value of this
regularization parameter. If specified, the Lasso L1 regression model will
be fit via glmnet, returning regularized coefficient values for each
value in the input array. |
| id |
a vector of ID values, used to generate cross-validation folds for
cross-validated selection of the regularization parameter lambda. |
| offset |
a vector of offset values, used in fitting. |
| cv_select |
A logical specifying whether the array of values
specified should be passed to cv.glmnet in order to
pick the optimal value (based on cross-validation) (when set to
TRUE) or to simply fit along the sequence of values (or single
value) using glmnet (when set to FALSE). |
| ... |
Other arguments passed to cv.glmnet. Please
consult its documentation for a full list of options. |
| yolo |
A logical indicating whether to print one of a curated
selection of quotes from the HAL9000 computer, from the critically
acclaimed epic science-fiction film "2001: A Space Odyssey" (1968). |
Value
Object of class hal9001, containing a list of basis
functions, a copy map, coefficients estimated for basis functions, and
timing results (for assessing computational efficiency).
Details
The procedure uses a custom C++ implementation to generate a design
matrix consisting of basis functions corresponding to covariates and
interactions of covariates and to remove duplicate columns of indicators.
The Lasso regression is fit to this (usually) very wide matrix using either
a custom implementation (based on origami) or by a call to
cv.glmnet.
Examples