prep_data
will prepare your data for machine learning.
Some steps enhance predictive power, some make sure that the data format is
compatible with a wide array of machine learning algorithms, and others
provide protection against common problems in model deployment. The
following steps are available; those followed by * are applied by default.
Many have customization options.
Convert columns with only 0/1 to factor*
Remove columns with near-zero variance*
Convert date columns to useful features*
Fill in missing values via imputation*
Collapse rare categories into "other"*
Center numeric columns
Standardize numeric columns
Create dummy variables from categorical variables*
Add protective levels to factors for rare and missing data*
While preparing your data, a recipe will be generated for identical transformation of future data and stored in the `recipe` attribute of the output data frame. If a recipe object is passed to `prep_data` via the `recipe` argument, that recipe will be applied to the data. This allows you to transform data in model training and apply exactly the same transformations in model testing and deployment. The new data must be identical in structure to the data that the recipe was prepared with.
prep_data(d, ..., outcome, recipe = NULL, remove_near_zero_variance = TRUE, convert_dates = TRUE, impute = TRUE, collapse_rare_factors = TRUE, center = FALSE, scale = FALSE, make_dummies = TRUE, add_levels = TRUE, factor_outcome = TRUE)
d | A dataframe or tibble containing data to impute. |
---|---|
... | Optional. Unquoted variable names to not be prepped. These will be returned unaltered. Typically ID and outcome columns would go here. |
outcome | Optional. Unquoted column name that indicates the target variable. If provided, argument must be named. If this target is 0/1, it will be coerced to Y/N if factor_outcome is TRUE; other manipulation steps will not be applied to the outcome. |
recipe | Optional. Recipe for how to prep d. In model deployment, pass the output from this function in training to this argument in deployment to prepare the deployment data identically to how the training data was prepared. If training data is big, pull the recipe from the "recipe" attribute of the prepped training data frame and pass that to this argument. If present, all following arguments will be ignored. |
remove_near_zero_variance | Logical. If TRUE (default), columns with near-zero variance will be removed. These columns are either a single value, or meet both of the following criteria: 1. they have very few unique values relative to the number of samples and 2. the ratio of the frequency of the most common value to the frequency of the second most common value is large. |
convert_dates | Logical or character. If TRUE (default), date columns are identifed and used to generate day-of-week, month, and year columns, and the original date columns are removed. If FALSE, date columns are removed. If a character vector, it is passed to the `features` argument of `recipes::step_date`. E.g. if you want only quarter and year back: `convert_dates = c("quarter", "year")`. |
impute | Logical or list. If TRUE (default), columns will be imputed
using mean (numeric), and new category (nominal). If FALSE, data will not
be imputed. If this is a list, it must be named, with possible entries for
`numeric_method`, `nominal_method`, `numeric_params`, `nominal_params`,
which are passed to |
collapse_rare_factors | Logical or numeric. If TRUE (default), factor levels representing less than 3 percent of observations will be collapsed into a new category, `other`. If numeric, must be in 0, 1, and is the proportion of observations below which levels will be grouped into other. See `recipes::step_other`. |
center | Logical. If TRUE, numeric columns will be centered to have a mean of 0. Default is FALSE. |
scale | Logical. If TRUE, numeric columns will be scaled to have a standard deviation of 1. Default is FALSE. |
make_dummies | Logical. If TRUE (default), dummy columns will be created for categorical variables. |
add_levels | Logical. If TRUE (defaults), "other" and "missing" will be added to all nominal columns. This is protective in deployment: new levels found in deployment will become "other" and missingness in deployment can become "missing" if the nominal imputation method is "new_category". If FALSE, these levels may be added to some columns depending on details of imputation and collapse_rare_factors. |
factor_outcome | Logical. If TRUE (default) and if all entries in outcome are 0 or 1 they will be converted to factor with levels N and Y for classification. Note that which level is the positive class is set in training functions rather than here. |
Prepared data frame with reusable recipe object for future data preparation in attribute "recipe". Attribute recipe contains the names of ignored columns (those passed to ...) in attribute "ignored_columns".
To let data preparation happen automatically under the hood, see
machine_learn
To take finer control of imputation, see impute
, and for
finer control of data prep in general check out the recipes package:
https://topepo.github.io/recipes/
To train models on prepared data, see tune_models
and
flash_models
d_train <- pima_diabetes[1:700, ] d_test <- pima_diabetes[701:768, ] # Prep data. Ignore patient_id (identifier) and treat diabetes as outcome d_train_prepped <- prep_data(d = d_train, patient_id, outcome = diabetes)#># Prep test data by reapplying the same transformations as to training data d_test_prepped <- prep_data(d_test, recipe = d_train_prepped)#># View the transformations applied and the prepped data d_test_prepped#>#> Data Recipe #> #> Inputs: #> #> role #variables #> outcome 1 #> predictor 8 #> #> Training data contained 700 data points and 340 incomplete rows. #> #> Operations: #> #> Sparse, unbalanced variable filter removed no terms [trained] #> Mean Imputation for pregnancies, plasma_glucose, ... [trained] #> Filling NA with missing for weight_class [trained] #> Adding levels to: other, missing [trained] #> Collapsing factor levels for weight_class [trained] #> Adding levels to: other, missing [trained] #> Dummy variables from weight_class [trained]#>#> # A tibble: 68 x 13 #> pregnancies plasma_glucose diastolic_bp skinfold insulin pedigree age #> <int> <int> <dbl> <dbl> <dbl> <dbl> <int> #> 1 2 122 76.0 27.0 200. 0.483 26 #> 2 6 125 78.0 31.0 154. 0.565 49 #> 3 1 168 88.0 29.0 154. 0.905 52 #> 4 2 129 72.3 29.1 154. 0.304 41 #> 5 4 110 76.0 20.0 100. 0.118 27 #> 6 6 80 80.0 36.0 154. 0.177 28 #> 7 10 115 72.3 29.1 154. 0.261 30 #> 8 2 127 46.0 21.0 335. 0.176 22 #> 9 9 164 78.0 29.1 154. 0.148 45 #> 10 2 93 64.0 32.0 160. 0.674 23 #> # ... with 58 more rows, and 6 more variables: diabetes <fct>, #> # weight_class_normal <dbl>, weight_class_obese <dbl>, #> # weight_class_overweight <dbl>, weight_class_other <dbl>, #> # weight_class_missing <dbl># Customize preparations: prep_data(d = d_train, patient_id, outcome = diabetes, impute = list(numeric_method = "bagimpute", nominal_method = "bagimpute"), collapse_rare_factors = FALSE, convert_dates = "year", center = TRUE, scale = TRUE, make_dummies = FALSE)#>#>#> Data Recipe #> #> Inputs: #> #> role #variables #> outcome 1 #> predictor 8 #> #> Training data contained 700 data points and 340 incomplete rows. #> #> Operations: #> #> Sparse, unbalanced variable filter removed no terms [trained] #> Bagged tree imputation for pregnancies, plasma_glucose, ... [trained] #> Bagged tree imputation for weight_class [trained] #> Centering for pregnancies, plasma_glucose, ... [trained] #> Scaling for pregnancies, plasma_glucose, ... [trained] #> Adding levels to: other, missing [trained] #> Adding levels to: other, missing [trained]#>#> # A tibble: 700 x 10 #> patient_id pregnancies plasma_glucose diastolic_bp skinfold insulin #> <int> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 1 0.646 0.871 -0.0214 0.636 0.292 #> 2 2 -0.840 -1.19 -0.515 0.0138 -0.844 #> 3 3 1.24 2.02 -0.679 -0.921 0.485 #> 4 4 -0.840 -1.06 -0.515 -0.609 -0.602 #> 5 5 -1.14 0.511 -2.65 0.636 0.150 #> 6 6 0.349 -0.176 0.143 -0.366 -0.274 #> 7 7 -0.246 -1.42 -1.83 0.325 -0.663 #> 8 8 1.83 -0.208 -0.0591 0.335 -0.153 #> 9 9 -0.543 2.47 -0.186 1.67 3.96 #> 10 10 1.24 0.119 1.95 0.310 0.659 #> # ... with 690 more rows, and 4 more variables: weight_class <fct>, #> # pedigree <dbl>, age <dbl>, diabetes <fct>