impute
will impute your data using a variety of methods
for both nominal and numeric data. Currently supports mean (numeric only),
new_category (categorical only), bagged trees, or knn.
impute(d = NULL, ..., recipe = NULL, numeric_method = "mean", nominal_method = "new_category", numeric_params = NULL, nominal_params = NULL, verbose = FALSE)
d | A dataframe or tibble containing data to impute. |
---|---|
... | Optional. Unquoted variable names to not be imputed. These will be returned unaltered. |
recipe | Optional, a recipe object or an imputed data frame (containing a recipe object as an attribute). If provided, this recipe will be applied to impute new data contained in d with values saved in the recipe. Use this param if you'd like to apply the same values used for imputation on a training dataset in production. |
numeric_method | Defaults to |
nominal_method | Defaults to |
numeric_params | A named list with parmeters to use with chosen
imputation method on numeric data. Options are |
nominal_params | A named list with parmeters to use with chosen
imputation method on nominal data. Options are |
verbose | Gives a print out of what will be imputed and which method will be used. |
Imputed data frame with reusable recipe object for future imputation in attribute "recipe".
d <- pima_diabetes d_train <- d[1:700, ] d_test <- d[701:768, ] # Train imputer train_imputed <- impute(d = d_train, patient_id, diabetes) # Apply to new data impute(d = d_test, patient_id, diabetes, recipe = train_imputed)#> Original missingness and methods used in imputation: #> #> variable percent_missing imputation_method_used #> 1 weight_class 1.5 new_category #> 2 diastolic_bp 2.9 mean #> 3 skinfold 26.5 mean #> 4 insulin 52.9 mean #> #> Current data: #> #> # A tibble: 68 x 10 #> patient_id pregnancies plasma_glucose diastolic_bp skinfold insulin #> <int> <int> <int> <dbl> <dbl> <dbl> #> 1 701 2 122 76.0 27.0 200. #> 2 702 6 125 78.0 31.0 154. #> 3 703 1 168 88.0 29.0 154. #> 4 704 2 129 72.3 29.1 154. #> 5 705 4 110 76.0 20.0 100. #> 6 706 6 80 80.0 36.0 154. #> 7 707 10 115 72.3 29.1 154. #> 8 708 2 127 46.0 21.0 335. #> 9 709 9 164 78.0 29.1 154. #> 10 710 2 93 64.0 32.0 160. #> # ... with 58 more rows, and 4 more variables: weight_class <fct>, #> # pedigree <dbl>, age <int>, diabetes <chr># Specify methods: impute(d = d_train, patient_id, diabetes, numeric_method = "bagimpute", nominal_method = "new_category")#> Original missingness and methods used in imputation: #> #> variable percent_missing imputation_method_used #> 1 plasma_glucose 0.7 bagimpute #> 2 weight_class 1.4 new_category #> 3 diastolic_bp 4.7 bagimpute #> 4 skinfold 29.9 bagimpute #> 5 insulin 48.3 bagimpute #> #> Current data: #> #> # A tibble: 700 x 10 #> patient_id pregnancies plasma_glucose diastolic_bp skinfold insulin #> <int> <int> <dbl> <dbl> <dbl> <dbl> #> 1 1 6 148. 72.0 35.0 194. #> 2 2 1 85. 66.0 29.0 69.8 #> 3 3 8 183. 64.0 21.8 224. #> 4 4 1 89. 66.0 23.0 94.0 #> 5 5 0 137. 40.0 35.0 168. #> 6 6 5 116. 74.0 23.6 128. #> 7 7 3 78. 50.0 32.0 88.0 #> 8 8 10 115. 72.2 33.0 128. #> 9 9 2 197. 70.0 45.0 543. #> 10 10 8 125. 96.0 33.0 211. #> # ... with 690 more rows, and 4 more variables: weight_class <fct>, #> # pedigree <dbl>, age <int>, diabetes <chr># Specify method and param: impute(d = d_train, patient_id, diabetes, nominal_method = "knnimpute", nominal_params = list(knn_K = 4))#> Original missingness and methods used in imputation: #> #> variable percent_missing imputation_method_used #> 1 plasma_glucose 0.7 mean #> 2 weight_class 1.4 knnimpute #> 3 diastolic_bp 4.7 mean #> 4 skinfold 29.9 mean #> 5 insulin 48.3 mean #> #> Current data: #> #> # A tibble: 700 x 10 #> patient_id pregnancies plasma_glucose diastolic_bp skinfold insulin #> <int> <int> <dbl> <dbl> <dbl> <dbl> #> 1 1 6 148. 72.0 35.0 154. #> 2 2 1 85. 66.0 29.0 154. #> 3 3 8 183. 64.0 29.1 154. #> 4 4 1 89. 66.0 23.0 94.0 #> 5 5 0 137. 40.0 35.0 168. #> 6 6 5 116. 74.0 29.1 154. #> 7 7 3 78. 50.0 32.0 88.0 #> 8 8 10 115. 72.3 29.1 154. #> 9 9 2 197. 70.0 45.0 543. #> 10 10 8 125. 96.0 29.1 154. #> # ... with 690 more rows, and 4 more variables: weight_class <fct>, #> # pedigree <dbl>, age <int>, diabetes <chr>