outliers()
detects outliers in (generalized) linear models.
heteroskedastic()
checks a linear model for (non-)constant error variance.
autocorrelation()
checks for independence of errors.
normality()
checks linear models for (non-)normality of residuals.
multicollin()
checks predictors of linear models for multicollinearity.
check_assumptions()
checks all of the above assumptions.
check_assumptions(x, model.column = NULL, as.logical = FALSE, ...) outliers(x, iterations = 5) heteroskedastic(x, model.column = NULL) autocorrelation(x, model.column = NULL, ...) normality(x, model.column = NULL) multicollin(x, model.column = NULL)
x | Fitted |
---|---|
model.column | Name or index of the list-variable that contains the fitted
model objects. Only applies, if |
as.logical | Logical, if |
... | Other arguments, passed down to |
iterations | Numeric, indicates the number of iterations to remove outliers. |
A data frame with the respective statistics.
These functions are wrappers that compute various test statistics,
however, each of them returns a tibble instead of a list of values.
Furthermore, all functions can also be applied to multiples models
in stored in list-variables (see 'Examples').
outliers()
wraps outlierTest
and iteratively
removes outliers for iterations
times, or if the r-squared value
(for glm: the AIC) did not improve after removing outliers. The function
returns a tibble with r-squared and AIC statistics for the original
and updated model, as well as the update model itself ($updated.model
),
the number ($removed.count
) and indices of the removed observations
($removed.obs
).
heteroskedastic()
wraps ncvTest
and returns
the p-value of the test statistics as tibble. A p-value < 0.05 indicates
a non-constant variance (heteroskedasticity).
autocorrelation()
wraps durbinWatsonTest
and returns the p-value of the test statistics as tibble. A p-value
< 0.05 indicates autocorrelated residuals. In such cases, robust
standard errors (see robust
return more accurate results
for the estimates, or maybe a mixed model with error term for the
cluster groups should be used.
normality()
calls shapiro.test
and checks the standardized residuals for normal distribution.
The p-value of the test statistics is returned as tibble. A p-value
< 0.05 indicates a significant deviation from normal distribution.
Note that this formal test almost always yields significant results
for the distribution of residuals and visual inspection (e.g. qqplots)
are preferable (see plot_model
with
type = "diag"
).
multicollin()
wraps vif
and returns
the maximum vif-value from a model as tibble. If this value is
larger than about 4, multicollinearity exists, else not.
In case of multicollinearity, the names of independent
variables that vioalte contribute to multicollinearity are printed
to the console.
check_assumptions()
runs all of the above tests and returns
a tibble with all test statistics included. In case the p-values
are too confusing, use the as.logical
argument, where all
p-values are replaced with either TRUE
(in case of violation)
or FALSE
(in case of model conforms to assumption of linar
regression).
These formal tests are very strict and in most cases violation of model
assumptions are alerted, though the model is actually ok. It is
preferable to check model assumptions based on visual inspection
(see plot_model
with type = "diag"
).
data(efc) fit <- lm(barthtot ~ c160age + c12hour + c161sex + c172code, data = efc) outliers(fit)#>heteroskedastic(fit)#>#> heteroskedastic #> 1 3.885808e-07autocorrelation(fit)#>#> autocorrelation #> 1 0normality(fit)#>#> non.normality #> 1 1.535796e-13check_assumptions(fit)#> #> # Checking Model-Assumptions #> #> Model: barthtot ~ c160age + c12hour + c161sex + c172code #> #> violated statistic #> Heteroskedasticity yes p = 0.000 #> Non-normal residuals yes p = 0.000 #> Autocorrelated residuals yes p = 0.000 #> Multicollinearity no vif = 1.153fit <- lm(barthtot ~ c160age + c12hour + c161sex + c172code + neg_c_7, data = efc) outliers(fit)#>#> models adjusted.r2 aic #> 1 original 0.3458095 7487.639 #> 2 updated 0.3530485 7468.980check_assumptions(fit, as.logical = TRUE)#> heteroskedasticity multicollinearity non.normal.resid autocorrelation #> 1 TRUE FALSE TRUE TRUE# apply function to multiple models in list-variable library(purrr) library(dplyr) tmp <- efc %>% bootstrap(50) %>% mutate( models = map(strap, ~lm(neg_c_7 ~ e42dep + c12hour + c161sex, data = .x)) ) # for list-variables, argument 'model.column' is the # quoted name of the list-variable with fitted models tmp %>% normality("models")#> non.normality #> 1 3.230058e-19 #> 2 3.827347e-16 #> 3 9.098247e-22 #> 4 9.981165e-19 #> 5 6.428575e-18 #> 6 3.224082e-20 #> 7 2.118794e-19 #> 8 1.288646e-16 #> 9 2.328003e-19 #> 10 8.999720e-21 #> 11 6.353047e-21 #> 12 1.026013e-19 #> 13 2.390006e-19 #> 14 6.701265e-19 #> 15 2.551566e-19 #> 16 3.057700e-18 #> 17 2.117815e-19 #> 18 1.077834e-17 #> 19 7.594577e-18 #> 20 9.239676e-19 #> 21 3.149844e-21 #> 22 5.519879e-20 #> 23 3.637194e-18 #> 24 1.764174e-21 #> 25 6.623029e-19 #> 26 1.181069e-19 #> 27 3.168530e-19 #> 28 4.854560e-18 #> 29 2.001162e-20 #> 30 7.352369e-22 #> 31 1.018042e-20 #> 32 7.376915e-19 #> 33 1.207461e-18 #> 34 4.070908e-16 #> 35 7.510733e-18 #> 36 2.049412e-20 #> 37 1.243810e-19 #> 38 2.813157e-19 #> 39 3.095559e-22 #> 40 8.518521e-21 #> 41 5.261510e-16 #> 42 1.156834e-18 #> 43 5.360721e-21 #> 44 3.306733e-18 #> 45 8.478782e-20 #> 46 1.000268e-20 #> 47 3.684449e-21 #> 48 1.301398e-17 #> 49 3.625559e-21 #> 50 8.619752e-18tmp %>% heteroskedastic("models")#> heteroskedastic #> 1 1.117145e-06 #> 2 4.136378e-16 #> 3 2.304193e-11 #> 4 2.303091e-13 #> 5 2.989163e-09 #> 6 1.617715e-06 #> 7 2.462472e-06 #> 8 5.650847e-11 #> 9 1.949355e-05 #> 10 9.487583e-12 #> 11 8.865950e-05 #> 12 1.001106e-11 #> 13 9.380742e-09 #> 14 2.692178e-11 #> 15 2.777129e-12 #> 16 1.156712e-08 #> 17 1.142231e-09 #> 18 3.003215e-05 #> 19 3.513974e-10 #> 20 8.914693e-07 #> 21 5.075324e-07 #> 22 1.443102e-04 #> 23 3.889332e-08 #> 24 2.962249e-05 #> 25 1.108236e-11 #> 26 1.494601e-11 #> 27 9.264544e-11 #> 28 2.394459e-08 #> 29 3.738415e-03 #> 30 3.710447e-07 #> 31 2.980297e-12 #> 32 4.978182e-15 #> 33 1.403946e-07 #> 34 1.833919e-14 #> 35 2.311357e-05 #> 36 1.506452e-13 #> 37 3.740443e-08 #> 38 2.266819e-07 #> 39 9.948937e-11 #> 40 8.844100e-08 #> 41 3.306678e-14 #> 42 4.618659e-10 #> 43 4.902720e-07 #> 44 8.314707e-12 #> 45 8.818074e-11 #> 46 8.447289e-10 #> 47 5.206397e-09 #> 48 3.063647e-15 #> 49 1.658216e-08 #> 50 1.059808e-12# Durbin-Watson-Test from package 'car' takes a little bit longer due # to simulation of p-values...# NOT RUN { tmp %>% check_assumptions("models", as.logical = TRUE, reps = 100) # }