• outliers() detects outliers in (generalized) linear models.

  • heteroskedastic() checks a linear model for (non-)constant error variance.

  • autocorrelation() checks for independence of errors.

  • normality() checks linear models for (non-)normality of residuals.

  • multicollin() checks predictors of linear models for multicollinearity.

  • check_assumptions() checks all of the above assumptions.

check_assumptions(x, model.column = NULL, as.logical = FALSE, ...)

outliers(x, iterations = 5)

heteroskedastic(x, model.column = NULL)

autocorrelation(x, model.column = NULL, ...)

normality(x, model.column = NULL)

multicollin(x, model.column = NULL)

Arguments

x

Fitted lm (for outliers(), may also be a glm model), or a (nested) data frame with a list-variable that contains fitted model objects.

model.column

Name or index of the list-variable that contains the fitted model objects. Only applies, if x is a nested data frame (e.g with models fitted to bootstrap replicates).

as.logical

Logical, if TRUE, the values returned by check_assumptions() are TRUE or FALSE, indicating whether each violation of model assumotion holds true or not. If FALSE (the default), the p-value of the respective test-statistics is returned.

...

Other arguments, passed down to durbinWatsonTest.

iterations

Numeric, indicates the number of iterations to remove outliers.

Value

A tibble with the respective statistics.

Details

These functions are wrappers that compute various test statistics, however, each of them returns a tibble instead of a list of values. Furthermore, all functions can also be applied to multiples models in stored in list-variables (see 'Examples').

outliers() wraps outlierTest and iteratively removes outliers for iterations times, or if the r-squared value (for glm: the AIC) did not improve after removing outliers. The function returns a tibble with r-squared and AIC statistics for the original and updated model, as well as the update model itself ($updated.model), the number ($removed.count) and indices of the removed observations ($removed.obs).

heteroskedastic() wraps ncvTest and returns the p-value of the test statistics as tibble. A p-value < 0.05 indicates a non-constant variance (heteroskedasticity).

autocorrelation() wraps durbinWatsonTest and returns the p-value of the test statistics as tibble. A p-value < 0.05 indicates autocorrelated residuals. In such cases, robust standard errors (see robust return more accurate results for the estimates, or maybe a mixed model with error term for the cluster groups should be used.

normality() calls shapiro.test and checks the standardized residuals for normal distribution. The p-value of the test statistics is returned as tibble. A p-value < 0.05 indicates a significant deviation from normal distribution. Note that this formal test almost always yields significant results for the distribution of residuals and visual inspection (e.g. qqplots) are preferable (see plot_model with type = "diag").

multicollin() wraps vif and returns the logical result as tibble. TRUE, if multicollinearity exists, else not. In case of multicollinearity, the names of independent variables that vioalte contribute to multicollinearity are printed to the console.

check_assumptions() runs all of the above tests and returns a tibble with all test statistics included. In case the p-values are too confusing, use the as.logical argument, where all p-values are replaced with either TRUE (in case of violation) or FALSE (in case of model conforms to assumption of linar regression).

Note

These formal tests are very strict and in most cases violation of model assumptions are alerted, though the model is actually ok. It is preferable to check model assumptions based on visual inspection (see plot_model with type = "diag").

Examples

data(efc) fit <- lm(barthtot ~ c160age + c12hour + c161sex + c172code, data = efc) outliers(fit)
#> No outliers detected.
heteroskedastic(fit)
#> Heteroscedasticity (non-constant error variance) detected: p = 0.000
#> # A tibble: 1 x 1 #> heteroskedastic #> <dbl> #> 1 0.000000389
autocorrelation(fit)
#> Autocorrelated residuals detected: p = 0.000
#> # A tibble: 1 x 1 #> autocorrelation #> <dbl> #> 1 0
normality(fit)
#> Non-normality of residuals detected: p = 0.000
#> # A tibble: 1 x 1 #> non.normality #> <dbl> #> 1 1.54e-13
check_assumptions(fit)
#> # A tibble: 1 x 4 #> heteroskedasticity multicollinearity non.normal.resid autocorrelation #> <dbl> <lgl> <dbl> <dbl> #> 1 0.000000389 FALSE 1.54e-13 0
fit <- lm(barthtot ~ c160age + c12hour + c161sex + c172code + neg_c_7, data = efc) outliers(fit)
#> 2 outliers removed in updated model.
#> # A tibble: 2 x 3 #> models adjusted.r2 aic #> <chr> <dbl> <dbl> #> 1 original 0.346 7488. #> 2 updated 0.353 7469.
check_assumptions(fit, as.logical = TRUE)
#> # A tibble: 1 x 4 #> heteroskedasticity multicollinearity non.normal.resid autocorrelation #> <lgl> <lgl> <lgl> <lgl> #> 1 TRUE FALSE TRUE TRUE
# apply function to multiple models in list-variable library(purrr) library(dplyr) tmp <- efc %>% bootstrap(50) %>% mutate( models = map(strap, ~lm(neg_c_7 ~ e42dep + c12hour + c161sex, data = .x)) ) # for list-variables, argument 'model.column' is the # quoted name of the list-variable with fitted models tmp %>% normality("models")
#> # A tibble: 50 x 1 #> non.normality #> <dbl> #> 1 3.58e-19 #> 2 4.13e-16 #> 3 8.16e-22 #> 4 9.24e-19 #> 5 6.11e-18 #> 6 3.62e-20 #> 7 2.33e-19 #> 8 9.11e-17 #> 9 2.30e-19 #> 10 1.01e-20 #> # ... with 40 more rows
tmp %>% heteroskedastic("models")
#> # A tibble: 50 x 1 #> heteroskedastic #> <dbl> #> 1 9.37e- 7 #> 2 2.96e-16 #> 3 2.20e-11 #> 4 2.07e-13 #> 5 3.61e- 9 #> 6 2.07e- 6 #> 7 2.34e- 6 #> 8 6.22e-11 #> 9 2.08e- 5 #> 10 8.68e-12 #> # ... with 40 more rows
# Durbin-Watson-Test from package 'car' takes a little bit longer due # to simulation of p-values...
# NOT RUN { tmp %>% check_assumptions("models", as.logical = TRUE, reps = 100) # }