vignettes/web_only/ggdotplotstats.Rmd
ggdotplotstats.Rmd
The function ggstatsplot::ggdotplotstats
can be used for data exploration and to provide an easy way to make publication-ready dot plots/charts with appropriate and selected statistical details embedded in the plot itself. In this vignette we will explore several examples of how to use it.
This function is a sister function of gghistostats
with the difference being it expects a labeled numeric variable.
ggdotplotstats
Let’s begin with a very simple example from the ggplot2
package (ggplot2::mpg
), a subset of the fuel economy data that the EPA makes available on http://fueleconomy.gov.
# looking at the structure of the data using glimpse
dplyr::glimpse(ggplot2::mpg)
#> Rows: 234
#> Columns: 11
#> $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi"...
#> $ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro"...
#> $ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0,...
#> $ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, ...
#> $ cyl <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, ...
#> $ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "a...
#> $ drv <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4",...
#> $ cty <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17...
#> $ hwy <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25...
#> $ fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p",...
#> $ class <chr> "compact", "compact", "compact", "compact", "compact",...
Let’s say we want to visualize the distribution of mileage by car manufacturer.
# for reproducibility
set.seed(123)
# removing factor level with very few no. of observations
df <- dplyr::filter(.data = ggplot2::mpg, cyl %in% c("4", "6"))
# creating a vector of colors using `paletteer` package
paletter_vector <-
paletteer::paletteer_d(
palette = "palettetown::venusaur",
n = nlevels(as.factor(df$manufacturer)),
type = "discrete"
)
# plot
ggstatsplot::ggdotplotstats(
data = df,
x = cty,
y = manufacturer,
xlab = "city miles per gallon",
ylab = "car manufacturer",
test.value = 15.5,
point.args = list(
shape = 16,
color = paletter_vector,
size = 5
),
title = "Distribution of mileage of cars",
ggtheme = hrbrthemes::theme_ipsum_ps(),
ggstatsplot.layer = FALSE
)
grouped_ggdotplotstats
What if we want to do the same analysis separately for different engines with different numbers of cylinders?
ggstatsplot
provides a special helper function for such instances: grouped_ggdotplotstats
. This is merely a wrapper function around ggstatsplot::combine_plots
. It applies ggdotplotstats
across all levels of a specified grouping variable and then combines the individual plots into a single plot.
Let’s see how we can use this function to apply ggdotplotstats
to accomplish our task.
# for reproducibility
set.seed(123)
# removing factor level with very few no. of observations
df <- dplyr::filter(.data = ggplot2::mpg, cyl %in% c("4", "6"))
# plot
ggstatsplot::grouped_ggdotplotstats(
# arguments relevant for ggstatsplot::ggdotplotstats
data = df,
x = cty,
y = manufacturer,
xlab = "city miles per gallon",
ylab = "car manufacturer",
grouping.var = cyl, # grouping variable
type = "bayes", # Bayesian test
test.value = 15.5,
title.prefix = "cylinder count",
point.args = list(
color = "red",
size = 5,
shape = 13
),
ggtheme = ggthemes::theme_fivethirtyeight(),
# arguments relevant for ggstatsplot::combine_plots
title.text = "Fuel economy data",
plotgrid.args = list(nrow = 2)
)
purrr
Although this is a quick and dirty way to explore a large amount of data with minimal effort, it does come with an important limitation: reduced flexibility. For example, if we wanted to add, let’s say, a separate test.value
argument for each gender, this is not possible with grouped_ggdotplotstats
. For cases like these, or to run separate kinds of tests (robust for some, parametric for other, while Bayesian for some other levels of the group) it would be better to use purrr
.
See the associated vignette here: https://indrajeetpatil.github.io/ggstatsplot/articles/web_only/purrr_examples.html
The central tendency measure displayed will depend on the statistics:
Type | Measure | Function used |
---|---|---|
Parametric | mean | parameters::describe_distribution |
Non-parametric | median | parameters::describe_distribution |
Robust | trimmed mean | parameters::describe_distribution |
Bayesian | MAP estimate | parameters::describe_distribution |
MAP: maximum a posteriori probability
Following tests are carried out for each type of analyses-
Type | Test | Function used |
---|---|---|
Parametric | One-sample Student’s t-test | stats::t.test |
Non-parametric | One-sample Wilcoxon test | stats::wilcox.test |
Robust | Bootstrap-t method for one-sample test |
trimcibt (custom) |
Bayesian | One-sample Student’s t-test | BayesFactor::ttestBF |
Following effect sizes (and confidence intervals/CI) are available for each type of test-
Type | Effect size | CI? | Function used |
---|---|---|---|
Parametric | Cohen’s d, Hedge’s g | Yes |
effectsize::cohens_d , effectsize::hedges_g
|
Non-parametric | r (rank-biserial correlation) | Yes | effectsize::rank_biserial |
Robust | trimmed mean | Yes |
trimcibt (custom) |
Bayes Factor | \(\delta_{posterior}\) | Yes | bayestestR::describe_posterior |
To see how the effect sizes displayed in these tests can be interpreted, see: https://indrajeetpatil.github.io/ggstatsplot/articles/web_only/effsize_interpretation.html
If you find any bugs or have any suggestions/remarks, please file an issue on GitHub: https://github.com/IndrajeetPatil/ggstatsplot/issues
For details, see- https://indrajeetpatil.github.io/ggstatsplot/articles/web_only/session_info.html