---
title: "Rchon statistics course, part 2"
output: learnr::tutorial
runtime: shiny_prerendered
---

```{r setup, include=FALSE}
library(learnr)
library(car)
knitr::opts_chunk$set(echo = FALSE)
```

## Introduction

The tutorial provided here is an accompaniment to the 'Rchon' statistics course, part 2. It treats the basics of statistical testing in R.

In order to run this tutorial, you should have installed the latest versions of R and Rstudio, and the `learnr` package. Use of other packages will not be necessary, with the exception of `ggplot2`.

The purpose of this tutorial is to extend the application of R from basic data analysis and visualisation to formal statistical testing. In this tutorial, we will treat the following statistical testing methods:

+ t-test
+ ANOVA

Further testing methods (including chi square test, Mann-Whitney and Kruskal-Wallis test) will be treated in part 3 and 4 of this course. Together, these methods can be considered the work horses of statistical analysis, so it is good to know how they work and when they can be applied. They also serve to show you a number of useful functions of R.

This is not primarily a statistics tutorial, even when the topics are treated in considerable detail. It is assumed that you have knowledge of the basics of statistics. When in doubt, consult the section [Now what was that again?] at the end of this tutorial, or work your way through part 1 of this course. 

The tutorial provides background and hints to the questions posed, but will not give you the full answers. You can use the exercise blocks in the tutorial for testing code, but you are encouraged to use the R console and R scripts to experiment with and solve the problems. An introduction to the interface is provided in the accompanying documentation. If you are really stuck, the answers to the exercises can be found in the document `Solutions Tutorial 2.R`

Once you have finished the tutorial, you can complete this part of the course by making the test that is provided separately.

### Credits

This tutorial was developed for Archon Research School of Archaeology by Philip Verhagen (Vrije Universiteit Amsterdam) and Bjørn P. Bartholdy (Leiden University). All content is CC BY-NC-SA: it can be freely distributed and modified under the condition of proper attribution and non-commercial use.

**How to cite:**
Verhagen, P. & B.P. Bartholdy, 2021. "Rchon statistics course, part 2". Amsterdam, ARCHON Research School of Archaeology. doi: 10.5281/zenodo.5155471

The current version of this tutorial was created on 3 August 2021.

## Standard error of the mean 

Before starting with formal statistical testing, we need to understand the characteristics of our data. While inspecting box plots is useful, in many cases we are not just interested in the sample values, but we want to use them to estimate the distribution of values in the total statistical population involved. The most common measure that we would like to estimate is the population mean.

### Background

The accuracy of the estimation of the mean depends on two factors: the observed **standard deviation** in the sample (*s*), and the **sample size** (*n*). The larger the sample size, the closer the estimate of the sample mean will be to the population mean. How close this estimate is can be calculated using the **standard error of the mean** - and this works for any distribution of the data.

$$SE=\frac{s}{\sqrt{n}}$$

The standard error gives us the spread of sampling means if we would take an infinite number of samples of size *n*, and this spread will follow a normal distribution. There is therefore a probability of 68.2% (the **confidence level**) that the sample is taken from a population with a mean lying between plus or minus one standard error (the **z-score**) from the sample mean, which then gives the **confidence interval** like this:

$$CI=\overline{x}\pm z\cdot SE$$

Outcomes can then be noted like this:

`12.4 ± 3.2 (CL = 0.95)`

This can be formulated as: 

'We expect the true population mean to fall between 9.2 and 15.6 95% of the time (i.e. in 95 of the next 100 experiments)'

or:

'The confidence level of the estimate is 95%; its precision (confidence interval, error range) is 3.2'

N.B. The standard error is a valid estimate of the variability of our estimate of the mean - but **not** of the variability of the observations, which is depicted by the standard deviation. It is not possible to directly calculate a standard error of the **median**; however, this can be simulated.

### How to do this in R

You may already have gathered that R can also be used as a simple calculator. Going back to our original sample from Table 1.10 used in part 1, can you determine the standard error of the mean of the observed lengths? And what would be the 95% confidence interval of the estimate of the mean?

For reporting of the results, you can try to concatenate values and pieces of text using the `paste()` command. The `round()` command can be used to reduce the number of decimals in the output.

```{r standard-error, exercise=TRUE}
# This chunk of code can be used to read tables into the exercise, but you will have to repeat this for every exercise!

setwd(dirname(rstudioapi::getSourceEditorContext()$path))
Tab10 <- read.table("./Table 1.10 Drennan.txt", header = T)

# First calculate the mean and standard deviation of the lengths, and determine the number of rows in Tab10

# Then, use the equations given above to calculate the standard errors, and the confidence interval

# The paste() command can then be used to produce customized text output of the calculated standard error

```

```{r standard-error-hint-1}

# confidence intervals can be calculated like this

# first, determine the corresponding z-value for the desired level of confidence
# then, determine the size of the interval
# last, determine the minimum and maximum value of the interval compared to the mean

```

```{r standard-error-hint-2}

# paste(statement1, statement2, etc.)
# the statement can be a string or a (calculated) number

```

## Designing plots with error bars

In many cases, you don't just want to have an estimate of the mean, but you also want to know whether a difference in means in two or more samples is due to the fact that we only have sampled a small proportion of a population, or if they reflect real differences between the distributions that they are drawn from. The site sizes registered in Table 3.5 of part 1 are a typical example of this kind of question: do mean site sizes change from the Early to the Late Bronze Age?

In order to compare data from different samples, we can start by making boxplots and histograms, like we did in part 1 of the course. However, this only tells us something about the characteristics of the samples. In order to know more about the population means, we also need to compare the standard errors obtained from the samples. This is, of course, quite simple with the equations given above.

### How to do this in R

First, some data preparation is needed.

```{r error-bars1, exercise=TRUE}

# first read the information from Table 3.5 on the site sizes for the Early and Bronze Age into separate vectors

# calculate means, standard deviations and number of rows for each
# N.B. use NROW() instead of nrow() when counting the number of items in a vector

# then, calculate standard errors and confidence intervals as in the previous exercise

# finally, put all the information in a new dataframe and print it

```

```{r error-bars1-hint-1}

# first read the information from Table 3.5 on the site sizes for the Early and Bronze Age into separate vectors
# remember that vectors can be immediately read from a column in a dataframe using $, and that you can provide a search condition using []

# vectorname <- dataframe$columnname1[dataframe$columnname2 == condition]

```

```{r error-bars1-hint-2}

# calculate means, standard deviations and number of rows for each
# N.B. use NROW() instead of nrow() when counting the number of items in a vector

```

```{r error-bars1-hint-3}

# creating a new dataframe is done like this

# data.frame(column1 = c(item1, item2, ...), column2 = c(item3, item4, ...), ...)

# finally, put all the information in a new dataframe and print it

# and save the dataframe for use in the next exercise

```

Now you can use ggplot to visualise the output contained in the dataframe.

```{r error-bars2, exercise=TRUE}

# always make sure that you have ggplot loaded!

# also, make sure that the dataframe you just created is available for the exercise

# then, tell ggplot to use the dataframe to create a plot with an x- and y-axis

# the plot can then be filled by successively adding different elements using the `+` sign.

# you can then add error bars for the 95% confidence intervals to the plot in a similar fashion

# and display the result

```

```{r error-bars2-hint-1}

# newplot <- ggplot(data = dataframe, aes(column1, column2))

# The `aes` argument takes any variable from the dataframe to be used for the x- and y-axis. 
# In our case, we are interested in the differences in means per period. 

```

```{r error-bars2-hint-2}

# new_plot <- new_plot + geom_point()

# This will plot the two means as points in the graph. Pretty impressive! 

```

```{r error-bars2-hint-3}

# new_plot <- new_plot + geom_errorbar(aes(ymin = , ymax = ), width = ) 

```
  
This already gets quite complex, and we haven't even tried to make the plot pretty yet! But at least you should now have a graph that clearly shows that the observed differences in mean site sizes are most probably not the result of sampling error.

## The t-test

A formal assessment of the differences in means requires us to compare the standard deviations of the samples. The **t-test** does exactly that: it calculates a standard error on the basis of the **pooled standard deviations** of two samples and determines its confidence level. Making the calculations yourself is not too hard, but can become quite tedious, and this is of course an important reason why statistics packages exist. However, you need to know the rationale behind the test in order to interpret it correctly.

### Background

Pooled standard deviations are calculated as follows:

${s_{p}}=\sqrt{ \frac{ {( {n_{1}} - 1)} {s_{1}^{2} + ( {n_{2}} - 1)} {s_{2}^{2}}} { {n_{1}} + {n_{2}} - 2}}$

The pooled standard error is then equal to:

${SE_{p}} = {s_{p}}\sqrt{ \frac{1}{n_{1}} + \frac{1}{n_{2}} }$

The test statistic *t* then expresses the difference in sample means in standard errors:

$t = \frac{\bar{x}_{1} - {\bar{x}_{2}} }{{SE_{p}}}$


### How to do this in R

First, determine the value of *t*. This should then be compared to the **Student's t-distribution** in order to obtain its confidence level, denoted as the *p-value*. A key parameter to consider is the *degrees of freedom* involved:

$df = n - 2$

So, in this case this would be *(24 + 26) - 2 = 48*. 

This is another reason to use a stats package: calculating p-values by hand for given values of *t* and *df* is nearly impossible, and is quite unnecessary given the availability of a wide variety of software to do this for you. The `t.test()` command in R can be used to directly obtain the values of *t* and *p*.

```{r t-test1, exercise=TRUE}

# we are going to depart from the calculations you already made in the previous exercise for the mean,
# standard deviation and standard error for the Early and Late Bronze Age sites in Table 3.5

# first, calculate the pooled standard deviation

# then, calculate the pooled standard error

# and finally, calculate the value of t

# and compare this to the outcome of the t.test() command

```

When reporting results make sure to **include** the point estimate (observed difference in means), test statistic (with df), exact p-value (unless p < 0.001), and 95% confidence interval.

For example: 't[232] = -0.033, p = 0.974, 95%CI[-1.56, 1.51]'

You can extract this information by first storing the output of `t.test` in a vector:

`t_outcome <- t.test()`

The `str()` command will then show you the list of columns to manipulate yourself. You will notice that some of these are `Named num` columns; to only extract the value, use the `unname()` command. The `names()` command will give you the column names only.

### The small print

The calculated p-value for Table 3.5 is a very low number, indicating that it is highly improbable that the means of both populations are equal. This should not come as a surprise, given the large differences in the error bar plots. Other things to notice:

1. The standard output of the `t.test()` command for Table 3.5 produces a lower value for *df* than 48. The standard t-test works under the assumption that the variance of both samples is roughly equal in order to calculate the pooled standard deviation. However, to account for the effect of non-equal variances, the t-test in R applies a correction known as the *Welch approximation*. In practice, this implies reducing the degrees of freedom, resulting in a lower confidence level of the outcomes - if you don't want this correction, specify `var.equal = TRUE`. In Table 3.5, however, the variances of both data sets are very different, so R is correct in applying this correction.

2. The second thing to consider is whether the t-test is applied two-sided or one-sided. In many cases, we are only interested in knowing whether A is larger or smaller than B, not whether it could be either larger or smaller. This should be formulated carefully. The standard version of the t-test in R is two-sided: it tells you the probability that there is no difference in means, with the alternative hypothesis that they are not the same. The p-value shown here is therefore showing that there is an extremely small probability that the means are equal. Using the test one-sided, you have to choose between "greater" and "less". Choosing "greater" will give you the probability that the mean of the first sample is larger than the mean of the second one.

3. Thirdly, a 95% confidence interval is given with the test. This states that if we run our test infinitely many times (or 100), our confidence interval will contain the true difference in means 95% of the time (so 95 times out of 100).

## ANOVA

ANOVA stands for **analysis of variance**, and can be applied to compare sample means of two or more data sets. 

### Background

In a similar fashion to the t-test, ANOVA compares the variance *within* samples and *between* samples to calculate a test statistic known as *F*. This quickly becomes cumbersome with larger numbers of samples and items, but we will demonstrate the approach here using Table 12.4 from Drennan (2009). It contains a list of house floor areas of five sites occupied during the Early, Middle and Late Neolithic in Heiligenstadt (Germany). *Within group variance* ${s_{W}}^{2}$ and *between group variance* ${s_{B}}^{2}$ are calculated as follows, with *m* equal to the number of groups:

$${s_{W}}^{2}=\frac{\sum_{i=1}^{m}(n_{i}-1){s_{i}}^{2}}{n_{\cdot } - m}$$

$${s_{B}}^{2}=\frac{\sum_{i=1}^{m}n_{i}({\bar{{}x}_{i}}-\bar{{x}_{\cdot}})^{2}}{m - 1}$$

The F statistic can then be calculated to determine the associated p-level:

$$F=\frac{{s_{B}}^{2}}{{s_{W}}^{2}}$$

### How to do this in R

```{r anova1, exercise=TRUE}

# first, let's make some boxplots of the floor areas per site

# then, calculate the basic statistics needed for the comparison
# mean, variance and number of observations per site
# this can be done quickly using the tapply() command

# if you want to display things nicely, pull them together using the cbind() command and print

# then, calculate the within group, between group variance and F

# the p-value for the calculated value of F can be obtained using the pf() command

# these outcomes match the result of the aov() command in R

```

```{r anova1-hint-1}
# you can easily display a boxplot on the basis of dataframe with two columns like this

# boxplot(dataframe$columnname1 ~ dataframe$columnname2)

```

```{r anova1-hint-2}
# calculate the basic statistics needed for the comparison
# mean, variance and number of observations per site
# this can be done quickly using the tapply() command

# tapply(dataframe$columnname1, dataframe$columnname2, statistic)

```

```{r anova1-hint-3}

# N.B. The `aov()` command will only provide full output when combined with the `summary()` command.

```

The degrees of freedom indicated in the ouput of the `aov()` command are equal to *m - 1* and *n - m* respectively, and the *Sum Sq* refers to the sum of squares within groups and between groups - you can check this yourself.

The results show an extremely small p-value, indicating that the means of the floor areas of the sites are not equal. From the boxplot it is already clear that we might in fact have two different groups. A quick way to compare this is to use **Tukey's test**, which will give the difference in means, confidence levels and the adjusted p-values for all possible pairs. In R, this can be done by parsing the results of the `aov()` command to the `TukeyHSD()` command.

```{r tukey, exercise=TRUE}

# the trick is to run aov() on the floor area and site number, parse this into the TukeyHSD() command, 
# and store the results in list of values
# this information can then be used directly to print and plot the outcomes

```

### The small print

ANOVA is carried out under the assumption of normality and of roughly equal within-group variances. If these assumptions are violated, ANOVA is not supposed to give reliable results, and you should apply the Kruskal-Wallis test instead (see part 4 of this course). As you may have noticed, this is the case when comparing the floor areas per site. However, as demonstrated here, the results of the ANOVA can still be useful to understand the differences in means between the groups, especially when combined with Tukey's test, but if the distribution is wonky then the means may not be a reliable metric.

## Quiz

To check whether you have understood the concepts treated in this part of the course, take the following quiz:

```{r quiz1, echo=FALSE}
quiz(
  question("The standard error of the mean",
         answer("only works for normal distributions"),
         answer("is normally distributed", TRUE),
         answer("can also be used for the median")
         ),
question("The confidence level of the standard error of the mean",
         answer("is also known as precision or error range"),
         answer("can be expressed in z-scores", TRUE),
         answer("also gives you the variation of the samples")
         ),
question("The p-value of the calculated value of t in the t-test",
         answer("depends on the degrees of freedom", TRUE),
         answer("can easily be calculated by hand"),
         answer("requires you to calculate within-group and between-group variance")
         ),
question("ANOVA",
         answer("determines p-values from the t-distribution"),
         answer("tests for difference in variance of two or more samples"),
         answer("can be applied to two or more normally distributed samples", TRUE)
         )
)
```

## Now what was that again?

### Normal distribution

A normal, or Gaussian distribution is a set of values that is distributed symmetrically around the mean $\mu$. In a graph, it shows a typical bell shape, with the probability of extreme values diminishing the further they are from the mean.

```{r echo=FALSE}
x <- seq(-4, 4, length = 1000)
y <- dnorm(x, mean = 0, sd = 1)
plot(x, y, type="l", lwd=1)
```

### z-score

If you know the standard deviation $\sigma$ of the normal distribution, then this can be used to predict the proportion of values being contained within a particular interval from the mean. The number of standard deviations from the mean is known as **z**.

|interval |proportion|
|:---|:---|
|$\mu$ ± $\sigma$ | 68.27%|
|$\mu$ ± 2 $\sigma$ | 95.45%|
|$\mu$ ± 3 $\sigma$ | 99.73%|
|$\mu$ ± 1.96 $\sigma$ | 95%|

### Student's t distribution

With small samples (*n < 30*), the normal distribution is unreliable for determining confidence levels. Instead, the Student's t distribution should be used. Its shape is symmetrical like the normal distribution, but the tail ends are longer. Its shape depends on the number of items involved, which commonly referred to as the *degrees of freedom* (equal to *n-1*). The graph below shows the normal distribution in red, and the Student's t distribution for *n=10* in blue. The value of *t* is then a measure for the confidence interval given a specific value of *n*, similar to the use of *z* with the normal distribution. The differences, however, are usually not enormous.

```{r echo=FALSE}
x <- seq(-4, 4, length = 1000)
y <- dnorm(x, mean = 0, sd = 1)
plot(x, y, type="l", lwd=1, col="red")
n <- 10
t <- dt(x, n)
lines(x, t, col="blue")
```