---
title: "Rchon statistics course, part 1"
output: learnr::tutorial
runtime: shiny_prerendered
---
```{r setup, include=FALSE}
library(learnr)
knitr::opts_chunk$set(echo = FALSE)
```
## Introduction
The tutorial provided here is an accompaniment to the 'Rchon' statistics course, part 1. It treats the basics of data manipulation in R, and teaches you how to make simple visualisations from a list of numerical data.
The purpose of this tutorial is to provide a relatively gentle introduction to R for working with archaeological datasets. More background on every topic imaginable in statistics and the use of R is available all over the internet, ranging from the R help files to discussion forums, full length tutorials and course books. Don't hesitate to consult these while going through this tutorial - there is often more than one way to tackle a particular problem. Some specific resources for archaeologists are listed below.
In this tutorial, we will start from the basics of R. There are often shortcuts and alternative solutions to the problems posed in the tutorial, but it is important to get familiar with the basic concepts of R before trying out all kinds of fancier tools.
This is not primarily a statistics tutorial; it is assumed that you have knowledge of the basic statistical concepts treated here. When in doubt, consult the section [Now what was that again?] at the end of this tutorial.
Once you have worked through this tutorial, you can finish this course by making the test that is provided separately.
### Resources
[M. Baxter & H. Cook, 2016. Life beyond Excel.](https://www.researchgate.net/publication/320592558_Basic_Statistical_Graphics_for_Archaeology_with_R_Life_Beyond_Excel)
Introduction to working with R for exploratory data analysis in archaeology
[M. Baxter, 2015. Notes on Quantitative Archaeology and R.](https://www.academia.edu/12545743/Notes_on_Quantitative_Archaeology_and_R)
Extended version of the previous, also treating more advanced topics such as PCA, cluster analysis and inferential statistics
[Carlson, D.L. 2017. Quantitative Methods in Archaeology Using R.](https://www.cambridge.org/core/books/quantitative-methods-in-archaeology-using-r/DEAE593FA2418EA3B8ECD538C34ED2D5)
Book-length tutorial, but not in Open Access, which of course goes against the philosophy of R. David Carlson's website features [example data sets](https://sites.google.com/a/tamu.edu/dlcarlson/home/r-project-for-statistical-computing) that can be used with the core statistical text books in archaeology by Shennan (1997) and Drennan (2009).
[Ben Marwick's Github repository](https://benmarwick.github.io/How-To-Do-Archaeological-Science-Using-R/ Q3 + 1.5 (Q3 - Q1)$$
Sometimes, **far outliers** are distinguished as well:
$$\ < Q1 - 3 (Q3 - Q1)$$
$$\ > Q3 + 3 (Q3 - Q1)$$
Keep in mind that the extremes (minimum and maximum value) of your dataset are not necessarily outliers, it all depends on the spread of the values around the median.
With the quartiles calculated in the previous section, you can now easily calculate the outlier values. R offers an additional command `IQR()` that will immediately give you the interquartile range.
You may have noticed that `quantiles()` produces output in a format that is known in R as `named num`. If you want to get rid of the name of the number, use `unname()`.
```{r calc-outliers-setup}
setwd(dirname(rstudioapi::getSourceEditorContext()$path))
Tab10 <- read.table("./Table 1.10 Drennan.txt", header = T)
Length10 <- Tab10[,2]
```
```{r calc-outliers, exercise=TRUE}
```
```{r calc-outliers-solution}
#The outlier bounds can be calculated as follows
my_quartiles <- unname(quantile(Length10, probs = seq(0, 1, 0.25)))
outlier_lo <- my_quartiles[2] - 1.5 * IQR(Length10)
outlier_hi <- my_quartiles[4] + 1.5 * IQR(Length10)
```
### Standard deviation
A more usual way to quantify the spread of values in a dataset is, of course, to calculate the **standard deviation**, defined as the square root of the **variance**. The commands for this in R are simply:
`var()` # variance
`sqrt()` # square root
`sd()` # standard deviation
### Quiz
Now let's see if you have come up with the right results:
```{r quiz3}
quiz(
question("What is the interquartile range of Length10 (or Tab10[,2])?",
answer("6.9", correct = TRUE),
answer("18.9"),
answer("27.6"),
allow_retry = TRUE
),
question("What is the upper value beyond which we classify an observation as an outlier?",
answer("25.8"),
answer("36.15", correct = TRUE),
answer("48.3"),
allow_retry = TRUE
),
question("How many far outliers are there in?",
answer("0"),
answer("1", message = "The upper bound for far outliers is 46.5, the lower bound -1.8", correct = TRUE),
answer("2"),
allow_retry = TRUE
),
question("What is the standard deviation?",
answer("10.86167", correct = TRUE),
answer("23.42941"),
answer("117.976"),
allow_retry = TRUE
)
)
```
## Boxplots
The most common way to visualize the spread of values in your dataset is the boxplot, or box-and-whiskerplot. It summarizes the main characteristics of your data in a single graph. In base R, this is done using the command `boxplot()`. With `ggplot2`, we can again use the `qplot()` command to make boxplots, giving `geom="boxplot"` as an additional argument.
The central box depicts the interquartile range, with the median shown as a solid line inside the box. The 'whiskers' around the box are the outlier bounds. Any outliers are depicted as individual data points outside the whiskers. However, when the minimum and maximum values are within the outlier bounds, then the whiskers will only extend to the extremes.
```{r make-boxplots-setup}
setwd(dirname(rstudioapi::getSourceEditorContext()$path))
Tab10 <- read.table("./Table 1.10 Drennan.txt", header = T)
Length10 <- Tab10[,2]
```
```{r make-boxplots, exercise=TRUE}
```
Again, you will notice differences in output between base R and ggplot2; there are many ways to manipulate the design of your graphs in both. A few things you might try out:
Base R
* Setting the colour of the boxes is done with the `col` argument. Colours can be specified as colour names, or as hexadecimal colour codes. See [here](https://htmlcolorcodes.com/color-picker/) to search for hex colours.
* `border` colours the outlines
* `xlab` and `ylab` can be used for axis labels

`qplot()`
* `main` can be used to add a title
* `xlab` and `ylab` can be used for axis labels
* `fill` is used to colour the box, and `color` to colour the outlines - take care however that the colour names should be coded as `I("colourname")`

A full cheatsheet for `qplot()` can be found [here](https://dcgerard.github.io/stat_412_612/lectures/03_graphics/03_qplot_cheatsheet.pdf).
## Comparing multiple datasets
More often than not, you will not just be interested in the characteristics of a single dataset, but you will want to compare the two. In R, combining visualizations for multiple datasets is fairly easy.
### Histograms for more than one dataset
With the `hist()` or `qplot()` commands, you only need to specify the `add = T` argument to place an extra histogram in your plot. In practice, having more than two of these will quickly get confusing (you can better use line graphs for that), but for side-by-side comparison of two datasets it can be very useful.
First of all, be aware that the minimum and maximum values in both datasets may not be the same, so an additional task is to first work out the values for `xlim` and `ylim` of the histogram. Try to figure out how to do this for both tables, already stored in the dataframes Tab10 and Tab11. For `xlim`, this will be straightforward on the basis of the values in the dataset, already stored in the variables Length10 and Length11.
The values of `ylim`, however, depend on the number of bins chosen. Ideally, the bin sizes for both plots should be equal! In order to know how many observations per bin are found, you can specify `plot = F` with the `hist()` command, and assign it to a variable. This will give you all the information used to plot the histogram, without actually creating the plot. The column `$counts` will give you the numbers of observations per bin.
As a finishing touch, you should set the colour of the second plot to transparent. With the `col` argument, this is done by specifying the RGB colour plus a transparency value (alpha) like this:
`col = rgb(0,1,0,0.5)`
Creating advanced plots in R takes some time, so it will pay off to save your work for future use!
```{r calc-axis-limits-setup}
setwd(dirname(rstudioapi::getSourceEditorContext()$path))
Tab10 <- read.table("./Table 1.10 Drennan.txt", header = T)
Length10 <- Tab10[,2]
Tab11 <- read.table("./Table 1.11 Drennan.txt", header = T)
Length11 <- Tab11[,2]
```
```{r calc-axis-limits, exercise=TRUE}
```
```{r calc-axis-limits-hint-1}
# calculating xlim
my_xlim = c(min(Length10,Length11),max(Length10,Length11))
```
```{r calc-axis-limits-hint-2}
# calculating ylim
hist10 = hist(Length10, breaks = 10, plot = F)
hist11 = hist(Length11, breaks = 10, plot = F)
my_ylim = c(0,max(hist10$counts,hist11$counts))
```
```{r calc-axis-limits-hint-3}
# making the plots in red and transparent green
hist(Length10, breaks = 10, xlim = my_xlim, ylim = my_ylim, col = "red")
hist(Length11, breaks = 10, xlim = my_xlim, ylim = my_ylim, col = rgb(0,1,0,0.5), add = T)
```
### Boxplots for more than one dataset
Creating multiple boxplots is much easier. The `boxplot()` command allows you to specify a range of input vectors to display. In order to display the boxes in different colours, just provide these as a vector of colour names in the order that you want them to appear to the `col` argument like this:
`boxplot(Length10, Length11, col = c("red", "green"))`
## Final quiz
These final quiz questions are to see if you have understood the basic concepts treated in this tutorial:
```{r final-quiz}
quiz(
question("What is an argument in R?",
answer("A difference of opinion on statistics"),
answer("An option to be used with a command", correct = TRUE),
answer("An operator used to assign values to a variable"),
allow_retry = TRUE
),
question("What do the whiskers in a box plot represent?",
answer("The maximum and minimum value"),
answer("The upper and lower bounds for outliers", correct = TRUE),
answer("The interquartile range"),
allow_retry = TRUE
),
question("What is a vector in R?",
answer("A list of numbers", correct = TRUE),
answer("A geometric object with length and direction"),
answer("A computer virus"),
allow_retry = TRUE
),
question("Which one of these commands will create a range of deciles from Tab10?",
answer("quantile(Tab10[,2], probs = c(1,2,3,4,5,6,7,8,9,10))"),
answer("quantile(Tab10[,2], probs = seq(0,1,0.1))", correct = TRUE),
answer("quantile(Tab10[,2], probs = range(Tab10[,2]))"),
allow_retry = TRUE
)
)
```
## Now what was that again?
### Mean
The (arithmetic) mean is defined as the sum of all values in a list of numbers, divided by the number of values in this list. Or, expressed as an equation:
$$\bar{x} = \sum_{i=1}^{n}\frac{x_{i}}{n}$$
### Median
The median is defined as the 'middle' number in a list of numbers, when ordered from smallest to greatest. If there is an odd number of numbers, then the middle one is picked. With an even number, there is no single middle value, so then the median is defined as the mean of the two middle values.
### Mode
The mode is defined as the most frequently occurring value in a list of numbers. The mode is not necessarily unique.
### Variance and standard deviation
The standard deviation \(s\) is a measure of the spread of values around the mean. In order to calculate it, first the sum of squared deviations from the mean is determined; this is called the variance \(s^2\). The standard deviation is then the square root of the variance. Both are used extensively for statistical analysis.
$$ s^{2} = \frac{1}{n-1}\sum_{i=1}^{n}(x-\bar{x})^{2}$$
$$ s = \sqrt{s^2} $$