--- title: "Rchon statistics course, part 1" output: learnr::tutorial runtime: shiny_prerendered --- {r setup, include=FALSE} library(learnr) knitr::opts_chunk$set(echo = FALSE)  ## Introduction The tutorial provided here is an accompaniment to the 'Rchon' statistics course, part 1. It treats the basics of data manipulation in R, and teaches you how to make simple visualisations from a list of numerical data. The purpose of this tutorial is to provide a relatively gentle introduction to R for working with archaeological datasets. More background on every topic imaginable in statistics and the use of R is available all over the internet, ranging from the R help files to discussion forums, full length tutorials and course books. Don't hesitate to consult these while going through this tutorial - there is often more than one way to tackle a particular problem. Some specific resources for archaeologists are listed below. In this tutorial, we will start from the basics of R. There are often shortcuts and alternative solutions to the problems posed in the tutorial, but it is important to get familiar with the basic concepts of R before trying out all kinds of fancier tools. This is not primarily a statistics tutorial; it is assumed that you have knowledge of the basic statistical concepts treated here. When in doubt, consult the section [Now what was that again?] at the end of this tutorial. Once you have worked through this tutorial, you can finish this course by making the test that is provided separately. ### Resources [M. Baxter & H. Cook, 2016. Life beyond Excel.](https://www.researchgate.net/publication/320592558_Basic_Statistical_Graphics_for_Archaeology_with_R_Life_Beyond_Excel) Introduction to working with R for exploratory data analysis in archaeology [M. Baxter, 2015. Notes on Quantitative Archaeology and R.](https://www.academia.edu/12545743/Notes_on_Quantitative_Archaeology_and_R) Extended version of the previous, also treating more advanced topics such as PCA, cluster analysis and inferential statistics [Carlson, D.L. 2017. Quantitative Methods in Archaeology Using R.](https://www.cambridge.org/core/books/quantitative-methods-in-archaeology-using-r/DEAE593FA2418EA3B8ECD538C34ED2D5) Book-length tutorial, but not in Open Access, which of course goes against the philosophy of R. David Carlson's website features [example data sets](https://sites.google.com/a/tamu.edu/dlcarlson/home/r-project-for-statistical-computing) that can be used with the core statistical text books in archaeology by Shennan (1997) and Drennan (2009). [Ben Marwick's Github repository](https://benmarwick.github.io/How-To-Do-Archaeological-Science-Using-R/ Q3 + 1.5 (Q3 - Q1)$$Sometimes, **far outliers** are distinguished as well:$$\ < Q1 - 3 (Q3 - Q1)\ > Q3 + 3 (Q3 - Q1)$$Keep in mind that the extremes (minimum and maximum value) of your dataset are not necessarily outliers, it all depends on the spread of the values around the median. With the quartiles calculated in the previous section, you can now easily calculate the outlier values. R offers an additional command IQR() that will immediately give you the interquartile range. You may have noticed that quantiles() produces output in a format that is known in R as named num. If you want to get rid of the name of the number, use unname(). {r calc-outliers-setup} setwd(dirname(rstudioapi::getSourceEditorContext()path)) Tab10 <- read.table("./Table 1.10 Drennan.txt", header = T) Length10 <- Tab10[,2]  {r calc-outliers, exercise=TRUE}  {r calc-outliers-solution} #The outlier bounds can be calculated as follows my_quartiles <- unname(quantile(Length10, probs = seq(0, 1, 0.25))) outlier_lo <- my_quartiles - 1.5 * IQR(Length10) outlier_hi <- my_quartiles + 1.5 * IQR(Length10)  ### Standard deviation A more usual way to quantify the spread of values in a dataset is, of course, to calculate the **standard deviation**, defined as the square root of the **variance**. The commands for this in R are simply: var() # variance sqrt() # square root sd() # standard deviation ### Quiz Now let's see if you have come up with the right results: {r quiz3} quiz( question("What is the interquartile range of Length10 (or Tab10[,2])?", answer("6.9", correct = TRUE), answer("18.9"), answer("27.6"), allow_retry = TRUE ), question("What is the upper value beyond which we classify an observation as an outlier?", answer("25.8"), answer("36.15", correct = TRUE), answer("48.3"), allow_retry = TRUE ), question("How many far outliers are there in?", answer("0"), answer("1", message = "The upper bound for far outliers is 46.5, the lower bound -1.8", correct = TRUE), answer("2"), allow_retry = TRUE ), question("What is the standard deviation?", answer("10.86167", correct = TRUE), answer("23.42941"), answer("117.976"), allow_retry = TRUE ) )  ## Boxplots The most common way to visualize the spread of values in your dataset is the boxplot, or box-and-whiskerplot. It summarizes the main characteristics of your data in a single graph. In base R, this is done using the command boxplot(). With ggplot2, we can again use the qplot() command to make boxplots, giving geom="boxplot" as an additional argument. The central box depicts the interquartile range, with the median shown as a solid line inside the box. The 'whiskers' around the box are the outlier bounds. Any outliers are depicted as individual data points outside the whiskers. However, when the minimum and maximum values are within the outlier bounds, then the whiskers will only extend to the extremes. {r make-boxplots-setup} setwd(dirname(rstudioapi::getSourceEditorContext()path)) Tab10 <- read.table("./Table 1.10 Drennan.txt", header = T) Length10 <- Tab10[,2]  {r make-boxplots, exercise=TRUE}  Again, you will notice differences in output between base R and ggplot2; there are many ways to manipulate the design of your graphs in both. A few things you might try out: Base R * Setting the colour of the boxes is done with the col argument. Colours can be specified as colour names, or as hexadecimal colour codes. See [here](https://htmlcolorcodes.com/color-picker/) to search for hex colours. * border colours the outlines * xlab and ylab can be used for axis labels qplot() * main can be used to add a title * xlab and ylab can be used for axis labels * fill is used to colour the box, and color to colour the outlines - take care however that the colour names should be coded as I("colourname") A full cheatsheet for qplot() can be found [here](https://dcgerard.github.io/stat_412_612/lectures/03_graphics/03_qplot_cheatsheet.pdf). ## Comparing multiple datasets More often than not, you will not just be interested in the characteristics of a single dataset, but you will want to compare the two. In R, combining visualizations for multiple datasets is fairly easy. ### Histograms for more than one dataset With the hist() or qplot() commands, you only need to specify the add = T argument to place an extra histogram in your plot. In practice, having more than two of these will quickly get confusing (you can better use line graphs for that), but for side-by-side comparison of two datasets it can be very useful. First of all, be aware that the minimum and maximum values in both datasets may not be the same, so an additional task is to first work out the values for xlim and ylim of the histogram. Try to figure out how to do this for both tables, already stored in the dataframes Tab10 and Tab11. For xlim, this will be straightforward on the basis of the values in the dataset, already stored in the variables Length10 and Length11. The values of ylim, however, depend on the number of bins chosen. Ideally, the bin sizes for both plots should be equal! In order to know how many observations per bin are found, you can specify plot = F with the hist() command, and assign it to a variable. This will give you all the information used to plot the histogram, without actually creating the plot. The column counts will give you the numbers of observations per bin. As a finishing touch, you should set the colour of the second plot to transparent. With the col argument, this is done by specifying the RGB colour plus a transparency value (alpha) like this: col = rgb(0,1,0,0.5) Creating advanced plots in R takes some time, so it will pay off to save your work for future use! {r calc-axis-limits-setup} setwd(dirname(rstudioapi::getSourceEditorContext()path)) Tab10 <- read.table("./Table 1.10 Drennan.txt", header = T) Length10 <- Tab10[,2] Tab11 <- read.table("./Table 1.11 Drennan.txt", header = T) Length11 <- Tab11[,2]  {r calc-axis-limits, exercise=TRUE}  {r calc-axis-limits-hint-1} # calculating xlim my_xlim = c(min(Length10,Length11),max(Length10,Length11))  {r calc-axis-limits-hint-2} # calculating ylim hist10 = hist(Length10, breaks = 10, plot = F) hist11 = hist(Length11, breaks = 10, plot = F) my_ylim = c(0,max(hist10counts,hist11counts))  {r calc-axis-limits-hint-3} # making the plots in red and transparent green hist(Length10, breaks = 10, xlim = my_xlim, ylim = my_ylim, col = "red") hist(Length11, breaks = 10, xlim = my_xlim, ylim = my_ylim, col = rgb(0,1,0,0.5), add = T)  ### Boxplots for more than one dataset Creating multiple boxplots is much easier. The boxplot() command allows you to specify a range of input vectors to display. In order to display the boxes in different colours, just provide these as a vector of colour names in the order that you want them to appear to the col argument like this: boxplot(Length10, Length11, col = c("red", "green")) ## Final quiz These final quiz questions are to see if you have understood the basic concepts treated in this tutorial: {r final-quiz} quiz( question("What is an argument in R?", answer("A difference of opinion on statistics"), answer("An option to be used with a command", correct = TRUE), answer("An operator used to assign values to a variable"), allow_retry = TRUE ), question("What do the whiskers in a box plot represent?", answer("The maximum and minimum value"), answer("The upper and lower bounds for outliers", correct = TRUE), answer("The interquartile range"), allow_retry = TRUE ), question("What is a vector in R?", answer("A list of numbers", correct = TRUE), answer("A geometric object with length and direction"), answer("A computer virus"), allow_retry = TRUE ), question("Which one of these commands will create a range of deciles from Tab10?", answer("quantile(Tab10[,2], probs = c(1,2,3,4,5,6,7,8,9,10))"), answer("quantile(Tab10[,2], probs = seq(0,1,0.1))", correct = TRUE), answer("quantile(Tab10[,2], probs = range(Tab10[,2]))"), allow_retry = TRUE ) )  ## Now what was that again? ### Mean The (arithmetic) mean is defined as the sum of all values in a list of numbers, divided by the number of values in this list. Or, expressed as an equation:$$\bar{x} = \sum_{i=1}^{n}\frac{x_{i}}{n}$$### Median The median is defined as the 'middle' number in a list of numbers, when ordered from smallest to greatest. If there is an odd number of numbers, then the middle one is picked. With an even number, there is no single middle value, so then the median is defined as the mean of the two middle values. ### Mode The mode is defined as the most frequently occurring value in a list of numbers. The mode is not necessarily unique. ### Variance and standard deviation The standard deviation $$s$$ is a measure of the spread of values around the mean. In order to calculate it, first the sum of squared deviations from the mean is determined; this is called the variance $$s^2$$. The standard deviation is then the square root of the variance. Both are used extensively for statistical analysis.$$ s^{2} = \frac{1}{n-1}\sum_{i=1}^{n}(x-\bar{x})^{2} s = \sqrt{s^2}$\$