--- title: "Rchon statistics course, test part 1" output: learnr::tutorial runtime: shiny_prerendered --- ```{r setup, include=FALSE} library(learnr) knitr::opts_chunk$set(echo = FALSE) ``` ## Introduction Now that you have successfully completed the tutorial of part 1, it is time to test your abilities in R. You will now be asked to carry out all tasks in the **R console**. A big difference is that all variable names and values will now be stored, and displayed in the **Environment** tab on the right hand side in Rstudio. Variables can be overwritten any time you want, but if you want to get rid of them you need to use the `remove()` command. Where it concerns syntax, if a command that you type is not completed, you will see the `+` sign appear in the R console - you can then just complete the command and hit Enter to finish. If you wish to abort the command, hit the *Esc* key. Repeating a command that you have issued previously can be done by using the 'up' arrow on your keyboard. Also, when you quit your Rstudio project for the first time you will be given the option to store all commands that you have used in a file called .Rhistory, which can be useful if you want to re-use pieces of code. However, if you want to keep better control of what you are doing it is highly recommended that you create an **R script** file. This will allow you to save bits of code for later. ### Credits This tutorial was developed for Archon Research School of Archaeology by Philip Verhagen (Vrije Universiteit Amsterdam) and Bjørn P. Bartholdy (University of Leiden). All content is CC BY-NC-SA: it can be freely distributed and modified under the condition of proper attribution and non-commercial use. **How to cite:** Verhagen, P. & B.P. Bartholdy, 2020. "Rchon statistics course, part 1". Amsterdam, ARCHON Research School of Archaeology. doi: 10.5281/zenodo.4094686 The current version of this tutorial was created on 16 October 2020. ## The data set For the test, we are using Table 3.5 from ***R.D.Drennan (2009): Statistics for Archaeologists (Springer)***, containing areas in hectares of sites found in Nanxiong, China. It is slightly differently organized from the tables used in the tutorial, since it contains a field **PERIOD** with two categories, **EBA** and **LBA** (Early and Late Bronze Age). As before, read the table into R. For the test, we want to analyze and compare the site areas per period. For this, we need one extra trick to separate the data in the dataframe according to period. This very simply done by specifying the term we are looking for in the second column. This piece of code will select all areas for the EBA period: `Tab35$AREA_HA[Tab35$PERIOD == "EBA"]` By now, you will know how to store this information in separate vectors and analyze them. The question of deciding on the right interval size for the histograms becomes more urgent here, since you will notice that the values in both tables are quite distinct. The best strategy would be to first compare the ranges and number of observations in both periods to see what would be the best equal interval for either period, and then experiment to see whether a middle ground can be found between the two. The breaks defined for the final plots then would need to line up exactly for both histograms for the best comparison - good luck! ## Assignment Use the R commands treated in the tutorial to answer the following questions: 1) What are the mean and median site area for each period? 2) What are the standard deviations? 3) Are there any outliers for any of the periods? 4) Finally, produce combined boxplots and histograms of site areas for both periods. 5) On the basis of the descriptive statistics and visualisations, do you think that there is a difference in mean site size for the periods? ## Hints 1. Start by importing Table 3.5 and assigning it to a dataframe variable. 2. Separate the values for in Table 3.5 for EBA and LBA in two vector variables. 3. Calculate mean, median and standard deviations for each using the appropriate commands. 4. Find the outliers for each by first establishing the lower and upper outlier bounds. 5. Make a combined boxplot 6. Combining histograms is the trickiest part. The intervals of both plots need to be synchronized and their is no single solution to this a) work out what would be a good interval width, based on the values in both EBA and LBA b) determine the breaks on the basis of that c) determine `ylim` for on the basis of the frequencies per interval in both datasets; use the `plot = F` option with `hist()` for this d) combine both histograms in one figure, using transparent colours for the second one The correct answers can be found separately on the Zenodo repository.