## This is an R tutorial written by Bobby Ruijgrok and Elly Dutton		##
## for the BA Taal en Cognitie course Methods II 						##
## Leiden, September 2016
## revised January 2017, July 2017  												##
## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##


## Welcome to this R tutorial!

## Before you do this tutorial, you need to have read the relevant sections of the reader,
## to make sure you understand the basics of how R works.

## To follow this tutorial, you should run every line that isn't a comment.
## To run a command, select the line(s) in the script and click cmd+ENTER (MAC) or Ctrl+ENTER (Windows).
## You'll then be able to see the output for yourself in the Console (if you're using R studio, it's down below).
## Try it now:
5+7
##...now take a look at the Console!

## In the Methods reader we mentioned how doing analysis with R is like writing a recipe, 
## not just making a cake. But a good recipe needs clear instructions!
## So you should add comments to your script (your "recipe"), so that you can follow it again later.
## To make a comment, put the text after a # (you can even do two ## or more).
## When running the code, the program will ignore any line that starts with #.

## When you start to write your own scripts, ALWAYS add comments next to the commands that you put in the script.
## This will save you a lot of time trying to understand your own script later.
## It also means that other people can understand your script (and follow your recipe if they want).

## OK, let's get started...


############################ PART I ############################

	##  ------ 1 : Always start by selecting a working directory  ------ 

## Work in an organised way and use dedicated folders for data you'll work on.

## Every time you start R, you need to select your working directory.
## The working directory is where R will look for data that you want to import, or save data that you want to keep.
## (We'll go through the steps for saving output later on.)

## Type Shift+Ctrl+H now to choose the working directory...

## For this tutorial, you should choose the directory of the unzipped folder; 
## it contains this script and two folders called "data" and "output".
## The Console will show the set directory command once you have set the directory; something like setwd("~/dropbox/R tutorial")

## You can always check the working directory with this command:
getwd()
## In the Console you should see the file path of the working directory you just chose.


	##  ------ 2 : Defining some variables  ------ 

## First let's look at how to define variables in R. Defining variables is a core part of any programming.
## All this means is that you declare a name, and tell the program what that name refers to. Simple!
## The variable is then stored as an object.
## In R Studio you can look the Global Environment on the right to see what you've stored.
## Give it a try:
mean_age <- 22

## Call the contents of an object by typing its name and running it:
mean_age
## now look at the Console!

## If you want to store text (a string), it should be put between quotation marks.
employee.of.the.month <- "bobby"
## Now call the contents of this object (like you did with mean_age). 

## Once you have defined a numerical object you can use it in a calculation. 
## Divide mean_age by 5; the answer will be shown in the Console!
mean_age/5

## If you'd like to output text in the Console, you can use print("Whatever text you want to print"):
print("Then we calculated the mean age divided by 5:")
mean_age/5

## How do you store the outcome of this calculation so you can refer back to it later?

## Easy: you define it as an object.
mean.age.div.by.5 <- mean_age/5
## and just to prove it:
mean.age.div.by.5
## It should be showing up in the Global Environment too!


	##  ------ 3 : Loading external data  ------ 

## R calls datasets "dataframes".
## Although you can create dataframes in R, you will normally load data that have been generated elsewhere
## (e.g. PsychoPy output).
## Now we're going to load the datafile for this tutorial.
## It's a csv file that shows a list of English-Dutch cognates. We're going to check 
## whether the means of frequency per million differs between English (E) and Dutch (D) word lists.

## So let's load the data:
cognates <- read.table("./data/cognates.csv", header = TRUE, sep = ",")
## What do the different parts of the command do? Let's take it step by step:
## We first assign our data object a name  so that we can easily use it later. Here we name it "cognates".
## read.table() is the command that, well, reads the table! 
## (There are also other possibilities for reading files, but for now we'll stick with read.table.)
## The full stop at the start of the file path (./) represents the location of the working directory. 
## We ask R to go from here into the subfolder called "data".
## If your script is itself in a subfolder, you'd have to type an extra full stop: ../ in order to 
## get R to move up one level before looking for "data".
## So then it would be something like this: "../data/cognates.csv".
## ...Confused? Well you can always just store your files in the same directory as your script! 
## Then you can just use the filename: e.g. "cognates.csv".
## Our table has headers, so we use header = TRUE. 
## Finally, the delimiter (or "separator") in this .csv is a comma, i.e. sep = ",".
## Csv files (short for “comma separated files”) can contain data in tables
## that are separated by a comma (of course), semicolon, tab or space — 
## actually, any character can be used but these are the most common. 
## Tip: If a csv file won’t load properly, check you have specified the right separator!


## In R studio, under Data in the Global Environment you'll see the table listed as an object.  
## You can display it here by clicking the table icon at the right under "Environment".
## Alternatively you could run the object 'cognates' to see its contents appear in the Console:
cognates

## So, we have a list of English-Dutch cognates with their frequencies per million.
## In the end we'd like to know to what extent the English frequencies differ from the Dutch.
## By now you know the drill: we start with descriptive statistics, then visualise the data, 
## and finally, run a statistical test.


## Before continuing, take some time to get familiar with the dataset structure and variables, 
## so that you will be able to understand how the commands relate to the data!



    ##   ------ 4 : Descriptive statistics ------ 

## So first we want to obtain the descriptive statistics for word frequency, per language.

## For the functions we use we will need an additional package called "pastecs".
## You can install it by using the following command:
install.packages("pastecs")
## Alternatively, you can click the Packages tab in the Files-plot-packages-help panel (bottom right), 
## then search for the package and install it there.
## Sometimes you'll need to restart R Studio during installation of a package, that's fine. 
## In that case, everything that you had open will reappear after the restart.

## Once you've installed a package, you need to load it, like this:
library(pastecs)
## Note: every time you start a new R session you'll need to load up the packages you need 
## by using the library() command.

## Now to get those descriptive statistics...

## To refer to a variable (column) in your table, you type the dataset name
## followed by the dollar sign $ then the column name. Like this:
cognates$frequencyMil
# To show the descriptive statistics of one variable you can use the function stat.desc()
stat.desc(cognates$frequencyMil)
## To get descriptive statistics by language you should use the by() function.
by(cognates$frequencyMil, cognates$Language, stat.desc)
## This function takes the frequency variable, groups it by the Language variable, 
## and runs the function stat.desc on each group.

## EXERCISE: Assign the descriptive statistics (grouped by language) to an object. Give it an appropriate name.
##           Then check the output in the Console by typing the name of the object you've just created.
##           How do the means compare between the two languages?
##           Is the data similarly distributed in both languages?



	##  ------ 5 : Visualise using a boxplot and a bar chart ------ 

## R has basic plotting possibilities already included, but the package "ggplot2" is way more versatile.
## ggplot2 works with layers: you define the elements of your plot in steps, and combine them.
## We'll do this now for our frequency data.

## Install and load the package (like you did with pastecs):
install.packages("ggplot2")
library(ggplot2)

## And for extra plotting options we also need the "Hmisc" package.

## EXERCISE: install and load the Hmisc package!

## Now we can start building the plot layer by layer:

## Layer 1: 
## define the basics: the x-axis (grouping variable) and y-axis (frequencies)
cognates.layer1 <- ggplot(cognates, aes(Language, frequencyMil))
## Let's break down the command again:
## we want establish the basic layer of our plot, calling it cognates.layer1;
## then we are asking ggplot to use the dataset "cognates";
## and we define the x and y axis of the plot using aes() -- 
## aes stands for aesthetics, and the command takes the form aes(x,y).

## Layer 2: 
## now we can create the boxplot layer using the geom_boxplot() command:
cognates.layer2 <- geom_boxplot()
## (There are lots of different 'geoms' available, for creating all kinds of different graph types.)

## Layer 3: 
## this layer contains a main title, and labels for the x and y axes:
cognates.layer3 <- labs(title = "Frequencies Dutch-English cognates", x = "Language", y = "Frequencies per million")
## labs() is the command for creating all the labels, and inside it we specify title, x axis label and y axis label, as strings.
## Don't want a title? Leave out the part title = "",

## Now, to bring it all together, define the whole plot as an object, adding all the layers:
cognates.Boxplot1 <- cognates.layer1 + cognates.layer2 + cognates.layer3
## Load the object to see it!

## QUESTION: How does the distribution of data compare between the two languages?
##           Does it fit the expections you had, based on the descriptive statistics?

## It's always nice to use colours in your graphs (it's a must, actually).
## Within the aes() command, you can already assign default colours to levels of your grouping variable,
## using fill = grouping variable.
## So here, our grouping variable is the column Language:
cognates.layer1.colour <- ggplot(cognates, aes(Language, frequencyMil, fill = Language))

## EXERCISE: Re-define the boxplot object using the colourful version of layer 1. 

## If you want to, you can add ends to the whiskers to make them clearer.
## The width of the ends can also be changed, if they're too wide.
cognates.layer.clearwhiskers <- stat_boxplot(geom = "errorbar", width = 0.2)

## EXERCISE: Now add this layer to the boxplot. 
##           What is the difference is between putting it before or after layer 2?

## It's very important to understand that ggplot2 builds graphs using layers.
## Once this is clear, you actually don't need to go through defining each layer as a separate object.
## Instead you can create an object with all of them at once!
## See how layers are added on using plus signs (+), just as we did above:
cognates.Boxplot2 <- ggplot(cognates, aes(Language, frequencyMil, fill = Language)) + geom_boxplot() + labs(x = "Language", y = "Frequencies per million")
## So, like many things in R, there's more than one way to build up the plot. Whatever works best for you!

## Whereas a boxplot gives insight into the range of scores, a bar chart can be used to compare the means.
## How would we make a bar chart instead of a boxplot?
## Let's look at each layer individually:
## Layer 1 said, "Using cognates dataset, please plot frequencies by language (and colour languages differently)".
## Layer 2 said, "Please make it a boxplot."
## Layer 3 said, "Give it this title and label the axes like this."

## Well, for a bar chart, we want the same dataset, the same variables, and the same labels. 
## So we only need to change layer 2! (An advantage of the layer approach!)
## Instead of using geom_boxplot(), we will define our new layer 2 using the following command:
cognates.layer2.bar <- stat_summary(fun.y = mean, geom = "bar")
## This code probably looks a bit tricky. But basically this tells R to find 
## the mean of the y-axis variable (frequency), and present the information as a bar chart.

## EXERCISE: create a bar chart object like we did with the boxplot object.
##           Include the colourful layer 1, the new layer 2, and layer 3.
##           Load (i.e. run) the bar chart object so you can view it.
##           What can you tell about the difference between the means?

## Learn more about plotting using ggplot2 by reading 
## Chapter 4 of Discovering Statistics Using R, by Field et al.



	## ------ 6 : Comparing the means using a t-test -----

## We want to know whether there is a difference in means. We need statistics!
## Will the statistics uphold the intuitions we got from exploring and visualising the data?

## As you know from statistics class, you can compare means using a t-test.
## QUESTION: should it be an independent or dependent t-test?

## As you saw in the reader, the t-test function looks like this:
## t.test(variableA~variableB) 
## Here, 'variableA' is a scale variable, and 'variableB' is a binary factor.
## So, 'variableA' is the numeric variable that we are looking at the means of. For us, that's frequency.
## Then 'variableB' is the grouping variable. For us, that's language. 
## You can read it as, 'variableA by variableB', or in our case, 'frequency by language'.
## (Note: The data needs to be in long format for this function to work!)

## So far so good? OK, here's the moment you've been waiting for:
cognates.ttest <- t.test(frequencyMil ~ Language, data=cognates, paired = FALSE)
## So looking step by step through the command:
## it's a t.test, comparing frequencyMil between the two categories of language; 
## using the cognates dataset, and... 
## ...QUESTION: What do you think 'paired = FALSE' means? (and in what situation would we use 'paired = TRUE'?)

## EXERCISE: Now run the ttest object.
##           Inspect the output; what can you say about the means now?

## Of course, there's more to life than t-tests. To learn how to run other statistical tests in R, 
## we recommend you consult Discovering Statistics Using R by Field et al.



## ------ 7 : Saving output ------

## The first thing you'll probably want to save are your fancy plots. 
## This can easily be done using the Export button at the top of the plot window.
## You have various options for the type and size of image (we recommend pdf).
## Alternatively, you can also save it directly from R using this code:
pdf("./output/pics/cognates_boxplot.pdf")
cognates.Boxplot2
dev.off()
## Notice that we're saving it in a folder called 'pics', 
## which is in a folder called 'output' within our working directory.
## (You might choose to organise your files differently, of course.)

## EXERCISE: save your cognates bar chart using the code above, and open it 
##           by browsing to the folder using Windows Explorer (or Finder, on a Mac).


## You might also want to save some text output.
## Why would you want to do this? Well, although you have all your commands in your script,
## sometimes it's handy to save the output of your statistical tests.
## You can save text directly from R as a .txt file;
## for this we use the sink() function (think of it 'sinking' the text into a file).

## Let's try this out.
## We'll save the data from the objects 'cognates.descriptive.freqbylanguage' and 'cognates.ttest'.
## Within the folder called 'output', we've also got a folder called 'stats' for this kind of output.
## (We like to be organised.)
## 'cognates_statistics.txt' will be the name of the file.

sink("./output/stats/cognates_statistics.txt")

print("Descriptive statistics")
cognates.descriptive.freqbylanguage 

print("Independent two group t-test")
cognates.ttest

sink()

## EXERCISE: Find the text file in your folder and open it. Does it look how you expected?



## ------ 8 : Merging several files ------

## Very often you'll end up having one file per participant, 
## and to run analyses you'll first need to get all the data in one file.
## E-Prime data can be merged in E-Merge, as you know. What about other programs like PsychoPy?
## There are a few options. You could do it by hand (no way!!)
## or by using a Visual Basic script in Excel (complicated unless you know VBA!)
## You'll be glad to know there's a fairly simple function to do it in R!
## So let's give it a go.

## In the folder 'data' you'll see there's a folder called 'raw'. 
## In this folder the cognates for English and Dutch can be found in two separate files.
## To practice merging files, we'll merge these two. 
## After that they should look like the file you've been working on already.

## The package that you'll need is "plyr":
install.packages("plyr")
library(plyr) 

## First of all, we use a command to collect up all the csv files in a certain directory:
## (Another good reason why you should organise your files clearly in separate folders!)
csv.files <- dir(path = "./data/raw", pattern = "csv$", full.names=TRUE)
## This creates an object 'csv.files' containing all the csvs in the 'raw' folder within 'data'.
## (don't worry about the rest of the function for now!)

## Look at the Environment panel (right). Our 'csv.files' object can be found in the section 'Values'.
## Now we want to merge the files in this object into a single dataset! 
## For this we use the ldply() function (from "plyr").
cognates.merged <- ldply(csv.files, read.csv, header = TRUE, sep = “,”)
## Here we create a 'cognates.merged' object, by asking ldply() to read and append each csv in turn.

## EXERCISE: Load the 'cognates.merged' object and compare it 
##           with the object 'cognates' that you stored earlier.
##           Do the two objects look the same?

## Let's save the merged data as a new csv file so we don't lose it.
## This will also make it available to use in other programs.
## Run this command:
write.csv(cognates.merged, "./data/cognatesMerged.csv", row.names = FALSE)
## This writes the object 'cognates.merged' to a csv file called cognatesMerged.csv in the 'data' folder.
## R usually creates row numbers, so we use row.names = FALSE to leave them out of the csv file.

##

## You now have the basics to enable you to complete the final exercise of chapter 15. 

## If you're feeling good and progressing smoothly, go ahead with Part II, 
## which gives you extra tips for handling data and saving output.

## If on the other hand things are not making much sense, 
## try reviewing the script from the start (or give us a shout!)



############################ PART II  ############################


## ------ 9 : Reshaping data ------

## In chapter 9 of the reader we've shown you that data can be stored in 'wide' and 'long' format.
## Whereas SPSS usually takes wide format as input, 
## many procedures in R require long format (including plotting in ggplot2!).
## In the folder 'data' you'll find a file called catsWide.csv. Let's load it:
cats.wide <- read.table("./data/catsWide.csv", header = TRUE, sep = ",")

## First take a look at the data to have an idea of what you're dealing with.
## Does it look familiar? (If not, check back to figure 12 in chapter 9 of the reader!)

## The data is in wide format, and we want to reshape it into long format.
## We do this using the melt() function. Think of it 'melting' the wide data
## so that it can be 'reshaped' as long data. (If you like.)

## melt() comes from the package "reshape" - so you have to install and load this first!
install.packages("reshape")
library(reshape)

## The form of the function is as follows: 
## new.data.frame <- melt(old.data.frame, id = c(constant variables), measured = c(variables that change across columns)).

## This seems complex, but let's break it down by section:

## We want to create a new dataframe, by melting our old data frame. So we start like this:
#         cats.long <- melt(cats.wide, ...

## Now, constant variables are the ones that stay the same across several measurements. 
## In our case, it's the participants. So we put that after 'id' like so:
#         cats.long <- melt(cats.wide, id = c("participant"), ...
## Names of variables (columns) have to be in quotation marks here.
## Note: If you have other constant variables (like participant age, or counterbalancing group) 
## you can list them like this: c("participant", "age", "group").
## The code c() is used when you want to make a list of items, as we do here.

## After 'measured', you include the measurement variables.
## In a wide data set, you'll have several!
## Notice again the c() which we use to make a list.
cats.long <- melt(cats.wide, id = c("participant"), measured = c("RT_picture1", "RT_picture2", "RT_picture3", "RT_picture4", "RT_picture5", "RT_picture6"))
## Now we have made the full function - run it to reshape the dataset!

## EXERCISE: compare the objects cats.wide and cats.long (in the Console).
##           Compare the outputs to figure 12 in Chapter 9.

## You'll see that the measured conditions are stacked on top of each other in one column. 
## The columns have weird names that show the measured variables.
## So let's rename the columns, to make it clearer.
## To do this, we use the names() function.
## This function takes a dataset and renames each column with a string you provide. 
## You provide the names in a list using c():
names(cats.long) <- c("participant","condition","RT")
## So note that you have to assign *every* column a name here, 
## including the ones you want to stay the same (e.g. "participant").

## And here's a tip: if you ever want to output a list of the column names in your dataset,
## just type names(dataset)! We can demonstrate this by checking that our dataset
## columns have been renamed correctly:
names(cats.long)

## Reshaping long format to wide format is done with cast(), 
## which you can think of as the counterpart to melt().
## The form is like this:
#     new.data.frame <- cast(old.data.frame, variables coded within a single column ~ variables coded across many columns, value = "outcome variable"))
# So let's try 
cats.wide.reshape <- cast(cats.long, participant ~ condition, value = "RT")
## Compare it to cats.wide - is it exactly the same?



  ## ------ 10 : Changing the font family of graphs ------

## The default font is Helvetica.
## Maybe you'd like to change the font in the graph, e.g. to match it with your presentation.
## The easiest way is to change the font of your graph while saving it.
## To see the fonts available in R that you can embed in a pdf, use this command:
names(pdfFonts())
## For example, if we want the font to be Palatino, we add family = "Palatino" in the pdf() function.
pdf("./output/pics/cognates_boxplot_pal.pdf", family = "Palatino")
cognates.Boxplot2
dev.off()

## You can also adjust dimensions of the plot when saving it.
## The default is 7 inches by 7 inches, but you can override it like this:
pdf("./output/pics/cognates_boxplot_pal_wide.pdf", family = "Palatino", width = 8, height = 5)
cognates.Boxplot2
dev.off()

## More options for fonts can be achieved by using the package "extrafont" but it may take a while to install it.



  ## ------ 11 : Extra options to pimp your graphs ------

## To override default colours you could add + scale_fill_manual() as another layer to a plot.
## This allows you to specify your own choice of colours for different data.
## We'll use cognates.Boxplot2 from section 5 above and add the extra layer to specify colours 
## for the grouping variable "Language". For example, English in Blue and Dutch in Green would be:
cognates.Boxplot.Mycolours <- cognates.Boxplot2 + scale_fill_manual("Language", values = c("English" = "Blue", "Dutch" = "Green"))

## Or you could assign colour codes yourself; however be consistent in presenting your data!
## For example, we could use Hex colour codes #3366FF and # 336633 instead of plain blue and green.
cognates.Boxplot.Hexcolours <- cognates.Boxplot2 + scale_fill_manual("Language", values = c("English" = "#3366FF", "Dutch" = "#336633")) 
## See chapter 18.6 of the reader for colour codes!
## This is also useful: http://www.sthda.com/english/wiki/ggplot2-colors-how-to-change-colors-automatically-and-manually

## EXERCISE: Change the colours of the frequencies bar graph that you made earlier.
##           Use the colour Magenta for English and Cyan for Dutch.


## Differences in appearance of text or other layout stuff can be done with the theme() function.
## For example, we could leave out the plot legend like this:
cognates.Boxplot.Hexcolours + theme(legend.position = "none")

## And, we could put the title in bold font
cognates.Boxplot.Hexcolours + theme(plot.title = element_text(face="bold") ,legend.position = "none")

## Endless possibilities...
## By the way, it is good to know that for any function in R you can ask for quick help
## by typing a question mark in front of the function.
## For example,
?theme()
## will show you the elements that you can adjust with this function (and how!).



  ## ------ 12 : The scatterplot from chapter 1 ------

## Have a look at figure 1 in the first chapter of the reader. 
## We promised to show you how to produce a scatter plot in R.

## As you can see the x-axis represents reading speed and y-axis the frequencies (layer 1!)
## Layer 2 should tell R: "Make this a scatterplot"
## You'll need the command geom_point() for this (without anything between brackets).

## Like layer 1, layer 3 is business as usual -- specifying labels.

## Adding a regression line requires an extra layer, adding the function geom_smooth().
## As you know, a regression line represents the relationship between two variables.
## To get a straight line (i.e. a linear model) you need to tell geom_smooth() to use "lm" as method.
## Additionally, the alpha aesthetic can be used to show a shaded 95% confidence interval around the regression line. 
## But we did not want this in our introduction graph, so therefore we set it to zero (the maximum is 1).
## In summary, for a linear regression line in blue you'd need the following geom_smooth contents:
#    geom_smooth(method = "lm", alpha = 0, fill = "Blue")


## EXERCISE: Load the data from fakeCorrelation.csv. 
##           Reproduce figure 1 from chapter 1.
##           Try adding a shaded 95% confidence interval around the regression line.


##

## This is it for now!
## We hope you'll enjoy using R -- remember that learning it is best done by using it!