R version 3.4.0 (2017-04-21) -- "You Stupid Darkness"
Copyright (C) 2017 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin15.6.0 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> #
> ##
> ###
> ## Start log
> logname = "log_r_basic_gwas.txt"
> mylog = file(logname, open = "wt")
> sink(mylog, append = TRUE, type = "message")
> Sys.time() ; sessionInfo()
[1] "2017-09-04 16:23:42 BST"
R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.5

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.4.0
> 
> 
> ## DESCRIPTION
> ## Reads GWAS results for each sex.
> ## Generates manhattan, p-value distribution and QQ-plots, for each sex.
> 
> 
> ## Load packages
> library(plyr) ## For cumulative calculation of FDR values.
> library(tidyverse) ## Data formatting, and fancy plots.
> library(cowplot) ## Put fancy plots in one figure.
> library(dtplyr) ## dplyr-data.table compatibility
> library(data.table) ## Loading big data.
> library(car) ## Recoding character variables.
> 
> 
> 
> ## Load Female and Male GWAS results, combine, and format the phenotype identifiers.
> df_raw = 	do.call("rbind", lapply(
+ 	list.files(pattern = "*.gwas_res.txt", recursive = TRUE),
+ 	function(x) cbind(Phenotype = gsub(".gwas_res.txt", "", x),
+ 		fread(x, verbose = TRUE, sep = " ")))) %>%
+ 	mutate(Phenotype = gsub("lhm_hc_2017.", "", Phenotype))
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 0.069151 GB.
Memory mapping ... ok
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Positioned on line 1 after skip or autostart
This line is the autostart and not blank so searching up for the last non-blank ... line 1
Using supplied sep ' ' ... found ok
Detected 17 columns. Longest stretch was from line 1 to line 30
Starting data input on line 1 (either column names or first row of data). First 10 characters: SNP CHR BP
All the fields on line 1 are character fields. Treating as the column names.
Count of eol: 624693 (including 1 at the end)
Count of sep: 9995072
nrow = MIN( nsep [9995072] / (ncol [17] -1), neol [624693] - endblanks [1] ) = 624692
Type codes (point  0): 44113333334444431
Type codes (point  1): 44113333334444431
Type codes (point  2): 44113333334444431
Type codes (point  3): 44113333334444431
Type codes (point  4): 44113333334444431
Type codes (point  5): 44113333334444431
Type codes (point  6): 44113333334444431
Type codes (point  7): 44113333334444431
Type codes (point  8): 44113333334444431
Type codes (point  9): 44113333334444431
Type codes (point 10): 44113333334444431
Type codes: 44113333334444431 (after applying colClasses and integer64)
Type codes: 44113333334444431 (after applying drop or select (if supplied)
Allocating 17 column slots (17 - 0 dropped)

Read 80.0% of 624692 rows
Read 624692 rows and 17 (of 17) columns from 0.069 GB file in 00:00:03
Read 624692 rows. Exactly what was estimated and allocated up front
   0.006s (  0%) Memory map (rerun may be quicker)
   0.000s (  0%) sep and header detection
   0.092s (  4%) Count rows (wc -l)
   0.002s (  0%) Column type detection (100 rows at 10 points)
   0.697s ( 28%) Allocation of 624692x17 result (xMB) in RAM
   1.695s ( 68%) Reading data
   0.000s (  0%) Allocation for type bumps (if any), including gc time if triggered
   0.000s (  0%) Coercing data already read in type bumps (if any)
   0.003s (  0%) Changing na.strings to NA
   2.495s        Total
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 0.069330 GB.
Memory mapping ... ok
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Positioned on line 1 after skip or autostart
This line is the autostart and not blank so searching up for the last non-blank ... line 1
Using supplied sep ' ' ... found ok
Detected 17 columns. Longest stretch was from line 1 to line 30
Starting data input on line 1 (either column names or first row of data). First 10 characters: SNP CHR BP
All the fields on line 1 are character fields. Treating as the column names.
Count of eol: 624693 (including 1 at the end)
Count of sep: 9995072
nrow = MIN( nsep [9995072] / (ncol [17] -1), neol [624693] - endblanks [1] ) = 624692
Type codes (point  0): 44113333334444431
Type codes (point  1): 44113333334444431
Type codes (point  2): 44113333334444431
Type codes (point  3): 44113333334444431
Type codes (point  4): 44113333334444431
Type codes (point  5): 44113333334444431
Type codes (point  6): 44113333334444431
Type codes (point  7): 44113333334444431
Type codes (point  8): 44113333334444431
Type codes (point  9): 44113333334444431
Type codes (point 10): 44113333334444431
Type codes: 44113333334444431 (after applying colClasses and integer64)
Type codes: 44113333334444431 (after applying drop or select (if supplied)
Allocating 17 column slots (17 - 0 dropped)
Read 624692 rows. Exactly what was estimated and allocated up front
   0.006s (  0%) Memory map (rerun may be quicker)
   0.000s (  0%) sep and header detection
   0.098s (  7%) Count rows (wc -l)
   0.001s (  0%) Column type detection (100 rows at 10 points)
   0.412s ( 28%) Allocation of 624692x17 result (xMB) in RAM
   0.943s ( 64%) Reading data
   0.000s (  0%) Allocation for type bumps (if any), including gc time if triggered
   0.000s (  0%) Coercing data already read in type bumps (if any)
   0.003s (  0%) Changing na.strings to NA
   1.465s        Total
> 
> 
> 
> ## General plot settings.
> theme_w = function() {
+ 	theme_bw(base_size = 10) +
+ 		theme(
+ 			axis.ticks = element_line(size = .3),
+ 			panel.grid = element_blank(),
+ 			strip.background = element_rect(color = "grey90", fill = "grey93", size = .3))}
> 
> 
> 
> ## Plot p-value distributions.
> pdist = ggplot(df_raw, aes(P, colour = Phenotype)) +
+ 	geom_histogram(fill = "white", bins = 60) +
+ 	scale_colour_manual(values = c("red4", "blue4")) +
+ 	scale_x_continuous(name = "p-value category", expand = c(.02,0),
+ 		limits = c(0, 1),
+ 		breaks = seq(0,1, .1),
+ 		labels = c(0,"",.2,"",.4,"",.6,"", .8, "", 1)) +
+ 	scale_y_continuous(name = "Count of tests", expand = c(.01,.01))+
+ 	facet_wrap("Phenotype") +
+ 	theme_w() +
+ 	theme(
+ 		axis.line.x = element_line(size = .3, colour = "black"),
+ 		axis.line.y = element_line(size = .3, colour = "black"),
+ 		legend.position = "none",
+ 		plot.margin = unit(c(2,.5,2,.5),"lines"))
> 
> 
> 
> ## Manhattan plot
> mh = ggplot(df_raw %>% select(
+ 	Phenotype, CHR, BP, P) , aes(BP/1e+06, -log10(P), colour = Phenotype)) +
+ 	geom_point(shape = 21, stroke = .2, size = .3) +
+ 	scale_colour_manual(values = c("red4", "blue4")) +
+ 	scale_x_continuous(name = "Genomic position (Mb)", expand = c(0.01,0),
+ 		breaks = seq(0,40,5), labels = c(0, "", 10, "", 20, "", 30, "", 40)) +
+ 	scale_y_continuous(name = "-log10.p (basic GWAS test)", expand = c(0,0.02), limits = c(0, 7.1),
+ 		breaks = seq(0,7,1), labels = c(0,"",2,"",4,"",6,"")) +
+ 	facet_grid(Phenotype~CHR, space = "free_x", scale = "free_x") +
+ 	theme_w() +
+ 	theme(
+ 		axis.line.x = element_line(size = .2, colour = "black"),
+ 		axis.line.y = element_line(size = .2, colour = "black"),
+ 		legend.position = "none",
+ 		strip.placement = "outside")
> 
> 
> 
> 
> ## FUNCTION : Calculate expected p-values (+95% CI) for QQ plot.
> fun_cal_qq = function(df, mytest){
+ 	x = df %>%
+ 		sample_frac(1) %>%
+ 		filter(Phenotype == mytest) %>%
+ 		select(P) %>%
+ 		arrange(P) %>%
+ 		mutate(expP = (1:length(P) / length(P))) %>%
+ 		mutate(log10p = -log10(P)) %>%
+ 		mutate(explog10p = -log10(expP))
+ 	c95 = as.numeric(rep(0,length(x$P)))
+ 	c05 = c95
+ 	for (i in 1:length(x$P)) {
+ 		c95[i] = as.numeric(-log10(qbeta(0.05,i,length(x$P) - i + 1)))
+ 		c05[i] = as.numeric(-log10(qbeta(0.95,i,length(x$P) - i + 1)))}
+ 	final = as.data.table(cbind(x, c05 = as.numeric(c05), c95 = as.numeric(c95), Phenotype = mytest))
+ 	rm(x) ; rm(c05) ; rm(c95) ; return(final) }
> 
> 
> 
> Sys.time() ; print("Making data for QQ plots...")
[1] "2017-09-04 16:23:54 BST"
[1] "Making data for QQ plots..."
> 
> df_qq = do.call("rbind", list(
+ 	fun_cal_qq(df_raw, "Female"),
+ 	fun_cal_qq(df_raw, "Male")))
> 
> Sys.time() ; print("Completed QQ plot data.")
[1] "2017-09-04 16:24:47 BST"
[1] "Completed QQ plot data."
> head(df_qq)
           P         expP   log10p explog10p      c05      c95 Phenotype
1: 2.633e-07 1.600789e-06 6.579549  5.795666 5.319164 7.085605    Female
2: 8.168e-07 3.201578e-06 6.087884  5.494636 5.119535 6.244995    Female
3: 1.458e-06 4.802367e-06 5.836242  5.318545 4.996617 5.883076    Female
4: 1.550e-06 6.403155e-06 5.809668  5.193606 4.906161 5.660113    Female
5: 1.550e-06 8.003944e-06 5.809668  5.096696 4.834080 5.501166    Female
6: 1.946e-06 9.604733e-06 5.710857  5.017515 4.773940 5.378523    Female
> 
> 
> ## QQ plot.
> qqplot =	ggplot(df_qq, aes(explog10p, log10p)) +
+ 	geom_ribbon(aes(ymin = as.numeric(c05), ymax = as.numeric(c95)),
+ 		fill = "grey70", alpha = .8, linetype = 0) +
+ 	geom_abline(intercept = 0, slope = 1, size = .1, colour = "grey40") +
+ 	geom_point(aes(colour = Phenotype), size = .6, stroke = .2) +
+ 	scale_colour_manual(values = c("red4", "blue4")) +
+ 	coord_fixed(ratio = 1, expand = c(0,0), xlim = c(0,7.1), ylim = c(0,7.1)) +
+ 	scale_x_continuous(name = "Expected -log10.p", breaks = seq(0,7,1), labels = c(0,"",2,"",4,"",6,"")) +
+ 	scale_y_continuous(name = "Observed -log10.p", breaks = seq(0,7,1), labels = c(0,"",2,"",4,"",6,"")) +
+ 	facet_wrap("Phenotype") +
+ 	theme_w() +
+ 	theme(legend.position = "none")
> 
> 
> 
> 
> Sys.time() ; print("Starting plots....")
[1] "2017-09-04 16:24:47 BST"
[1] "Starting plots...."
> 
> save_plot("plot_basic_gwas_lhm.png",
+ 	ggdraw() +
+ 		draw_plot(mh, 0, .5, 1, .5) +
+ 		draw_plot(pdist, 0, 0, .5, .5) +
+ 		draw_plot(qqplot, .5, 0, .5, .5) +
+ 		draw_plot_label(c("A", "B", "C"), c(0, 0, .55), c(.98, .48, .48), size = 10)
+ 	, base_height = 6, base_width = 8)
> 
> Sys.time() ; print("Plots saved to file.")
[1] "2017-09-04 16:26:53 BST"
[1] "Plots saved to file."
> 
> 
> 
> ## Links
> ## Morrow lab, Sussex Uni http://www.sussex.ac.uk/lifesci/morrowlab/
> ## Project https://zenodo.org/communities/sussex_drosophila_gwas/
> ## Plink https://www.cog-genomics.org/plink2
> 
> 
> ls(all.names = TRUE)
 [1] ".Random.seed" "df_qq"        "df_raw"       "fun_cal_qq"   "logname"     
 [6] "mh"           "mylog"        "pdist"        "qqplot"       "theme_w"     
> rm(list = ls())
> Sys.time()
[1] "2017-09-04 16:26:53 BST"
> print("William P. Gilks, wpgilks@gmail.com, University of Sussex 2017. End of script")
[1] "William P. Gilks, wpgilks@gmail.com, University of Sussex 2017. End of script"
> sink() ; unlink(mylog)