#+OPTIONS: ':nil *:t -:t ::t <:t H:3 \n:nil ^:nil arch:headline #+OPTIONS: author:t c:nil creator:nil d:(not "LOGBOOK") date:t e:t #+OPTIONS: email:nil f:t inline:t num:t p:nil pri:t prop:nil stat:t #+OPTIONS: tags:t tasks:t tex:t timestamp:t title:t toc:t todo:t |:t #+TITLE: Data analysis for the GenABEL paper #+DATE: <2016-04-29 vr> #+AUTHOR: L.C. Karssen #+EMAIL: l.c.karssen@polyomica.com #+LANGUAGE: en #+SELECT_TAGS: export #+EXCLUDE_TAGS: noexport #+CREATOR: Emacs 24.5.1 (Org mode 8.3.3) # Typeset and format code listings #+LaTeX_HEADER: \lstset{backgroundcolor=\color[rgb]{0.9,0.9,0.9}, #+LaTeX_HEADER: keywordstyle=\color[rgb]{0,0,1}, #+LaTeX_HEADER: commentstyle=\color[rgb]{0.3,0.6,0.3}, #+LaTeX_HEADER: stringstyle=\color[rgb]{0.627,0.126,0.941}, #+LaTeX_HEADER: breaklines=false, showstringspaces=false, #+LaTeX_HEADER: prebreak = \raisebox{0ex}[0ex][0ex]{\ensuremath{\hookleftarrow}}, #+LaTeX_HEADER: frame=single, #+LaTeX_HEADER: basicstyle=\scriptsize\ttfamily} # Typeset and format example blocks #+LATEX_HEADER: \RequirePackage{fancyvrb} #+LATEX_HEADER: \DefineVerbatimEnvironment{verbatim}{Verbatim}{fontsize=\scriptsize,formatcom = {\color[rgb]{0.5,0,0}}} # Name the R session #+PROPERTY: session GenABEL_R * Introduction This document describes the data extraction and analysis used for the GenABEL project paper. Not all data in the paper was extracted using scripts. In some cases manual extraction was more time-efficient. The code contained in this document is in the public domain. * Initialisation Load some R packages we will need: #+begin_src R :results none library(dplyr) library(ggplot2) #+end_src * Determining the number of bug reports In this section we will determine the number of bug reports submitted to the GenABEL bug tracker on R-forge and Github. ** Number of bugs reported on R-forge On <2016-04-16 za> I logged into the GenABEL Project R-forge site and downloaded the CSV of the bug tracker data. The file is called #+begin_src R :results none trackerfile <- "tracker_report-2016-04-16.csv" #+end_src Let's read the data: #+begin_src R :results none trackerdata <- read.table(trackerfile, header=TRUE, sep=";") #+end_src The table contains src_R{nrow(trackerdata)} {{{results(=86=)}}} rows and src_R{ncol(trackerdata)} {{{results(=20=)}}} columns. The column names are: #+begin_src R :exports both colnames(trackerdata) #+end_src #+RESULTS: | artifact_id | | status_id | | status_name | | priority | | submitter_id | | submitter_name | | assigned_to_id | | assigned_to_name | | open_date | | close_date | | last_modified_date | | summary | | details | | Resolution | | Operating.System | | Severity | | Hardware | | Version | | Component | | URL | In total src_R{nrow(trackerdata)} {{{results(=86=)}}} bugs were reported. In order to find out how many non-core members submitted a bug report, we first define the list of core members: #+begin_src R :exports both coremembers <- c("Yurii Aulchenko", "Lennart Karssen", "Maarten Kooyman", "Maksim Struchalin", "Xia Shen") #+end_src #+RESULTS: | Yurii Aulchenko | | Lennart Karssen | | Maarten Kooyman | | Maksim Struchalin | | Xia Shen | This is the list of bugs submitted by the other (non-core) contributers: #+begin_src R :exports both noncore_submitted_bugs <- trackerdata %>% select(artifact_id, submitter_name) %>% filter(!submitter_name %in% coremembers) #+end_src #+RESULTS: | 1641 | Karl Forner | | 2664 | Daniel Taliun | | 1673 | Daniel Taliun | | 1817 | Benedikt Haug | | 1846 | Anne Grotenhuis | | 5118 | Ge Zhang | | 5883 | Farid Radmanesh | | 2437 | Daniel Taliun | | 4915 | Hermann Norpois | | 4916 | Hermann Norpois | | 4919 | Daniel Taliun | A total of src_R{nrow(noncore_submitted_bugs)} {{{results(=11=)}}} bugs were submitted directly to the R-forge bug tracker by non-core members. This is the list of non-core submitters: #+begin_src R :exports both unique(noncore_submitted_bugs[,2]) #+end_src #+RESULTS: | Karl Forner | | Daniel Taliun | | Benedikt Haug | | Anne Grotenhuis | | Ge Zhang | | Farid Radmanesh | | Hermann Norpois | Figuring out how many bugs were reported on the forum and then added to the R-forge tracker by GenABEL team members is more difficult. A first attempt: #+begin_src R :exports both non_core_bugs <- trackerdata %>% filter(submitter_name %in% coremembers) %>% filter(grepl("forum|from", summary, ignore.case=TRUE) | grepl("forum|mail|reported|reporting", details, ignore.case=TRUE) | grepl("forum", URL, ignore.case=TRUE) ) %>% select(artifact_id) nrow(non_core_bugs) #+end_src #+RESULTS: : 37 In the end, I manually parsed the tracker and the following bugs were added to the tracker by core members, but actually reported by others: #+NAME: counted_external_non #+ATTR_LATEX: :environment longtable | artifact_id | |-------------| | 1158 | | 1186 | | 1187 | | 1210 | | 1211 | | 1227 | | 1259 | | 1266 | | 1273 | | 1280 | | 1287 | | 1322 | | 1339 | | 1388 | | 1398 | | 1430 | | 1641 | | 1676 | | 1889 | | 2147 | | 2525 | | 2886 | | 2598 | | 2772 | | 4683 | | 4700 | | 4854 | | 4885 | | 5367 | | 5409 | | 5522 | | 5658 | | 5665 | | 5726 | | 5793 | | 5982 | | 6011 | | 6041 | | 6045 | | 6194 | | 6280 | So in total: src_R[:var counted_external=counted_external_non]{nrow(counted_external)} {{{results(=41=)}}} So src_R{format((41+11)/86*100, digits=3)} {{{results(=60.5=)}}} percent of the R-forge bug reports was from non-core members. ** Number of bugs reported on Github Given the small number of issues reported on GitHub and the fact that the old issues for these tools have been migrated from R-forge, extracting the required numbers by hand is currently still easier than trying to automate it. #+NAME: github_issues | Tool | external bug reports | core bug reports | |--------------+----------------------+------------------| | ProbABEL | 1 | 3 | | OmicABEL | 0 | 1 | | OmicABELnoMM | 0 | 1 | | filevector | 1 | 1 | In total src_R[:var tbl=github_issues]{sum(tbl[,2]) + sum(tbl[,3])} {{{results(=8=)}}} bugs were reported on Github that are not also present in the R-forge tracker. ** Conclusion on bug reports So the total number of bug reports from non-core members is: src_R{11 + 41 + 1} {{{results(=53=)}}}, the total number of bug reports is: src_R{86 + 8} {{{results(=94=)}}} and the total percentage is: src_R{format((11 + 41 + 1)/(86 + 8) * 100, digits=3)} {{{results(=56.4=)}}}. * Analysis of traffic of the GenABEL website In this section we analyse the 'city-of-origin' data of the visitors of the GenABEL website. For this I use a custom report that I made In Google Analytics, which is based on the default "Location" entry, but has some uninteresting columns removed. The data was extracted on <2016-04-29 vr>. Set some variables that will be used later: #+begin_src R :results none period <- "April 28, 2015 -- April 28, 2016" yrmnth <- "2015-2016" cities.file <- file.path("Website-Stats", "Data", "Analytics www.genabel.org Locatie Lennart 20150428-20160428.csv") #+end_src Read the data from file. Only keep the columns with the ISO country code, the names of the cities, the number of visits and the average visit duration from the file. Ignore the other columns. #+begin_src R :results none cities.data <- read.csv(cities.file, header=TRUE, stringsAsFactors=FALSE, skip=6, colClasses=c( "character", "character", "character", rep("NULL", 3), "character" ) ) colnames(cities.data) <- c("Country", "City", "nVisits", "avgVisitDuration") #+end_src Unfortunately, Google exports numbers based on the browser language or location or something similar. So the file I exported has . as thousands separator and , as decimal sign. This needs to be fixed before using the data. Get rid of the . as thousands separator: #+begin_src R cities.data$nVisits <- as.integer(gsub("[.]", "", cities.data$nVisits)) #+end_src Create column with text City name, country label, for use as horizontal axis: #+begin_src R cities.data$CityCountry <- paste0(cities.data[, "City"], ", ", cities.data[,"Country"]) #+end_src Remove the last row as it only contains the total number of visits. #+begin_src R cities.data <- cities.data[-nrow(cities.data),] #+end_src The total number of visits in this period was: src_R{sum(cities.data$nVisits)} {{{results(=24227=)}}}. Remove the visits that lasted less than 60 seconds and those from a city with less than 15 visits: #+begin_src R :results none ## Convert HH:MM:SS to seconds cities.data$secDuration <- sapply( strsplit( cities.data$avgVisitDuration, ":"), function(x) { x <- as.numeric(x) x[1] * 3600 + x[2] * 60 + x[3] } ) ## Find the cities where the average visit was >= minTime seconds and ## the number of visits was >= minVisits. minVisits <- 15 minTime <- 60 user.cities <- cities.data[which( cities.data$secDuration >= minTime & cities.data$nVisits >= minVisits), ] #+end_src The total number of visits lasting more than 60 seconds and from locations with > 15 visits was: src_R{sum(user.cities$nVisits)} {{{results(=16319=)}}}. Calculate visits from unkown cities: #+begin_src R :exports both unknownCities <- user.cities[which(user.cities$City=="(not set)"), ] nUnknown <- sum(unknownCities[, "nVisits"]) #+end_src #+RESULTS: : 696 Make a bar plot of the top 20 of cities-of-origin (Figure [[fig:visits_city]]): #+begin_src R :results graphics :file bar_visit_city_2015-2016.pdf :exports both ## Remove the first row, it contains the number of visits from unknown ## cities. user.cities <- user.cities[which(user.cities$City!="(not set)"), ] user.cities.plot <- user.cities[1:20, c("CityCountry", "nVisits")] user.cities.plot$Percentage <- round(user.cities.plot$nVisits/totVisits * 100.) user.cities.plot$factCtry <- # Make a column with a factor # of the country data in the # order they appear in the # df. factor(user.cities.plot$CityCountry, as.character(user.cities.plot$CityCountry)) plt <- ggplot(data=user.cities.plot, aes(x=factCtry, y=nVisits, fill=factCtry, ), ) plt <- plt + geom_bar(stat='identity') + theme(axis.text.x=element_text(angle=90, # Rotated x-axis text hjust=1, vjust=0.5, size=14), axis.text=element_text(colour="black", size=12), axis.title=element_text(size=14, face="bold"), plot.title=element_text(face='bold') ) + xlab("City") + ylab("Number of visits") + # Remove the legend as the # country names are already in # the x-labels guides(fill=FALSE)# + plt #+end_src #+NAME: fig:visits_city #+CAPTION: Top 20 of cities with most visits of the GenABEL website. #+RESULTS: [[file:bar_visit_city_2015-2016.pdf]]