This R Notebook supports the electronic laboratory notebook (ELN) suvey data shown in the publication: “Considerations for Implementing Electronic Laboratory Notebooks in an Academic Research Environment”, S.G. Higgins, A.A. Nogiwa-Valdez, M.M. Stevens (2021).

Configure environment

Load required packages:

library(here)
here() starts at /Users/stuart/OneDrive - Imperial College London/_Papers/ELN-essay/product_survey_revised
library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
  method         from
  print.tbl_lazy     
  print.tbl_sql      
── Attaching packages ──────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
✓ ggplot2 3.3.3     ✓ purrr   0.3.4
✓ tibble  3.0.6     ✓ dplyr   1.0.4
✓ tidyr   1.1.2     ✓ stringr 1.4.0
✓ readr   1.4.0     ✓ forcats 0.5.1
── Conflicts ─────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
library(plotly)
Registered S3 method overwritten by 'data.table':
  method           from
  print.data.table     
Registered S3 methods overwritten by 'htmltools':
  method               from         
  print.html           tools:rstudio
  print.shiny.tag      tools:rstudio
  print.shiny.tag.list tools:rstudio
Registered S3 method overwritten by 'htmlwidgets':
  method           from         
  print.htmlwidget tools:rstudio

Attaching package: ‘plotly’

The following object is masked from ‘package:ggplot2’:

    last_plot

The following object is masked from ‘package:stats’:

    filter

The following object is masked from ‘package:graphics’:

    layout
library(htmlwidgets)

Import data

Load in survey data from file, recode ‘ongoing’ tags in the date_defunct column to the year 2021, calculate the total number of years active, and create a logical vector for each row determining whether the ELN is active or not in the year 2021: (Note: this Notebook expects file ‘ELN_Review_Higgins_2021_Survey.csv’ to be present in the same directory as the working directory identified by the here package)

data <-
  read_csv(here("ELN_Review_Higgins_2021_Survey.csv")) %>%
  mutate(date_defunct_numeric = as.numeric(replace(date_defunct, date_defunct == "ongoing", 2021)),
         years_active = date_defunct_numeric - date_released,
         defunct_in_2021 =
           case_when(
             date_defunct == "ongoing" ~ FALSE,
             date_defunct == 2021 ~ TRUE,
             TRUE ~ TRUE
           ),
         row_number = row_number())

── Column specification ─────────────────────────────────────────────────────────────────────────────────
cols(
  product_name = col_character(),
  manufacturer = col_character(),
  date_released = col_double(),
  date_defunct = col_character(),
  codebase = col_character(),
  notes = col_character(),
  reference_1 = col_character(),
  reference_2 = col_character(),
  reference_3 = col_character(),
  reference_4 = col_character(),
  references_accessed = col_character()
)

Generate statistics

How many ELNs were surveyed?

data %>%
  count()

How many of the ELNs surveyed are active (FALSE) or defunct (TRUE) in 2021?

data %>%
  count(defunct_in_2021)

What is the average (and spread) of the lifetime (years_active) of the ELNs surveyed? (Note: the median absolute estimate here has a default scaling constant of 1.4826, so that it acts as as a consistent estimator of the standard deviation)

data %>%
  summarise(mean_years_active = mean(years_active),
            sd_years_active = sd(years_active),
            median_years_active = median(years_active),
            mad_years_active = mad(years_active),
            iqr_years_active = IQR(years_active),
            range_years_active = max(years_active)-min(years_active))

What are the average and spread of the lifetimes of ELNs, sub-divided by codebase?

data %>%
  group_by(codebase) %>%
  summarise(mean_years_active = mean(years_active),
            sd_years_active = sd(years_active),
            median_years_active = median(years_active),
            mad_years_active = mad(years_active),
            iqr_years_active = IQR(years_active),
            range_years_active = max(years_active)-min(years_active))

How many of the ELNs surveyed have open-source or proprietary codebases?

data %>%
  count(codebase)

Which are the longest running proprietary and open source ELNs (in the survey data)?

data %>%
  group_by(codebase) %>%
  slice_max(n=1, order_by=years_active) %>%
  select(product_name, manufacturer, years_active, date_defunct, codebase)

Generate figures

Define a theme for plotting figures:

mytheme <-
  theme_bw() +
  theme(
    panel.background = element_rect(fill = "white", colour = "black", size = 2),
    panel.grid.minor = element_blank(),
    panel.grid.major = element_blank(),
    text = element_text(size = 25, face = "plain", colour = "black"),
    axis.title.x = element_text(size = 25, face = "plain"),
    axis.title.y = element_text(size = 25),
    element_line(size = 2),
    axis.ticks.length = unit(0.15, "cm"))

Define functions for customising the appearance of plotted figures:

get_point_colour <- function(x){
  ifelse(x==TRUE, "grey", "grey30")
}

get_line_colour <- function(x){
  ifelse(x!="opensource", "#0072B2", "#CC79A7")
}

Produce the timeline plot featured in Figure 1 of the main manuscript:

p_timeline <-
  data %>%
  mutate(row_number = as_factor(row_number)) %>%
  mutate(row_number = fct_reorder(fct_reorder(row_number, years_active, .desc=FALSE), codebase, .desc=FALSE)) %>%
  mutate(row_number_new = as.numeric(row_number)) %>%
  ggplot() +
  geom_segment(aes(x=date_released, xend=date_defunct_numeric,y=row_number_new, yend=row_number_new),
               colour=get_line_colour(data$codebase),
               linetype="solid",
               size=0.5) +
  geom_point(aes(x=date_released, y=row_number_new), colour=get_point_colour(data$defunct_in_2021), shape=1, size=2 ) +
  geom_point(aes(x=date_defunct_numeric, y=row_number_new), colour=get_point_colour(data$defunct_in_2021), shape=16, size=0.5) + 
  scale_x_continuous(position="bottom", breaks=c(seq(1980,2021,5))) +
  coord_cartesian(xlim=c(1980,2021)) +
  theme_bw() +
  theme(
    plot.margin = margin(0.1, 0.1, 0.1, 0.1, "cm"),
    panel.border = element_blank(),
    panel.grid.major.y = element_line(colour="grey95", size=0.25),
    panel.grid.major.x = element_line(colour="grey95", size=0.25),
    panel.grid.minor.x = element_line(colour="grey95", size=0.25),
    axis.text.y = element_blank(),
    axis.text.x = element_blank(),
    axis.title.y = element_blank(),
    axis.title.x = element_blank(),
    axis.ticks.y = element_blank(),
    axis.ticks.x = element_blank(),
    legend.position = "bottom"
  )

print(p_timeline)

ggsave(here("ELN_Review_Higgins_2021_Timeline.pdf"), plot=p_timeline, width=18.0, height=10, device="pdf", dpi=600, units="cm")