BacDiveR

This R package provides a programmatic interface for the Bacterial Diversity Metadatabase of the DSMZ (German Collection of Microorganisms and Cell Cultures). It helps you download full datasets or just their IDs based on reproducible searches against the BacDive Web Service.

Reference

Carola Söhngen, Adam Podstawka, Boyke Bunk, Dorothea Gleim, Anna Vetcininova, Lorenz Christian Reimer, Christian Ebeling, Cezar Pendarovski, Jörg Overmann; BacDive – The Bacterial Diversity Metadatabase in 2016, Nucleic Acids Research, Volume 44, Issue D1, 4 January 2016, Pages D581–D585, doi:10.1093/nar/gkv983

Installation

  1. Because the BacDive API requires registration please do that first and wait for your access to be granted.

  2. Once you have your login credentials, install BacDiveR from GitHub with:

  1. After installing, run the following commands to save your login credentials locally:
  1. In that file, add your email and password directly after the = signs, save it, then restart R(Studio) or run:

In the following examples, the data retrieval will only work if your login credentials are correct in themselves (no typos) and were correctly saved. Console output like "{\"detail\": \"Invalid username/password\"}", or Error: $ operator is invalid for atomic vectors mean that either the login credentials or the .Renviron file are incorrect. Please repeat steps 2 to 4.

Example

The BacDive website lets you easily search for all their strains within a given taxonomic unit. BacDive.DSMZ.de/index.php?search=Bacillus for example a paginated list of strains that you can then access, download and analyse further. All manual, though. BacDiveR automates this workflow:

Calling retrieve_data() with just a searchTerm results in a vector of numeric BacDive IDs. You can use such ID downloads for meta-analyses of different bacterial taxons such as comparisons of taxon sizes, as they are represented in the DSMZ’s collection.

Downloading datasets

In order to analyse the actual datasets, we now need to download them. Suppose
we want to compare the optimal growth temperatures of strains from the taxon Bacillus halodurans with another one. You can obtain that data of course by feeding the ID vector obtained above into self-made loops that calls retrieve_data(…, searchType = "bacdive_id").

However, you can save yourself some time and effort by activating the parameter force_taxon_download. This will get you all taxon data in a single (albeit large) list of dataframes. Feel free to take a break while the computers do some work for you:

Extracting data fields

We wanted the growth temperatures, right? As with any other database field, you now need to determine its path within the list data structure that BacDiveR returned to you. Use either

  1. RStudio’s Environment > Data viever, or
  2. str(Bac_data), or
  3. your web browser’s JSON viewer on the dataset’s URL: BacDive.DSMZ.de/api/bacdive/taxon/Bacillus/halodurans,

to find the $-marked path to the field of your interest. In our example, it’s $culture_growth_condition$culture_temp$temp, which we’ll now use to extract that field from all entries in our downloaded datasets.

Multiple steps are necessary here, which could easily result in hardly readable code if we used the regular assignment operator <-, intermediate variables and nested function calls. We will avoid this with the pipe operator %>%. It indicates that

  1. an object is passed into a function as its first argument, and that
  2. the function’s output is “piped” into the next function.

Note the ~ .x prepended to the path $culture_growth_condition$culture_temp$temp! This is map()’s way of indicating that each element in the piped-in dataset will be accessed at that path.

Before visualising the data, we need to create a dataframe of the two datasets.

library("ggplot2")

rbind(temperature_Bac, temperature_At) %>% 
  ggplot(aes(x = taxon, y = temp_C)) +
  geom_boxplot(notch = TRUE, varwidth = TRUE) +
  geom_jitter(height = 0.05, alpha = 0.5) +
  theme(legend.position = "none")

And thus we find, that Aneurinibacillus thermoaerophilus contains strains with different growth optima (note the groups of data points), even up to the 50-something-°C-range as the thermo-part in its name suggest). On the other hand, all Bacillus halodurans strains known to BacDive were found to grow best at the lower temperature of NA°C. Thanks to the notch in Aneurinibacillus thermoaerophilus’s box, we can also say that there is a significant difference between the temperature ranges of these two taxons, even before digging into the numbers:

Summary

BacDiveR helps you download BacDive data for investigating it offline. Use ?retrieve_data to learn more about its options.