To reproduce this talk, you will have to install the following packages from CRAN.

install.packages(c("lfstat", "osmdata", "tidyverse", "sf", "leaflet"))

In order to conveniently plot objects of class sf (simple feature) with ggplot the function geom_sf() is needed which is not yet available from CRAN. For the time being we can install the development version of ggplot2 from github.

install.packages("devtools")
library(devtools)
install_github("hadley/ggplot2")

Tidy hydrological data with list-columns

library(tidyverse)

Often time you might receive a dataset like the following where the actual observations (in our case discharge measurements) and the corresponding station metadata are stored in two separate files. The idea is to

Importing Metadata

The metadata is provided as a csv file which we are going to import using the function read_csv2(). This function is quite verbose and can spam your R console with a lot of messages.

metadata <- read_csv2(file = "./metadata.csv")
## Using ',' as decimal and '.' as grouping mark. Use read_delim() for more control.
## Parsed with column specification:
## cols(
##   filename = col_character(),
##   id = col_integer(),
##   river = col_character(),
##   station = col_character(),
##   lon = col_double(),
##   lat = col_double(),
##   z = col_double(),
##   catchment = col_double()
## )

To prevent this messages simply specify the arguments according to the data, e.g. by pasting the messages into the function call. Furthermore we have to append the name of directory to the file name of the csv files to obtain the full relative path.

directory <- "./discharge/"

metadata <- read_csv2(file = "./metadata.csv", locale = locale(decimal_mark = ","),
                      col_types = cols(
                        filename = col_character(),
                        id = col_character(),
                        river = col_character(),
                        station = col_character(),
                        lon = col_double(),
                        lat = col_double(),
                        z = col_double(),
                        catchment = col_double()
                      )) %>%
  mutate(filename = paste0(directory, filename))


metadata
## # A tibble: 11 x 8
##    filename             id    river            station            lon   lat      z catchment
##    <chr>                <chr> <chr>            <chr>            <dbl> <dbl>  <dbl>     <dbl>
##  1 ./discharge/1100.csv 1100  Kučnica          Cankova           16.0  46.7 206        30.4 
##  2 ./discharge/1220.csv 1220  Ledava           Polana I          16.1  46.7 191       208   
##  3 ./discharge/1300.csv 1300  Martjanski potok Martjanci         16.2  46.7 189        28.1 
##  4 ./discharge/1310.csv 1310  Kobiljski potok  Kobilje           16.4  46.7 184        48.7 
##  5 ./discharge/1350.csv 1350  Velika Krka      Hodoš             16.3  46.8 225       107   
##  6 ./discharge/3400.csv 3400  Jezernica        Mlino I           14.1  46.4 468         8.61
##  7 ./discharge/8680.csv 8680  Reka             Neblo             13.5  46.0  73.1      29.7 
##  8 ./discharge/9210.csv 9210  Rižana           Kubed II          13.9  45.5  57.7     204   
##  9 ./discharge/9280.csv 9280  Drnica           Pišine I          13.6  45.5   1.78     29.8 
## 10 ./discharge/9300.csv 9300  Dragonja         Podkaštel I       13.7  45.5   9.81     92.7 
## 11 ./discharge/4761.csv 4761  Mestinjščica     Sodna vas, I, II  15.6  46.2 193       133

Appending the Data

The dataset comprises 11 gauging stations from Slovenia. Each discharge time series is stored in a csv file with two columns (time, discharge). Let’s read in just a single file:

read_csv2(metadata$filename[1])
## Using ',' as decimal and '.' as grouping mark. Use read_delim() for more control.
## Parsed with column specification:
## cols(
##   time = col_date(format = ""),
##   discharge = col_double()
## )
## # A tibble: 20,088 x 2
##    time       discharge
##    <date>         <dbl>
##  1 1961-01-01     0.110
##  2 1961-01-02     0.110
##  3 1961-01-03     0.120
##  4 1961-01-04     0.570
##  5 1961-01-05     1.00 
##  6 1961-01-06     0.980
##  7 1961-01-07     0.430
##  8 1961-01-08     0.270
##  9 1961-01-09     0.210
## 10 1961-01-10     0.160
## # ... with 20,078 more rows

To import all the files we need to call an importing function (read_csv(), read_table(), …) for every file name in our metadata tibble. Instead of storing the discharge data in a new object we append another column called data. dplyr calls this ‘mutating’, the corresponding dplyr verb is mutate().

slovenia <- metadata %>%
  group_by(filename) %>%
  mutate(data = list(read_csv2(file = filename, 
                               locale = locale(decimal_mark = ","),
                               col_types = cols(
                                 time = col_date("%Y-%m-%d"),
                                 discharge = col_double()
                               )))) %>%
  ungroup()

slovenia <- slovenia %>%
  select(-filename) %>%
  select(data, everything()) %>%
  print()
## # A tibble: 11 x 8
##    data                  id    river            station            lon   lat      z catchment
##    <list>                <chr> <chr>            <chr>            <dbl> <dbl>  <dbl>     <dbl>
##  1 <tibble [20,088 × 2]> 1100  Kučnica          Cankova           16.0  46.7 206        30.4 
##  2 <tibble [19,723 × 2]> 1220  Ledava           Polana I          16.1  46.7 191       208   
##  3 <tibble [16,801 × 2]> 1300  Martjanski potok Martjanci         16.2  46.7 189        28.1 
##  4 <tibble [16,071 × 2]> 1310  Kobiljski potok  Kobilje           16.4  46.7 184        48.7 
##  5 <tibble [12,418 × 2]> 1350  Velika Krka      Hodoš             16.3  46.8 225       107   
##  6 <tibble [21,915 × 2]> 3400  Jezernica        Mlino I           14.1  46.4 468         8.61
##  7 <tibble [12,783 × 2]> 8680  Reka             Neblo             13.5  46.0  73.1      29.7 
##  8 <tibble [22,280 × 2]> 9210  Rižana           Kubed II          13.9  45.5  57.7     204   
##  9 <tibble [8,035 × 2]>  9280  Drnica           Pišine I          13.6  45.5   1.78     29.8 
## 10 <tibble [22,645 × 2]> 9300  Dragonja         Podkaštel I       13.7  45.5   9.81     92.7 
## 11 <tibble [12,053 × 2]> 4761  Mestinjščica     Sodna vas, I, II  15.6  46.2 193       133

The column data is a list and because it is part of tibble it is called a list-column. Lists are the most versatile data structures in R; anything can be stored in a list. We make use of this list-columns to store discharge time series of varying length.

The data structure of the object slovenia is very flexible an powerful. By using a tibble to hold both data and meta data we ensure that they don’t get accidentally mixed up.

Working with the Data

To access the discharge values of the list-column data it has to be unnested.

runoff <- slovenia %>%
  select(station, data) %>%
  unnest(data) %>%
  print()
## # A tibble: 184,812 x 3
##    station time       discharge
##    <chr>   <date>         <dbl>
##  1 Cankova 1961-01-01     0.110
##  2 Cankova 1961-01-02     0.110
##  3 Cankova 1961-01-03     0.120
##  4 Cankova 1961-01-04     0.570
##  5 Cankova 1961-01-05     1.00 
##  6 Cankova 1961-01-06     0.980
##  7 Cankova 1961-01-07     0.430
##  8 Cankova 1961-01-08     0.270
##  9 Cankova 1961-01-09     0.210
## 10 Cankova 1961-01-10     0.160
## # ... with 184,802 more rows
coverage <- runoff %>%
  mutate(station, time, covered = is.finite(discharge))

ggplot(coverage, aes(x = time, y = station, fill = covered)) +
  geom_raster()

Newly derived variables (e.g. the data coverage) can be appended to the existing tibble as a column. The map() functions from the package purrr make it easy to apply a function to each element of a list. In our case we want to apply the function perc_covered() to each discharge time series. Because perc_covered() returns a single number between 0 and 1 we have to use the function map_dbl().

perc_covered <- function(x)
{
  ndays <- as.double(diff(range(x$time)), unit = "days")
  nvalues <- sum(is.finite(x$discharge))
  
  return(nvalues/ndays)
}

slovenia <- slovenia %>%
  ungroup() %>%
  mutate(coverage = map_dbl(data, perc_covered)) 

slovenia %>%
  arrange(coverage)
## # A tibble: 11 x 9
##    data                  id    river            station         lon   lat      z catchment coverage
##    <list>                <chr> <chr>            <chr>         <dbl> <dbl>  <dbl>     <dbl>    <dbl>
##  1 <tibble [22,645 × 2]> 9300  Dragonja         Podkaštel I    13.7  45.5   9.81     92.7     0.638
##  2 <tibble [20,088 × 2]> 1100  Kučnica          Cankova        16.0  46.7 206        30.4     0.736
##  3 <tibble [16,801 × 2]> 1300  Martjanski potok Martjanci      16.2  46.7 189        28.1     0.759
##  4 <tibble [22,280 × 2]> 9210  Rižana           Kubed II       13.9  45.5  57.7     204       0.841
##  5 <tibble [16,071 × 2]> 1310  Kobiljski potok  Kobilje        16.4  46.7 184        48.7     0.848
##  6 <tibble [12,053 × 2]> 4761  Mestinjščica     Sodna vas, I…  15.6  46.2 193       133       0.892
##  7 <tibble [12,783 × 2]> 8680  Reka             Neblo          13.5  46.0  73.1      29.7     0.938
##  8 <tibble [8,035 × 2]>  9280  Drnica           Pišine I       13.6  45.5   1.78     29.8     0.942
##  9 <tibble [19,723 × 2]> 1220  Ledava           Polana I       16.1  46.7 191       208       0.993
## 10 <tibble [21,915 × 2]> 3400  Jezernica        Mlino I        14.1  46.4 468         8.61    0.998
## 11 <tibble [12,418 × 2]> 1350  Velika Krka      Hodoš          16.3  46.8 225       107       1.00

It is also possible to perform an analysis only on a subset of stations. The dplyr verb filter() filters only the rows (stations) matching a given criteria.

gaps <- slovenia %>%
  filter(coverage < 0.8) %>%
  transmute(data,
            label = paste(river, "at", station, "\tCatchment: ", catchment, "km²,",
                          "Altitude:", z, "m a.s.l.")) %>%
  unnest()
ggplot(gaps, aes(x = time, y = discharge)) +
  geom_line() +
  facet_wrap(~label, ncol = 1, scales = "free_y")
## Warning: Removed 365 rows containing missing values (geom_path).

We can quickly check if the coordinates of the gauging stations are plausible using an interactive leaflet map.

library(leaflet)

leaflet(slovenia) %>%
  addTiles() %>%
  addMarkers(label = ~paste(station, river, sep = " - "))
## Assuming 'lon' and 'lat' are longitude and latitude, respectively

Coordinate conversion

Leaflet requires the coordinates to be in WGS84 (EPSG code: 4326). To convert coordinates from one coordinate reference system (CRS) to another one can use the function st_transform() from the package sf (Simple Features).

If the CRS in use or its EPSG code is unknown, http://www.epsg-registry.org/ provides help in listing all EPSG codes of a projected CRS.

For example, transforming the WGS84 coordinates to LCC Europe (EPSG code: 3034) yields:

transform_crs <- function(x, y, from, to)
{
  require(sf)
  p1 <- st_as_sf(data.frame(x, y), coords = c("x", "y"), crs = from)
  st_coordinates(st_transform(p1, crs = to))
}

transform_crs(x = slovenia$lon, y = slovenia$lat, from = 4326, to = 3034)
##          X       Y
## 1  4445582 2249828
## 2  4454421 2246998
## 3  4457966 2248190
## 4  4472946 2249737
## 5  4467625 2262820
## 6  4305796 2202058
## 7  4262199 2161277
## 8  4293794 2112466
## 9  4276196 2104376
## 10 4277448 2103001
## 11 4418965 2189618

Using OpenStreetMap data in R

Data from OpenStreetMap can be accessed using the recent R package osmdata. This package only downloads the data, it doesn’t modify it. Associated to osmdata is the package osmplotr which helps in producing maps.

library(osmdata)
## Data (c) OpenStreetMap contributors, ODbL 1.0. http://www.openstreetmap.org/copyright

osmdata is an R package for accessing OpenStreetMap data using the Overpass API. The Overpass API is a read-only API that serves up custom selected parts of the OSM map data.

As described at https://github.com/ropensci/osmdata, Overpass API queries can be built from a base query constructed with opq() followed by add_osm_feature(). The corresponding OSM objects are then downloaded and converted to R Simple Features (sf) objects with osmdata_sf() or to R Spatial (sp) objects with osmdata_sp().

In the following example we want to use the administrative boundaries of Vienna and its river network as a backgrouńd for a map of randomly generated temperature data.

The function opq() offers several ways to specify the bounding box for the query. In case of a character string the free Nominatim API provided by OpenStreetMap is used to find the bounding box associated with place names.

Let us first fetch the administrative borders of Vienna’s districts. https://taginfo.openstreetmap.org/keys/admin_level tells us that admin_level=9 will give us the borders of the districts.

# boundingbox <- "Vienna, Austria"
boundingbox <- c(16.18, 48.12, 16.58, 48.33)

borders <- opq(bbox = boundingbox) %>%
  add_osm_feature(key = "admin_level", value = "9") %>%
  osmdata_sf() 

borders
## Object of class 'osmdata' with:
##                  $bbox : 48.12,16.18,48.33,16.58
##         $overpass_call : The call submitted to the overpass API
##             $timestamp : [ Wed 4 Apr 2018 07:44:25 ]
##            $osm_points : 'sf' Simple Features Collection with 5542 points
##             $osm_lines : 'sf' Simple Features Collection with 406 linestrings
##          $osm_polygons : 'sf' Simple Features Collection with 5 polygons
##        $osm_multilines : 'sf' Simple Features Collection with 0 multilinestrings
##     $osm_multipolygons : 'sf' Simple Features Collection with 24 multipolygons

https://wiki.openstreetmap.org/wiki/Map_Features lists all available OSM features and its tags. Every feature needed has to be downloaded in a separate query.

rivers <- opq(bbox = boundingbox) %>%
  add_osm_feature(key = "waterway", value = "river") %>%
  osmdata_sf()

streams <- opq(bbox = boundingbox) %>%
  add_osm_feature(key = "waterway", value = "stream") %>%
  osmdata_sf()

riverbank <- opq(bbox = boundingbox) %>%
  add_osm_feature(key = "waterway", value = "riverbank") %>%
  osmdata_sf()

water <- opq(bbox = boundingbox) %>%
  add_osm_feature(key = "natural", value = "water") %>%
  osmdata_sf()

surfacewater <- c(riverbank, water)

Features (like the Danube) can be much larger than the bounding box used for the query. And as osmdata doesn’t modify (crops) the data, we have to limit the extend of our map manually by passing appropriate values to coord_sf().

extract_limits <- function(x)
{
  coord <- as.numeric(strsplit(x$bbox, ",", fixed = TRUE)[[1]])
  list(x = coord[c(2, 4)], y = coord[c(1, 3)])
}

limits <- extract_limits(opq(bbox = boundingbox))

The function geom_sf() is not yet available from CRAN. You will have install the development version from github.

# install_github("hadley/ggplot2")
library(ggplot2)

ggplot() +
  geom_sf(data = borders$osm_multipolygons, fill = NA) +
  geom_sf(data = surfacewater$osm_multipolygons, fill = "lightblue", col = NA)  +
  geom_sf(data = surfacewater$osm_polygons, fill = "lightblue", col = NA) +
  geom_sf(data = streams$osm_lines, col = "lightblue", size = 0.5) +
  geom_sf(data = rivers$osm_lines, col = "#286ee0", size = 1) +
  coord_sf(xlim = limits$x, ylim = limits$y) + 
  labs(caption = "Data (c) OpenStreetMap contributors, ODbL 1.0")

The same map with simulated discharge data.

set.seed(2804)
nstation <- 11
coord <- data_frame(id = seq_len(nstation),
                    lon = runif(nstation, limits$x[1], limits$x[2]),
                    lat = runif(nstation, limits$y[1], limits$y[2]))

nyears <- 4
temp <- data_frame(id = rep(coord$id, each = nyears),
                   year = rep(seq(2000, length.out = nyears), times = nstation),
                   temperature = runif(length(year), min = 10, max = 20)) %>%
  left_join(coord, by = "id")


ggplot(temp) +
  geom_sf(data = borders$osm_multipolygons, fill = NA, size = 0.1) +
  geom_sf(data = rivers$osm_multilines, col = "#286ee0", size = 1) +
  coord_sf(xlim = limits$x, ylim = limits$y, ndiscr = 0) +
  geom_point(aes(x = lon, y = lat, size = temperature)) +
  facet_wrap(~year) +
  theme_bw() +
  theme(axis.title = element_blank()) + 
  labs(caption = "Data (c) OpenStreetMap contributors, ODbL 1.0")

Ressources