To reproduce this talk, you will have to install the following packages from CRAN.
install.packages(c("lfstat", "osmdata", "tidyverse", "sf", "leaflet"))
In order to conveniently plot objects of class sf (simple feature) with ggplot the function geom_sf()
is needed which is not yet available from CRAN. For the time being we can install the development version of ggplot2 from github.
install.packages("devtools")
library(devtools)
install_github("hadley/ggplot2")
library(tidyverse)
Often time you might receive a dataset like the following where the actual observations (in our case discharge measurements) and the corresponding station metadata are stored in two separate files. The idea is to
The metadata is provided as a csv file which we are going to import using the function read_csv2()
. This function is quite verbose and can spam your R console with a lot of messages.
metadata <- read_csv2(file = "./metadata.csv")
## Using ',' as decimal and '.' as grouping mark. Use read_delim() for more control.
## Parsed with column specification:
## cols(
## filename = col_character(),
## id = col_integer(),
## river = col_character(),
## station = col_character(),
## lon = col_double(),
## lat = col_double(),
## z = col_double(),
## catchment = col_double()
## )
To prevent this messages simply specify the arguments according to the data, e.g. by pasting the messages into the function call. Furthermore we have to append the name of directory to the file name of the csv files to obtain the full relative path.
directory <- "./discharge/"
metadata <- read_csv2(file = "./metadata.csv", locale = locale(decimal_mark = ","),
col_types = cols(
filename = col_character(),
id = col_character(),
river = col_character(),
station = col_character(),
lon = col_double(),
lat = col_double(),
z = col_double(),
catchment = col_double()
)) %>%
mutate(filename = paste0(directory, filename))
metadata
## # A tibble: 11 x 8
## filename id river station lon lat z catchment
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 ./discharge/1100.csv 1100 Kučnica Cankova 16.0 46.7 206 30.4
## 2 ./discharge/1220.csv 1220 Ledava Polana I 16.1 46.7 191 208
## 3 ./discharge/1300.csv 1300 Martjanski potok Martjanci 16.2 46.7 189 28.1
## 4 ./discharge/1310.csv 1310 Kobiljski potok Kobilje 16.4 46.7 184 48.7
## 5 ./discharge/1350.csv 1350 Velika Krka Hodoš 16.3 46.8 225 107
## 6 ./discharge/3400.csv 3400 Jezernica Mlino I 14.1 46.4 468 8.61
## 7 ./discharge/8680.csv 8680 Reka Neblo 13.5 46.0 73.1 29.7
## 8 ./discharge/9210.csv 9210 Rižana Kubed II 13.9 45.5 57.7 204
## 9 ./discharge/9280.csv 9280 Drnica Pišine I 13.6 45.5 1.78 29.8
## 10 ./discharge/9300.csv 9300 Dragonja Podkaštel I 13.7 45.5 9.81 92.7
## 11 ./discharge/4761.csv 4761 Mestinjščica Sodna vas, I, II 15.6 46.2 193 133
The dataset comprises 11 gauging stations from Slovenia. Each discharge time series is stored in a csv file with two columns (time, discharge). Let’s read in just a single file:
read_csv2(metadata$filename[1])
## Using ',' as decimal and '.' as grouping mark. Use read_delim() for more control.
## Parsed with column specification:
## cols(
## time = col_date(format = ""),
## discharge = col_double()
## )
## # A tibble: 20,088 x 2
## time discharge
## <date> <dbl>
## 1 1961-01-01 0.110
## 2 1961-01-02 0.110
## 3 1961-01-03 0.120
## 4 1961-01-04 0.570
## 5 1961-01-05 1.00
## 6 1961-01-06 0.980
## 7 1961-01-07 0.430
## 8 1961-01-08 0.270
## 9 1961-01-09 0.210
## 10 1961-01-10 0.160
## # ... with 20,078 more rows
To import all the files we need to call an importing function (read_csv()
, read_table()
, …) for every file name in our metadata
tibble. Instead of storing the discharge data in a new object we append another column called data
. dplyr calls this ‘mutating’, the corresponding dplyr verb is mutate()
.
slovenia <- metadata %>%
group_by(filename) %>%
mutate(data = list(read_csv2(file = filename,
locale = locale(decimal_mark = ","),
col_types = cols(
time = col_date("%Y-%m-%d"),
discharge = col_double()
)))) %>%
ungroup()
slovenia <- slovenia %>%
select(-filename) %>%
select(data, everything()) %>%
print()
## # A tibble: 11 x 8
## data id river station lon lat z catchment
## <list> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 <tibble [20,088 × 2]> 1100 Kučnica Cankova 16.0 46.7 206 30.4
## 2 <tibble [19,723 × 2]> 1220 Ledava Polana I 16.1 46.7 191 208
## 3 <tibble [16,801 × 2]> 1300 Martjanski potok Martjanci 16.2 46.7 189 28.1
## 4 <tibble [16,071 × 2]> 1310 Kobiljski potok Kobilje 16.4 46.7 184 48.7
## 5 <tibble [12,418 × 2]> 1350 Velika Krka Hodoš 16.3 46.8 225 107
## 6 <tibble [21,915 × 2]> 3400 Jezernica Mlino I 14.1 46.4 468 8.61
## 7 <tibble [12,783 × 2]> 8680 Reka Neblo 13.5 46.0 73.1 29.7
## 8 <tibble [22,280 × 2]> 9210 Rižana Kubed II 13.9 45.5 57.7 204
## 9 <tibble [8,035 × 2]> 9280 Drnica Pišine I 13.6 45.5 1.78 29.8
## 10 <tibble [22,645 × 2]> 9300 Dragonja Podkaštel I 13.7 45.5 9.81 92.7
## 11 <tibble [12,053 × 2]> 4761 Mestinjščica Sodna vas, I, II 15.6 46.2 193 133
The column data
is a list and because it is part of tibble it is called a list-column. Lists are the most versatile data structures in R; anything can be stored in a list. We make use of this list-columns to store discharge time series of varying length.
The data structure of the object slovenia
is very flexible an powerful. By using a tibble to hold both data and meta data we ensure that they don’t get accidentally mixed up.
To access the discharge values of the list-column data
it has to be unnested.
runoff <- slovenia %>%
select(station, data) %>%
unnest(data) %>%
print()
## # A tibble: 184,812 x 3
## station time discharge
## <chr> <date> <dbl>
## 1 Cankova 1961-01-01 0.110
## 2 Cankova 1961-01-02 0.110
## 3 Cankova 1961-01-03 0.120
## 4 Cankova 1961-01-04 0.570
## 5 Cankova 1961-01-05 1.00
## 6 Cankova 1961-01-06 0.980
## 7 Cankova 1961-01-07 0.430
## 8 Cankova 1961-01-08 0.270
## 9 Cankova 1961-01-09 0.210
## 10 Cankova 1961-01-10 0.160
## # ... with 184,802 more rows
coverage <- runoff %>%
mutate(station, time, covered = is.finite(discharge))
ggplot(coverage, aes(x = time, y = station, fill = covered)) +
geom_raster()
Newly derived variables (e.g. the data coverage) can be appended to the existing tibble as a column. The map()
functions from the package purrr make it easy to apply a function to each element of a list. In our case we want to apply the function perc_covered()
to each discharge time series. Because perc_covered()
returns a single number between 0 and 1 we have to use the function map_dbl()
.
perc_covered <- function(x)
{
ndays <- as.double(diff(range(x$time)), unit = "days")
nvalues <- sum(is.finite(x$discharge))
return(nvalues/ndays)
}
slovenia <- slovenia %>%
ungroup() %>%
mutate(coverage = map_dbl(data, perc_covered))
slovenia %>%
arrange(coverage)
## # A tibble: 11 x 9
## data id river station lon lat z catchment coverage
## <list> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 <tibble [22,645 × 2]> 9300 Dragonja Podkaštel I 13.7 45.5 9.81 92.7 0.638
## 2 <tibble [20,088 × 2]> 1100 Kučnica Cankova 16.0 46.7 206 30.4 0.736
## 3 <tibble [16,801 × 2]> 1300 Martjanski potok Martjanci 16.2 46.7 189 28.1 0.759
## 4 <tibble [22,280 × 2]> 9210 Rižana Kubed II 13.9 45.5 57.7 204 0.841
## 5 <tibble [16,071 × 2]> 1310 Kobiljski potok Kobilje 16.4 46.7 184 48.7 0.848
## 6 <tibble [12,053 × 2]> 4761 Mestinjščica Sodna vas, I… 15.6 46.2 193 133 0.892
## 7 <tibble [12,783 × 2]> 8680 Reka Neblo 13.5 46.0 73.1 29.7 0.938
## 8 <tibble [8,035 × 2]> 9280 Drnica Pišine I 13.6 45.5 1.78 29.8 0.942
## 9 <tibble [19,723 × 2]> 1220 Ledava Polana I 16.1 46.7 191 208 0.993
## 10 <tibble [21,915 × 2]> 3400 Jezernica Mlino I 14.1 46.4 468 8.61 0.998
## 11 <tibble [12,418 × 2]> 1350 Velika Krka Hodoš 16.3 46.8 225 107 1.00
It is also possible to perform an analysis only on a subset of stations. The dplyr verb filter()
filters only the rows (stations) matching a given criteria.
gaps <- slovenia %>%
filter(coverage < 0.8) %>%
transmute(data,
label = paste(river, "at", station, "\tCatchment: ", catchment, "km²,",
"Altitude:", z, "m a.s.l.")) %>%
unnest()
ggplot(gaps, aes(x = time, y = discharge)) +
geom_line() +
facet_wrap(~label, ncol = 1, scales = "free_y")
## Warning: Removed 365 rows containing missing values (geom_path).
We can quickly check if the coordinates of the gauging stations are plausible using an interactive leaflet map.
library(leaflet)
leaflet(slovenia) %>%
addTiles() %>%
addMarkers(label = ~paste(station, river, sep = " - "))
## Assuming 'lon' and 'lat' are longitude and latitude, respectively
Leaflet requires the coordinates to be in WGS84 (EPSG code: 4326). To convert coordinates from one coordinate reference system (CRS) to another one can use the function st_transform()
from the package sf (Simple Features).
If the CRS in use or its EPSG code is unknown, http://www.epsg-registry.org/ provides help in listing all EPSG codes of a projected CRS.
For example, transforming the WGS84 coordinates to LCC Europe (EPSG code: 3034) yields:
transform_crs <- function(x, y, from, to)
{
require(sf)
p1 <- st_as_sf(data.frame(x, y), coords = c("x", "y"), crs = from)
st_coordinates(st_transform(p1, crs = to))
}
transform_crs(x = slovenia$lon, y = slovenia$lat, from = 4326, to = 3034)
## X Y
## 1 4445582 2249828
## 2 4454421 2246998
## 3 4457966 2248190
## 4 4472946 2249737
## 5 4467625 2262820
## 6 4305796 2202058
## 7 4262199 2161277
## 8 4293794 2112466
## 9 4276196 2104376
## 10 4277448 2103001
## 11 4418965 2189618
Data from OpenStreetMap can be accessed using the recent R package osmdata. This package only downloads the data, it doesn’t modify it. Associated to osmdata is the package osmplotr which helps in producing maps.
library(osmdata)
## Data (c) OpenStreetMap contributors, ODbL 1.0. http://www.openstreetmap.org/copyright
osmdata is an R package for accessing OpenStreetMap data using the Overpass API. The Overpass API is a read-only API that serves up custom selected parts of the OSM map data.
As described at https://github.com/ropensci/osmdata, Overpass API queries can be built from a base query constructed with opq()
followed by add_osm_feature()
. The corresponding OSM objects are then downloaded and converted to R Simple Features (sf) objects with osmdata_sf()
or to R Spatial (sp) objects with osmdata_sp()
.
In the following example we want to use the administrative boundaries of Vienna and its river network as a backgrouńd for a map of randomly generated temperature data.
The function opq()
offers several ways to specify the bounding box for the query. In case of a character string the free Nominatim API provided by OpenStreetMap is used to find the bounding box associated with place names.
Let us first fetch the administrative borders of Vienna’s districts. https://taginfo.openstreetmap.org/keys/admin_level tells us that admin_level=9
will give us the borders of the districts.
# boundingbox <- "Vienna, Austria"
boundingbox <- c(16.18, 48.12, 16.58, 48.33)
borders <- opq(bbox = boundingbox) %>%
add_osm_feature(key = "admin_level", value = "9") %>%
osmdata_sf()
borders
## Object of class 'osmdata' with:
## $bbox : 48.12,16.18,48.33,16.58
## $overpass_call : The call submitted to the overpass API
## $timestamp : [ Wed 4 Apr 2018 07:44:25 ]
## $osm_points : 'sf' Simple Features Collection with 5542 points
## $osm_lines : 'sf' Simple Features Collection with 406 linestrings
## $osm_polygons : 'sf' Simple Features Collection with 5 polygons
## $osm_multilines : 'sf' Simple Features Collection with 0 multilinestrings
## $osm_multipolygons : 'sf' Simple Features Collection with 24 multipolygons
https://wiki.openstreetmap.org/wiki/Map_Features lists all available OSM features and its tags. Every feature needed has to be downloaded in a separate query.
rivers <- opq(bbox = boundingbox) %>%
add_osm_feature(key = "waterway", value = "river") %>%
osmdata_sf()
streams <- opq(bbox = boundingbox) %>%
add_osm_feature(key = "waterway", value = "stream") %>%
osmdata_sf()
riverbank <- opq(bbox = boundingbox) %>%
add_osm_feature(key = "waterway", value = "riverbank") %>%
osmdata_sf()
water <- opq(bbox = boundingbox) %>%
add_osm_feature(key = "natural", value = "water") %>%
osmdata_sf()
surfacewater <- c(riverbank, water)
Features (like the Danube) can be much larger than the bounding box used for the query. And as osmdata doesn’t modify (crops) the data, we have to limit the extend of our map manually by passing appropriate values to coord_sf()
.
extract_limits <- function(x)
{
coord <- as.numeric(strsplit(x$bbox, ",", fixed = TRUE)[[1]])
list(x = coord[c(2, 4)], y = coord[c(1, 3)])
}
limits <- extract_limits(opq(bbox = boundingbox))
The function geom_sf()
is not yet available from CRAN. You will have install the development version from github.
# install_github("hadley/ggplot2")
library(ggplot2)
ggplot() +
geom_sf(data = borders$osm_multipolygons, fill = NA) +
geom_sf(data = surfacewater$osm_multipolygons, fill = "lightblue", col = NA) +
geom_sf(data = surfacewater$osm_polygons, fill = "lightblue", col = NA) +
geom_sf(data = streams$osm_lines, col = "lightblue", size = 0.5) +
geom_sf(data = rivers$osm_lines, col = "#286ee0", size = 1) +
coord_sf(xlim = limits$x, ylim = limits$y) +
labs(caption = "Data (c) OpenStreetMap contributors, ODbL 1.0")
The same map with simulated discharge data.
set.seed(2804)
nstation <- 11
coord <- data_frame(id = seq_len(nstation),
lon = runif(nstation, limits$x[1], limits$x[2]),
lat = runif(nstation, limits$y[1], limits$y[2]))
nyears <- 4
temp <- data_frame(id = rep(coord$id, each = nyears),
year = rep(seq(2000, length.out = nyears), times = nstation),
temperature = runif(length(year), min = 10, max = 20)) %>%
left_join(coord, by = "id")
ggplot(temp) +
geom_sf(data = borders$osm_multipolygons, fill = NA, size = 0.1) +
geom_sf(data = rivers$osm_multilines, col = "#286ee0", size = 1) +
coord_sf(xlim = limits$x, ylim = limits$y, ndiscr = 0) +
geom_point(aes(x = lon, y = lat, size = temperature)) +
facet_wrap(~year) +
theme_bw() +
theme(axis.title = element_blank()) +
labs(caption = "Data (c) OpenStreetMap contributors, ODbL 1.0")
R for Data Science, a book by Garrett Grolemund and Hadley Wickham: http://r4ds.had.co.nz/
leaflet: http://leafletjs.com/
Overpass API: https://wiki.openstreetmap.org/wiki/Overpass_API