From dataset To RDF

library(dataset)

Our datasets are defined in a way that their dimensions can be easily and unambiguously reduced to triples for RDF applications; they can be easily serialized to, or synchronized with semantic web applications with the rdflib package1.

Read more about how we adopted to datacube model of SDMX and the RDF Data Cube Vocabulary for datasets as R objects in the The dataset S3 Class vignette article.

Because our datasets conform the tidy data concept, they can be reduced into long-form triples.

       
RDF subject predicate object
JSON object property value
spreadsheet row id column name cell
data.frame key variable measurement
data.frame key attribute value

Table source: rdflib

Dimension reductions of the dataset

Our datasets are tidy.

example_dataset <- readxl::read_excel(
  system.file("extdata", "rdf_example.xlsx", package = "dataset"), 
  sheet = "dataset-wide")
example_dataset
#> # A tibble: 8 × 8
#>   rowid  time geo   sex   value unit  freq  status
#>   <dbl> <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> 
#> 1     1  2021 NL    F         9 NR    A     A     
#> 2     2  2021 BE    F         8 NR    A     E     
#> 3     3  2021 NL    M        10 NR    A     A     
#> 4     4  2021 BE    M         7 NR    A     A     
#> 5     5  2022 NL    F        10 NR    A     A     
#> 6     6  2022 BE    F        11 NR    A     A     
#> 7     7  2022 NL    M        NA NR    A     O     
#> 8     8  2022 BE    M        10 NR    A     A

You can start to reduce the dimensions, for example, with uniting the dimensions. In this case, the row identifier becomes more and more a unique resource identifier, i.e. a URI.

Eventually you can reduce the entire dataset into a triple. The uri uniquely defines the observations, the component maintains the W3C/SDMX datacube models main structural element, and the value field the value of the dimension, measurement or attribute.

library(tidyr)
library(dplyr)
example_long <- example_dataset %>%
  unite("uri", 
        all_of(c("rowid", "time", "geo", "sex", "value", "unit", "freq", "status")), 
        remove = FALSE) %>%
  mutate_all ( as.character) %>%
  pivot_longer(
    cols = -any_of("uri"),
    names_to = "component",
    values_to = "value"
  )

The benefit of standard codelists

In this example, except for the measurement of the observation, we used only SDMX-attribute conform variable names and codes. The advantage of this approach is that it is very easy to increase the dimensions of the dataset, and add human-readable labels, potentially in many natural languages.

set.seed(2022)
library(statcodelists)

example_long %>%
   filter (.data$component == "sex") %>%
   left_join(statcodelists::CL_SEX %>%
              rename ( value = .data$id ), 
            by = "value") %>%
  bind_rows (
    example_long %>%
      filter (.data$component == "freq") %>%
      left_join(statcodelists::CL_FREQ %>%
              dplyr::rename ( value = .data$id ), 
            by = "value") 
  )  %>% 
    bind_rows (
    example_long %>%
      filter (.data$component == "status") %>%
      left_join(statcodelists::CL_OBS_STATUS %>%
              dplyr::rename ( value = .data$id ), 
            by = "value") 
  ) %>%
  group_by (.data$component) %>%
  sample_frac( size = 0.3 ) %>%
  kableExtra::kbl() %>%
  kableExtra::kable_paper()
uri component value name description name_locale description_locale
4_2021_BE_M_7_NR_A_A freq A Annual To be used for data collected or disseminated every year en en
3_2021_NL_M_10_NR_A_A freq A Annual To be used for data collected or disseminated every year en en
6_2022_BE_F_11_NR_A_A sex F Female NA en NA
7_2022_NL_M_NA_NR_A_O sex M Male NA en NA
3_2021_NL_M_10_NR_A_A status A Normal value To be used as default value if no value is provided or when no special coded qualification is assumed. Usually, it can be assumed that the source agency assigns sufficient confidence to the provided observation and/or the value is not expected to be dramatically revised. en en
4_2021_BE_M_7_NR_A_A status A Normal value To be used as default value if no value is provided or when no special coded qualification is assumed. Usually, it can be assumed that the source agency assigns sufficient confidence to the provided observation and/or the value is not expected to be dramatically revised. en en

  1. Carl Boettiger: A tidyverse lover’s intro to RDF↩︎