Motivation of the dataset package

The primary aim of dataset is to build well-documented data.frames, tibbles or data.tables that translate well into the W3C DataSet definition within the Data Cube Vocabulary in a reproducible manner. The data cube model in itself is is originated in the Statistical Data and Metadata eXchange, and it is almost fully harmonzied with the Resource Description Framework (RDF), the standard model for data interchange on the web.

A mapping of R objects into these models has numerous advantages: it makes data importing easier and less error-prone; it leaves plenty of room for documentation automation; and it makes the publication of results from R following the FAIR principles much easier.

Lovers of the tidy data concept and tidyverse know that the world is full of untidy data tables that give on average 5-times more work to tidy up then to actually do analytical work with them. But that’s not the only problem with datasets. Datasets without codebooks cannot be intelligently matched with other information sources. Data without reference to an authoritative copy cannot be validated against accidental damage during processing. A dataset without bibliographic references is not ready for publication.

The motivation of the ecosystem of the dataset, statcodelists, and the dataobservatory packages is to provide a comprehensive toolkit that helps many reproducible research tasks.

Our dataset class follows the organizational model of the datacube, which is used by the Statistical Data and Metadata eXchange, and which is also described in a non-normative manner by the the RDF Data Cube Vocabulary. While the SDMX standards predate the Resource Description Framework (RDF) framework for the semantic web, they are already harmonised to a great deal, which enables users and data publishers to create machine-to-machine connections among statistical data. Our goal is to create a modern data frame object in R with utilities that allow the R user to benefit from synchronizing data with semantic web applications, including statistical resources, libraries, or open science repositories.

The dataset package aims to produce well-formatted and well-documented data.frames, tibbles or data.tables that can be easily synchronized with web resources (such as statistical web APIs), and which are ready to published on open science repositories with metadata that makes them findable, accessible and interoperable in library catalogues and in statistical applications.

The primary aim of dataset is to build well-documented data.frames, tibbles or data.tables that follow the W3C Data Cube Vocabulary based on the statistical SDMX data cube model1. Such standard R objects (data.fame, data.table, tibble, or well-structured lists like json) become highly interoperable and can be placed into relational databases, semantic web applications, archives, repositories. They follow the FAIR principles: they are findable, accessible, interoperable and reusable.

Our datasets:

We also aim to replace the survey in the retroharmonize survey harmonization package to an inherited dataset that is optimalised to contain social sciences survey data.

The connecting statcodelists packages facilitates further reproducability with standardized, natural language independent codelists for categorical variables. Such categorical variables are correctly interpreted in a wide array of statistical applications and can be easily joined with data from many countries irrespective of the primary sources’ language.

FAIR: findable datasets

library(dataset)
iris_datacite  <- datacite_add(
  x = iris,
  Title = "Iris Dataset",
  Creator = person("Anderson", "Edgar", role = "aut"),
  Publisher = "American Iris Society",
  Identifier = "https://doi.org/10.1111/j.1469-1809.1936.tb02137.x",
  PublicationYear = 1935,
  Description = "This famous (Fisher's or Anderson's) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.",
  Language = "en")

In R, objects can have arbitrary attributes. For example, a data.frame has a class attribute that tells functions to treat the object as a data.frame.Under the hood, you still keep your data frame object of choice, the good old base r ?data.frame, or the more modern data.table or tibble.

We add descriptive metadata conforming the Dublin Core and DataCite standard as data frame attributes, because they must clearly describe the dataset. The dataset-level attributes do not interfere with the tidy data concept, because the tidy data concept relates to the contents of the data frame.

datacite(iris_datacite)
#> $names
#> [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"     
#> 
#> $Title
#>          Title titleType
#> 1 Iris Dataset     Title
#> 
#> $Creator
#> [1] "Anderson Edgar [aut]"
#> 
#> $Identifier
#> [1] "https://doi.org/10.1111/j.1469-1809.1936.tb02137.x"
#> 
#> $Publisher
#> [1] "American Iris Society"
#> 
#> $Issued
#> [1] 1935
#> 
#> $publication_year
#> [1] 1935
#> 
#> $Type
#>   resourceType resourceTypeGeneral
#> 1      Dataset             Dataset
#> 
#> $Description
#> [1] "This famous (Fisher's or Anderson's) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica."
#> 
#> $Geolocation
#> [1] NA
#> 
#> $Language
#> [1] "eng"
#> 
#> $Rights
#> [1] NA
#> 
#> $Size
#> [1] "12.72 kB [12.42 KiB]"

Reproducible datasets

temp_path <- file.path(tempdir(), "iris.csv")
write.csv(iris, file = temp_path)
iris_dataset <- read_dataset(
  dataset_id = "iris_dataset",
  obs_id = NULL,
  dimensions = NULL,
  measures = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"),
  attributes = "Species",
  Title = "Iris Dataset",
  unit = list(code = "MM", label = "millimeter"),
  .f = "utils::read.csv", file = temp_path )
attributes(iris_dataset)

Interoperable datasets

The dataset package creates data frame templates that confirm the tidy data concept, but goes beyond the it, and organize them for survey processing, statistical aggregation and publication. Our templates follow international metadata standards and data organization recommendations, and translates them on three R object times, the base R data.frames, and their modern tibble (tidyverse) and data.table (DT) versions.

We aim for interoperability, even if not full compliance, with the W3C Dataset concept, the SDMX Dataset concept, and DDI concept for survey data. These metadata concepts are mainly used in the semantic web to foster machine-to-machine communication. Because our dataset classes are not designed for web services, but for producing research output in R that will easily integrate into them, we subjectively select specifications from the W3C and SDMX Dataset concepts (which are largely harmonised, and are special cases of the Datacube), and DDI standards for surveys and Dublin Core and DataCite standards for publication and reuse. Our selection criteria will be practical usability in reproducible research in R, particularly in the tidy data and tidymodels framework.

Our real goal is to facilitate efficient reproducible research workflows that perform flawlessly tasks that most R users, even scientific researchers, are unfamiliar with—or if they are familiar with them, they find it boring, because they will usually not get credited for it as analyst or researchers. We want to help processing survey data in a way that it can find its way to a survey archive. We want to help the statistical processing and analysis of survey, accounts, and other transactional data into formats that easy to publish or add to relational databases or semantic web applications.

The W3C Dataset concept (or more general Datacube) concept is far too general for practical specification for our purposes. We will use small parts of these international standards, and our selection therefore will be subjective.

See: Linked SDMX Data


  1. RDF Data Cube Vocabulary, W3C Recommendation 16 January 2014 https://www.w3.org/TR/vocab-data-cube/, Introduction to SDMX data modeling https://www.unescap.org/sites/default/files/Session_4_SDMX_Data_Modeling_%20Intro_UNSD_WS_National_SDG_10-13Sep2019.pdf↩︎