Published August 4, 2025 | Version v1
Conference paper Open

Better Use of Research Data through Better Context

  • 1. Huygens Institute
  • 2. Huygens Institute, KNAW
  • 1. Nationale Forschungsdateninfrastruktur (NFDI) e.V.
  • 2. University of Amsterdam

Description

Introduction To use research data responsibly, a user must know technical details about the dataset. But they must also know its context: what do the data mean, how and why were they created, are there gaps, errors or biases? Including this information in data shared across borders via research infrastructures can have a long-term impact on research practices and quality. Data-Envelopes for Describing Context Existing documentation frameworks for dataset description were found to be lacking in important areas such as provenance, bias, and machine-readability, leading to the development of data-envelopes [1], a means to assist responsible (re)use of datasets by describing them in a structured, machine-readable format. Data-envelopes adhere to commonly used standards, e.g. DCAT-3 [2], and follow the FAIR principles. See Figure 1 for an overview of the data-envelopes structure [1]. Practical challenges In the "Accessing Context"[3] project, data-envelopes are being created for almost 100 datasets from the Huygens Institute [4], the Netherlands Institute for Sound & Vision [5], and the Amsterdam City Archives [6], in cooperation between dataset experts and digital archivists. Consultations are being held with providers and users to discover requirements for good descriptions. In this paper, we discuss some of the practical challenges that we discovered. Language: A seemingly simple issue is the language of the description. It is often important to describe the dataset in the language in which it was created, but datasets may include multiple languages. To ensure widespread reuse in the international community it can be helpful to describe datasets in other languages. Yet increasing the number of languages increases the burden on the data providers, and makes it harder to search descriptions. Static vs Dynamic: For dynamic datasets, some users want the most recent version, while others require static snapshots, e.g. for reproducibility. Supporting multiple static snapshots can lead to a plethora of data-envelopes. Whereas when describing a dynamic dataset, users may not understand how the data changes, and there is the risk that the description becomes outdated as the dataset changes. Granularity: Providers and users may have different ideas about how data should be divided up for description. A provider may find it logical to divide it according to the source, whereas one user may find a thematic division helpful, and another may prefer a split according to access rights. Conclusion and Future Work Implementing data-envelopes has given us insights into practical questions that apply to all means of describing datasets. This work already involves three ERICs: the activities are funded by European Research Infrastructure for Heritage Science (E-RIHS [7]) while the practical implementation is based on the technical solutions developed and established as current practices within Common Language Resources and Technology Infrastructure (CLARIN [8]) and Digital Research Infrastructure for the Arts and Humanities (DARIAH-EU [9]). We plan to expand contact with the (inter)national data management community, to align initiatives and move towards commonly agreed practices that lead to better dataset descriptions throughout Research Data Management infrastructure.

Files

CoRDI_2025_paper_156.pdf

Files (145.9 kB)

Name Size Download all
md5:212d4c1c5f539b0514d15bc4a2d49538
145.9 kB Preview Download