Published May 1, 2016 | Version v1
Journal article Open

Identification of Reproducible Subsets for Data Citation, Sharing and Re-Use

  • 1. TU Wien
  • 2. University of Helsinki


Research data is changing over time as new records are added, errors are corrected and obsolete records are deleted from a data set. Researchers rarely use an entire data set or stream data as it is, but rather create specific subsets tailored to their experiments. In order to keep such experiments reproducible and to share and cite the particular data used in a study, researchers need means of identifying the exact version of a subset as it was used during a specific execution of a workflow, even if the data source is continuously evolving. In this paper we present 14 recommendations on how to adapt a data source for providing identifiable subsets for the long term, elaborated by the RDA Working Group on Dynamic Data Citation (WGDC). The proposed solution is based upon versioned data, timestamping and a query based subsetting mechanism. We provide a detailed discussion of the recommendations, the rationale behind them, and give examples of how to implement them.



