Enhancing PID services towards a more fine-grained granularity level as a base for a FAIR data infrastructure

Saldanha Bach, Janete; Klas, Claus-Peter; Mutschke, Peter

doi:10.5281/zenodo.6760992

Published June 27, 2022 | Version 1.0

Poster Open

Enhancing PID services towards a more fine-grained granularity level as a base for a FAIR data infrastructure

1. GESIS – Leibniz Institute for the Social Sciences

Assigning a PID to a whole dataset, as common practice within research data management, is not enough to unambiguously identify the piece of information used and ensure the data citation properly and, consequently, promote the accreditation of research results.

There is an increasing research data availability within data repositories, which leads to data visibility and intensifies re-use and reproducibility approaches. Data per se has various levels of granularity. For the case of Social Sciences, for instance, the variable level is the construct that provides evidence for the research results and allows future inferences and analyses. The variable level in the Social Sciences research data is a unit of quantitative data, commonly obtained through survey questionnaires or experiments and represented in a tabular datasets format. In the sense of re-use, researchers are much more interested in the concept of those variables.

When re-used, variables are currently cited "in the text" without a unique identifier; usually, only the study or parts of the questions are cited. These non-standard practices lead to consequences such as making it inefficient for the service provider to identify critical variables and for the researcher to re-use variables. It also hinders automated access to variables, making harmonization very expensive since it is a costly and time-consuming task.

To solve this problem, we propose assigning a PID to the variable level to cite it unambiguously. A Persistent Identifier (PID) is a persistent, unique, and globally resolvable identifier based on an openly specified PID Scheme (EOSC, 2020, see doi: 10.2777/926037). Persistent Identifiers (PIDs) have been the assignment for data identification whatever the standard is (DOI, Handle, URN, ARK). A given study that re-used data (variables) relies on the variable's analysis to provide results and recommendations, make inferences, and produce outcomes through secondary data analysis practices. The approach means identifying the variables must assure a precise, unique, traceable, unambiguous, long-lasting, and undouble standard, consequently qualifying for an accurate data citation using a PID.

PIDs as data identifiers have far advantages in terms of machine-actionable features: enable citation tracking and aggregating; scientific production combination; empowering authoritative; promote digital connections among researchers, organizations, and research outputs.

Our proposal addresses providing the complete citation possible, even though the granularity level requires adding a PID to a smaller unit within the dataset. Assigning PIDs to the variables will make research data easier to find and cite, improving data findability and accessibility. The more unambiguously researched data identification becomes when citing data at a most discrete level, such as a variable. A more detailed citation method can build trust in data, provide provenance, foster reusability, and favor reproducibility.

The functionality of registering variable's service assigns a PID with Handle standard, using a PID as a third-party registration process service (ePIC API see http://www.pidconsortium.net/). The service also enables "bulk registration," which means the registration of many variables at once. It provides a landing page and requires a minimal metadata schema such as Study DOI, variable name, variable label, landing page, resource type, title, creators, publisher, publication date, and availability.

We advocate that more specific data citation practices can perform to a higher maturity level, in the best sense of the FAIR Principles. We assessed the service under the FAIR Data Maturity Model (RDA Working Group on FAIR Data Maturity Model, 2020, see doi: 10.15497/rda00050), applying the stricter evaluation method on each indicator, assessing them by passing or failing binary answers. The results demonstrate outstanding achievements at levels 1 and 2, marking 100% on the assessment measure. The service achieves 88% compliance at level 3 and 89% at level 4. At level 5, the results show 80% of passed indicators. Our service meets all indicators classified as essential. The indicator classes which do not meet the measures were four from the important and useful classified categories. However, it is essential to highlight the failed indicators concerned with automatic features, including references and/or qualified references to other data, and data is accessed automatically (i.e., by a computer program). We intend to address these in our future work. Given the high relevance of the service for implementing FAIR, we aim to provide reusable and generalized components as a blueprint for other projects.

Notes

Funded by Deutsche Forschungsgemeinschaft as part of NFDI under project number 442494171.

Files