Published January 9, 2023 | Version 1.0.0
Annotation collection Open

TOWARDS A DATA QUALITY FRAMEWORK FOR EOSC

  • 1. Barcelona Supercomputing Center, Spain
  • 2. European Research Infrastructure on Highly Pathogenic Agents
  • 3. University of Tartu, Estonia
  • 4. University of Helsinki, Finland
  • 5. Politecnico di Milano, Italy
  • 6. Biozentrum, University of Basel, Switzerland
  • 7. National Physical Laboratory, Teddington, UK
  • 8. Vienna University of Technology, Library, Austria
  • 9. SWITCH, Zürich, Switzerland
  • 10. Deutsches Klimarechenzentrum Gmbh (DKRZ), Hamburg, Germany
  • 11. Novo Nordisk Foundation Center for Stem Cell Medicine- reNEW, University of Copenhagen, Denmark

Description

The European Open Science Cloud (EOSC) Association leverages thirteen Task Forces (TFs), grouped into five Advisory Boards, to help steer the implementation of EOSC. This document is released by the Data Quality subgroup of the “FAIR Metrics and Data Quality” TF. Data quality is critical in ensuring the credibility, legitimacy, and actionability of resources within EOSC. Indeed, certification and conformity mechanisms must be established to assure researchers that the infrastructures they deposit and access data conform to clear rules and criteria. If researchers feel a loss of control and visibility or have concerns about how professionally their data will be managed, additional barriers to data sharing will emerge. Informed by the results of a systematic literature review and community consultation utilising surveys, presentations, and case studies, this TF identified key concepts and formulated recommendations. Let us start with a quick view of the critical concepts.

Following the definition given by ISO 8000, with “data quality” we mean the degree to which a set of inherent characteristics of data fulfils requirements. Aligning actual data characteristics with the desired requirements implies that quality depends on context (both dataset application and lifecycle) and stakeholders, central elements in setting the requirements. Requirements must have a clear target aspect (e.g., privacy) and a level to reach (e.g., GDPR standard) and can be distinguished into functional (targeting the question “are you producing the right thing for its application?") and non-functional (targeting the question "are you producing it right?”). The goal of data quality management is to ensure that: (i) valuable information to understand the dataset is available, (ii) the dataset is reliable and ready to be used according to non-functional requirements (fit-for-use), and (iii) the dataset meets functional requirements (fit-for-purpose) when the purpose is known; if the purpose is unknown, data quality management ensures that the information necessary for users to make self-assessments of fitness-for-purpose is available. Note that in a FAIR ecosystem, where datasets could be reused far from the original purpose, information about the intended new purpose and associated functional requirements is limited.

This document explains basic concepts to build a solid basis for a mutual understanding of data quality in a multidisciplinary environment. These range from the difference between quality control, assurance, and management to categories of quality dimensions, as well as typical approaches and workflows to curate and disseminate dataset quality information, minimum requirements, indicators, certification, and vocabulary. These concepts are explored considering the importance of evaluating resources carefully when deciding the sophistication of the quality assessments. Human resources, technology capabilities, and capacity-building plans constrain the design of sustainable solutions.

We identified several benefits (or risks) of having good (or poor) quality and which are the stakeholders impacted. Despite these benefits, barriers and concerns prevent the provision of quality-assessed datasets, we identified these issues in detail.

Distilling the knowledge accumulated in this Task Force, we extracted cross-domain commonalities, lessons learned, and challenges. The resulting main recommendations are:

  1. Data quality assessment needs standards to check data against; unfortunately, not all communities have agreed on standards, so EOSC should assist and push each community to agree on community standards to guarantee the FAIR exchange of research data. Although we extracted a few examples highlighting this gap, the current situation requires a more detailed and systematic evaluation in each community. Establishing a quality management function can help in this direction because the process can identify which standard already in use by some initiatives can be enforced as a general requirement for that community. We recommend that EOSC considers taking the opportunity to encourage communities to reach a consensus in using their standards.

  2. Data in EOSC need to be served with enough information for the user to understand how to read and correctly interpret the dataset, what restrictions are in place to use it, and what processes participate in its production. EOSC should ensure that the dataset is structured and documented in a way that can be (re)used and understood. Quality assessments in EOSC should not be concerned with checking the soundness of the data content. Aspects like uncertainty are also important to properly (re)use a dataset. Still, these aspects must be evaluated outside the EOSC ecosystem, which only checks that evidence about data content assessments is available. Following stakeholders’ expectations, we recommend that EOSC is equipped with essential data quality management, i.e., it should perform tasks like controlling the availability of basic metadata and documentation and performing basic metadata compliance checks. The EOSC quality management should not change data but point to deficiencies that the data provider or producer can address.

  3. Errors found by the curators or users need to be rectified by the data producer/provider. If not possible, errors need to be documented. Improving data quality as close to the source (i.e., producer or provider) as possible is highly recommended. Quality assessments conducted in EOSC should be shown first to the data provider to give a chance to improve the data and then to the users.

  4. User engagement is necessary to understand the user requirements (needs, expectations, etc.); it may or may not be part of a quality management function. Determining and evaluating stakeholder needs is not a one-time requirement but a continuous and collaborative part of the service delivery process.

  5. It is recommended to develop a proof-of-concept quality function performing basic quality assessments tailored to the EOSC needs (e.g., data reliability and usability). These assessments can also support rewarding research teams most committed to providing FAIR datasets. The proof-of-concept function cannot be a theoretical conceptualization of what is preferable in terms of quality. Still, it must be constrained by the reality of dealing with an enormous amount of data within a reasonable time and workforce.

  6. Data quality is a concern for all stakeholders, detailed further in this document. The quality assessments must be a multi-actor process between the data provider, EOSC, and users, potentially extended to other actors in the long run. The resulting content of quality assessments should be captured in structured, human- and machine-readable, and standard-based formats. Dataset information must be easily comparable across similar products, which calls for providing homogeneous quality information.

  7. A number of requirements valid for all datasets in EOSC (and beyond) and specific aspects of a maturity matrix gauging the maturity of a community when dealing with quality have been defined. Further refinement will be necessary for the future, and specific standards to follow will need to be identified.

Files

Towards_a_data_quality_framework_for_eosc_final.pdf

Files (6.1 MB)