FAIRsFAIR Data Object Assessment Metrics

This specification (v0.4) contains 17 core metrics proposed by FAIRsFAIR to evaluate the FAIRness of research data objects in Trustworthy Digital Repositories (TDRs). Two new metrics representing the FAIR principle A1 have been added into the specification. Metric descriptions (e.g., related resources, comments) were refined based on feedback received from external users and pilot repositories.


Versioning History
Version Date Notes 0.5 8 March 2022 The specification now also includes a definition of compliance (maturity) levels (0-3) for each metric.A draft of this version of the metrics has been published as appendix of deliverable 4.5 (https://doi.org/10.5281/zenodo.5336159).
The current version contains a corrigendum with respect to FsF-I1-02M which was wrongly attributed to the I1 principle during the transition from v0.3 to v0.4 (see below).Instead it clearly is a I2 principle and has to be relabelled to 'FsF-I2-01M' 0.4 12 October 2020 This specification includes 17 metrics.Two metrics representing the principle A1 have been added into the specification.Metric descriptions (e.g., related resources, comments) were refined based on feedback received from external users and pilot repositories.

Introduction
The overall goal of FAIRsFAIR 1 is to develop practical solutions to facilitate the application of the FAIR principles 2 throughout the research data life cycle.One of the expected outcomes of FAIRsFAIR is building pilots to support the assessment of FAIR digital objects from selected members of the European network of FAIR-enabling Trustworthy Digital Repositories (TDRs).While FAIR principles may apply to any digital objects, we are concerned with the subset of digital objects: research data 3 that are collected, measured, or created for purposes of scientific analysis.

Purpose
This specification (v0.5) presents 17 minimum viable metrics to systematically measure the extent to which research data objects are FAIR.A research data object 4 may comprise data, metadata, and documentation (such as policies and procedures).These components influence the implementation of the FAIR assessment.For instance, they can either be resources to be evaluated or evidence of enabling FAIR.The metrics are developed in stages, and are based on indicators proposed by the RDA FAIR Data Maturity Model Working Group 5 , in addition to prior work conducted by the project partners such as FAIRdat 6 and FAIREnough 7 , and WDS/RDA Assessment of Data Fitness for Use checklist 8 .We have evaluated and improved the metrics, for example through focus groups, internal reviews, public feedback, and tools (F-UJI 9 , FAIR-Aware 10 ) implemented to support FAIR assessment in selected use cases. 11Datasets from five CoreTrustSeal certified repositories 12 have been tested with the automated FAIR assessment tool (F-UJI) developed.We welcome the possible adaptations of the metrics and the tools to support different FAIR assessment scenarios 13 in the research data lifecycle.

Scope
In its current form, the specification applies metrics that may correspond to a part of or the whole of a FAIR principle.To be inclusive of current data practices, we will continue improving the metrics through several iterations based on feedback from stakeholders interested in FAIR, and on the implementation of our use cases to demonstrate FAIR assessment.A new metric will be incorporated into the specification if required by a majority of participating TDRs.Ultimately, we strive to define metrics to address most FAIR principles and as explicitly as possible, both at data and metadata level.We recognize that data quality elements (e.g., completeness, precision/accuracy, validity, ease of data use), and data archival, preservation, and retention aspects are essential, but they are not within the scope of this specification.In addition to defining metrics against FAIR principles, the assessment of the metrics proposed in this specification depends on several factors below.
• In the FAIR ecosystem 14 , FAIR assessment must go beyond the object itself.FAIR enabling services and repositories are vital to ensure that research data objects remain FAIR over time.Importantly, machine-readable services (e.g., registries) and documents (e.g., policies) are required to enable automated tests.• In addition to repository and services requirements, automated testing depends on clear, machine assessable criteria.Some aspects (rich, plurality, accurate, relevant) specified in FAIR principles still require human mediation and interpretation.
• The tests must focus on generally applicable data/metadata characteristics until domain/community-driven criteria have been agreed (e.g., appropriate schemas and required elements for usage/access control).For example, for some of the metrics (i.e., on I and R principles), the automated tests we proposed only inspect the 'surface' of criteria to be evaluated.Therefore, tests are designed in consideration of generic cross-domain metadata standards such as dublin core, dcat, datacite, schema.org,etc.

Metric Outline
The metrics are specified following the template (Table 1), modified from Wilkinson et al. (2018) 15 .In each metric table, we provide the descriptions and assessment details of the metric, and its alignment with the relevant FAIR principle and CoreTrustSeal requirement(s).

Metric Identifier
The local (FAIRsFAIR) identifier of the metric (for more details, see Figure 1).

Metric Name
Metric name in a human readable form.

Description
The definition of the metric, including examples.

FAIR Principle
The FAIR principle most related to the metric.
CoreTrustSeal Alignment The CoreTrustSeal requirement(s) most related to the metric.

Assessment
Requirements and methods to perform the assessment against the metric.Comments A list of related resources which may be used as a reference basis to implement the assessment, constraints and limitations of the proposed assessment.
Each of the FAIRsFAIR metrics is identified following a naming convention.For example, in Figure 1, the identifier starts with the shortened form of the project's name, followed by the related FAIR principle identifier and local identifier.The last part of the identifier distinguishes the resource that will be evaluated based on the metric, e.g., data or metadata.

Identifier Name
FsF-F1-01D Data is assigned a globally unique identifier.

FsF-F2-01M
Metadata includes descriptive core elements (creator, title, data identifier, publisher, publication date, summary and keywords) to support data findability.

FsF-F3-01M
Metadata includes the identifier of the data it describes.

FsF-F4-01M
Metadata is offered in such a way that it can be retrieved by machines.

FsF-A1-01M
Metadata contains access level and access conditions of the data.

FsF-A1-02M
Metadata is accessible through a standardized communication protocol

FsF-A1-03D
Data is accessible through a standardized communication protocol FsF-A2-01M Metadata remains available, even if the data is no longer available.

FsF-I1-01M
Metadata is represented using a formal knowledge representation language.

FsF-I3-01M
Metadata includes links between the data and its related entities.
FsF-R1-01MD Metadata specifies the content of the data.
FsF-R1.1-01M Metadata includes license information under which data can be reused.
FsF-R1.3-01M Metadata follows a standard recommended by the target research community of the data.

FsF-R1.3-02D
Data is available in a file format recommended by the target research community.Data is assigned a persistent identifier.

Description
In this specification, we make a distinction between the uniqueness and persistence of an identifier.An HTTP URL (the address of a given unique resource on the web) is globally unique, but may not be persistent as the URL of data may be not accessible (link rot problem) or the data available under the original URL may be changed (content drift problem).Identifiers based on, e.g., the Handle System, DOI, ARK are both globally unique and persistent.They are maintained and governed such that they remain stable and resolvable for the long term.The persistent identifier (PID) of a data object may be resolved (point) to a landing page with metadata containing further information on how to access the data content, in some cases a downloadable artefact, or none if the data or repository is no longer maintained.Therefore, ensuring persistence is a shared responsibility between a PID service provider (e.g., datacite) and its clients (e.g., data repositories).For example, the DOI system guarantees the persistence of its identifiers through its social (e.g., policy) and technical infrastructures, whereas a data provider ensures the availability of the resource (e.g., landing page, downloadable artefact) associated with the identifier.

Background
The EOSC PID policy requires a PID to be globally unique, persistent, and

Method
Check if the data identifier specified is based on a commonly accepted persistent identifier scheme and syntax, and it resolves to a landing page with metadata containing further information on how to access the data object.Note that this assessment method follows the current best practice to have a PID resolve to a landing page instead of its actual content.

COMMENTS Related Resources
• A wiki entry on persistent identifier, https://en.wikipedia.org/wiki/Persistent_identifier• The assessment verifies the resolvability of the specified identifier to a landing page, but a PID may resolve to a data file or a web service response.
• A registry of persistent identifiers should provide the list of identifiers as well as associated policy documents for ensuring persistence that may be periodically reviewed and updated.If a policy document is issued with a validity period, this should be captured by the registry.
• A PID service provider may periodically check if an identifier within its registry is resolvable (e.g., https://support.datacite.org/docs/link-checker).While the PID itself may be persistent, it may not resolve to a downloadable artefact if the data or repository is no longer maintained.

FIELD DESCRIPTION Metric Identifier
FsF-F2-01M Metric Name Metadata includes descriptive core elements (creator, title, data identifier, publisher, publication date, summary and keywords) to support data findability.

Description
Metadata is descriptive information about a data object.Since the metadata required differs depending on the users and their applications, this metric focuses on core metadata.The core metadata is the minimum descriptive information required to enable data finding, including citation which makes it easier to find data.We determine the required metadata based on common data citation guidelines (e.g., DataCite

Method
Use the data identifier to access its metadata document.Parse or retrieve core metadata, e.g., through one or more options below, combine the results and then verify presence/absence of the core elements in the metadata.
• Structured data embedded in the landing page of the identifier (e.g., Schema.org,Dublin Core and OpenGraph meta tags) • Typed Links in the HTTP Link header; for more information, see https://signposting.org/conventions/ • The assessment assumes that the identifier resolves to a landing page (e.g., html) that contains the metadata of the data.Landing page may not necessarily be an html page.
• Data providers may use different standards to expose the metadata of their data.
• The metadata records maintained by a data provider might not be accessible, due to, e.g., broken link of the landing page, proprietary metadata standard used, and restricted metadata.

DESCRIPTION Metric Identifier
FsF-F3-01M Metric Name Metadata includes the identifier of the data it describes.

Description
The metadata should explicitly specify the identifier of the data (content) such that users can discover and access the data through the metadata.If the identifier specified is persistent and points to a landing page, the data identifier and links to download the data content should be taken into account in the assessment.• A metadata standard may not support any element or include multiple elements through which a data identifier may be specified.
• Different practices of associating data with its metadata should be handled as part of the assessment: • Data is assigned with an identifier that resolves to a page that contains metadata of the data.The metadata may contain the identifier and a URL to access the data (contents).In this case, the access URL should be tested.
• Data and metadata are assigned with separate identifiers.Therefore, the data identifier should be tested.

FIELD DESCRIPTION Metric Identifier
FsF-F4-01M Metric Name Metadata is offered in such a way that it can be retrieved by machines.

Description
This metric refers to ways through which the metadata of data is exposed or provided in a standard and machine-readable format.Assessing this metric will require an understanding of the capabilities offered by the data repository used to host the data.Metadata may be available through multiple endpoints.For example, if data is hosted by a repository, the repository may disseminate its metadata through a metadata harvesting protocol (e.g., via OAI-PMH) and/or a web service.Metadata may also be embedded as structured data on a data page

Assessment
The following methods may be applied to determine if metadata of the data is accessible programmatically: • Check if the metadata provision endpoint returns metadata records based on a request using the data identifier (see comment* below) • Check if search engine friendly structured data is embedded in the data landing page with a proper resource type, e.g., schema.orgrepresentation of type 'Dataset' or 'Collection'.

COMMENTS Related Resources
• Google reference documentation on representing structured data of Dataset, https://developers.google.com/search/docs/data-types/datasetKnown Limitations/Constraints • *Data providers may expose their metadata through different ways, e.g., OAI-PMH, REST API using JSONAPI specification, and Catalog Service for the Web (CSW).Their endpoints (URLs) should be machine discoverable and accessible.The metadata access endpoints of a repository can be found through FAIRsharing and re3data.However, at present, it is not possible to programmatically discover the metadata endpoints of a repository based on a data identifier, unless they are explicitly specified in the metadata or the landing page of the data.Mapping the client ids from DataCite's PID service to re3data identifiers is in progress and might provide a starting point for the assessment.
• Structured data may be provided in different formats, JSON-LD, RDFa or Microdata.The variety of formats should be handled as part of the assessment.
• The assessment only verifies if structured data is present on the data landing page with a proper type (e.g., Dataset or Collection).Embedding structured data does not guarantee that the data will be present on search results.To verify that the data is findable through a web search engine, we should perform a search through the search engine API based on the data identifier and its descriptive metadata (e.g., title, author).However, most of the web search engine APIs (e.g., Google Custom Search, Bing Web Search API) offer a limited number of free search queries.

FIELD DESCRIPTION Metric Identifier
FsF-A1-01M Metric Name Metadata contains access level and access conditions of the data.

Description
This metric determines if the metadata includes the level of access to the data such as public, embargoed, restricted, or metadata-only access and its access conditions.Both access level and conditions are necessary information to potentially gain access to the data.It is recommended that data should be as open as possible and as closed as necessary.
• There are no access conditions for public data.Datasets should be released into the public domain (e.g., with an appropriate public-domain-equivalent license such as Creative Commons CC0 license) and openly accessible without restrictions when possible.
• Embargoed access refers to data that will be made publicly accessible at a specific date.For example, a data author may release their data after having published their findings from the data.Therefore, access conditions such as the date the data will be released publically is essential and should be specified in the metadata.
• Restricted access refers to data that can be accessed under certain conditions (e.g. because of commercial, sensitive, or other confidentiality reasons or the data is only accessible via a subscription or a fee).Restricted data may be available to a particular group of users or after permission is granted.For restricted data, the metadata should include the conditions of access to the data such as point of contact or instructions to access the data.
• Metadata-only access refers to data that is not made publicly available and for which only metadata is publicly available.

FAIR Principle
A1: (Meta)data are retrievable by their identifier using a standardized communication protocol Note: This metric is about ensuring provision of metadata related to data access.This metadata is important to retrieve data using a standardized communication protocol, thus we mapped it to the principle A1.

CoreTrustSeal Alignment
R2.The repository maintains all applicable licenses covering data access and use and monitors compliance R15.The repository functions on well-supported operating systems and other core infrastructural software and is using hardware and software technologies appropriate to the services it provides to its Designated Community ASSESSMENT Requirement(s) • The metadata standard considered as part of the assessment may not include all of the elements for representing data access levels and related access information.The access information may be expressed in an unstructured manner, e.g., as a 'comment' in the metadata document.
• The assessment of this metric only checks the metadata of access restrictions, but it does not validate if the access conditions specified are correct.
• The assessment should be complemented with the evaluation of the data access mechanism based on the specified access levels, e.g., data is not accessible, accessible in a semi-automated (mediated access to data via data custodian), or automated fashion.
• A data object may consist of several files with different access levels; some are with open access while others are with restricted access.So mixed access levels may apply to the object.

Metric Name
Metadata is accessible through a standardized communication protocol

Description
Given an identifier of a dataset, the metadata of the dataset should be retrievable using a standard communication protocol.Consider, for example, the application layer protocols such as HTTP, HTTPS, FTP, TFTP, SFTP and AtomPub.Avoid disseminating metadata using a proprietary protocol (e.g., Apple Filing Protocol).

FAIR Principle
A1: (Meta)data are retrievable by their identifier using a standardized communication protocol CoreTrustSeal Alignment R15.The repository functions on well-supported operating systems and other core infrastructural software and is using hardware and software technologies appropriate to the services it provides to its Designated Community.

ASSESSMENT Requirement(s)
• Data identifier (IRI, URL) Compliance Levels level test score Landing page link is based on standardized web communication protocols. 1

Assessment
Use the data identifier to access its landing page (metadata document).Verify the application protocol used to serve the page based on the scheme part of the IRI.In case external metadata is linked to the landing page by typed links, use the data identifier specified in the typed link.

Known Limitations/Constraints
• The metadata of a dataset may be shared in different ways (landing page, dedicated API, link relation type).The assessment assumes that the identifier resolves to a landing page (e.g., html) that contains the metadata of the dataset or includes typed links resolving to an external metadata resource.

Metric Name
Data is accessible through a standardized communication protocol

Description
Given an identifier of a dataset, the dataset should be retrievable using a standard communication protocol such as HTTP, HTTPS, FTP, TFTP, SFTP, FTAM and AtomPub.Avoid disseminating data using a proprietary protocol.

FAIR Principle
A1: (Meta)data are retrievable by their identifier using a standardized communication protocol CoreTrustSeal Alignment R15.The repository functions on well-supported operating systems and other core infrastructural software and is using hardware and software technologies appropriate to the services it provides to its Designated Community ASSESSMENT Requirement(s) • Data identifier (IRI, URL) Compliance Levels level test score Metadata includes a resolvable link to data which is based on standardized web communication protocols. 1

Assessment
Check the application protocol of the data identifier based on the scheme part of the given IRI.In case external metadata is linked to the landing page by typed links, use the data identifier specified in the typed link.

Known Limitations/Constraints
• Restricted or sensitive datasets may not be retrievable over the Web.Special authorization services may be required to retrieve these datasets from the data bank.

FIELD DESCRIPTION Metric Identifier
FsF-A2-01M Metric Name Metadata remains available, even if the data is no longer available.

Description
This metric determines if the metadata will be preserved even when the data they represent are no longer available, replaced or lost.Similar to metric FsF-F4-01M, answering this metric will require an understanding of the capabilities offered, data preservation plan and policies implemented by the data repository and data services (e.g., Datacite PID service).Continued access to metadata depends on a data repository's preservation practice which is usually documented in the repository's service policies or statements.A trustworthy data repository offering DOIs and implementing a PID Policy should guarantee that metadata will remain accessible even when data is no longer available for any reason (e.g., by providing a tombstone page) FAIR Principle A2.Metadata should be accessible even when the data is no longer available CoreTrustSeal Alignment R10.The repository assumes responsibility for long-term preservation and manages this function in a planned and documented way ASSESSMENT Requirement(s) --

Assessment
Programmatic assessment of the preservation of metadata of a data object can only be tested if the object is deleted or replaced.So this test is only applicable for deleted, replaced or obsolete objects.Importantly, continued access to metadata depends on a data repository's preservation practice.Therefore, we regard that the assessment of metric applies to at the level of a repository, not at the level of individual objects.For this reason, we excluded its assessment details from this specification.
Depending on the supported persistent identifier type, some metadata may be by default preserved in a registry maintained by a PID provider (e.g.datacite).In addition to a repository's preservation policy or statement, exchange protocol may indicate the status of records in an archive.For instance, OAI-PMH harvesting protocol which offers a field to declare one of three levels (no, persistent, and transient) of support for deleted records.

Known Limitations/Constraints
• Data preservation statements are usually found in a repository's data policy or other governance documents.Machine-actionable representation of preservation policies in repository catalogues and registries such as re3data is important to enable an automated assessment of the statements.Further work in this areas is needed, for example to enable data producers to receive repository recommendations, based on preservation requirements expressed in machine-actionable DMPs, e.g., http://dx.doi.org/10.2218/ijdc.v15i1.704 • Currently, PID providers (e.g., DataCite) do not offer any tombstone pages automatically for unavailable objects.Data providers may maintain the pages instead, for example https://doi.pangaea.de/10.1594/PANGAEA.715333

FIELD DESCRIPTION Metric Identifier
FsF-I1-01M Metric Name Metadata is represented using a formal knowledge representation language.Description Knowledge representation is vital for machine-processing of the knowledge of a domain.Expressing the metadata of a data object using a formal knowledge representation will enable machines to process it in a meaningful way and enable more data exchange possibilities.Examples of knowledge representation languages are RDF, RDFS, and OWL.These languages may be serialized (written) in different formats.For instance, RDF/XML, RDFa, Notation3, Turtle, N-Triples and N-Quads, and JSON-LD are RDF serialization formats.

FAIR Principle
I1. (Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation Note: The I1 principle loosely defines the use of knowledge representation.Therefore, we define two metrics corresponding to the principle concerning metadata.The metric FsF-I1-01M focuses on making the metadata available for machine-mediated interpretation, whereas the metric FsF-I1-02M focuses on the use of semantic resources to enrich metadata.

CoreTrustSeal Alignment
R14.The repository enables reuse of the data over time, ensuring that appropriate metadata are available to support the understanding and use of the data R15.The repository functions on well-supported operating systems and other core infrastructural software and is using hardware and software technologies appropriate to the services it provides to its Designated Community ASSESSMENT Requirement(s)

Assessment
Machine-actionable representation (e.g., RDF) of the metadata may be retrieved as follows: • If content negotiation is supported, use the identifier to perform a request, e.g., an RDF-based document.
• Use the 'typed links' given in the HTML header section of the landing page to access the RDF-based metadata of the data, e.g., https://data.gov.lv/dati/lv/dataset/covid-19 • Query the SPARQL endpoint using the identifier (or optionally title) of the data, for example by using metadata elements from dcterms and dcat standards.Perform a full text-search within the SPARQL query if it is supported.

Known Limitations/Constraints
• Based on a data identifier, it is not possible to programmatically discover the SPARQL endpoint provided by a data repository, unless the endpoint information is specified in the repository metadata, e.g., https://www.re3data.org/repository/r3d100012203 • The RDF-based metadata may not be supported by the data repository which curates the data, but it may be available through external linked data repositories, e.g., bio2rdf.
• RDF data may be serialized in a number of different ways.Therefore, the variety of serialization formats (and their respective MIME types) should be considered when performing the SPARQL query.

FIELD DESCRIPTION Metric Identifier
FsF-I2-01M Metric Name Metadata uses semantic resources.

Description
A metadata document or selected parts of the document may incorporate additional terms from semantic resources (also referred as semantic artefacts) that unambiguously describe the contents so they can be processed automatically by machines.This metadata enrichment may facilitate enhanced data search and interoperability of data from different sources.Ontology, thesaurus, and taxonomy are kinds of semantic resources, and they come with varying degrees of expressiveness and computational complexity.Knowledge organization schemes such as thesaurus and taxonomy are semantically less formal than ontologies.

FAIR Principle
I2. (Meta)data use vocabularies that follow FAIR principles CoreTrustSeal Alignment R14.The repository enables reuse of the data over time, ensuring that appropriate metadata are available to support the understanding and use of the data R15.The repository functions on well-supported operating systems and other core infrastructural software and is using hardware and software technologies appropriate to the services it provides to its Designated Community ASSESSMENT Requirement(s)

Assessment
This assessment is the continuation of the assessment FsF-I1-01M, but focuses on the metadata contents.

Known Limitations/Constraints
• The assessment checks the inclusion of semantic markup in the metadata page, not their contents and quality, e.g., if the terms used are in appropriate context and accessible over the web.• There is no up-to-date, maintained, cross domain ontology catalogue, registry or ontology library available.
• It is hard to verify if the metadata uses FAIR vocabularies as the criteria defining a FAIR vocabulary have not fully developed and recommended yet.

Metric Name
Metadata includes links between the data and its related entities.

Description
Linking data to its related entities will increase its potential for reuse.The linking information should be captured as part of the metadata.A dataset may be linked to its prior version, related datasets or resources (e.g.publication, physical sample, funder, repository, platform, site, or observing network registries).Links between data and its related entities should be expressed through relation types (e.g., DataCite Metadata Schema specifies relation types between research objects through the fields 'RelatedIdentifier' and 'RelationType'), and preferably use persistent Identifiers for related entities (e.g., ORCID for contributors, DOI for publications, and ROR for institutions).

FAIR Principle
I3. (Meta)data include qualified references to other (meta)data CoreTrustSeal Alignment R11.The repository has appropriate expertise to address technical data and metadata quality and ensures that sufficient information is available for end users to make quality-related evaluations ASSESSMENT Requirement(s)

Assessment
• Use the data identifier to access its metadata record.
• Check the metadata elements which indicate the relationship between data and related entities.• Test if the URLs of the related entities are active (not broken links).

Known Limitations/Constraints
• Different metadata schemas may use different properties to specify the relation between data and its related entities.
• The assessment regards any relation between a data and its related entities as success.It does not consider the quantity or types of relations.
• Links to related resources are not necessarily expressed as actionable links but may also be strings such as citations.

FIELD DESCRIPTION Metric Identifier
FsF-R1-01MD Metric Name Metadata specifies the content of the data.

Description
This metric evaluates if the content of the dataset is specified in the metadata, and it should be an accurate reflection of the actual data deposited.Examples of the properties specifying data content are resource type (e.g., data or a collection of data), variable(s) measured or observed, method, data format and size.Ideally, ontological vocabularies should be used to describe data content (e.g., variable) to support interdisciplinary reuse.

FAIR Principle
R1: (Meta)data are richly described with a plurality of accurate and relevant attributes Note: Data quality aspect is not explicitly addressed by FAIR principles.However, an accurate description of the data content is important for assessing the quality of the data.We regard the properties of data content as part of rich metadata, therefore we map this metric to its closest principle R1.CoreTrustSeal Alignment R11.The repository has appropriate expertise to address technical data and metadata quality and ensures that sufficient information is available for end users to make quality-related evaluations ASSESSMENT Requirement(s)

Assessment
• Use the data identifier to access its metadata document.Verify the presence/absence of elements representing data content descriptions in the metadata document.• Use the data access URL specified in the metadata to retrieve the actual data.
Check if ontology terms are used to describe data content.
• Compare the content descriptions found with actual data properties (see comment* below).

Related Resources
• Frictionless Data, https://frictionlessdata.io/ • CSV on the Web: A Primer, https://www.w3.org/TR/tabular-data-primer/ • Apache Tika (an example of content analysis toolkit), https://tika.apache.org/Known Limitations/Constraints • *The proposed assessment has some general limitations and some cases where future expansion is dependent on contexts: o Descriptors (mandatory and optional properties of a schema) may influence metadata completeness.o Validation of descriptor content is beyond the scope of this test as it would depend on human judgement.o A detailed assessment of data files properties would depend on some agreed mechanism for defining and agreeing domain requirements.• General-purpose metadata standards such as Datacite Metadata Schema and Schema.orgprovide elements to represent content descriptions.Thus, it is possible to check programmatically if the descriptions required are present in the metadata.However, the conformance/matching test may become a challenge due to a variety of data types and data size.Standardized tabular data and selfdescribing data formats (e.g., HDF, NetCDF, Parquet) are promising, but not the solution to every research domain.Another challenge is that unstructured content descriptions might be included in a data file; fuzzy text-matching algorithms can be useful here.

FIELD DESCRIPTION Metric Identifier
FsF-R1.1-01M Metric Name Metadata includes license information under which data can be reused.

Description
This metric evaluates if data is associated with a license because otherwise users cannot reuse it in a clear legal context.We encourage the application of licenses for all kinds of data whether public, restricted or for specific users.• The assessment checks if the license information is provided as part of the metadata.It does not validate if the specified license is the most appropriate license for the data.There may be quite specific circumstances related to the data that cannot be explicitly expressed in the metadata as to why a license was chosen.
• As part of the future improvement, the assessment of the metric may take into account several aspects of a data license such as (i) standard or bespoke license and (ii) machine-readability of license.

FIELD DESCRIPTION Metric Identifier
FsF-R1.2-01M Metric Name Metadata includes provenance information about data creation or generation.

Description
Data provenance (also known as lineage) represents a dataset's history, including the people, entities, and processes involved in its creation, management and longer-term curation.It is essential that data producers provide provenance information about the data to enable informed use and reuse.The levels of provenance information needed can vary depending on the data type (e.g., measurement, observation, derived data, or data product) and research domains.
For that reason, it is difficult to define a set of finite provenance properties that will be adequate for all domains.Based on existing work, we suggest that the following provenance properties of data generation or collection are included in the metadata record as a minimum.
• Sources of data, e.g., datasets the data is derived from and instruments

Assessment
Use the data identifier to access its metadata record.Verify the presence/absence of metadata element(s) corresponding to the minimum data provenance properties.
• Presence of PROV-O or PAV information in RDFa microformats (landing page) or in RDF metadata.

COMMENTS Related Resources
• PROV Model Primer, https://www.w3.org/TR/prov-primer/ • The proposed minimum provenance properties are not final; new properties may be incorporated into the assessment if the requirement emerges.Properties such as processes/methods (incl.model, instrument, etc.) used in the data creation depend on domain standards.
• We regard references to related works (scholarly articles, data papers, preceding or associated data) as useful provenance information.This property of provenance is considered as part of FsF-I3-01M, therefore we excluded it from the assessment.
• Data may be published at different analysis stages (raw, processed, derivative, product).The completeness of the provenance information may depend on the stage at which the data is published.

FIELD DESCRIPTION Metric Identifier
FsF-R1.3-01M Metric Name Metadata follows a standard recommended by the target research community of the data.

Description
In addition to core metadata required to support data discovery (covered under metric FsF-F2-01M), metadata to support data reusability should be made available following community-endorsed metadata standards.Some communities have well-established metadata standards (e.g., geospatial: ISO19115; biodiversity: DarwinCore, ABCD, EML; social science: DDI; astronomy: International Virtual Observatory Alliance Technical Specifications) while others have limited standards or standards that are under development (e.g., engineering and linguistics).The use of community-endorsed metadata standards is usually encouraged and for long-term storage, https://www.iso.org/standard/73117.html • File type support lists provided by open source and commercial statistics (e.g.https://de.mathworks.com/help/matlab/import_export/supported-file-formats.html) or spreadsheet processing software vendors (e.g.https://support.microsoft.com/en-us/office/fileformats-that-are-supported-in-excel-0943ff2c-6014-4e8d-aaea-b83d51d46247?ui=en-us&rs=en-us&ad=us).

Known Limitations/Constraints
• *At present, there is a lack of reference resources (registries) against which a file format test can be checked programmatically.Common file formats endorsed by communities are not available through a registry but on static web pages (see resources above).This is an issue for the scientific community as a whole.Further work is needed to develop a standard approach to defining which formats are open and suitable for long-term preservation and use and managing those community-specific lists over time.
• Not all data can be made available in an open, non-proprietary, widely supported format, such as most 3D data, CAD data, dynamic spreadsheets or databases with specific significant characteristics which cannot be exported.
• Standard formats in earth system modeling (atmosphere, ocean) are netCDF and GRIB.GRIB is used for internal storage rather than for publication.
• Commonly used community file formats are not necessarily very domain specific.Some very generic file formats for e.g.spreadsheets are widely used by the scientific community.
• Data files may be made available using an archive file format (e.g., *.zip).In addition to the archive format, the actual file formats should be specified in the metadata such that machines can extract/unzip the downloaded file and read the actual files programmatically.
• Many scientific formats do not have an associated mime-type (e.g.BUFR), thus are hard to detect.
FAIR PrincipleF3: Metadata clearly and explicitly include the identifier of the data they describe CoreTrustSeal Alignment R13.The repository enables users to discover the data and refer to them in a persistent way through proper citation ASSESSMENT Requirement(s)

Table 1 .
Modified Metric Template

Table 2 .
List of Metrics.

•
Generic PID definitions, Initial Persistent Identifier Policy for the EOSC, 18, ESIP19, and IASSIST 20 ), and metadata recommendations 22 DataCite Metadata Working Group.2019."DataCiteMetadataSchemaDocumentationfor the Publication and Citation of Research Data.Version 4.3."DataCitee.V. 2019.https://doi.org/10.14454/7xq3-zf69.19ESIPDataPreservationandStewardshipCommittee.2019."DataCitationGuidelinesforEarthScienceData,Version 2." ESIP.https://doi.org/10.6084/m9.figshare.8441816.v1.20 https://iassistdata.org/community/data-citation-ig/data-citation-resources/for data discovery (e.g., EOSC Datasets Minimum Information (EDMI)21, DataCite Metadata Schema, W3C Recommendation Data on the Web Best Practices and Data Catalog Vocabulary).This metric focuses on domain-agnostic core metadata.Domain or disciplinespecific metadata specifications are covered under metric FsF-R1.3-01M.A repository should adopt a schema that includes properties of core metadata, whereas data authors should take the responsibility of providing core metadata.BackgroundFollowing data citation guidelines (Data Citation Synthesis Group, 201422, Ball & Duke, 2015 23 ; Mooney & Newton, 2012 24 and Fenner et al., 2019 25 ) metadata properties necessary for proper data citation are: creator, title, publication date, publisher, and identifier.In addition, abstract or summary and keywords are essential to enable discoverability and the indication of a resource type is necessary to distinguish research data objects from other digital objects ( Fenner et al. ,2019 24 ).The resulting set of core descriptive metadata elements (creator, title, publisher, publication date, summary, keywords, identifier) aligns well with existing recommendations for data discovery and core metadata definition (Asmi et al., 2017 21 , DataCite Metadata Working Group, 2019 26 , Loscio et al., 2017 27 and Albertoni et al., 2020 28 ).This set of metadata elements is present in most domain agnostic metadata standards such as Dublin Core, DCAT-2, schema.org/Dataset,and DataCite schema.FAIR Principle F2.Data are described with rich metadata CoreTrustSeal Alignment R13.The repository enables users to discover the data and refer to them in a persistent way through proper citation

•
Content negotiation (including external negotiation services offered by PID providers)Check if metadata has to be made available via common methods at all.Check if data citation metadata is available.Check if core descriptive metadata is available.

•
Data identifier (IRI, URL) • Machine-accessible and readable metadata • Signposting the Scholarly Web, https://signposting.org/conventions/ • FAIR Signposting Profile, https://signposting.org/FAIR/ Known Limitations/Constraints R13.The repository enables users to discover the data and refer to them in a persistent way through proper citation for use by web search engines such as Google and Bing or be available as linked (open) data.FAIR PrincipleF4.(Meta)data are registered or indexed in a searchable resource CoreTrustSeal Alignment

•
Data identifier (IRI, URL) • Machine-accessible and readable metadataAssessmentUse the data identifier to access its metadata document.Check the presence/absence of data access level through metadata element(s).If it is embargoed data, check if the embargo end date is specified.If it is restricted data, check if the data access conditions are specified.COMMENTS

•
Data identifier (IRI, URL)• Optionally a metadata provision endpoint (SPARQL endpoint) • Machine-accessible and readable metadata • Registry of semantic resources

Compliance Levels level test score
1 Vocabulary namespace URIs can be identified in metadata 2 3 Namespaces of known semantic resources can be identified in metadata 1 Compare the remaining namespaces with entries from existing (known) ontology registries (see examples listed in Related Resources).

•
Without an explicit license, users do not have a clear idea of what can be done with your data.Licenses can be of standard type (Creative Commons, Open Data Commons Open Database License) or bespoke licenses, and rights statements which indicate the conditions under which data can be reused.It is highly recommended to use a standard, machine-readable license such that it can be interpreted by machines and humans.In order to inform users about what rights they have to use a dataset, the license information should be specified as part of the dataset's metadata.SPDX license registry, https://spdx.org/licenses/• Rights statements of cultural heritage objects, https://rightsstatements.org/page/1.0/?language=en • ARDC Data Rights Management Guide, https://ardc.edu.au/guides/research-data-rights- FAIR Principle R1.1.(Meta)data are released with a clear and accessible data usage license

•
Data creation or collection date• Contributors involved in data creation and their roles • Data publication, modification and versioning information There are various ways through which provenance information may be included in a metadata record.Some of the provenance properties (e.g., instrument, contributor) may be best represented using PIDs (such as DOIs for data, ORCIDs for researchers).This way, humans and systems can retrieve more information about each of the properties by resolving the PIDs.Alternatively, the provenance information can be given in a linked provenance record expressed explicitly in, e.g., PROV-O or PAV or Vocabulary of Interlinked Datasets (VoID).