Open Science Graphs Must Interoperate!

. Open Science Graphs (OSGs) are Scientiﬁc Knowledge Graphs whose intent is to improve the overall FAIRness of science, by enabling open access to graph representations of metadata about people, artefacts, institutions involved in the research lifecycle, as well as the relationships between these entities, in order to support stakeholder needs, such as discovery, reuse, reproducibility, statistics, trends, monitoring, impact, validation, and assessment. The represented information may span across entities such as research artefacts (e.g. publications, data, software, samples, instruments) and items of their content (e.g. statistical hypothesis tests reported in publications), research organisations, researchers, services, projects, and funders. OSGs include relationships between such entities and sometimes formalised (semantic) concepts characterising them, such as machine-readable concept descriptions for advanced discoverability, interoperability, and reuse. OSGs are generally valuable individually, but would greatly beneﬁt from information exchange across their collections, thereby improving their eﬃcacy to serve stakeholder needs. They could, therefore, reuse and exploit the data aggregation and added value that characterise each OSG, decentralising the eﬀort and capitalising on synergies, as no one-size-ﬁts-all solution exists. The RDA IG on Open Science Graphs for FAIR Data is investigating the motivation and challenges underpinning the realisation of an Interoperability Framework for OSGs. This work describes the key motivations for i) the deﬁnition of a classiﬁcation for OSGs to compare their features, identify commonalities and diﬀerences, and added value and for ii) the deﬁnition of an Inter-operability Framework, speciﬁcally an information model and APIs that enable a seamless exchange of information across graphs.


Introduction
The Open Science movement is urging scientists, communities, institutions, and policymakers to define and adopt methodologies, practices, and tools for open publishing research artefacts beyond the scientific article, including research data, software, and digital experiments. As a consequence of this trend, researchers are increasingly depositing these artefacts and metadata about them, together with relationships among artefacts and other relevant contextual entities such as those described, in metadata registries about authors (e.g. ORCID 6 ), organisations (e.g. ROR 7 , GRID 8 ), and data repositories (e.g. re3data 9 ). De facto, Open Science publishing practices materialise a global and decentralised Open Science Graph.
Naturally, there is great interest to contribute to and/or consume such a graph for discovering and reusing artefacts as well as monitoring Open Science. To address this interest, several initiatives are building specialised Open Science Graphs (OSG), capable of serving specific user needs: Google Scholar, Scopus [3], Web of Science [4], Microsoft Academic Graph [17], FREYA PID Graph [7], Research Graph Foundation [1], OpenAIRE Research Graph [12], Open Research Knowledge Graph [11], Scholexplorer [5], Human Brain Project Knowledge Graph 10 , Open Citations [14], Crossref [9], SciGraph 11 , Semantic Scholar [8], Dimensions [10], as well as the CERIF 12 graphs built via Current Research Information System (CRIS) are just a few existing OSGs.
The fragmentation of these specialised OSGs motivates our interest to provide OSGs with an Interoperability Framework, whose drivers are manifold. First, interoperability would reduce duplication of effort and capitalise on synergies and complementarity. Second, interoperability enables information to circulate and thus ensures the enrichment and quality of individual OSGs as well as more redundancy to safeguard information availability and persistence. Third, interoperability will elevate OSGs as the backbone of Open Science scholarly communication.
The Research Data Alliance (RDA) Interest Group (IG) on Open Science Graphs for FAIR Data 13 is currently investigating the motivation and challenges underpinning the realisation of an Interoperability Framework for OSGs. The work presented here describes the motivations and challenges underlying the goal of an OSG Interoperability Framework, identified as: -Need to define a classification for OSGs that supports assessing their value, compare their features, and identify differences. To this end, the presented preliminary analysis of the FREYA PID Graph, OpenAIRE Research Graph, Open Knowledge Research Graph, Research Graph, and Scholexplorer paves the way for a classification of OSGs.
-Need to define an agreed-upon information model and APIs that enable the seamless exchange of information across OSGs.
The results of our preliminary investigation suggest that there is a need for a community-driven initiative that ensures common terminology (i.e. classification) and interoperability-enabled added value scholarly communication services that exploit the full potential of OSGs.

A Classification for Open Science Graphs
The fabric required to enact Open Science is a digital infrastructure based on an Interoperability Framework that captures research artefacts (in particular articles, datasets, software, services, workflows), metadata about artefacts, people and institutions as well as their relationships, as they evolve over time. This infrastructure relies on the adoption of Persistent Identifiers (PIDs) and metadata standards for the persistent identification and description of such entities across data sources (e.g. repositories, archives), thematic services (e.g. research infrastructures), and research communities.
Open Science Graphs (OSGs) are use case driven specialisations of Scientific Knowledge Graphs that build on the fabric of PIDs, metadata, and relationships. Figure 1 depicts OSGs in their context. Their scope differs according to served stakeholders, whose needs range from discovery, access, and reuse of research artefacts to monitoring and evaluating funding efforts, and identifying research trends. Stakeholder needs also drive the selection of data sources a particular OSG ought to integrate and the required data processing and enrichment capabilities.
The increasingly diverse and complex OSG landscape fuels an urgent demand for a classification to i) facilitate service providers in building needed added value services on OSGs, ii) assist consumers in selecting the services that meet their needs, and iii) facilitate OSG providers in identifying and communicating the characteristics of their service, and in understanding how to benefit from other OSGs.
In a first attempt to develop a classification framework that supports the systematic description of OSG characteristics and identification of their commonalities and differences, in the following, we introduce some existing OSGs that have been developed in recent years, namely: FREYA PID Graph, Ope-nAIRE Research Graph, Open Knowledge Research Graph, Research Graph, and Scholexplorer. This selection is by no means exhaustive. Indeed, additional initiatives do exist, e.g. Microsoft Academic Graph, SciGraph, Crossref, Dimensions, Semantic Scholar, Open Citations. Still, we argue that the selected OSGs reasonably represent the broader landscape.

Existing OSGs
FREYA PID Graph. The PID Graph is a scholarly infrastructure built by the partners in the EC-funded FREYA project, with the core infrastructure hosted by DataCite 14 . PID Graph identifies all nodes in the graph using persistent identifiers (PIDs) and describes these nodes, as well as the edges between nodes using the metadata associated with these PIDs. The graph is a federated graph, with PIDs and associated metadata provided by a number of PID providers who store this information in their respective services.
The development of the PID Graph is driven by user stories that the FREYA project partners have initially identified and that are continuously expanded. A distinguishing feature of these user stories is that they cannot be easily resolved with existing scholarly infrastructure, as they assume an underlying graph. Many of these user stories are around the discovery of connected resources, and the tracking of reuse.
The research entities supported by the PID Graph currently include publications, datasets, software, physical samples, instruments, services, people, research organisations, funders, and research data repository registries from the PID providers Crossref, DataCite, ORCID and ROR, for a total of currently about 35 million resources.
The PID Graph uses GraphQL 15 to query the PID Graph 16 , a widely used open source technology that aims to make it easy to build client applications for the PID Graph. The fields that describe resources have been harmonised across resource types to simplify working with the PID Graph and to enhance connections between resources. Many Jupyter notebooks have been written to explore the PID Graph, and they are openly available for reuse. All information in the PID Graph is available for reuse without restrictions, the software stack powering the PID Graph is available as open source software.
OpenAIRE Research Graph. The mission of the OpenAIRE initiative 17 , one of the foundations of the European Open Science Cloud (EOSC) 18 , is to provide training, dissemination, and technical services to seed (and support) Open Science publishing practices into the research lifecycle. To this end, one key activity of OpenAIRE aims at the construction of the OpenAIRE Open Research Graph by aggregating and integrating metadata records relative to digital research products (literature, dataset, software, and others) from more than 13,000 scholarly data sources world-wide (scientific repositories, archives, registries, databases, publishers), for a current total of more than 114 million publications, and 10 million research data. The graph is also algorithmically enhanced so to i) find and merge metadata records that describe the same entity (literature, and organisations), and ii) apply inference techniques on the metadata records and mine full-texts of Open Access publications to add new properties and new semantic relationships. End-user claims provided via the Web portal are also fed in the loop, so to drive the processing of raw metadata.
The OpenAIRE Research Graph data model is described in detail in [13] and its modelled entities are: literature, datasets, software, funders, funding streams, grants, organisations, researchers, data sources. Its content supports a number of analytics and applications such as discovery, research impact assessment, Open Science monitoring, brokering, reporting to funders, and statistics.
The graph is redistributed free of charge for everyone to use 19 , both in bulk access mode (snapshot dump [12], OAI-PMH), and in selective access mode via APIs (REST Search API, LOD) under CC-BY licence, due to the fact the graph integrates sources with licences stricter than CC0 (e.g. Microsoft Academic Graph, Springer Nature).
Open Knowledge Research Graph. The Open Research Knowledge Graph 20 (ORKG) [11] is a scholarly infrastructure and open project led by the TIB Leibniz Information Centre for Science and Technology that aims to publish scholarly knowledge communicated in the literature in structured and semantic form.
The entity of primary interest to ORKG is therefore the research article (paper) and, importantly, article content. ORKG models article contents as "research contribution", an abstract concept that, in general terms, relates the problem addressed by a contribution with the materials and methods used, and the obtained results.
ORKG enables a range of new applications, including automated comparisons. As a classic example, it is possible to automatically compare the characteristics of sorting algorithms, e.g. their best and worst-case complexity. Given precision, recall and F1 score of classification algorithms across the literature on a specific problem, say road-vehicle detection, it is possible to create leaderboards automatically, showing the trend of classification performance over time and the currently leading approach.
The primary data sources for ORKG are peer-reviewed research articles. In case data published in the literature (e.g. as a plot) is deposited in a research data repository, such infrastructures are an additional important data source. Furthermore, ORKG relies on third-party terminologies to align resources and thus ensure data interoperability and reusability.
ORKG adds value by making scholarly knowledge published in the literature better accessible to and processable by machines. As a multimodal infrastructure, ORKG integrates diverse data (i.e. scholarly knowledge) acquisition forms, specifically manual crowdsourcing, automated text mining, and scholarly knowledge exchange among research infrastructure, services and tools, e.g. data analysis environments such as Jupyter.
ORKG data export and provision is primarily via its REST API and SPARQL endpoint. Research Graph. Research Graph is a distributed network of scholarly works including data from data repositories such as NCI in Australia [15], academic and grey literature (e.g. GESIS, ICPSR), grants and funders (e.g. Australian Research Council, NIH) and researchers and research organisation information. Research Graph initially formed as a Graph Database by participants in the DDRI Working Group of Research Data Alliance [2] to connect datasets and metadata about data collections across repositories and data infrastructures. This graph later extended to a distributed network of graphs connecting via graph augmentation functionality running on a hybrid (national, private and commercial) cloud. At the time of writing this article, the graph holds close to 250 million nodes, including metadata about 180 million publications, 51 million datasets, 55 thousand grants, 1.4 million organisations, and 8.6 million researchers.
Research Graph is accessible to the partner organisations via Augment API, that is a cloud-hosted capability which creates graphs from bibliographic records, and extends this graph using information available in Research Graph clusters. The schema used for this transformation is based on the minimum required fields for identifying a research object, a trade-off between completeness and practicality, in favour of practicality. The graph schema [1] supports both XML, RDF XML and JSON-LD [16], and the endpoint supports Cloud Hosted Services, REST API and GraphQL.
Research Graph is mainly used by data infrastructures, repositories, and research systems for discovery of related scholarly works such as related datasets, and connections between grants and research output. Metadata about Research Graph is available on researchgraph.org, and github.com/researchgraph/schema, the input API supports RDF, DDI, RIF-CS, Dublin Core, Scholix, DataCite, Crossref, and many other metadata formats, and the output includes Research Graph Schema (JSON, XML), JSON-LD and RDF XML. Research Graph includes a subgraph reusable under CC-By licence while some other parts are accessible for limited use only under NC-ND-SA-CreativeCommons.
Scholexplorer. Scholexplorer 21 [5] populates and provides access to a graph of Scholix [6] links between dataset and literature objects, and between dataset and dataset objects. Links (and objects) are provided by data sources managed by publishers, data centres, or other organisations providing services to store and manage links between data sets and publications such as CrossRef, DataCite, PubMed, EMBL-EBI data sources, Pangaea, and OpenAIRE. Scholexplorer aggregates links metadata harvested from these data sources as Scholix records and out of these builds a harmonised and de-duplicated graph of scholarly objects counting today over 21 million publications, 53 million datasets, and over 269 million bi-directional semantic links between them. The graph is openly accessible under CC-BY licence via REST search APIs that return links in Scholix format, and via periodic dumps on Zenodo 22 .

Classification
Based on a comparison of the five initiatives described above (Figure 2), we propose a first classification across seven main features, regarded as relevant to both OSG consumers and OSG providers. model (e.g. metadata structural and semantic transformations), and ii) by enhancing the metadata via web crawling, interlinking, inference, full-text mining, AI, user annotations and feedback, etc. 5. Data export and provision: OSGs offer access to their content via APIs (e.g. OAI-PMH, SPARQL, GraphQL, ad-hoc REST APIs, etc.) and standard exchange formats (e.g. XML, JSON, RDF) that implement standard metadata formats (e.g. DataCite, Scholix.org, Dublin Core, ORCID profile, CERIF) or proprietary formats. 6. FAIRness: FAIRness of OSGs regards their nature as dataset in regard to being Findable, Accessible, Interoperable, and Reusable. Practices vary, but in general, OSGs are available via standard exchange formats (e.g. XML, JSON, RDF) and accessible via standard protocols, from simple download to GraphQL, OAI-PMH, or proprietary search REST APIs. In some cases, accessibility is facilitated by minting a DOI to the OSG collection, and, in some cases, complicated by the fact consumers need to go through tollgated cloud services to access the graph. OSG schemata give life to the hardest interoperability and reusability challenge, as they follow applicationdriven interpretations of research entities, which complicate OSG reuse and integration. 7. Openness: Different OSGs are released and redistributed under different licences (CC0, CC-BY, CC-SA, etc.). In general, the licence applies to the whole graph, but, in some cases, different parts of the graph can be released under different licences, be accessible only to a limited number of stakeholders, or be behind a paywall. In other cases, for example for Microsoft Academic Graph, the graph is released openly with ODC-BY licence, but a small fee is needed to sustain the provisioning platform (i.e. Azure).
While the table already provides evidence for the value of a classification, it also highlights the need for common agreements on classification criteria. For example, aspects such as coverage of the data sources aggregated by the OSG may be of interest, as a graph may focus on a geographical region, be crosscommunity or community-specific (e.g. computer science and neuroscience in the early Semantic Scholar), or be able to capture geospatial descriptions (e.g. INSPIRE in Research Graph).

A Framework for Open Science Graphs Interoperability
We advocate for the establishment of a community-driven Interoperability Framework in order to mediate the diverse data models and technologies used by existing OSGs. The drivers for conceiving an Interoperability Framework for OSGs are manifold.
Firstly, as we have seen in Section 2, the various OSGs differ in scope, extent and technological details as they strive to capture various aspects of scholarly communication from diverse perspectives and different abstraction and granularity. Thus, the information pertaining to different OSGs can be overlapping or can be complementary. With overlap we gain plurality, e.g. different identifiers for same papers, authors, organisations, etc. while with complementarity we gain completeness and coverage, e.g. integrate information of various granularity as published by various OSGs.
Secondly, despite building on data sources with clear sustainability plans, some OSGs have unclear directions, lack a viable business model, and thus might cease to exist. Given this risk, OSG content should be federated, shared, and possibly fed back to original data sources where it can be managed sustainably for the common good of both the scientific community and, more broadly, society.
Thirdly, OSGs and more generally Scientific Knowledge Graphs should act as the backbone of modern Open Science scholarly communication, embody its core principles, and foster its adoption along several dimensions such as discoverability, monitoring, and FAIRness. This is especially relevant for the non-commercial OSGs and their leading role in open innovation with best-in-class, cutting-edge services, free at the point of use.
It is therefore of paramount importance to exchange OSG content and capitalise on the non-negligible acquisition, integration and enrichment efforts performed by the various OSGs. To facilitate information exchange between OSGs, the Interoperability Framework may rely on an agreed-upon lingua franca. This was already achieved with the specification of Scholix [6], an agreed-upon highlevel interoperability framework for exchanging information about the links between scholarly literature and data. However, Scholix operated within a much narrower scope. Given the complexity of the modelled information and the ambition of the endeavour, for OSGs a set of "dialects" rather than a single lingua franca may be more viable while still efficiently catalyse interoperability.
OSG content exchange has to occur on at least two levels of abstraction: information model and technological. In regard to information modelling, there is an urge for the various OSGs to define bottom-up a common model that can maximise information exchange and has the flexibility to accommodate unforeseen extensions, use cases, and stakeholders. From a technological standpoint, we need a portfolio of operational frameworks supporting a seamless exchange of information across different OSGs by means of operators/primitives. Doing so implies supporting a plethora of exchange formats (and the relative mappings to the common model) such as CSV, XML, RDF, JSON-LD, Scholix, and OAI-ORE, as well as different APIs enabling the provisioning of OSG information such as REST, SPARQL, and GraphQL.
To this end, we envisage the European Open Science Cloud (EOSC) as one optimal channel through which such an Interoperability Framework for OSGs could be developed via consensus and for the benefit of Open Science, at least at a pan-European level. EOSC is being constructed having a System of Systems paradigm, where local autonomy and differences are fostered as they can be an added value, and where convergence is recommended and facilitated via common interoperability frameworks to optimise cost and maximise the efficiency of science. OSGs would, in such an ecosystem, become the mean for i) bridging research infrastructures, i.e. thematic and scholarly communication services, and ii) offer to EOSC users, such as researchers, research communities, policymakers, and SMEs the tools to discover and monitor trends and impact of science. The RDA IG on Open Science Graphs for FAIR Data is and will be contributing to the definition of the EOSC interoperability frameworks to ensure that specific solutions will be sought after. Finally, another channel that is potentially conducive so to bring this discussion onto the global landscape and the long-term perspective of a broader "Global Open Science Cloud" could be the RDA IG on Global Open Research Commons 23 .

Conclusions
In this paper, we targeted two challenges of working with Open Science Graphs (OSGs). On the one hand, OSGs would benefit from a classification framework that enables their inspection and comparison along key features. On the other hand, we argue that an Interoperability Framework is pivotal to enable a seamless exchange of information among OSGs with the resulting suggested benefits.
We proposed a preliminary classification framework by analysing a selection of representative OSGs, namely: FREYA PID graphs, OpenAIRE Research Graph, Open Knowledge Research Graph, Research Graph, and Scholexplorer. Moreover, we outlined the main drivers and desiderata of a possible Interoperability Framework.
Going forward, we see the RDA Interest Group on Open Science Graphs for FAIR Data as an important community to make further progress on aligning the various OSG initiatives, in particular concrete work on interoperability.