Combining Algorithms and Human Expertise: OpenAIRE's Entity Disambiguation Method
Authors/Creators
Description
Lightning Talk at the International Digital Curation Conference 2025.
The presentation examines OpenAIRE's solution to the “entity disambiguation” problem, presenting a hybrid data curation method that combines deduplication algorithms with the expertise of human curators to ensure high-quality, interoperable scholarly information.
Entity disambiguation is invaluable to building a robust and interconnected open scholarly communication system. It involves accurately identifying and differentiating entities such as authors, organisations, data sources and research results across various entity providers. This task is particularly complex in contexts like the OpenAIRE Graph, where metadata is collected from over 100,000 data sources. Different metadata describing the same entity can be collected multiple times, potentially providing different information, such as different Persistent Identifiers (PIDs) or names, for the same entity. This heterogeneity poses several challenges to the disambiguation process. For example, the same organisation may be referenced using different names in different languages, or abbreviations. In some cases, even the use of PIDs might not be effective, as different identifiers may be assigned by different data providers. Therefore, accurate entity disambiguation is essential for ensuring data quality, improving search and discovery, facilitating knowledge graph construction, and supporting reliable research impact assessment.
To address this challenge, OpenAIRE employs a deduplication algorithm to identify and merge duplicate entities, configured to handle different entity types. While the algorithm proves effective for research results, when applied to organisations and data sources, it needs to be complemented with human curation and validation since additional information may be needed.
OpenAIRE's data source disambiguation relies primarily on the OpenAIRE technical team overseeing the deduplication process and ensuring accurate matches across DRIS, FAIRSharing, re3data, and OpenDOAR registries. While the algorithm automates much of the process, human experts verify matches, address discrepancies and actively search for matches not proposed by the algorithm. External stakeholders, such as data source managers, can also contribute by submitting suggestions through a dedicated ticketing system. So far OpenAIRE curated almost 3 935 groups for a total of 8 140 data sources.
To address organisational disambiguation, OpenAIRE developed OpenOrgs, a hybrid system combining automated processes and human expertise. The tool works on organisational data aggregated from multiple sources (ROR registry, funders databases, CRIS systems, and others) by the OpenAIRE infrastructure, automatically compares metadata, and suggests potential merged entities to human curators. These curators, authorised experts in their respective research landscapes, validate merged entities, identify additional duplicates, and enrich organisational records with missing information such as PIDs, alternative names, and hierarchical relationships. With over 100 curators from 40 countries, OpenOrgs has curated more than 100,000 organisations to date. A dataset containing all the OpenOrgs organizations can be found on Zenodo (https://doi.org/10.5281/zenodo.13271358). This presentation demonstrates how OpenAIRE's entity disambiguation techniques and OpenOrgs aim to be game-changers for the research community by building and maintaining an integrated open scholarly communication system in the years to come.
Files
IDCC25-OpenAIRE.pdf
Files
(2.2 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:f8d6c86d10936fb776bba9fa012f457c
|
2.2 MB | Preview Download |
Additional details
Funding
Dates
- Available
-
2025-02-19