Published June 24, 2021 | Version Version 1.0
Project deliverable Open

D3.1 Semantic resources

  • 1. EXPERT SYSTEM IBERIA SL
  • 2. Universidad Politecnica de Madrid

Description

This deliverable describes the extension and customization of the text mining and enrichment services that will be integrated in EOSC during RELIANCE. Currently available through ROHub.org, such services provide a variety of scientific communities with text analytics functionalities, contributing to make research data, materials, and results machine-readable and easier to discover by scientists and machines alike1. Despite being already in operation, these text mining and enrichment services need to be tailored to the specific vocabulary used by the RELIANCE research communities and therefore extended and customized to successfully deliver domain-specific information to such communities.

We start by carrying out a survey where we elicit from the communities the specific fields of research that are relevant for their work, as well as the key journals and venues where they usually communicate their results. Then, we harvest from SciGraph, a knowledge graph of scientific publications released by Springer Nature, a corpus of scientific papers with publications from the last 5 years that belong to such fields of research. The resulting corpus is important for different reasons. First, to enhance the coverage of the scientific terminology supported by the RELIANCE text mining services. Second, to train new language models, either from scratch or by fine-tuning existing pre-trained models, which enable the development of further experimental text mining services based on natural language understanding and machine reading comprehension of scientific documents. Herein, we mainly focus on the former, while the latter will be addressed in forthcoming deliverables.

We run a text mining analysis of our corpus with special attention to the entities, phrases, and concepts, as well as the relationships between them, that were not previously covered by our text mining and enrichment services. Such linguistic artifacts, which represent the missing pieces of information necessary to successfully analyze documents of interest for the RELIANCE communities, are integrated by knowledge engineers and linguists in a knowledge graph. This knowledge graph is called Sensigrafo, a lexico-semantic knowledge graph at the core of the RELIANCE text mining services. As a result of this process, the text mining and enrichment services are enabled to understand the domain terminology used by the target scientific communities in RELIANCE. The resulting text corpus and domain-specific terminology have been released and are publicly available through Zenodo2.

In this deliverable, we also introduce pre-trained language models and their application in the context of the RELIANCE text mining and enrichment services. Pre-trained language models like BERT were trained on large general-purpose corpora and have proven to be very useful to tackle different natural language understanding challenges by fine-tuning for specific tasks on domain-specific data. Currently, language models represent the state of the art in many tasks in natural language understanding. In RELIANCE, we plan to use them as a complementary resource to further improve performance in text mining tasks like text classification. In addition, we will explore the application of language models to everyday tasks in a researcher’s life that can benefit from machine understanding of natural language, like the comprehension of scientific documents or the analysis of scientific claims.

Finally, this deliverable analyzes other resources that we are planning to leverage in RELIANCE, like the OpenAIRE knowledge graph, which interlinks scientific results, including papers, data, and software, across different repositories. The enrichment of such resources through the RELIANCE text mining and enrichment services will increase their findability by the EOSC communities and support the scalable creation of research objects. We also review the ongoing OpenAIRE open citation initiative, which aims at providing a citation-based graph of research work through OpenAIRE with the potential to become a valuable resource for RELIANCE as well.

Notes

This is the draft version of the deliverable not yet approved by the European Commission.

Files

D3.1-Semantic Resources_v1.0.pdf

Files (1.6 MB)

Name Size Download all
md5:aff32aa47ebfec08c652acbc6cc2d40e
1.6 MB Preview Download

Additional details

Funding

European Commission
RELIANCE - REsearch LIfecycle mAnagemeNt for Earth Science Communities and CopErnicus users in EOSC 101017501