Published December 20, 2021 | Version v1
Project deliverable Open

D3.9 Report on Ontology and Vocabulary Collection and Publication

  • 1. CNR-ILC
  • 2. CLARIN/ERIC
  • 3. CLARIN/WWI
  • 4. CESSDA/UL-ADP

Description

This deliverable pertains to SSHOC Task 3.1 which was responsible for investigating and providing resources and tools to support the multilingual aspects of the future pan-EU SSH infrastructure.

Making data and services accessible and usable in SSH is very much also a matter of providing relevant translations, translation of metadata concepts, multilingual vocabularies, terminology extraction across languages, multilingual databases.

The deliverable offers a detailed report on the gathering and translation of relevant SSH metadata, ontologies and vocabularies for the use-cases indicated in the task’s topics: multilingual metadata concepts and vocabularies, the multilingual occupation ontology, with cross-country female occupational titles.

In accordance with SSHOC and the EOSC FAIR recommendations and requirements, the metadata vocabularies and ontologies have been published via several different formats and facilities.

Section 1. The introduction sets the landscape and describes the need of multilingual vocabularies both for classification and discovery in the context of a cloud-based infrastructure that will offer access to research data and related services adapted to the needs of the SSH community.

Section 2. “Multilingual metadata” investigates the possibility to use and test Natural Language Processing (NLP) approaches and Machine Translation (MT) to make the metadata more accessible using national languages other than English. A selected case study was the recommended metadata set of the CLARIN Concept Registry (CCR): the whole set of metadata and definitions were translated into French, Greek, and Italian. The section describes the machine-translation and evaluation process, also comparing different technologies.

Section 3. “Multilingual vocabularies and ontologies” introduces two other typical case-studies. The first one addresses one of the pressing needs in social sciences research. Many surveys, indeed, ask respondents to specify their occupation and the occupational ontology is used for the survey questions. For many languages the occupational titles for males and females are not identical. In section 3.1 the enrichment of the occupational ontology with lists for male and female titles, is described for many languages, namely for Dutch, German, Slovenian and French.

The second case study focuses on the automatic extraction of terminology from texts: a list of domain- specific terms was automatically extracted from a corpus of Data Curation and Stewardship, validated by domain experts, automatically translated into multiple languages (Dutch, French, German, Greek, Italian, Slovenian) and linked to other existing terminologies.

Section 4. describes the SKOS-ification and publication process of the results, together with the challenges posed by multilinguality.

Section 5. offers an overview of the exploitation and sustainability of the results and how these are made available to the community.

Finally the Conclusions provide some reflections on Machine Translation approaches adopted for translating the vocabularies into multiple languages, the advantages in terms of time saving and some first recommendations to the community.

Files

D3.9 Report on Ontology and Vocabulary Collection and Publication.pdf

Files (1.3 MB)