Project deliverable Open Access

TRIPLE Deliverable: D2.5 - Report on Data Enrichment

De Santis, Luca


JSON-LD (schema.org) Export

{
  "inLanguage": {
    "alternateName": "eng", 
    "@type": "Language", 
    "name": "English"
  }, 
  "description": "<p>In this deliverable, the strategies for data enrichment in TRIPLE are presented. Through the Core&nbsp;Pipeline, named SCRE, metadata regarding publications and projects for the Social Sciences and&nbsp;Humanities are automatically harvested, mapped in the TRIPLE data model, curated, enriched&nbsp;and finally saved in the GoTriple platform&rsquo;s indexes.<br>\nThe document starts by presenting the ways SCRE imports publications metadata from&nbsp;OAI-PMH endpoints, OpenAIRE and Isidore data dumps. This reflects the strategies for&nbsp;integrating content which was planned in the project. On the one hand, OAI-PMH is a<br>\nwell-known and established standard for content harvesting: many data providers, especially&nbsp;those of small dimension, support it, facilitating therefore their onboarding in GoTriple. The&nbsp;support for OpenAIRE and Isidore, on the other hand, responds to the wish to also harvest data&nbsp;from large aggregators, a strategy that allowed GoTriple to quickly present a significant amount<br>\nof publications in its index (more than 4 million at the time of writing).<br>\nThen the normalisation strategies applied to the acquired metadata are described. By analysing&nbsp;the first batches of acquired data, it has been decided to define the rules to normalise and clean&nbsp;the attributes for the following metadata: publication date, language codes, keywords,&nbsp;document types, licences, access rights and authors&rsquo; names. In the document, the definition of<br>\ncontrolled vocabularies for some of these attributes is also presented.&nbsp;</p>\n\n<p>Then enrichment services are explained, including language recognition, translation, automatic&nbsp;classification and annotation.<br>\nThe services to detect duplicate publications and to disambiguate authors are also discussed,&nbsp;followed by the presentation of the acquisition and processing of project metadata&nbsp;</p>\n\n<p>Some final remarks on the data enrichment process, including the difficulties that have been<br>\nfaced and solved, conclude the document.</p>", 
  "license": "https://creativecommons.org/licenses/by/4.0/legalcode", 
  "creator": [
    {
      "affiliation": "Net7", 
      "@id": "https://orcid.org/0000-0003-0527-840X", 
      "@type": "Person", 
      "name": "De Santis, Luca"
    }
  ], 
  "url": "https://zenodo.org/record/7359654", 
  "datePublished": "2022-09-30", 
  "version": "Draft", 
  "keywords": [
    "SSH", 
    "Data enrichment", 
    "Metdata", 
    "Open Science", 
    "OPERAS", 
    "TRIPLE"
  ], 
  "@context": "https://schema.org/", 
  "identifier": "https://doi.org/10.5281/zenodo.7359654", 
  "@id": "https://doi.org/10.5281/zenodo.7359654", 
  "@type": "CreativeWork", 
  "name": "TRIPLE Deliverable: D2.5 - Report on Data Enrichment"
}
183
120
views
downloads
All versions This version
Views 183183
Downloads 120120
Data volume 185.0 MB185.0 MB
Unique views 171171
Unique downloads 111111

Share

Cite as