Project deliverable Open Access

TRIPLE Deliverable: D2.5 - Report on Data Enrichment

De Santis, Luca


Citation Style Language JSON Export

{
  "publisher": "Zenodo", 
  "DOI": "10.5281/zenodo.7359654", 
  "language": "eng", 
  "title": "TRIPLE Deliverable: D2.5 - Report on Data Enrichment", 
  "issued": {
    "date-parts": [
      [
        2022, 
        9, 
        30
      ]
    ]
  }, 
  "abstract": "<p>In this deliverable, the strategies for data enrichment in TRIPLE are presented. Through the Core&nbsp;Pipeline, named SCRE, metadata regarding publications and projects for the Social Sciences and&nbsp;Humanities are automatically harvested, mapped in the TRIPLE data model, curated, enriched&nbsp;and finally saved in the GoTriple platform&rsquo;s indexes.<br>\nThe document starts by presenting the ways SCRE imports publications metadata from&nbsp;OAI-PMH endpoints, OpenAIRE and Isidore data dumps. This reflects the strategies for&nbsp;integrating content which was planned in the project. On the one hand, OAI-PMH is a<br>\nwell-known and established standard for content harvesting: many data providers, especially&nbsp;those of small dimension, support it, facilitating therefore their onboarding in GoTriple. The&nbsp;support for OpenAIRE and Isidore, on the other hand, responds to the wish to also harvest data&nbsp;from large aggregators, a strategy that allowed GoTriple to quickly present a significant amount<br>\nof publications in its index (more than 4 million at the time of writing).<br>\nThen the normalisation strategies applied to the acquired metadata are described. By analysing&nbsp;the first batches of acquired data, it has been decided to define the rules to normalise and clean&nbsp;the attributes for the following metadata: publication date, language codes, keywords,&nbsp;document types, licences, access rights and authors&rsquo; names. In the document, the definition of<br>\ncontrolled vocabularies for some of these attributes is also presented.&nbsp;</p>\n\n<p>Then enrichment services are explained, including language recognition, translation, automatic&nbsp;classification and annotation.<br>\nThe services to detect duplicate publications and to disambiguate authors are also discussed,&nbsp;followed by the presentation of the acquisition and processing of project metadata&nbsp;</p>\n\n<p>Some final remarks on the data enrichment process, including the difficulties that have been<br>\nfaced and solved, conclude the document.</p>", 
  "author": [
    {
      "family": "De Santis, Luca"
    }
  ], 
  "note": "The TRIPLE project (https://project.gotriple.eu/), which is financed under the Horizon 2020 framework https://cordis.europa.eu/project/id/863420), under Grant Agreement No. 863420, with approx. 5.6 million Euros for a duration of 42 months (2019-2023). The content of this deliverable reflects only TRIPLE's view and the Commission is not responsible for any use that may be made of the information it contains.\n---\nAt the heart of the project is the development of the GoTriple platform (https://www.gotriple.eu/), an innovative multilingual and multicultural discovery solution.", 
  "version": "Draft", 
  "type": "report", 
  "id": "7359654"
}
183
120
views
downloads
All versions This version
Views 183183
Downloads 120120
Data volume 185.0 MB185.0 MB
Unique views 171171
Unique downloads 111111

Share

Cite as