TRIPLE Deliverable: D2.4 Report on identification and creation of new vocabularies
Description
The GoTriple platform is a discovery service for SSH publications. It can be classed as an aggregator since it harvests publication metadata records from distributed sources (namely other aggregators or repositories). During the ingestion pipeline, it transforms metadata records into the Triple Data Model, it performs a series of cleansing, normalisation and enrichment procedures - in order to deal with metadata heterogeneity, increase multilingualism and improve content searchability and discoverability - and, finally, it stores and indexes the enriched metadata records, making them searchable via the GoTriple search engine.
Two of the most important enrichment procedures that metadata records undergo are classification and annotation. The former uses machine learning technology to automatically classify each publication using the MORESS classification scheme (D2.31). The latter searches specific metadata fields of a record (titles, descriptions/abstracts and subjects/keywords) to assign them concepts from a multilingual LOD vocabulary of SSH concepts. The record is then updated with the respective links (concept URIs) to the concepts, as well as all available labels in the different languages. We call a concept URI with all the available labels that we add to a
metadata record an annotation or Triple Keyword. Triple keywords are distinguished from the subjects/keywords of the original metadata. Since objects are indexed with annotation labels in all available languages, they are found when a search term matches an annotation label in any of the available languages. This way, both searchability and multilingualism are increased.
This deliverable describes the work and presents the outcome of task T2.4 “Cartography and creation of new vocabularies”. The objective of the task was to create a vocabulary of SSH concepts with labels in the 10 languages supported by the annotation service. The outcome is the GoTriple Vocabulary, a multilingual hierarchical set of 3,375 SSH-related concepts. It is a subset of LCSH (Library of Congress Subject Headings) that covers popular SSH subject areas. The English labels are enhanced with labels in Greek, French, Polish, German, Italian, Portuguese, Spanish, Croatian and Ukrainian. The vocabulary conforms to the SKOS data model and is published as Linked Open Data (LOD) under http://semantics.gr/authorities/vocabularies/SSH-LCSH in Semantics.gr, which is a platform developed by EKT for managing and publishing LOD vocabularies, thesauri and authority files of any schema. The vocabulary is used by the annotation service but, at the same time, is a standalone product, since it is published under an open license and can be used by the SSH research communities. The biggest challenges we faced in creating the vocabulary were a) choosing a base vocabulary b) defining a reasonable number of SSH concepts and c) adding labels in all GoTriple languages
Notes
Files
D2.4 Report on identification and creation of new vocabularies_DRAFT.pdf
Files
(1.9 MB)
Name | Size | Download all |
---|---|---|
md5:ad87d6ecbef836a075f148c529f2f81d
|
1.9 MB | Preview Download |