Corpus annotation and dictionary linking using Wikibase

Lindemann, David

doi:10.5281/zenodo.12078616

Published June 19, 2024 | Version v1

Poster Open

Corpus annotation and dictionary linking using Wikibase

Lindemann, David (Contact person)¹

1. University of the Basque Country

This poster presents a data model and two first use cases for the representation of contents of text corpus data on Wikibase instances, including morphosyntactic, semantic and philological annotations as well as links to dictionary entries. Wikibase (cf. Diefenbach et al. 2021), an extension of MediaWiki, is the software that underlies Wikidata (Vrandečić & Krötzsch 2014), an exceptionally large crowdsourced queriable Knowledge Graph, which includes nodes for ontological concepts, on the one hand, and for lexemes, lexeme senses and lexeme forms, on the other, together with annotations to and relations between them. The use case for which the model has been proposed is documents that belong to the Basque Historical Corpus, although we claim that it can serve in other contexts, too. That corpus contains literature text written in Basque from before 1900, and today exists in several versions stored in separated and incompatible data siloes (based on relational databases) and made available through different online user front ends. One version displays historical documents in a standardized orthography; another version, based on the former, allows for lemma-based searches, and a third version contains morphosyntactic annotations (part of speech, inflection form descriptions, and corresponding lemmata), and some texts are also published elsewhere, sometimes in an electronic format, together with philological annotations. A second use case is an experiment for linking a Serbian literature corpus in NIF format to a Serbian dictionary in Ontolex-Lemon.

Heavily inspired by the latest trends in the field of Linguistic Linked Open Data, we model a corpus token as node in a knowledge graph, and link it (1) to the respective paragraph (Basque) or token (Serbian) in the source document ; (2) to a lexeme node, which is annotated with the standard lemma; (3) to a lexical form associated to that lexeme, which is annotated with the grammatical features describing the form; (4) to a lexical sense associated to that lexeme, which is annotated with a sense gloss; (5) to an ontology concept representing the word sense; and (6), to a text chain containing philological annotations. Furthermore, we represent token spans as separate nodes; these are linked to the contained tokens, and to annotations that apply to the whole span. We implement and populate the model on our own Wikibase instance hosted on Wikibase Cloud. Core classes and properties used on a Wikibase by default for describing lexemes deploy Ontolex-Lemon (McCrae et al. 2017), the W3C-recommended model for lexical data, so that the created datasets are compatible with the Linguistic Linked Open Data Cloud. We define properties that describe corpus tokens as equivalent to NIF, a standard for corpus annotation (Hellmann et al. 2013). We are currently populating the proposed model with tokens from a 1737 Basque manuscript, the transcription of which has been carried out on Wikisource, and inserting annotations of the above described types including philological annotations by Lakarra (1985), as well as direct links to the corresponding paragraph in the manuscript transcription on the Wikisource platform.

Files

Poster_Lindemann_Wikibase.pdf

Files (1.9 MB)

Name	Size	Download all
Poster_Lindemann_Wikibase.pdf md5:f43762de0fca0dd7931bf85d9da67405	1.9 MB	Preview Download

	All versions	This version
Views	224	224
Downloads	113	113
Data volume	234.6 MB	234.6 MB

Corpus annotation and dictionary linking using Wikibase

Authors/Creators

Description

Files

Poster_Lindemann_Wikibase.pdf

Files (1.9 MB)