Published October 1, 2019 | Version v1
Conference paper Open

Aggregating Dictionaries into the Language Portal Sõnaveeb: Issues With and Without Solutions

  • 1. Institute of the Estonian Language

Description

In this paper we present Sõnaveeb (Wordweb), a new type of language portal of the Institute of the Estonian Language containing data from a growing number of dictionaries and termbases. Sõnaveeb currently displays a total of 150,000 Estonian headwords, obtained from many databases, with many new types of lexicographic information: collocations, etymology, multi-word expressions, etc.

The paper reports on problems encountered so far: the consistency of information and avoiding duplicates when unifying the dictionaries, turning dictionary-specific information into customisations of the central service, deciding on deliberate ambiguities, parsing data fields containing more than one data element, including textual condensation, moving from annotating form (e.g. italics) to annotating content (e.g. a citation), moving from (near-)duplicates to sensible information fragments, deciding on the advantage of an app over a responsive web page, and possible legal problems regarding the authorship of the new central resource, as it may become difficult to show who authored which part of the published resource.

The development of Sõnaveeb continues in the direction of both the tighter aggregation of existing datasets and the addition of new data from other dictionaries and termbases, as well as compiling new data in the new DWS Ekilex.

Files

eLex_2019_Aggregating dictionaris into the language portal Sõnaveeb.pdf

Files (712.1 kB)

Additional details

Funding

ELEXIS – European Lexicographic Infrastructure 731015
European Commission