Published June 7, 2023 | Version v1
Presentation Open

Cultural heritage data as sources for databases of historical language use of Hungarian

  • 1. Hungarian Research Centre for Linguistics

Description

The Middle Hungarian period, i.e. the interval time between the second third of the sixteenth century and the second third of the eighteenth century is less intensively explored so far. This also is the earliest period of the history of Hungarian for which an appropriate amount of extant text material is at our disposal for studying the language use of everyday private life with the necessary thoroughness (cf. Dömötör–Gugán–Varga 2021).
The present proposal focuses on two databases designed by the presenters and their team: The Old and Middle Hungarian corpus of informal language use (Történeti Magánéleti Korpusz, TMK) and The corpus of memoirs and dramas (Középmagyar emlékirat- és drámakorpusz). Both of the corpora contain texts representing important sources of the cultural heritage of Hungarian: ego-documents from noblemen and noblewomen, genres related to everyday language use involving speakers with lower social status as well, and constructed dialogs imitating everyday language use in fiction.
The Old and Middle Hungarian corpus of informal language use (tmk.nytud.hu) consists of private letters and records of witch trials from between the fifteenth-century beginnings and 1772, a total of 8 million characters. This presentation highlights some requirements and steps of the corpus building executed by the historical linguists in a collaboration with the computational linguist. It includes the manual normalization and disambiguation for diachronic adequacy, the morphological analysis and query interface. This database is the first fully normalized and annotated historical corpus of Hungarian completed with sociolinguistic information (Novák–Gugán–Varga–Dömötör 2018).
The other topic of the presentation is The corpus of memoirs and dramas, the building of which is in progress following the guidelines developed for the previous corpus (cf. Gugán 2020). The language use of memoires and dramas in Middle Hungarian proved to be suitable as an extension to the more directly speech-related sources of TMK. Memoires are ego-documents, yet they are still farther from informal language use than private letters. Dramas are constructed texts, however, they are speech-purposed as well. Therefore, the four registers to be included all share certain characteristics, but each differs in at least one feature.
In both corpora, all of the records are normalized and morphologically annotated. The new corpus is also planned to get a freely available user-friendly query interface, providing a valuable source of information for historical linguists and specialists or students of related fields.

Files

Files (3.3 MB)