Corpus of Hungarian Lyrical Poetry
Authors/Creators
Description
The Corpus of Hungarian Lyrical Poetry was created by the Research Group in Stylistics at Eötvös Loránd University. The corpus contains manually corrected annotations for parts of speech, lemmas, and morphosyntactic features. Syntactic features related to verbs and their complements are also annotated manually.
Subcorpora, sizes, sampling methods:
- All: 531/525 texts (From the 531 texts, 6 texts were removed at the manual annotation layers.) Number of tokens: 170 000
- Canonical poems from the 20th century: 159 poems were selected from popular high school literary anthologies published between 2007 and 2020. Poems were selected if they appeared in at least three of the five textbooks used.
From the level2_corrected files containing the morphosyntactic annotations corrected manually, six long poems were removed. Number of tokens: 38,000 - Contemporary poems: 120 poems were selected from the database of the Digitális Irodalmi Akadémia (Digital Literary Academy). Three poems by each author were sampled randomly. Number of tokens: 19,000
- Song lyrics: 144 songs were selected from the annual streaming and radio charts provided by MAHASZ (Hungarian Association of Record Producers), covering the period from 2014 to 2022. A maximum of three songs per artist were included. Number of tokens: 52,000
- Slam poetry texts: 108 texts were selected based on the view counts of videos from the Slam Poetry Magyarország (Slam Poetry Hungary) YouTube channel. A maximum of three texts per author was included. Number of tokens: 60,000
Annotation levels:
- level0: manually corrected slam poetry transcriptions without annotations and the non-annotated texts of songs, canonical poems, and contemporary poems generated from the level1 files.
- level1: manually created annotations of structural units (lines, stanzas).
- level2: automatically created annotations of lemma, part of speech, and morphosyntactic features of words.
- level2_corrected: Manually corrected annotations of lemma, part of speech, and morphosyntactic features of words.
- level3_without_meter: Automatically created annotations of rhyme patterns, rhyme pairs, rhythm of lines, alliterations, and phonological features of words, in addition to the already created annotations.
- level3: Automatically created annotations of quantitative and qualitative meters, in addition to the already created annotations.
- level4: Modified position and/or name of some elements and attributes, and automatically generated annotations of syllable counts, word counts per structural unit, and other simple numeric features, alongside existing annotations.
- level5: Manually annotated syntactic features. The annotations focus on the dependency relations between the verb and its complements.
Funding:
The corpus building was supported by the project No. K-137659 (_Corpus-based cognitive poetic research on person marking constructions_) of the National Research, Development and Innovation Office of Hungary.
Copyright:
The content of the corpus is not public and cannot be disclosed due to copyright protection. The corpus is for research purposes only and can only be used by those who have received permission from the principal investigator.