Published May 27, 2025 | Version v1.0

ELTE Poetry Corpus

Description

ELTE Poetry Corpus is a database developed by the Department of Digital Humanities at Eötvös Loránd University. Currently, the corpus contains the complete poems of 53 Hungarian canonical poets, the sound devices of the poems and the grammatical features of words in XML format (in TEI and non-TEI XML format).

Size:

- number of poets: 53
- number of poems: 14 358
- number of words: 2 859 163
- number of tokens: 3 621 416

TEI Levels:

The source of the corpus was the collection of the Hungarian Electronic Library(http://mek.oszk.hu), which contains numerous poetic oeuvres in digitized form.

1. The texts from the Hungarian Electronic Library were converted into TEI XML format based on the Text Encoding Initiative.
2. The automatically converted poems containing the annotations of structural units were checked manually (level1).
3. Then, we tokenized the poems and annotated the grammatical features of words by using e-magyar, an NLP tool chain for Hungarian texts. The level2 folder contains the TEI XML files in which the morphosyntactic features (values of the msd attributes) are annotated in the format of universal dependencies, while the level2\_emMorph folder contains the same files in which the morphosyntactic features are annotated in its own, emMorph format of e-magyar.
4. After the grammatical annotation, we also annotated the rhyme patterns, the rhyme pairs, the rhythm of lines, the alliterations, the phonological features of words, and the meter of the poems (level3).
5. Finally, we added further annotations of poetic features to the corpus and changed the name and the position of some elements and attributes, using a non-TEI XML format defined for the project (level4).

Files

poetry-corpus-1.0.zip

Files (568.2 MB)

Name Size
md5:4951413ae1d3cc6598364e10c8d5d825
568.2 MB Preview Download