Published October 23, 2025
| Version 1.0.0
Dataset
Open
PoeTree. Poetry Corpora in Czech, English, French, German, Hungarian, Italian, Norwegian, Portuguese, Russian, Slovenian, and Spanish
Creators
-
Plecháč, Petr1
-
Šeļa, Artjoms1, 2
-
Bermúdez Sabel, Helena3
- Bobenhausen, Klemens4
-
Cinková, Silvie1, 5
-
Dale, Ingerid Løyning6
- Delente, Éliane7
-
De Sisto, Mirella8
-
Haider, Thomas9
- Hammerich, Benjamin4
-
Horváth, Péter10
-
Kvinnsland, Ranveig11
-
Kočnik, Neža12
-
Kolár, Robert1
-
Korchagin, Kirill13
-
Martynenko, Antonina1
- Mittmann, Adiel14
-
Nagy, Benjamin15
-
Navarro Colorado, Borja16
-
Nugues, Lara17
-
Palkó, Gábor10
-
Plungian, Vladimir13, 18
- Renault, Richard7
-
Ruiz Fabo, Pablo19
-
Seláf, Levente10
-
Sitchinava, Dmitri20
-
1.
Czech Academy of Sciences, Institute of Czech Literature
-
2.
University of Tartu
- 3. Jinntec GmbH
- 4. Metricalizer
-
5.
Charles University
- 6. National Library of Norway
-
7.
Université de Caen Normandie
-
8.
Tilburg University
-
9.
University of Passau
-
10.
Eötvös Loránd University
-
11.
University of Oslo
-
12.
University of Ljubljana
- 13. Institute of Russian Language, Russian Academy of Sciences
-
14.
Universidade Federal de Santa Catarina
-
15.
The Institute of the Polish Language of the Polish Academy of Sciences
-
16.
University of Alicante
-
17.
University of Basel
- 18. Institute of Linguistics, Russian Academy of Sciences
-
19.
Université de Strasbourg
-
20.
University of Potsdam
Description
PoeTree is a dataset comprising nearly 335,000 poems / 90,000,000 tokens in 11 languages (Czech, English, French, German, Hungarian, Italian, Norwegian, Portuguese, Spanish, Slovenian, and Russian). Each corpus has been deduplicated, enriched with Universal Dependencies, provided with additional metadata and converted into a unified JSON structure (schema available at https://versologie.cz/poetree/json-schema).
- cs (~80k poems)
- derived from Corpus of Czech Verse
- de (~74k poems)
- derived from Metricalizer and Deutsches Lyrik Korpus
- en (~40k poems)
- based on texts from Project Gutenberg
- es (~9k poems)
- derived from Corpus of Spanish Golden-Age Sonnets and Diachronic Spanish Sonnet Corpus
- fr (~18k poems)
- derived from Malherbə
- hu (~13k poems)
- derived from ELTE Poetry Corpus
- it (~40k poems)
- derived from Biblioteca Italiana
- no (~3k poems)
- derived from NORN Poems
- pt (~5k poems)
- derived from Poemas
- ru (~45k poems)
- derived from Corpus of Russian Poetry
- sl (~5k poem)
- based on texts from wikisource
new in v. 1.0.0:
- PoeTree.no added
- PoeTree.(cs,de,en,fr,hu,it,ru,sl) enriched with geolocation mentions
- Updated and corrected metadata in PoeTree.(de,en,es,ru)
- Multiple text corrections in PoeTree.ru
Files
cs.zip
Files
(2.1 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:4f82d1b34f7835838d560000683223d1
|
513.0 MB | Preview Download |
|
md5:c244d8aab3b087bbcd9527988cee70bd
|
406.2 MB | Preview Download |
|
md5:66d12f53340624a203e81187ad865fdf
|
357.0 MB | Preview Download |
|
md5:8192cc8349b416d039dbcc732c299291
|
29.4 MB | Preview Download |
|
md5:96f4a80d1b5e968c6c012c47051c27f3
|
160.8 MB | Preview Download |
|
md5:f580c5eb9573b4c7faf631d7109215a5
|
79.1 MB | Preview Download |
|
md5:fc02f2f57876ab7c7d6cbfbf46178638
|
282.5 MB | Preview Download |
|
md5:2f7ed6e957c43d77f4eeec377e6ff3b5
|
16.7 MB | Preview Download |
|
md5:0afbd72966a71d399f7ce60f2eb5d13c
|
28.7 MB | Preview Download |
|
md5:ec605cc4b0bd8e50bb6e5a31e0b2dc32
|
247.7 MB | Preview Download |
|
md5:5097ec0d4ab56eb980341b7b10db2492
|
23.2 MB | Preview Download |
Additional details
References
- Bobenhausen, K., & Hammerich, B. (2015). Métrique littéraire, métrique linguistique et métrique algorithmique de l'allemand mises en jeu dans le programme Metricalizer². Langages, (199), 67–87.
- Delente, É., & Renault, R. (2021). Projet Anamètre : présentation, limites et avancées. In A.-S. Bories, G. Purnelle, & H. Marchal (Eds.), Plotting Poetry, On Mechanically-Enhanced Reading (pp. 73–92). Presses universitaires de Liège.
- Horváth, P., Kundráth, P., Indig, B., Fellegi, Z., Szlávich, E., Borbála Bajzát, T., Sárközi-Lindner, Z., Vida, B., Karabulut A., Timári M., & Palkó, G. (2022). ELTE Poetry Corpus: a machine annotated database of canonical Hungarian poetry. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, J. Odijk, & S. Piperidis (Eds.), Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022) (pp. 3471–3478). ELRA. https://aclanthology.org/2022.lrec-1.372/
- Mittmann, A., Pergher, P. H., & Luiz dos Santos, A. (2019). What Rhythmic Signature Says About Poetic Corpora. In P. Plecháč, B. P. Scherr, T. Skulacheva, H. Bermúdez-Sabel, R. Kolár (Eds.), Quantitative Approaches to Versification (pp. 153–172). ICL CAS. https://versologie.cz/conference2019/proceedings/mittmann-pergher-dossantos.pdf
- Navarro-Colorado, B., Ribez Lafoz, M., & Sánchez, N. (2017). Metrical annotation of a large corpus of Spanish sonnets: representation, scansion and evaluation. In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016) (pp. 4360–4364). ELRA. http://www.lrec-conf.org/proceedings/lrec2016/pdf/453_Paper.pdf
- Plecháč, P., & Kolár, R. (2015). The Corpus of Czech Verse. Studia Metrica et Poetica, 2(1), 107–118. https://doi.org/10.12697/smp.2015.2.1.05
- Ruiz Fabo, P., Bermúdez Sabel, H., Martínez Cantón, C., & González-Blanco, E. (2020). The Diachronic Spanish Sonnet Corpus: TEI and linked open data encoding, data distribution, and metrical findings. Digital Scholarship in the Humanities, vol. 36(Supplement_1), i68–i80, 2021. https://doi.org/10.1093/llc/fqaa035
- Grishina E., Korchagin K., Plungian V., & Sichinava D. (2009). Poeticheskii korpus v ramkah NKRIA: obschaia struktura i perspektivy ispolzovania. In Natsionalnii korpus russkogo iazyka: 2006-2008. Novye rezultaty i perspektivy (pp. 71–113). Nestor-Istoria.
- Haider, T. (2021). Metrical Tagging in the Wild: Building and Annotating Poetry Corpora with Rhythmic Features. In Merlo, P., Tiedemann, J., & Tsarfaty, R. (Eds.), Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (pp. 3715–3725). Association for Computational Linguistics https://doi.org/10.18653/v1/2021.eacl-main.325
- Kvinnsland, R., Dale, I. L., & Tungland, L. M. (2024). Rediscovering the 1890s: A Norwegian Poetry Corpus. In W. Haverals, M. Koolen, L. Thompson (Eds.), Proceedings of the Computational Humanities Research Conference 2024 (pp. 1259–1271). CEUR. https://ceur-ws.org/Vol-3834/#paper24