Published October 23, 2025 | Version 1.0.0
Dataset Open

PoeTree. Poetry Corpora in Czech, English, French, German, Hungarian, Italian, Norwegian, Portuguese, Russian, Slovenian, and Spanish

  • 1. ROR icon Czech Academy of Sciences, Institute of Czech Literature
  • 2. ROR icon University of Tartu
  • 3. Jinntec GmbH
  • 4. Metricalizer
  • 5. ROR icon Charles University
  • 6. National Library of Norway
  • 7. ROR icon Université de Caen Normandie
  • 8. ROR icon Tilburg University
  • 9. ROR icon University of Passau
  • 10. ROR icon Eötvös Loránd University
  • 11. ROR icon University of Oslo
  • 12. ROR icon University of Ljubljana
  • 13. Institute of Russian Language, Russian Academy of Sciences
  • 14. ROR icon Universidade Federal de Santa Catarina
  • 15. ROR icon The Institute of the Polish Language of the Polish Academy of Sciences
  • 16. ROR icon University of Alicante
  • 17. ROR icon University of Basel
  • 18. Institute of Linguistics, Russian Academy of Sciences
  • 19. ROR icon Université de Strasbourg
  • 20. ROR icon University of Potsdam

Description

PoeTree is a dataset comprising nearly 335,000 poems / 90,000,000 tokens in 11 languages (Czech, English, French, German, Hungarian, Italian, Norwegian, Portuguese, Spanish, Slovenian, and Russian). Each corpus has been deduplicated, enriched with Universal Dependencies, provided with additional metadata and converted into a unified JSON structure (schema available at https://versologie.cz/poetree/json-schema).

new in v. 1.0.0:

  • PoeTree.no added
  • PoeTree.(cs,de,en,fr,hu,it,ru,sl) enriched with geolocation mentions
  • Updated and corrected metadata in PoeTree.(de,en,es,ru)
  • Multiple text corrections in PoeTree.ru

Files

cs.zip

Files (2.1 GB)

Name Size Download all
md5:4f82d1b34f7835838d560000683223d1
513.0 MB Preview Download
md5:c244d8aab3b087bbcd9527988cee70bd
406.2 MB Preview Download
md5:66d12f53340624a203e81187ad865fdf
357.0 MB Preview Download
md5:8192cc8349b416d039dbcc732c299291
29.4 MB Preview Download
md5:96f4a80d1b5e968c6c012c47051c27f3
160.8 MB Preview Download
md5:f580c5eb9573b4c7faf631d7109215a5
79.1 MB Preview Download
md5:fc02f2f57876ab7c7d6cbfbf46178638
282.5 MB Preview Download
md5:2f7ed6e957c43d77f4eeec377e6ff3b5
16.7 MB Preview Download
md5:0afbd72966a71d399f7ce60f2eb5d13c
28.7 MB Preview Download
md5:ec605cc4b0bd8e50bb6e5a31e0b2dc32
247.7 MB Preview Download
md5:5097ec0d4ab56eb980341b7b10db2492
23.2 MB Preview Download

Additional details

References

  • Bobenhausen, K., & Hammerich, B. (2015). Métrique littéraire, métrique linguistique et métrique algorithmique de l'allemand mises en jeu dans le programme Metricalizer². Langages, (199), 67–87.
  • Delente, É., & Renault, R. (2021). Projet Anamètre : présentation, limites et avancées. In A.-S. Bories, G. Purnelle, & H. Marchal (Eds.), Plotting Poetry, On Mechanically-Enhanced Reading (pp. 73–92). Presses universitaires de Liège.
  • Horváth, P., Kundráth, P., Indig, B., Fellegi, Z., Szlávich, E., Borbála Bajzát, T., Sárközi-Lindner, Z., Vida, B., Karabulut A., Timári M., & Palkó, G. (2022). ELTE Poetry Corpus: a machine annotated database of canonical Hungarian poetry. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, J. Odijk, & S. Piperidis (Eds.), Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022) (pp. 3471–3478). ELRA. https://aclanthology.org/2022.lrec-1.372/
  • Mittmann, A., Pergher, P. H., & Luiz dos Santos, A. (2019). What Rhythmic Signature Says About Poetic Corpora. In P. Plecháč, B. P. Scherr, T. Skulacheva, H. Bermúdez-Sabel, R. Kolár (Eds.), Quantitative Approaches to Versification (pp. 153–172). ICL CAS. https://versologie.cz/conference2019/proceedings/mittmann-pergher-dossantos.pdf
  • Navarro-Colorado, B., Ribez Lafoz, M., & Sánchez, N. (2017). Metrical annotation of a large corpus of Spanish sonnets: representation, scansion and evaluation. In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016) (pp. 4360–4364). ELRA. http://www.lrec-conf.org/proceedings/lrec2016/pdf/453_Paper.pdf
  • Plecháč, P., & Kolár, R. (2015). The Corpus of Czech Verse. Studia Metrica et Poetica, 2(1), 107–118. https://doi.org/10.12697/smp.2015.2.1.05
  • Ruiz Fabo, P., Bermúdez Sabel, H., Martínez Cantón, C., & González-Blanco, E. (2020). The Diachronic Spanish Sonnet Corpus: TEI and linked open data encoding, data distribution, and metrical findings. Digital Scholarship in the Humanities, vol. 36(Supplement_1), i68–i80, 2021. https://doi.org/10.1093/llc/fqaa035
  • Grishina E., Korchagin K., Plungian V., & Sichinava D. (2009). Poeticheskii korpus v ramkah NKRIA: obschaia struktura i perspektivy ispolzovania. In Natsionalnii korpus russkogo iazyka: 2006-2008. Novye rezultaty i perspektivy (pp. 71–113). Nestor-Istoria.
  • Haider, T. (2021). Metrical Tagging in the Wild: Building and Annotating Poetry Corpora with Rhythmic Features. In Merlo, P., Tiedemann, J., & Tsarfaty, R. (Eds.), Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (pp. 3715–3725). Association for Computational Linguistics https://doi.org/10.18653/v1/2021.eacl-main.325
  • Kvinnsland, R., Dale, I. L., & Tungland, L. M. (2024). Rediscovering the 1890s: A Norwegian Poetry Corpus. In W. Haverals, M. Koolen, L. Thompson (Eds.), Proceedings of the Computational Humanities Research Conference 2024 (pp. 1259–1271). CEUR. https://ceur-ws.org/Vol-3834/#paper24