Published April 2, 2024 | Version 0.0.2
Dataset Open

PoeTree. Poetry Treebanks in Czech, English, French, German, Hungarian, Italian, Portuguese, Russian, Slovenian and Spanish

  • 1. ROR icon Czech Academy of Sciences, Institute of Czech Literature
  • 2. ROR icon Charles University
  • 3. ROR icon The Institute of the Polish Language of the Polish Academy of Sciences
  • 4. ROR icon University of Tartu
  • 5. ROR icon Tilburg University
  • 6. ROR icon University of Basel
  • 7. ROR icon University of Passau
  • 8. ROR icon University of Ljubljana
  • 9. ROR icon Université de Caen Normandie
  • 10. Metricalizer
  • 11. ROR icon Universidade Federal de Santa Catarina
  • 12. ROR icon Eötvös Loránd University
  • 13. ROR icon University of Alicante
  • 14. ROR icon University of Strasbourg
  • 15. Jinntec GmbH
  • 16. Institute of Russian Language, Russian Academy of Sciences
  • 17. Institute of Linguistics, Russian Academy of Sciences
  • 18. ROR icon University of Potsdam

Description

PoeTree (Poetry Treebanks) is a dataset comprising over 330,000 poems / 89,000,000 tokens in nine languages (Czech, English, French, German, Hungarian, Italian, Portuguese, Spanish, Slovenian, and Russian). Each corpus has been deduplicated, enriched with Universal Dependencies, provided with additional metadata and converted into a unified JSON structure (schema available at https://versologie.cz/poetree/json-schema).

new in v. 0.0.2:

  • PoeTree.sl added
  • PoeTree.de enriched with Deutsches Lyrik Korpus

Files

cs.zip

Files (2.3 GB)

Name Size Download all
md5:f5a700ae5b0fdb3bed39b98f593496ec
554.1 MB Preview Download
md5:d6a1ef2a3aee6dc9216510db0d30b800
434.3 MB Preview Download
md5:19912a9e048744f6eea150a15d829c2f
388.0 MB Preview Download
md5:c471a04e4cad3a914310a36292f9a1e3
31.9 MB Preview Download
md5:028eeb1dd613fe59f93f9a2aed08a3c9
172.4 MB Preview Download
md5:e9185276ea96bd513ef894899b745bc0
83.7 MB Preview Download
md5:e8c4d075fce2c29cfb9dd108670559ce
305.6 MB Preview Download
md5:efa7eb72d5ce86b8aea90ca75ba0aecd
31.3 MB Preview Download
md5:b7964a83a441b7602ebd71210d5e1fec
258.1 MB Preview Download
md5:b95b18e5949c16b2b930d5f4e8d7501b
26.1 MB Preview Download

Additional details

References

  • Bobenhausen, K., & Hammerich, B. (2015). Métrique littéraire, métrique linguistique et métrique algorithmique de l'allemand mises en jeu dans le programme Metricalizer². Langages, (199), 67–87.
  • Delente, É., & Renault, R. (2021). Projet Anamètre : présentation, limites et avancées. In A.-S. Bories, G. Purnelle, & H. Marchal (Eds.), Plotting Poetry, On Mechanically-Enhanced Reading (pp. 73–92). Presses universitaires de Liège.
  • Horváth, P., Kundráth, P., Indig, B., Fellegi, Z., Szlávich, E., Borbála Bajzát, T., Sárközi-Lindner, Z., Vida, B., Karabulut A., Timári M., & Palkó, G. (2022). ELTE Poetry Corpus: a machine annotated database of canonical Hungarian poetry. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, J. Odijk, & S. Piperidis (Eds.), Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022) (pp. 3471–3478). ELRA. https://aclanthology.org/2022.lrec-1.372/
  • Mittmann, A., Pergher, P. H., & Luiz dos Santos, A. (2019). What Rhythmic Signature Says About Poetic Corpora. In P. Plecháč, B. P. Scherr, T. Skulacheva, H. Bermúdez-Sabel, R. Kolár (Eds.), Quantitative Approaches to Versification (pp. 153–172). ICL CAS. https://versologie.cz/conference2019/proceedings/mittmann-pergher-dossantos.pdf
  • Navarro-Colorado, B., Ribez Lafoz, M., & Sánchez, N. (2017). Metrical annotation of a large corpus of Spanish sonnets: representation, scansion and evaluation. In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016) (pp. 4360–4364). ELRA. http://www.lrec-conf.org/proceedings/lrec2016/pdf/453_Paper.pdf
  • Plecháč, P., & Kolár, R. (2015). The Corpus of Czech Verse. Studia Metrica et Poetica, 2(1), 107–118. https://doi.org/10.12697/smp.2015.2.1.05
  • Ruiz Fabo, P., Bermúdez Sabel, H., Martínez Cantón, C., & González-Blanco, E. (2020). The Diachronic Spanish Sonnet Corpus: TEI and linked open data encoding, data distribution, and metrical findings. Digital Scholarship in the Humanities, vol. 36(Supplement_1), i68–i80, 2021. https://doi.org/10.1093/llc/fqaa035
  • Grishina E., Korchagin K., Plungian V., & Sichinava D. (2009). Poeticheskii korpus v ramkah NKRIA: obschaia struktura i perspektivy ispolzovania. In Natsionalnii korpus russkogo iazyka: 2006-2008. Novye rezultaty i perspektivy (pp. 71–113). Nestor-Istoria.
  • Haider, T. (2021). Metrical Tagging in the Wild: Building and Annotating Poetry Corpora with Rhythmic Features. In Merlo, P., Tiedemann, J., & Tsarfaty, R. (Eds.), Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (pp. 3715–3725). Association for Computational Linguistics https://doi.org/10.18653/v1/2021.eacl-main.325