There is a newer version of the record available.

Published October 16, 2023 | Version 0.0.1
Dataset Open

PoeTree. Poetry Treebanks in Czech, English, French, German, Hungarian, Italian, Portuguese, Russian and Spanish

  • 1. ROR icon Czech Academy of Sciences, Institute of Czech Literature
  • 2. ROR icon Charles University
  • 3. ROR icon The Institute of the Polish Language of the Polish Academy of Sciences
  • 4. ROR icon University of Tartu
  • 5. ROR icon Tilburg University
  • 6. ROR icon University of Basel
  • 7. ROR icon University of Passau
  • 8. ROR icon Université de Caen Normandie
  • 9. Metricalizer
  • 10. ROR icon Universidade Federal de Santa Catarina
  • 11. ROR icon Eötvös Loránd University
  • 12. ROR icon University of Alicante
  • 13. ROR icon University of Strasbourg
  • 14. Jinntec GmbH
  • 15. Institute of Russian Language, Russian Academy of Sciences
  • 16. Institute of Linguistics, Russian Academy of Sciences
  • 17. ROR icon University of Potsdam

Description

PoeTree (Poetry Treebanks) is a dataset comprising over 300,000 poems / 84,000,000 tokens in nine languages (Czech, English, French, German, Hungarian, Italian, Portuguese, Spanish, and Russian). Each corpus has been deduplicated, enriched with Universal Dependencies, provided with additional metadata and converted into a unified JSON structure (schema available at https://versologie.cz/poetree/json-schema).

Files

cs.zip

Files (2.2 GB)

Name Size Download all
md5:7adb32d5511ea610058ff1cfa30b1d59
554.1 MB Preview Download
md5:2fb5de44301a6e393f4a8bd19ec54d18
333.1 MB Preview Download
md5:8157f5d7a2dbb667d0d9a6a4da3f6e4e
388.0 MB Preview Download
md5:9d090d9bb54946b32ee76f4f10b18ea7
31.9 MB Preview Download
md5:6b28a244226503e78e14a2b1d771c39c
172.4 MB Preview Download
md5:64fb203e5bbe4cec14f5761100a4096c
83.7 MB Preview Download
md5:4342e2a8767037de509e6846d63a388b
305.6 MB Preview Download
md5:7d82d408c5046c1ed9cd711ba3994321
31.3 MB Preview Download
md5:fbdddb6ee7077a1c1f94c215b5d0099b
258.1 MB Preview Download

Additional details

References

  • Bobenhausen, K., & Hammerich, B. (2015). Métrique littéraire, métrique linguistique et métrique algorithmique de l'allemand mises en jeu dans le programme Metricalizer². Langages, (199), 67–87.
  • Delente, É., & Renault, R. (2021). Projet Anamètre : présentation, limites et avancées. In A.-S. Bories, G. Purnelle, & H. Marchal (Eds.), Plotting Poetry, On Mechanically-Enhanced Reading (pp. 73–92). Presses universitaires de Liège.
  • Horváth, P., Kundráth, P., Indig, B., Fellegi, Z., Szlávich, E., Borbála Bajzát, T., Sárközi-Lindner, Z., Vida, B., Karabulut A., Timári M., & Palkó, G. (2022). ELTE Poetry Corpus: a machine annotated database of canonical Hungarian poetry. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, J. Odijk, & S. Piperidis (Eds.), Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022) (pp. 3471–3478). ELRA. https://aclanthology.org/2022.lrec-1.372/
  • Mittmann, A., Pergher, P. H., & Luiz dos Santos, A. (2019). What Rhythmic Signature Says About Poetic Corpora. In P. Plecháč, B. P. Scherr, T. Skulacheva, H. Bermúdez-Sabel, R. Kolár (Eds.), Quantitative Approaches to Versification (pp. 153–172). ICL CAS. https://versologie.cz/conference2019/proceedings/mittmann-pergher-dossantos.pdf
  • Navarro-Colorado, B., Ribez Lafoz, M., & Sánchez, N. (2017). Metrical annotation of a large corpus of Spanish sonnets: representation, scansion and evaluation. In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016) (pp. 4360–4364). ELRA. http://www.lrec-conf.org/proceedings/lrec2016/pdf/453_Paper.pdf
  • Plecháč, P., & Kolár, R. (2015). The Corpus of Czech Verse. Studia Metrica et Poetica, 2(1), 107–118. https://doi.org/10.12697/smp.2015.2.1.05
  • Ruiz Fabo, P., Bermúdez Sabel, H., Martínez Cantón, C., & González-Blanco, E. (2020). The Diachronic Spanish Sonnet Corpus: TEI and linked open data encoding, data distribution, and metrical findings. Digital Scholarship in the Humanities, vol. 36(Supplement_1), i68–i80, 2021. https://doi.org/10.1093/llc/fqaa035
  • Grishina E., Korchagin K., Plungian V., & Sichinava D. (2009). Poeticheskii korpus v ramkah NKRIA: obschaia struktura i perspektivy ispolzovania. In Natsionalnii korpus russkogo iazyka: 2006-2008. Novye rezultaty i perspektivy (pp. 71–113). Nestor-Istoria.