PoeTree. Poetry Treebanks in Czech, English, French, German, Hungarian, Italian, Portuguese, Russian and Spanish

Plecháč, Petr; Kolár, Robert; Cinková, Silvie; Šeļa, Artjoms; De Sisto, Mirella; Nugues, Lara; Haider, Thomas; Nagy, Benjamin; Delente, Éliane; Renault, Richard; Bobenhausen, Klemens; Hammerich, Benjamin; Mittmann, Adiel; Palkó, Gábor; Horváth, Péter; Navarro Colorado, Borja; Ruiz Fabo, Pablo; Bermúdez Sabel, Helena; Korchagin, Kirill; Plungian, Vladimir; Sitchinava, Dmitri

doi:10.5281/zenodo.10008459

Published October 16, 2023 | Version 0.0.1

Dataset Open

PoeTree. Poetry Treebanks in Czech, English, French, German, Hungarian, Italian, Portuguese, Russian and Spanish

1. Czech Academy of Sciences, Institute of Czech Literature
2. Charles University
3. The Institute of the Polish Language of the Polish Academy of Sciences
4. University of Tartu
5. Tilburg University
6. University of Basel
7. University of Passau
8. Université de Caen Normandie
9. Metricalizer
10. Universidade Federal de Santa Catarina
11. Eötvös Loránd University
12. University of Alicante
13. Université de Strasbourg
14. Jinntec GmbH
15. Institute of Russian Language, Russian Academy of Sciences
16. Institute of Linguistics, Russian Academy of Sciences
17. University of Potsdam

PoeTree (Poetry Treebanks) is a dataset comprising over 300,000 poems / 84,000,000 tokens in nine languages (Czech, English, French, German, Hungarian, Italian, Portuguese, Spanish, and Russian). Each corpus has been deduplicated, enriched with Universal Dependencies, provided with additional metadata and converted into a unified JSON structure (schema available at https://versologie.cz/poetree/json-schema).

cs (~80k poems)
- derived from Corpus of Czech Verse
de (~50k poems)
- derived from Metricalizer
en (~40k poems)
- based on the texts from Project Gutenberg
es (~9k poems)
- derived from Corpus of Spanish Golden-Age Sonnets and Diachronic Spanish Sonnet Corpus
fr (~18k poems)
- derived from Malherbə
hu (~13k poems)
- derived from ELTE Poetry Corpus
it (~40k poems)
- derived from Biblioteca Italiana
pt (~5k poems)
- derived from Poemas
ru (~45k poems)
- derived from Corpus of Russian Poetry

Files

cs.zip

Files (2.2 GB)

Name	Size
cs.zip md5:7adb32d5511ea610058ff1cfa30b1d59	554.1 MB	Preview Download
de.zip md5:2fb5de44301a6e393f4a8bd19ec54d18	333.1 MB	Preview Download
en.zip md5:8157f5d7a2dbb667d0d9a6a4da3f6e4e	388.0 MB	Preview Download
es.zip md5:9d090d9bb54946b32ee76f4f10b18ea7	31.9 MB	Preview Download
fr.zip md5:6b28a244226503e78e14a2b1d771c39c	172.4 MB	Preview Download
hu.zip md5:64fb203e5bbe4cec14f5761100a4096c	83.7 MB	Preview Download
it.zip md5:4342e2a8767037de509e6846d63a388b	305.6 MB	Preview Download
pt.zip md5:7d82d408c5046c1ed9cd711ba3994321	31.3 MB	Preview Download
ru.zip md5:fbdddb6ee7077a1c1f94c215b5d0099b	258.1 MB	Preview Download

Additional details

Bobenhausen, K., & Hammerich, B. (2015). Métrique littéraire, métrique linguistique et métrique algorithmique de l'allemand mises en jeu dans le programme Metricalizer². Langages, (199), 67–87.
Delente, É., & Renault, R. (2021). Projet Anamètre : présentation, limites et avancées. In A.-S. Bories, G. Purnelle, & H. Marchal (Eds.), Plotting Poetry, On Mechanically-Enhanced Reading (pp. 73–92). Presses universitaires de Liège.
Horváth, P., Kundráth, P., Indig, B., Fellegi, Z., Szlávich, E., Borbála Bajzát, T., Sárközi-Lindner, Z., Vida, B., Karabulut A., Timári M., & Palkó, G. (2022). ELTE Poetry Corpus: a machine annotated database of canonical Hungarian poetry. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, J. Odijk, & S. Piperidis (Eds.), Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022) (pp. 3471–3478). ELRA. https://aclanthology.org/2022.lrec-1.372/
Mittmann, A., Pergher, P. H., & Luiz dos Santos, A. (2019). What Rhythmic Signature Says About Poetic Corpora. In P. Plecháč, B. P. Scherr, T. Skulacheva, H. Bermúdez-Sabel, R. Kolár (Eds.), Quantitative Approaches to Versification (pp. 153–172). ICL CAS. https://versologie.cz/conference2019/proceedings/mittmann-pergher-dossantos.pdf
Navarro-Colorado, B., Ribez Lafoz, M., & Sánchez, N. (2017). Metrical annotation of a large corpus of Spanish sonnets: representation, scansion and evaluation. In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016) (pp. 4360–4364). ELRA. http://www.lrec-conf.org/proceedings/lrec2016/pdf/453_Paper.pdf
Plecháč, P., & Kolár, R. (2015). The Corpus of Czech Verse. Studia Metrica et Poetica, 2(1), 107–118. https://doi.org/10.12697/smp.2015.2.1.05
Ruiz Fabo, P., Bermúdez Sabel, H., Martínez Cantón, C., & González-Blanco, E. (2020). The Diachronic Spanish Sonnet Corpus: TEI and linked open data encoding, data distribution, and metrical findings. Digital Scholarship in the Humanities, vol. 36(Supplement_1), i68–i80, 2021. https://doi.org/10.1093/llc/fqaa035
Grishina E., Korchagin K., Plungian V., & Sichinava D. (2009). Poeticheskii korpus v ramkah NKRIA: obschaia struktura i perspektivy ispolzovania. In Natsionalnii korpus russkogo iazyka: 2006-2008. Novye rezultaty i perspektivy (pp. 71–113). Nestor-Istoria.

	All versions	This version
Views	1,954	533
Downloads	2,352	816
Data volume	578.3 GB	205.3 GB

cs.zip

Files (2.2 GB)

Related works

References

PoeTree. Poetry Treebanks in Czech, English, French, German, Hungarian, Italian, Portuguese, Russian and Spanish

Authors/Creators

Description

Files

cs.zip

Files (2.2 GB)

Additional details

Related works

References