Published October 16, 2023
| Version 0.0.1
Dataset
Open
PoeTree. Poetry Treebanks in Czech, English, French, German, Hungarian, Italian, Portuguese, Russian and Spanish
Creators
- Plecháč, Petr1
- Kolár, Robert1
- Cinková, Silvie1, 2
- Šeļa, Artjoms3, 4
- De Sisto, Mirella5
- Nugues, Lara6
- Haider, Thomas7
- Nagy, Benjamin3
- Delente, Éliane8
- Renault, Richard8
- Bobenhausen, Klemens9
- Hammerich, Benjamin9
- Mittmann, Adiel10
- Palkó, Gábor11
- Horváth, Péter11
- Navarro Colorado, Borja12
- Ruiz Fabo, Pablo13
- Bermúdez Sabel, Helena14
- Korchagin, Kirill15
- Plungian, Vladimir15, 16
- Sitchinava, Dmitri17
- 1. Czech Academy of Sciences, Institute of Czech Literature
- 2. Charles University
- 3. The Institute of the Polish Language of the Polish Academy of Sciences
- 4. University of Tartu
- 5. Tilburg University
- 6. University of Basel
- 7. University of Passau
- 8. Université de Caen Normandie
- 9. Metricalizer
- 10. Universidade Federal de Santa Catarina
- 11. Eötvös Loránd University
- 12. University of Alicante
- 13. University of Strasbourg
- 14. Jinntec GmbH
- 15. Institute of Russian Language, Russian Academy of Sciences
- 16. Institute of Linguistics, Russian Academy of Sciences
- 17. University of Potsdam
Description
PoeTree (Poetry Treebanks) is a dataset comprising over 300,000 poems / 84,000,000 tokens in nine languages (Czech, English, French, German, Hungarian, Italian, Portuguese, Spanish, and Russian). Each corpus has been deduplicated, enriched with Universal Dependencies, provided with additional metadata and converted into a unified JSON structure (schema available at https://versologie.cz/poetree/json-schema).
- cs (~80k poems)
- derived from Corpus of Czech Verse
- de (~50k poems)
- derived from Metricalizer
- en (~40k poems)
- based on the texts from Project Gutenberg
- es (~9k poems)
- derived from Corpus of Spanish Golden-Age Sonnets and Diachronic Spanish Sonnet Corpus
- fr (~18k poems)
- derived from Malherbə
- hu (~13k poems)
- derived from ELTE Poetry Corpus
- it (~40k poems)
- derived from Biblioteca Italiana
- pt (~5k poems)
- derived from Poemas
- ru (~45k poems)
- derived from Corpus of Russian Poetry
Files
cs.zip
Files
(2.2 GB)
Name | Size | Download all |
---|---|---|
md5:7adb32d5511ea610058ff1cfa30b1d59
|
554.1 MB | Preview Download |
md5:2fb5de44301a6e393f4a8bd19ec54d18
|
333.1 MB | Preview Download |
md5:8157f5d7a2dbb667d0d9a6a4da3f6e4e
|
388.0 MB | Preview Download |
md5:9d090d9bb54946b32ee76f4f10b18ea7
|
31.9 MB | Preview Download |
md5:6b28a244226503e78e14a2b1d771c39c
|
172.4 MB | Preview Download |
md5:64fb203e5bbe4cec14f5761100a4096c
|
83.7 MB | Preview Download |
md5:4342e2a8767037de509e6846d63a388b
|
305.6 MB | Preview Download |
md5:7d82d408c5046c1ed9cd711ba3994321
|
31.3 MB | Preview Download |
md5:fbdddb6ee7077a1c1f94c215b5d0099b
|
258.1 MB | Preview Download |
Additional details
References
- Bobenhausen, K., & Hammerich, B. (2015). Métrique littéraire, métrique linguistique et métrique algorithmique de l'allemand mises en jeu dans le programme Metricalizer². Langages, (199), 67–87.
- Delente, É., & Renault, R. (2021). Projet Anamètre : présentation, limites et avancées. In A.-S. Bories, G. Purnelle, & H. Marchal (Eds.), Plotting Poetry, On Mechanically-Enhanced Reading (pp. 73–92). Presses universitaires de Liège.
- Horváth, P., Kundráth, P., Indig, B., Fellegi, Z., Szlávich, E., Borbála Bajzát, T., Sárközi-Lindner, Z., Vida, B., Karabulut A., Timári M., & Palkó, G. (2022). ELTE Poetry Corpus: a machine annotated database of canonical Hungarian poetry. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, J. Odijk, & S. Piperidis (Eds.), Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022) (pp. 3471–3478). ELRA. https://aclanthology.org/2022.lrec-1.372/
- Mittmann, A., Pergher, P. H., & Luiz dos Santos, A. (2019). What Rhythmic Signature Says About Poetic Corpora. In P. Plecháč, B. P. Scherr, T. Skulacheva, H. Bermúdez-Sabel, R. Kolár (Eds.), Quantitative Approaches to Versification (pp. 153–172). ICL CAS. https://versologie.cz/conference2019/proceedings/mittmann-pergher-dossantos.pdf
- Navarro-Colorado, B., Ribez Lafoz, M., & Sánchez, N. (2017). Metrical annotation of a large corpus of Spanish sonnets: representation, scansion and evaluation. In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016) (pp. 4360–4364). ELRA. http://www.lrec-conf.org/proceedings/lrec2016/pdf/453_Paper.pdf
- Plecháč, P., & Kolár, R. (2015). The Corpus of Czech Verse. Studia Metrica et Poetica, 2(1), 107–118. https://doi.org/10.12697/smp.2015.2.1.05
- Ruiz Fabo, P., Bermúdez Sabel, H., Martínez Cantón, C., & González-Blanco, E. (2020). The Diachronic Spanish Sonnet Corpus: TEI and linked open data encoding, data distribution, and metrical findings. Digital Scholarship in the Humanities, vol. 36(Supplement_1), i68–i80, 2021. https://doi.org/10.1093/llc/fqaa035
- Grishina E., Korchagin K., Plungian V., & Sichinava D. (2009). Poeticheskii korpus v ramkah NKRIA: obschaia struktura i perspektivy ispolzovania. In Natsionalnii korpus russkogo iazyka: 2006-2008. Novye rezultaty i perspektivy (pp. 71–113). Nestor-Istoria.