Poesi.as dataset
Creators
Description
Collection of poems, mostly Spanish, from the 21th century and before
Some stats:
- Number of poems: 25.187
- Number of words: 7.918.679
Two jsons are provided:
poesias_corpora.json
: This is the json used to generate the txt files.poesias_corpora_old_spanish.json
: This json is still a work in progress. It has old Spanish poems made mostly by Alfonso X and they are not included in the corpora folder.
An additional CSV file, authors.csv
, provides reconciled information for authors of the 20th Century and below. Identifiers (VIAF, BnF, BNE, LoC, ISNI), dates of birth and death, and gender, are also added as they appear in Wikidata.
This repo is a dump of the website www.poesi.as, we do not own the rights of any of the works pulished here.
For any violations or infringement of copyright, take proper action within the scope of the original website.
Public DomainThe script extract.py
generates a public domain corpus in JSON extracted from the corpus in poesi.as. The number of years since the death of an author needed for a work to be considered in the public domain can be specified using -y YEARS
(--years YEARS
). Defaults to 80 as per Spanish copyright laws.
`
$ python extract.py > public_domain.json
Files
linhd-postdata/poesi.as-v1.0.0.zip
Files
(33.3 MB)
Name | Size | Download all |
---|---|---|
md5:5e9506a0e9a8178d093ca245fb8160bf
|
33.3 MB | Preview Download |
Additional details
Related works
- Is supplement to
- https://github.com/linhd-postdata/poesi.as/tree/v1.0.0 (URL)