Planned intervention: On Thursday 19/09 between 05:30-06:30 (UTC), Zenodo will be unavailable because of a scheduled upgrade in our storage cluster.
Published May 29, 2024 | Version v1.0
Dataset Open

CEREAL I, el Corpus del Español REAL

  • 1. ROR icon German Research Centre for Artificial Intelligence
  • 2. ROR icon University of Bologna

Description

Content:

CEREAL (visit the project website) is a document-level corpus of documents in Spanish extracted from OSCAR. Each document in the corpus is classified according to its country of origin. CEREAL covers 24 countries where Spanish is spoken. Following OSCAR, we provide our annotations with CCO license, but we do not hold the copyright of the content text, which comes from OSCAR and therefore from Common Crawl.  

The process to build the corpus and its characteristics can be found in:

Cristina España-Bonet and Alberto Barrón-Cedeño. "Elote, Choclo and Mazorca: on the Varieties of Spanish." In proceedings of the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024), Mexico City, Mexico, June 2024.

The corpus used to train the classifier and the sentence-level version of CEREAL is available at
https://zenodo.org/records/11390829

Files Description:

See the README.txt file

Files

README.txt

Files (45.5 GB)

Name Size Download all
md5:3264ab4c54994c5006330066a445bb9b
1.2 MB Download
md5:5f89b58752f945d453820ccb5396378d
2.4 GB Download
md5:38cd20545d6febac486a883b0be7af32
87.8 MB Download
md5:3b676373c7dbf082fee16e5c8e1743b4
1.4 GB Download
md5:e9c964b03c56040170d6156a52f36e70
3.3 GB Download
md5:7fa369be1e4782ebfaae2c0b084feeca
988.3 MB Download
md5:40d388adff8ccea94beebab10e4b2c34
17.1 GB Download
md5:8c1dfc783d7570379850d705298b7986
5.5 GB Download
md5:d705181c1a980b3e547a6251dd42264c
3.2 GB Download
md5:f8b4f277ff06736243ee20fa0f51c431
743.1 MB Download
md5:ef68a92eec82184956dfc92c682bcbe3
69.5 MB Download
md5:24dba5f5a71c0f28e767c2c588732cc1
184.9 MB Download
md5:ea823ecd515a41bb04b91b1aec548fca
101.4 MB Download
md5:a67ca770299777eeda7b3de3d79937d0
143.8 MB Download
md5:52e72a72e236128aa410dc1a12b87f4b
6.8 GB Download
md5:396dd0d794c0f37dfd5938c837693396
702.0 kB Download
md5:2bbcb95da69e73a4ce9ad7259e226a92
46.3 MB Download
md5:6639bdf4ed243290b186462b620114e4
57.3 MB Download
md5:3a6b5696e4ba6c3a6ba7712eea253af3
2.5 GB Download
md5:360e57cf2c3ed98e2ddbe7b72298f0fe
37.0 MB Download
md5:6d89c10999378edea1167245b4f92c71
37.5 MB Download
md5:1c9654773b03a828b2d017c506387963
432.9 MB Download
md5:2cb27b2791548e647e46566c67761e66
139.9 kB Download
md5:5438411f5251831c1ca200689ee1b987
11.6 MB Download
md5:840971a806c1d418cef70c20476592db
65.9 MB Download
md5:c0af8f2dd061db547570c9681e622484
33.5 MB Download
md5:762e37cbfaf62e584f1b23a3e25f1c78
36.5 MB Download
md5:82503a794997384bf03c610d1c2dba58
166.5 MB Download
md5:cd6db126c081d2f7f4b28353e51ecd95
112.2 MB Download
md5:094d6765babc4449624f22ee231950cf
208.6 kB Preview Download
md5:bbf32add3b66161c690812b5387e76d9
816.6 kB Preview Download
md5:a6de56343e00cc3c6770f1a7fc46e55c
7.6 kB Preview Download

Additional details

Additional titles

Subtitle (English)
Document-level corpus

Related works

Is described by
Other: https://cereal-es.github.io/CEREAL/ (URL)
Is supplemented by
Dataset: 10.5281/zenodo.11390829 (DOI)
Model: https://huggingface.co/cristinae/cereal (URL)

Software

Repository URL
https://github.com/cristinae/docTransformer
Programming language
Python