CEREAL I, el Corpus del Español REAL
Description
Content:
CEREAL (visit the project website) is a document-level corpus of documents in Spanish extracted from OSCAR. Each document in the corpus is classified according to its country of origin. CEREAL covers 24 countries where Spanish is spoken. Following OSCAR, we provide our annotations with CCO license, but we do not hold the copyright of the content text, which comes from OSCAR and therefore from Common Crawl.
The process to build the corpus and its characteristics can be found in:
Cristina España-Bonet and Alberto Barrón-Cedeño. "Elote, Choclo and Mazorca: on the Varieties of Spanish." In proceedings of the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024), Mexico City, Mexico, June 2024.
The corpus used to train the classifier and the sentence-level version of CEREAL is available at
https://zenodo.org/records/11390829
Files Description:
See the README.txt file
Files
README.txt
Files
(45.5 GB)
Name | Size | Download all |
---|---|---|
md5:3264ab4c54994c5006330066a445bb9b
|
1.2 MB | Download |
md5:5f89b58752f945d453820ccb5396378d
|
2.4 GB | Download |
md5:38cd20545d6febac486a883b0be7af32
|
87.8 MB | Download |
md5:3b676373c7dbf082fee16e5c8e1743b4
|
1.4 GB | Download |
md5:e9c964b03c56040170d6156a52f36e70
|
3.3 GB | Download |
md5:7fa369be1e4782ebfaae2c0b084feeca
|
988.3 MB | Download |
md5:40d388adff8ccea94beebab10e4b2c34
|
17.1 GB | Download |
md5:8c1dfc783d7570379850d705298b7986
|
5.5 GB | Download |
md5:d705181c1a980b3e547a6251dd42264c
|
3.2 GB | Download |
md5:f8b4f277ff06736243ee20fa0f51c431
|
743.1 MB | Download |
md5:ef68a92eec82184956dfc92c682bcbe3
|
69.5 MB | Download |
md5:24dba5f5a71c0f28e767c2c588732cc1
|
184.9 MB | Download |
md5:ea823ecd515a41bb04b91b1aec548fca
|
101.4 MB | Download |
md5:a67ca770299777eeda7b3de3d79937d0
|
143.8 MB | Download |
md5:52e72a72e236128aa410dc1a12b87f4b
|
6.8 GB | Download |
md5:396dd0d794c0f37dfd5938c837693396
|
702.0 kB | Download |
md5:2bbcb95da69e73a4ce9ad7259e226a92
|
46.3 MB | Download |
md5:6639bdf4ed243290b186462b620114e4
|
57.3 MB | Download |
md5:3a6b5696e4ba6c3a6ba7712eea253af3
|
2.5 GB | Download |
md5:360e57cf2c3ed98e2ddbe7b72298f0fe
|
37.0 MB | Download |
md5:6d89c10999378edea1167245b4f92c71
|
37.5 MB | Download |
md5:1c9654773b03a828b2d017c506387963
|
432.9 MB | Download |
md5:2cb27b2791548e647e46566c67761e66
|
139.9 kB | Download |
md5:5438411f5251831c1ca200689ee1b987
|
11.6 MB | Download |
md5:840971a806c1d418cef70c20476592db
|
65.9 MB | Download |
md5:c0af8f2dd061db547570c9681e622484
|
33.5 MB | Download |
md5:762e37cbfaf62e584f1b23a3e25f1c78
|
36.5 MB | Download |
md5:82503a794997384bf03c610d1c2dba58
|
166.5 MB | Download |
md5:cd6db126c081d2f7f4b28353e51ecd95
|
112.2 MB | Download |
md5:094d6765babc4449624f22ee231950cf
|
208.6 kB | Preview Download |
md5:bbf32add3b66161c690812b5387e76d9
|
816.6 kB | Preview Download |
md5:a6de56343e00cc3c6770f1a7fc46e55c
|
7.6 kB | Preview Download |
Additional details
Additional titles
- Subtitle (English)
- Document-level corpus
Related works
- Is described by
- Other: https://cereal-es.github.io/CEREAL/ (URL)
- Is supplemented by
- Dataset: 10.5281/zenodo.11390829 (DOI)
- Model: https://huggingface.co/cristinae/cereal (URL)
Software
- Repository URL
- https://github.com/cristinae/docTransformer
- Programming language
- Python