TUBERCLE, a Localised Document-Level Catalan Corpus
Authors/Creators
Description
Content:
TUBERCLE is a Catalan document-level corpus extracted from Colossal OSCAR with documents classified according to their country of origin, which has been extracted from the URL of the document. The corpus covers 5 countries where Catalan is spoken. It has been created mirrowing CEREAL (visit the project website), a document-level corpus for the Spanish varieties. Following OSCAR and CEREAL, we provide our annotations with CCO license, but we do not hold the copyright of the content text which comes from Common Crawl.
The process to build the corpus is the same as that for CEREAL and can be found in:
Cristina España-Bonet and Alberto Barrón-Cedeño. "Elote, Choclo and Mazorca: on the Varieties of Spanish." In proceedings of the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024), Mexico City, Mexico, June 2024.
Files Description:
See the README.txt file
Files
README.txt
Files
(24.7 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:eefd1b08eb8b8377fb2f9153a406613c
|
4.3 kB | Preview Download |
|
md5:f2a87ef825a26e65f1303de7ee847c90
|
111.7 MB | Download |
|
md5:9a4a7dbd42784e7d19d4c00cd67cbc9e
|
12.2 GB | Download |
|
md5:13614f4563c2272656b014669d6758f9
|
4.5 GB | Download |
|
md5:51768594fbeba8b6e11abf7642127b68
|
817.3 MB | Download |
|
md5:a43fde893c4fa9ee1c263647201f0bcb
|
12.2 MB | Download |
|
md5:0edb9a85e2f8ce85fc5261caac4bedd1
|
9.7 MB | Download |
|
md5:c09a701d72df7fa79bd07f867c9970f4
|
7.0 GB | Download |
Additional details
Related works
- Is described by
- Other: https://cereal-es.github.io/CEREAL/ (URL)