Published December 20, 2025 | Version v1.0
Dataset Open

TUBERCLE, a Localised Document-Level Catalan Corpus

  • 1. ROR icon German Research Centre for Artificial Intelligence

Description

Content:

TUBERCLE is a Catalan document-level corpus extracted from  Colossal OSCAR with documents classified according to their country of origin, which has been extracted from the URL of the document. The corpus covers 5 countries where Catalan is spoken. It has been created mirrowing CEREAL (visit the project website), a document-level corpus for the Spanish varieties. Following OSCAR and CEREAL, we provide our annotations with CCO license, but we do not hold the copyright of the content text which comes from Common Crawl

The process to build the corpus is the same as that for CEREAL and can be found in:

Cristina España-Bonet and Alberto Barrón-Cedeño. "Elote, Choclo and Mazorca: on the Varieties of Spanish." In proceedings of the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024), Mexico City, Mexico, June 2024.

Files Description:

See the README.txt file

Files

README.txt

Files (24.7 GB)

Name Size Download all
md5:eefd1b08eb8b8377fb2f9153a406613c
4.3 kB Preview Download
md5:f2a87ef825a26e65f1303de7ee847c90
111.7 MB Download
md5:9a4a7dbd42784e7d19d4c00cd67cbc9e
12.2 GB Download
md5:13614f4563c2272656b014669d6758f9
4.5 GB Download
md5:51768594fbeba8b6e11abf7642127b68
817.3 MB Download
md5:a43fde893c4fa9ee1c263647201f0bcb
12.2 MB Download
md5:0edb9a85e2f8ce85fc5261caac4bedd1
9.7 MB Download
md5:c09a701d72df7fa79bd07f867c9970f4
7.0 GB Download

Additional details

Related works

Is described by
Other: https://cereal-es.github.io/CEREAL/ (URL)