TUBERCLE, a Localised Document-Level Catalan Corpus

España-Bonet, Cristina

doi:10.5281/zenodo.17996190

Published December 20, 2025 | Version v1.0

Dataset Open

TUBERCLE, a Localised Document-Level Catalan Corpus

España-Bonet, Cristina (Researcher)¹

1. German Research Centre for Artificial Intelligence

Content:

TUBERCLE is a Catalan document-level corpus extracted from Colossal OSCAR with documents classified according to their country of origin, which has been extracted from the URL of the document. The corpus covers 5 countries where Catalan is spoken. It has been created mirrowing CEREAL (visit the project website), a document-level corpus for the Spanish varieties. Following OSCAR and CEREAL, we provide our annotations with CCO license, but we do not hold the copyright of the content text which comes from Common Crawl.

The process to build the corpus is the same as that for CEREAL and can be found in:

Cristina España-Bonet and Alberto Barrón-Cedeño. "Elote, Choclo and Mazorca: on the Varieties of Spanish." In proceedings of the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024), Mexico City, Mexico, June 2024.

Files Description:

See the README.txt file

Files

README.txt

Files (24.7 GB)

Name	Size	Download all
README.txt md5:eefd1b08eb8b8377fb2f9153a406613c	4.3 kB	Preview Download
tubercle.ad.bz2 md5:f2a87ef825a26e65f1303de7ee847c90	111.7 MB	Download
tubercle.all.bz2 md5:9a4a7dbd42784e7d19d4c00cd67cbc9e	12.2 GB	Download
tubercle.ca.bz2 md5:13614f4563c2272656b014669d6758f9	4.5 GB	Download
tubercle.es.bz2 md5:51768594fbeba8b6e11abf7642127b68	817.3 MB	Download
tubercle.fr.bz2 md5:a43fde893c4fa9ee1c263647201f0bcb	12.2 MB	Download
tubercle.it.bz2 md5:0edb9a85e2f8ce85fc5261caac4bedd1	9.7 MB	Download
tubercle.unk.bz2 md5:c09a701d72df7fa79bd07f867c9970f4	7.0 GB	Download

Additional details

Is described by: Other: https://cereal-es.github.io/CEREAL/ (URL)

	All versions	This version
Views	306	306
Downloads	104	104
Data volume	310.8 GB	310.8 GB

TUBERCLE, a Localised Document-Level Catalan Corpus

Authors/Creators

Description

Files

README.txt

Files (24.7 GB)

Additional details

Related works