Dataset Open Access

Catalan Textual Corpus

De Gibert Bonet, Ona; Armengol-Estapé, Jordi; Rodriguez-Penagos, Carlos; Melero, Maite; Villegas, Marta; Carrino, Casimiro Pio; Armentano-Oller, Carme; Gonzalez-Agirre, Aitor; Asensio, Alejandro

The Catalan Textual Corpus is a 1760-million-token web corpus of Catalan built from several sources: existing corpus such as DOGC, CaWac (non-dedup version), Oscar (unshuffled version), Open Subtitles, Catalan Wikipedia; and three brand new crawlings: the Catalan General Crawling, obtained by crawling the 500 most popular .cat and .ad domains; the Catalan Government Crawling, obtained by crawling the .gencat domain and subdomains, belonging to the Catalan Government; and the ACN corpus with 220k news items from March 2015 until October 2020, crawled from the Catalan News Agency.

It consists of 1.758.388.896 tokens, 73.172.152 sentences and 12.556.365 documents. Documents are separated by single new lines. These boundaries have been preserved as long as the license allowed it.

We license the actual packaging of these data under a Attribution-ShareAlike 4.0 International License.

Copyright (c) 2021 Text Mining Unit at BSC

If you use this resource in your work, please cite our latest paper:

@misc{armengolestape2021multilingual, title={Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? A Comprehensive Assessment for Catalan}, author={Jordi Armengol{-}Estap{\'{e}} and Casimiro Pio Carrino and Carlos Rodriguez-Penagos and Ona de Gibert Bonet and Carme Armentano{-}Oller and Aitor Gonzalez{-}Agirre and Maite Melero and Marta Villegas}, year={2021}, eprint={2107.07903}, archivePrefix={arXiv}, primaryClass={cs.CL} }

Funded by the Generalitat de Catalunya, Departament de Polítiques Digitals i Administració Pública (AINA), MT4ALL and Plan de Impulso de las Tecnologías del Lenguaje (Plan TL).
Files (3.9 GB)
Name Size
3.9 GB Download
All versions This version
Views 521521
Downloads 8383
Data volume 322.4 GB322.4 GB
Unique views 459459
Unique downloads 7070


Cite as