Catalan Textual Corpus

De Gibert Bonet, Ona; Armengol-Estapé, Jordi; Rodriguez-Penagos, Carlos; Melero, Maite; Villegas, Marta; Carrino, Casimiro Pio; Armentano-Oller, Carme; Gonzalez-Agirre, Aitor; Asensio, Alejandro

doi:10.5281/zenodo.4519349

Published February 9, 2021 | Version 1.0.0

Dataset Open

Catalan Textual Corpus

1. Barcelona Supercomputing Center

The Catalan Textual Corpus is a 1760-million-token web corpus of Catalan built from several sources: existing corpus such as DOGC, CaWac (non-dedup version), Oscar (unshuffled version), Open Subtitles, Catalan Wikipedia; and three brand new crawlings: the Catalan General Crawling, obtained by crawling the 500 most popular .cat and .ad domains; the Catalan Government Crawling, obtained by crawling the .gencat domain and subdomains, belonging to the Catalan Government; and the ACN corpus with 220k news items from March 2015 until October 2020, crawled from the Catalan News Agency.

It consists of 1.758.388.896 tokens, 73.172.152 sentences and 12.556.365 documents. Documents are separated by single new lines. These boundaries have been preserved as long as the license allowed it.

We license the actual packaging of these data under a Attribution-ShareAlike 4.0 International License.

If you use this resource in your work, please cite our latest paper:

@misc{armengolestape2021multilingual, title={Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? A Comprehensive Assessment for Catalan}, author={Jordi Armengol{-}Estap{\'{e}} and Casimiro Pio Carrino and Carlos Rodriguez-Penagos and Ona de Gibert Bonet and Carme Armentano{-}Oller and Aitor Gonzalez{-}Agirre and Maite Melero and Marta Villegas}, year={2021}, eprint={2107.07903}, archivePrefix={arXiv}, primaryClass={cs.CL} }

Notes

Funded by the Generalitat de Catalunya, Departament de Polítiques Digitals i Administració Pública (AINA), MT4ALL and Plan de Impulso de las Tecnologías del Lenguaje (Plan TL).

Files