Dataset Open Access
De Gibert Bonet, Ona;
Armengol-Estapé, Jordi;
Rodriguez-Penagos, Carlos;
Melero, Maite;
Villegas, Marta;
Carrino, Casimiro Pio;
Armentano-Oller, Carme;
Gonzalez-Agirre, Aitor;
Asensio, Alejandro
The Catalan Textual Corpus is a 1760-million-token web corpus of Catalan built from several sources: existing corpus such as DOGC, CaWac (non-dedup version), Oscar (unshuffled version), Open Subtitles, Catalan Wikipedia; and three brand new crawlings: the Catalan General Crawling, obtained by crawling the 500 most popular .cat and .ad domains; the Catalan Government Crawling, obtained by crawling the .gencat domain and subdomains, belonging to the Catalan Government; and the ACN corpus with 220k news items from March 2015 until October 2020, crawled from the Catalan News Agency.
It consists of 1.758.388.896 tokens, 73.172.152 sentences and 12.556.365 documents. Documents are separated by single new lines. These boundaries have been preserved as long as the license allowed it.
We license the actual packaging of these data under a Attribution-ShareAlike 4.0 International License.
Copyright (c) 2021 Text Mining Unit at BSC
If you use this resource in your work, please cite our latest paper:
@misc{armengolestape2021multilingual, title={Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? A Comprehensive Assessment for Catalan}, author={Jordi Armengol{-}Estap{\'{e}} and Casimiro Pio Carrino and Carlos Rodriguez-Penagos and Ona de Gibert Bonet and Carme Armentano{-}Oller and Aitor Gonzalez{-}Agirre and Maite Melero and Marta Villegas}, year={2021}, eprint={2107.07903}, archivePrefix={arXiv}, primaryClass={cs.CL} }
Name | Size | |
---|---|---|
catalan_textual_corpus.zip
md5:89b266ad780f40f8898874585648e7e2 |
3.9 GB | Download |
All versions | This version | |
---|---|---|
Views | 521 | 521 |
Downloads | 83 | 83 |
Data volume | 322.4 GB | 322.4 GB |
Unique views | 459 | 459 |
Unique downloads | 70 | 70 |