Dataset Open Access
De Gibert Bonet, Ona;
Armengol-Estapé, Jordi;
Carrino, Casimiro Pio;
Melero, Maite;
Villegas, Marta
If you use this resource in your work, please cite our latest paper:
@inproceedings{armengol-estape-etal-2021-multilingual,
title = "Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan",
author = "Armengol-Estap{\'e}, Jordi and
Carrino, Casimiro Pio and
Rodriguez-Penagos, Carlos and
de Gibert Bonet, Ona and
Armentano-Oller, Carme and
Gonzalez-Agirre, Aitor and
Melero, Maite and
Villegas, Marta",
booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.findings-acl.437",
doi = "10.18653/v1/2021.findings-acl.437",
pages = "4933--4946",
}
The Catalan General Crawling Corpus is a 435-million-token web corpus of Catalan built from the web. It has been obtained by crawling the 500 most popular .cat and .ad domains during July 2020. It consists of 434.817.705 tokens, 19.451.691 sentences and 1.016.114 documents. Documents are separated by single new lines. It is a subcorpus of the Catalan Textual Corpus.
We license the actual packaging of this data under a Attribution 4.0 International License.
Notice and take down policy
Notice: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
Copyright (c) 2021 Text Mining Unit at BSC
Name | Size | |
---|---|---|
catalan_general_crawling.zip
md5:d69672a38039355d0b7c4083dd986adf |
875.1 MB | Download |
All versions | This version | |
---|---|---|
Views | 287 | 61 |
Downloads | 68 | 53 |
Data volume | 59.5 GB | 46.4 GB |
Unique views | 263 | 56 |
Unique downloads | 57 | 43 |