There is a newer version of the record available.

Published March 25, 2021 | Version 1.0.0
Dataset Open

Catalan General Crawling

Description

If you use this resource in your work, please cite our latest paper:

@inproceedings{armengol-estape-etal-2021-multilingual,
    title = "Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan",
    author = "Armengol-Estap{\'e}, Jordi  and
      Carrino, Casimiro Pio  and
      Rodriguez-Penagos, Carlos  and
      de Gibert Bonet, Ona  and
      Armentano-Oller, Carme  and
      Gonzalez-Agirre, Aitor  and
      Melero, Maite  and
      Villegas, Marta",
    booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-acl.437",
    doi = "10.18653/v1/2021.findings-acl.437",
    pages = "4933--4946",
}

The Catalan General Crawling Corpus is a 435-million-token web corpus of Catalan built from the web. It has been obtained by crawling the 500 most popular .cat and .ad domains during July 2020. It consists of 434.817.705 tokens, 19.451.691 sentences and 1.016.114 documents. Documents are separated by single new lines. It is a subcorpus of the Catalan Textual Corpus.

We license the actual packaging of this data under a Attribution 4.0 International License.

Copyright (c) 2021 Text Mining Unit at BSC

Notes

Funded by the Generalitat de Catalunya, Departament de Polítiques Digitals i Administració Pública (AINA), MT4ALL and Plan de Impulso de las Tecnologías del Lenguaje (Plan TL).

Files

catalan_general_crawling.zip

Files (874.1 MB)

Name Size Download all
md5:c977bd2e055d6cc0efc65af49583dd73
874.1 MB Preview Download