Published March 25, 2021 | Version 1.0.1
Dataset Open

Catalan Government Crawling

Description

The Catalan Government Crawling Corpus is a 39-million-token web corpus of Catalan built from the web. It has been obtained by crawling the .gencat domain and subdomains, belonging to the Catalan Government during September and October 2020. It consists of 39.117.909 tokens, 1.565.433 sentences and 71.043 documents. Documents are separated by single new lines. It is a subcorpus of the Catalan Textual Corpus.

We license the actual packaging of this data under a CC0 1.0 Universal License.

Notice and take down policy

Notice: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

  • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
  • Clearly identify the copyrighted work claimed to be infringed.
  • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate

 

If you use this resource in your work, please cite our latest paper:

@inproceedings{armengol-estape-etal-2021-multilingual,
    title = "Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan",
    author = "Armengol-Estap{\'e}, Jordi  and
      Carrino, Casimiro Pio  and
      Rodriguez-Penagos, Carlos  and
      de Gibert Bonet, Ona  and
      Armentano-Oller, Carme  and
      Gonzalez-Agirre, Aitor  and
      Melero, Maite  and
      Villegas, Marta",
    booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-acl.437",
    doi = "10.18653/v1/2021.findings-acl.437",
    pages = "4933--4946",
}

Notes

Funded by the Generalitat de Catalunya, Departament de Polítiques Digitals i Administració Pública (AINA) and MT4ALL.

Files

catalan_government_crawling.zip

Files (69.8 MB)

Name Size Download all
md5:255f3b1a340af23ba8f5bb3d0fb9d9da
69.8 MB Preview Download