Catalan Government Crawling

De Gibert Bonet, Ona; Armengol-Estapé, Jordi; Carrino, Casimiro Pio; Melero, Maite; Villegas, Marta

doi:10.5281/zenodo.4636486

Published March 25, 2021 | Version 1.0.0

Dataset Open

Catalan Government Crawling

1. Barcelona Supercomputing Center

The Catalan Government Crawling Corpus is a 39-million-token web corpus of Catalan built from the web. It has been obtained by crawling the .gencat domain and subdomains, belonging to the Catalan Government during September and October 2020. It consists of 39.117.909 tokens, 1.565.433 sentences and 71.043 documents. Documents are separated by single new lines. It is a subcorpus of the Catalan Textual Corpus.

We license the actual packaging of this data under a CC0 1.0 Universal License.

Notes

Funded by the Generalitat de Catalunya, Departament de Polítiques Digitals i Administració Pública (AINA), MT4ALL and Plan de Impulso de las Tecnologías del Lenguaje (Plan TL).

Files