Published October 6, 2022 | Version 1.0.0
Dataset Open

BasCrawl

  • 1. Barcelona Supercomputing Center

Description

BasCrawl is a 186-million-token web corpus of Basque obtained by crawling over 12000 domains. We include the crawled domains. The corpus has been preprocessed and deduplicated as described in http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405.

It consists of 186.832.691 tokens, 12.303.132 sentences and 736.180 documents. Documents are separated by single new lines.

We license the actual packaging of these data under a Creative Commons Attribution 4.0 International License.

Copyright (c) 2022 Secretaría de Estado de Digitalización e Inteligencia Artificial

Notes

Funded by the Plan de Impulso de las Tecnologías del Lenguaje (Plan TL).

Files

BasCrawl.txt

Files (1.4 GB)

Name Size Download all
md5:c0bfbd314a46f39ae6a372fb4f574e10
1.4 GB Preview Download
md5:c02c84224a39ecb301224e16668c0d94
189.9 kB Preview Download
md5:583bbf274a72c74c65be103ac488c1ed
3.7 kB Preview Download