Published October 6, 2022
| Version 1.0.0
Dataset
Open
BasCrawl
Authors/Creators
- 1. Barcelona Supercomputing Center
Description
BasCrawl is a 186-million-token web corpus of Basque obtained by crawling over 12000 domains. We include the crawled domains. The corpus has been preprocessed and deduplicated as described in http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405.
It consists of 186.832.691 tokens, 12.303.132 sentences and 736.180 documents. Documents are separated by single new lines.
We license the actual packaging of these data under a Creative Commons Attribution 4.0 International License.
Copyright (c) 2022 Secretaría de Estado de Digitalización e Inteligencia Artificial