BasCrawl
Authors/Creators
- 1. Barcelona Supercomputing Center
Description
BasCrawl is a 186-million-token web corpus of Basque obtained by crawling over 12000 domains. We include the crawled domains. The corpus has been preprocessed and deduplicated as described in http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405.
It consists of 186.832.691 tokens, 12.303.132 sentences and 736.180 documents. Documents are separated by single new lines.
We license the actual packaging of these data under a Creative Commons Attribution 4.0 International License.
Copyright (c) 2022 Secretaría de Estado de Digitalización e Inteligencia Artificial
Notes
Files
BasCrawl.txt
Files
(1.4 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:c0bfbd314a46f39ae6a372fb4f574e10
|
1.4 GB | Preview Download |
|
md5:c02c84224a39ecb301224e16668c0d94
|
189.9 kB | Preview Download |