Published October 27, 2022
| Version 1.0.0
Dataset
Open
CSIC Spanish Corpus
- 1. Barcelona Supercomputing Center
Description
The CSIC Spanish corpus is a 146-million-token corpus of Spanish scientific magazines from the revistas.csic.es/ repository. The corpus has been preprocessed and deduplicated using the Corpus-Cleaner pipeline.
It consists of 146.795.650 tokens, 4.395.368 sentences and 30.929. Documents are separated by single new lines.
We license the actual packaging of these data under a Attribution 4.0 International License.
Copyright by Secretaría de Estado de Digitalización e Inteligencia Artificial (SEDIA) (2022)
Notes
Files
csic_es.txt
Files
(929.1 MB)
Name | Size | Download all |
---|---|---|
md5:a7103b0a8c84e5036e4b31759bd7932c
|
929.1 MB | Preview Download |
md5:917cfa3fa0baef7e9595ea33b87059f8
|
3.5 kB | Preview Download |