Info: Zenodo’s user support line is staffed on regular business days between Dec 23 and Jan 5. Response times may be slightly longer than normal.

Published October 27, 2022 | Version 1.0.0
Dataset Open

CSIC Spanish Corpus

  • 1. Barcelona Supercomputing Center

Description

The CSIC Spanish corpus is a 146-million-token corpus of Spanish scientific magazines from the revistas.csic.es/ repository. The corpus has been preprocessed and deduplicated using the Corpus-Cleaner pipeline.

It consists of 146.795.650 tokens, 4.395.368 sentences and 30.929. Documents are separated by single new lines.

We license the actual packaging of these data under a Attribution 4.0 International License.

Copyright by Secretaría de Estado de Digitalización e Inteligencia Artificial (SEDIA) (2022)

Notes

Funded by the Plan de Impulso de las Tecnologías del Lenguaje (Plan TL).

Files

csic_es.txt

Files (929.1 MB)

Name Size Download all
md5:a7103b0a8c84e5036e4b31759bd7932c
929.1 MB Preview Download
md5:917cfa3fa0baef7e9595ea33b87059f8
3.5 kB Preview Download