CSIC Spanish Corpus

Petrea, Paul Andrei; De Gibert Bonet, Ona; Villegas, Marta

doi:10.5281/zenodo.7313126

Published October 27, 2022 | Version 1.0.0

Dataset Open

CSIC Spanish Corpus

1. Barcelona Supercomputing Center

The CSIC Spanish corpus is a 146-million-token corpus of Spanish scientific magazines from the revistas.csic.es/ repository. The corpus has been preprocessed and deduplicated using the Corpus-Cleaner pipeline.

It consists of 146.795.650 tokens, 4.395.368 sentences and 30.929. Documents are separated by single new lines.

We license the actual packaging of these data under a Attribution 4.0 International License.

Notes

Funded by the Plan de Impulso de las Tecnologías del Lenguaje (Plan TL).

Files

csic_es.txt

Files (929.1 MB)

Name	Size	Download all
csic_es.txt md5:a7103b0a8c84e5036e4b31759bd7932c	929.1 MB	Preview Download
README.md md5:917cfa3fa0baef7e9595ea33b87059f8	3.5 kB	Preview Download

469

Views

492

Downloads

Show more details

	All versions	This version
Views	469	138
Downloads	492	375
Data volume	453.4 GB	344.7 GB

More info on how stats are collected....

DOI

Resource type

Dataset

Publisher

Zenodo

Languages

Spanish

Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: November 11, 2022
Modified: November 24, 2022

CSIC Spanish Corpus

Creators

Description

Notes

Files

csic_es.txt

Files (929.1 MB)