Spanish Biomedical Crawled Corpus

Carrino, Casimiro Pio; Silveira-Ocampo, Joaquín; Gonzalez-Agirre, Aitor; Gutiérrez-Fandiño, Asier; Krallinger, Martin; Villegas, Marta

doi:10.5281/zenodo.5513237

Published April 12, 2022 | Version 0.3

Dataset Open

Spanish Biomedical Crawled Corpus

1. Barcelona Supercomputing Center (BSC)

The largest Spanish biomedical and heath corpus to date gathered from a massive Spanish health domain crawler over more than 3,000 URLs were downloaded and preprocessed. All the collected data have been preprocessed to produce the CoWeSe (Corpus Web Salud Español) resource, a large-scale and high-quality corpus intended for biomedical and health NLP in Spanish.

Enlarged version with less restrictive document and sentence deduplication.

Citation

If you use this resource in your work, please cite our paper:

@misc{carrino2021spanish,
    title={Spanish Biomedical Crawled Corpus: A Large, Diverse Dataset for Spanish Biomedical Language Models},
    author={Casimiro Pio Carrino and Jordi Armengol-Estapé and Ona de Gibert Bonet and Asier Gutiérrez-Fandiño and Aitor Gonzalez-Agirre and Martin Krallinger and Marta Villegas},
    year={2021},
    eprint={2109.07765},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Notes

Funded by the Plan de Impulso de las Tecnologías del Lenguaje (Plan-TL).

Files

CoWeSe.txt

Files (5.7 GB)

Name	Size
CoWeSe.txt md5:4785967f41558f5e47ceb4aa3ab819e5	5.7 GB	Preview Download

	All versions	This version
Views	4,938	2,302
Downloads	7,184	1,420
Data volume	43.7 TB	14.4 TB

Spanish Biomedical Crawled Corpus

Authors/Creators

Description

Notes

Files

CoWeSe.txt

Files (5.7 GB)