Biomedical Spanish CBOW Word Embeddings in Floret

Llop Palao, Joan

doi:10.5281/zenodo.7314041

Published November 11, 2022 | Version 1.0

Dataset Open

Biomedical Spanish CBOW Word Embeddings in Floret

Llop Palao, Joan¹

1. Barcelona Supercomputing Center

Biomedical Spanish CBOW Word Embeddings in Floret

The embeddings have been trained with a biomedical Spanish corpus using floret with the following hyperparameters:

mode: str = "floret",
model: str = "cbow",
dim: int = 300,
mincount: int = 10,
minn: int = 5,
maxn: int = 6,
neg: int = 10,
hashcount: int = 2,
bucket: int = 50000,
thread: int = 128,

The embeddings were trained on the concatenation of all corpora from the Spanish biomedical corpus that includes Spanish data from various sources for a total of 1.1B tokens across 2,5M documents.

Source	No. tokens
Medical crawler	903,558,136
Clinical cases misc.	102,855,267
EHRs documents*	95,267,204
Scielo	60,007,289
BARR2 Background	24,516,442
Wikipedia (Life Sciences)	13,890,501
Patents	13,463,387
EMEA	5,377,448
Mespen (MedlinePlus)	4,166,077
PubMed	1,858,966

More information about the corpus can be found here https://aclanthology.org/2022.bionlp-1.19/ and here https://arxiv.org/abs/2109.07765

The processing took place on an HPC node equipped with an AMD EPYC 7742 (@ 2.250GHz) processor with 128 threads.

How to use

First initialize the spacy vectors from the floret table (.floret file):

spacy init vectors es floret_embeddings_bio_es.floret floret_embeddings_bio_es --mode floret

import spacy

# Load the floret vectors
floret_embeddings = spacy.load("floret_embeddings_bio_es")

# Get the embeddings of some words
diabetes = floret_embeddings.vocab["diabetes"]
insulina = floret_embeddings.vocab["insulina"]
radiografia = floret_embeddings.vocab["radiografia"]

# Get some similarities
print(diabetes.similarity(insulina))
print(diabetes.similarity(radiografia))
# diabetes should be more similar to insuline than radiografia

Intended Uses and Limitations

At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model. However, we are well aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this card will be updated.

Authors

The Text Mining Unit from Barcelona Supercomputing Center.

Contact Information

For further information, send an email to plantl-gob-es@bsc.es

Funding

This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.

Copyright

Notes

Funded by the Plan de Impulso de las Tecnologías del Lenguaje (Plan-TL).

Files

Files (2.6 GB)

Name	Size	Download all
floret_embeddings_bio_es.bin md5:fd4cc8ce4f46c592934b851ca214ca77	1.2 GB	Download
floret_embeddings_bio_es.floret md5:65de05423b54e1f10aa46991bd67d94a	115.2 MB	Download
floret_embeddings_bio_es.vec md5:f4f96f7ed627034098e1db1abc0e3df4	1.2 GB	Download

	All versions	This version
Views	311	311
Downloads	257	257
Data volume	212.0 GB	212.0 GB

Biomedical Spanish CBOW Word Embeddings in Floret

Authors/Creators

Description

Notes

Files

Files (2.6 GB)