Biomedical Spanish CBOW Word Embeddings in Floret
Description
Biomedical Spanish CBOW Word Embeddings in Floret
The embeddings have been trained with a biomedical Spanish corpus using floret with the following hyperparameters:
mode: str = "floret",
model: str = "cbow",
dim: int = 300,
mincount: int = 10,
minn: int = 5,
maxn: int = 6,
neg: int = 10,
hashcount: int = 2,
bucket: int = 50000,
thread: int = 128,
The embeddings were trained on the concatenation of all corpora from the Spanish biomedical corpus that includes Spanish data from various sources for a total of 1.1B tokens across 2,5M documents.
| Source | No. tokens |
|---|---|
| Medical crawler | 903,558,136 |
| Clinical cases misc. | 102,855,267 |
| EHRs documents* | 95,267,204 |
| Scielo | 60,007,289 |
| BARR2 Background | 24,516,442 |
| Wikipedia (Life Sciences) | 13,890,501 |
| Patents | 13,463,387 |
| EMEA | 5,377,448 |
| Mespen (MedlinePlus) | 4,166,077 |
| PubMed | 1,858,966 |
More information about the corpus can be found here https://aclanthology.org/2022.bionlp-1.19/ and here https://arxiv.org/abs/2109.07765
The processing took place on an HPC node equipped with an AMD EPYC 7742 (@ 2.250GHz) processor with 128 threads.
How to use
First initialize the spacy vectors from the floret table (.floret file):
spacy init vectors es floret_embeddings_bio_es.floret floret_embeddings_bio_es --mode floret
import spacy
# Load the floret vectors
floret_embeddings = spacy.load("floret_embeddings_bio_es")
# Get the embeddings of some words
diabetes = floret_embeddings.vocab["diabetes"]
insulina = floret_embeddings.vocab["insulina"]
radiografia = floret_embeddings.vocab["radiografia"]
# Get some similarities
print(diabetes.similarity(insulina))
print(diabetes.similarity(radiografia))
# diabetes should be more similar to insuline than radiografia
Intended Uses and Limitations
At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model. However, we are well aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this card will be updated.
Authors
The Text Mining Unit from Barcelona Supercomputing Center.
Contact Information
For further information, send an email to plantl-gob-es@bsc.es
Funding
This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
Copyright
Copyright (c) 2022 Secretaría de Estado de Digitalización e Inteligencia Artificial
Notes
Files
Files
(2.6 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:fd4cc8ce4f46c592934b851ca214ca77
|
1.2 GB | Download |
|
md5:65de05423b54e1f10aa46991bd67d94a
|
115.2 MB | Download |
|
md5:f4f96f7ed627034098e1db1abc0e3df4
|
1.2 GB | Download |