Other Open Access

Spanish COVID-19 Twitter Embeddings in FastText

Miranda-Escalada, Antonio; Aguero, Marvin; Krallinger, Martin

Intro

300-dimensional FastText embeddings generated from 140 million tweets in Spanish. All tweets are COVID19-related, meaning that they include one or more keywords related to COVID-19 and lockdown.

 

Please, cite:

Miranda-Escalada, A., Farré-Maduell, E., Lima-López, S., Gascó, L., Briva-Iglesias, V., Agüero-Torales, M., & Krallinger, M. (2021, June). The profner shared task on automatic recognition of occupation mentions in social media: systems, evaluation, guidelines, embeddings and corpora. In Proceedings of the Sixth Social Media Mining for Health (# SMM4H) Workshop and Shared Task (pp. 13-20).

@inproceedings{miranda2021profner,
  title={The profner shared task on automatic recognition of occupation mentions in social media: systems, evaluation, guidelines, embeddings and corpora},
  author={Miranda-Escalada, Antonio and Farr{\'e}-Maduell, Eul{\`a}lia and Lima-L{\'o}pez, Salvador and Gasc{\'o}, Luis and Briva-Iglesias, Vicent and Ag{\"u}ero-Torales, Marvin and Krallinger, Martin},
  booktitle={Proceedings of the Sixth Social Media Mining for Health (\# SMM4H) Workshop and Shared Task},
  pages={13--20},
  year={2021}
}

 

Description

  • Available are the cased and uncased versions for the cbow and skipgram models.
  • FastText parameter configurations were: 
    • dim 300 
    • minCount 5
    • minn 3
    • maxn 6

 

Preprocessing

"RT: @" patterns are removed. URL and mentions are substituted by URL and @MENTION. Text is tokenized with NLTK TweetTokenizer.

 

Resources

For more information, see https://temu.bsc.es/smm4h-spanish

Funded by the Plan de Impulso de las Tecnologías del Lenguaje (Plan TL).
Files (28.1 GB)
Name Size
cbow_cased.tar.gz
md5:cb6decc3e77e3d4fe840ad7d6b296e6c
7.5 GB Download
cbow_uncased.tar.gz
md5:ed75b214f4aa63e026dae7293a6ba0b7
6.5 GB Download
README.txt
md5:192753a1b86627f96bdfcb4a80540b47
341 Bytes Download
skipgram_cased.tar.gz
md5:fa0ed3051e6dc83adadde3009f7ef5df
7.5 GB Download
skipgram_uncased.tar.gz
md5:333b6c5294836f93928be7224d933b2d
6.5 GB Download
379
101
views
downloads
All versions This version
Views 379378
Downloads 101101
Data volume 545.4 GB545.4 GB
Unique views 344343
Unique downloads 6868

Share

Cite as