Spanish COVID-19 Twitter Embeddings in FastText
- 1. Barcelona Supercomputing Center
Description
Intro
300-dimensional FastText embeddings generated from 140 million tweets in Spanish. All tweets are COVID19-related, meaning that they include one or more keywords related to COVID-19 and lockdown.
Please, cite:
Miranda-Escalada, A., Farré-Maduell, E., Lima-López, S., Gascó, L., Briva-Iglesias, V., Agüero-Torales, M., & Krallinger, M. (2021, June). The profner shared task on automatic recognition of occupation mentions in social media: systems, evaluation, guidelines, embeddings and corpora. In Proceedings of the Sixth Social Media Mining for Health (# SMM4H) Workshop and Shared Task (pp. 13-20).
@inproceedings{miranda2021profner,
title={The profner shared task on automatic recognition of occupation mentions in social media: systems, evaluation, guidelines, embeddings and corpora},
author={Miranda-Escalada, Antonio and Farr{\'e}-Maduell, Eul{\`a}lia and Lima-L{\'o}pez, Salvador and Gasc{\'o}, Luis and Briva-Iglesias, Vicent and Ag{\"u}ero-Torales, Marvin and Krallinger, Martin},
booktitle={Proceedings of the Sixth Social Media Mining for Health (\# SMM4H) Workshop and Shared Task},
pages={13--20},
year={2021}
}
Description
- Available are the cased and uncased versions for the cbow and skipgram models.
- FastText parameter configurations were:
- dim 300
- minCount 5
- minn 3
- maxn 6
Preprocessing
"RT: @" patterns are removed. URL and mentions are substituted by URL and @MENTION. Text is tokenized with NLTK TweetTokenizer.
Resources
- Web
- Gold Standard corpus
- Annotation guidelines (in Spanish)
- Annotation guidelines (in English)
- Occupations gazetteer
- Conference Proceedings
For more information, see https://temu.bsc.es/smm4h-spanish