Files and code for English dictionaries, gold and silver standard corpora for biomedical natural language processing related to SARS-CoV-2 and COVID-19
- 1. Lund University, Faculty of Medicine, Cell Death, Lysosomes and Artificial Intelligence Group and AI Lund
- 2. Lund University, Humanities Lab
Description
Automated information extraction with natural language processing (NLP) tools is required to gain systematic insights from the large number of COVID-19 publications, reports and social media posts, which far exceed human processing capabilities.
Here we present an NLP toolbox comprising COVID-19-related dictionaries and annotated corpora in English as well as useful code and workflows for their update and use. The dictionaries contain terms referring to the COVID-19 disease, the SARS-CoV-2 virus, its variants and common mutations, respectively. They were used together with the EasyNER NLP tool to extract and annotate all 764 398 abstracts in the CORD-19 dataset, creating a very large silver standard corpus (named Lund-Annotated-CORD-19 corpus). This was complemented with a small gold standard corpus consisting of PubMed abstracts manually annotated for key entity classes such as disease, virus, symptom, protein/gene, cell type, chemical and species terms.
The toolbox can support various text analysis tasks related to COVID-19 such as named entity recognition and co-mention analysis. A preliminary version of the toolbox, which was released early in the pandemic, was for example already used to create a COVID-19 knowledge graph and study the evolution and variation of COVID-19-related terminology. In addition, the toolbox can be applied in the development of other NLP tools, for example to train and evaluate large language models.
When using the toolbox, please cite this record and the associated article.
Notes
Files
Supplemental_file1.txt
Files
(249.3 MB)
Name | Size | Download all |
---|---|---|
md5:b5cd39a94d2f4d0ac2f684c3e4b89755
|
28.7 kB | Preview Download |
md5:25971d209ce11a588b571bc0ce36e61b
|
12.6 kB | Preview Download |
md5:0e09381e2a90af5723c10986a490ffc1
|
34.2 MB | Preview Download |
md5:38c8b23c832cbe7d6cf77009a80dcd47
|
4.9 MB | Preview Download |
md5:63cc2d3b709f3b582666ef052ad98eca
|
70.1 MB | Preview Download |
md5:ec1de8888196244127505204e7ff74d3
|
998 Bytes | Preview Download |
md5:9f26cd778b60a70022fa50cab84e0575
|
23.4 kB | Preview Download |
md5:6253c6b7868fec86ed2ea2b31082e227
|
139.9 MB | Preview Download |
md5:81051bcc2cf1bedf378224b0a93e2877
|
2 Bytes | Preview Download |
md5:703295f9d3910eef89cd410682e5566c
|
65.9 kB | Preview Download |
md5:fd5dc3147c121c45c5d0a963f6620df6
|
96.0 kB | Preview Download |
Additional details
Related works
- Is documented by
- Preprint: arXiv:2003.09865 (arXiv)
- Preprint: 10.48550/arXiv.2003.09865 (DOI)
Software
- Repository URL
- https://github.com/Aitslab/Covid19/
- Development Status
- Active
References
- English dictionaries, gold and silver standard corpora for biomedical natural language processing related to SARS-CoV-2 and COVID-19. arXiv 2020, v3 2022 (arXiv:2003.09865)