Published June 12, 2025 | Version 3.0.0
Dataset Open

Files and code for English dictionaries, gold and silver standard corpora for biomedical natural language processing related to SARS-CoV-2 and COVID-19

  • 1. Lund University, Faculty of Medicine, Cell Death, Lysosomes and Artificial Intelligence Group and AI Lund
  • 2. Lund University, Humanities Lab

Description

Automated information extraction with natural language processing (NLP) tools is required to gain systematic insights from the large number of COVID-19 publications, reports and social media posts, which far exceed human processing capabilities. 

Here we present an NLP toolbox comprising COVID-19-related dictionaries and annotated corpora in English as well as useful code and workflows for their update and use. The dictionaries contain terms referring to the COVID-19 disease, the SARS-CoV-2 virus, its variants and common mutations, respectively. They were used together with the EasyNER NLP tool to extract and annotate all 764 398 abstracts in the CORD-19 dataset, creating a very large silver standard corpus (named Lund-Annotated-CORD-19 corpus). This was complemented with a small gold standard corpus consisting of PubMed abstracts manually annotated for key entity classes such as disease, virus, symptom, protein/gene, cell type, chemical and species terms.

The toolbox can support various text analysis tasks related to COVID-19 such as named entity recognition and co-mention analysis. A preliminary version of the toolbox, which was released early in the pandemic, was for example already used to create a COVID-19 knowledge graph and study the evolution and variation of COVID-19-related terminology. In addition, the toolbox can be applied in the development of other NLP tools, for example to train and evaluate large language models.

When using the toolbox, please cite this record and the associated article.

 

 

Notes

The code is written in python v3. Required packages to run: PyBioC (https://github.com/2mh/PyBioC), pandas v1.4.0 or higher, spacy v2.2

Files

Supplemental_file1.txt

Files (249.3 MB)

Name Size Download all
md5:b5cd39a94d2f4d0ac2f684c3e4b89755
28.7 kB Preview Download
md5:25971d209ce11a588b571bc0ce36e61b
12.6 kB Preview Download
md5:0e09381e2a90af5723c10986a490ffc1
34.2 MB Preview Download
md5:38c8b23c832cbe7d6cf77009a80dcd47
4.9 MB Preview Download
md5:63cc2d3b709f3b582666ef052ad98eca
70.1 MB Preview Download
md5:ec1de8888196244127505204e7ff74d3
998 Bytes Preview Download
md5:9f26cd778b60a70022fa50cab84e0575
23.4 kB Preview Download
md5:6253c6b7868fec86ed2ea2b31082e227
139.9 MB Preview Download
md5:81051bcc2cf1bedf378224b0a93e2877
2 Bytes Preview Download
md5:703295f9d3910eef89cd410682e5566c
65.9 kB Preview Download
md5:fd5dc3147c121c45c5d0a963f6620df6
96.0 kB Preview Download

Additional details

Related works

Is documented by
Preprint: arXiv:2003.09865 (arXiv)
Preprint: 10.48550/arXiv.2003.09865 (DOI)

Software

Repository URL
https://github.com/Aitslab/Covid19/
Development Status
Active

References

  • English dictionaries, gold and silver standard corpora for biomedical natural language processing related to SARS-CoV-2 and COVID-19. arXiv 2020, v3 2022 (arXiv:2003.09865)