Files and code for English dictionaries, gold and silver standard corpora for biomedical natural language processing related to SARS-CoV-2 and COVID-19

Salma Kazemi Rashed; Rafsan Ahmed; Johan Frid; Sonja Aits

doi:10.5281/zenodo.15395348

Published June 12, 2025 | Version 3.0.0

Dataset Open

Files and code for English dictionaries, gold and silver standard corpora for biomedical natural language processing related to SARS-CoV-2 and COVID-19

1. Lund University, Faculty of Medicine, Cell Death, Lysosomes and Artificial Intelligence Group and AI Lund
2. Lund University, Humanities Lab

Automated information extraction with natural language processing (NLP) tools is required to gain systematic insights from the large number of COVID-19 publications, reports and social media posts, which far exceed human processing capabilities.

Here we present an NLP toolbox comprising COVID-19-related dictionaries and annotated corpora in English as well as useful code and workflows for their update and use. The dictionaries contain terms referring to the COVID-19 disease, the SARS-CoV-2 virus, its variants and common mutations, respectively. They were used together with the EasyNER NLP tool to extract and annotate all 764 398 abstracts in the CORD-19 dataset, creating a very large silver standard corpus (named Lund-Annotated-CORD-19 corpus). This was complemented with a small gold standard corpus consisting of PubMed abstracts manually annotated for key entity classes such as disease, virus, symptom, protein/gene, cell type, chemical and species terms.

The toolbox can support various text analysis tasks related to COVID-19 such as named entity recognition and co-mention analysis. A preliminary version of the toolbox, which was released early in the pandemic, was for example already used to create a COVID-19 knowledge graph and study the evolution and variation of COVID-19-related terminology. In addition, the toolbox can be applied in the development of other NLP tools, for example to train and evaluate large language models.

When using the toolbox, please cite this record and the associated article.

Notes

The code is written in python v3. Required packages to run: PyBioC (https://github.com/2mh/PyBioC), pandas v1.4.0 or higher, spacy v2.2

Files

Supplemental_file1.txt

Files (249.3 MB)

Name	Size	Download all
Supplemental_file1.txt md5:b5cd39a94d2f4d0ac2f684c3e4b89755	28.7 kB	Preview Download
Supplemental_file10.csv md5:25971d209ce11a588b571bc0ce36e61b	12.6 kB	Preview Download
Supplemental_file11.zip md5:0e09381e2a90af5723c10986a490ffc1	34.2 MB	Preview Download
Supplemental_file2.txt md5:38c8b23c832cbe7d6cf77009a80dcd47	4.9 MB	Preview Download
Supplemental_file3.txt md5:63cc2d3b709f3b582666ef052ad98eca	70.1 MB	Preview Download
Supplemental_file4.txt md5:ec1de8888196244127505204e7ff74d3	998 Bytes	Preview Download
Supplemental_file5.zip md5:9f26cd778b60a70022fa50cab84e0575	23.4 kB	Preview Download
Supplemental_file6.zip md5:6253c6b7868fec86ed2ea2b31082e227	139.9 MB	Preview Download
Supplemental_file7.txt md5:81051bcc2cf1bedf378224b0a93e2877	2 Bytes	Preview Download
Supplemental_file8.xml md5:703295f9d3910eef89cd410682e5566c	65.9 kB	Preview Download
Supplemental_file9.json md5:fd5dc3147c121c45c5d0a963f6620df6	96.0 kB	Preview Download

Additional details

Is documented by: Preprint: arXiv:2003.09865 (arXiv); Preprint: 10.48550/arXiv.2003.09865 (DOI)

Repository URL: https://github.com/Aitslab/Covid19/
Development Status: Active

English dictionaries, gold and silver standard corpora for biomedical natural language processing related to SARS-CoV-2 and COVID-19. arXiv 2020, v3 2022 (arXiv:2003.09865)

COVID-19: https://www.ncbi.nlm.nih.gov/mesh/2052179

	All versions	This version
Views	630	96
Downloads	604	295
Data volume	8.3 GB	7.8 GB

Files and code for English dictionaries, gold and silver standard corpora for biomedical natural language processing related to SARS-CoV-2 and COVID-19

Notes

Files

Supplemental_file1.txt

Files (249.3 MB)

Additional details

Related works

Software

References

Subjects

Files and code for English dictionaries, gold and silver standard corpora for biomedical natural language processing related to SARS-CoV-2 and COVID-19

Creators

Description

Notes

Files

Supplemental_file1.txt

Files (249.3 MB)

Additional details

Related works

Software

References

Subjects