Dataset Open Access

A meta analysis of Wikipedia's coronavirus sources during the COVID-19 pandemic

Sobel, Jonathan; Benjakob, Omer; Aviram, Rona

At the height of the coronavirus pandemic, on the last day of March 2020, Wikipedia in all languages broke a record for most traffic in a single day. Since the breakout of the Covid-19 pandemic at the start of January, tens if not hundreds of millions of people have come to Wikipedia to read - and in some cases also contribute - knowledge, information and data about the virus to an ever-growing pool of articles. Our study focuses on the scientific backbone behind the content people across the world read: which sources informed Wikipedia’s coronavirus content, and how was the scientific research on this field represented on Wikipedia. Using citation as readout we try to map how COVID-19 related research was used in Wikipedia and analyse what happened to it before and during the pandemic. Understanding how scientific and medical information was integrated into Wikipedia, and what were the different sources that informed the Covid-19 content, is key to understanding the digital knowledge echosphere during the pandemic. 

To delimitate the corpus of Wikipedia articles containing Digital Object Identifier (DOI), we applied two different strategies. First we scraped every Wikipedia pages form the COVID-19 Wikipedia project (about 3000 pages) and we filtered them to keep only page containing DOI citations. For our second strategy, we made a search with EuroPMC on Covid-19, SARS-CoV2, SARS-nCoV19 (30’000 sci papers, reviews and preprints) and a selection on scientific papers form 2019 onwards that we compared to the Wikipedia extracted citations from the english Wikipedia dump of May 2020 (2’000’000 DOIs). This search led to 231 Wikipedia articles containing at least one citation of the EuroPMC search or part of the wikipedia COVID-19 project pages containing DOIs. Next, from our 231 Wikipedia articles corpus we extracted DOIs, PMIDs, ISBNs, websites and URLs using a set of regular expressions. Subsequently, we computed several statistics for each wikipedia article  and we retrive Atmetics, CrossRef and EuroPMC infromations for each DOI. Finally, our method allowed to produce tables of citations annotated and extracted infromations in each wikipadia articles such as books, websites, newspapers.

Files used as input and extracted information on Wikipedia's COVID-19 sources are presented in this archive.

See the WikiCitationHistoRy Github repository for the R codes, and other bash/python scripts utilities related to this project.

Analysis on Wikipedia sources during the first wave of the COVID-19 pandemics (up to May 2020)
Files (1.1 GB)
Name Size
annotated_Altmetric_full_COVID_Corpus.txt
md5:5a28ea5c1059bd758864632fbea3be98
688.5 kB Download
annotated_Altmetric_full_euroPMC_30K_DOI.txt
md5:44d576da3f99a0da9d96090d45e2a3ca
7.2 MB Download
annotated_Altmetric_full_wikidump_DOI.txt
md5:c30cbd88c7e5d2da62a586415189cdfa
422.8 MB Download
annotated_crossRef_full_COVID_Corpus.txt
md5:5167aab665a1be93bfdaa7663153d4f6
850.3 kB Download
annotated_crossref_sub_europmc_30K_DOI.txt
md5:a5b06296a4d0ef035d77b238fc0225ee
10.1 MB Download
annotated_EPMC_DOI_COVID19_corpus.txt
md5:ceddd5f6c20da27b2e79093c4b7e319f
734.3 kB Download
annotated_EPMC_euroPMC_30K_DOI.txt
md5:d16b0380268aa5a555d86e06835fcfc7
4.7 MB Download
annotated_EPMC_wikidump_DOI_clean.txt
md5:6567532d7a93ecab93da4249e570e321
232.1 MB Download
citation_dump_010520.csv
md5:1c62a716757f2f1ae19f65b7c85da88d
398.9 MB Download
covid_full_corpus_art___web_regexp_exctracted_citations.xlsx
md5:672a8bf03958d0a292c8524792ef0b24
664.5 kB Download
covid_full_corpus_art__book_regexp_exctracted_citations.xlsx
md5:1e0ac98d58d7a0c9afdacccadc617540
34.1 kB Download
covid_full_corpus_art__cite_regexp_exctracted_citations.xlsx
md5:5847fa2f72a0cb0cf0064e270d7bfa8c
2.1 MB Download
covid_full_corpus_art__doi_regexp_exctracted_citations.xlsx
md5:efa72d30a28cfb80414a0c33d82d1b7e
104.0 kB Download
covid_full_corpus_art__isbn_regexp_exctracted_citations.xlsx
md5:a2a008d986816c413ac66d0bc6ef45ce
20.8 kB Download
covid_full_corpus_art__journal_regexp_exctracted_citations.xlsx
md5:5136d6322f225b14f98a89dba3ecfc7a
347.8 kB Download
covid_full_corpus_art__news_regexp_exctracted_citations.xlsx
md5:c98782743f7e0b2e3149207830afe6e8
439.4 kB Download
covid_full_corpus_art__pmid_regexp_exctracted_citations.xlsx
md5:4fcad422e6e692c80985e2dbe98d7b57
134.0 kB Download
covid_full_corpus_art__press_release_regexp_exctracted_citations.xlsx
md5:d0293ca01842a4c897250bec29bc3c95
10.2 kB Download
covid_full_corpus_art__ref_regexp_exctracted_citations.xlsx
md5:1693dcd1fefb2152331bbc2cbc5b740d
1.4 MB Download
covid_full_corpus_art__report_regexp_exctracted_citations.xlsx
md5:e971a71e8197725768565e07cfbfeae1
5.9 kB Download
covid_full_corpus_art__template_protected_regexp_exctracted_citations.xlsx
md5:4ecc6ad74713a72287392c9da5e6aa48
4.9 kB Download
covid_full_corpus_art__tweet_regexp_exctracted_citations.xlsx
md5:cd7d5257fab0f420f55f82f6de67d8dd
6.1 kB Download
covid_full_corpus_art__url_regexp_exctracted_citations.xlsx
md5:c6ec0af472a79928246e04b50886b089
1.0 MB Download
covid_full_corpus_art__wikihyperlink_regexp_exctracted_citations.xlsx
md5:bf07a9d04162d4edfdb4bc4a93a0d880
1.1 MB Download
europepmc_covid-19_150520.bib
md5:c829fc0c67863245f40da5bb52617758
11.3 MB Download
euroPMC_table_COVID-19_30K_05122020.csv
md5:659456dfae5b0cfc301c40da775ee669
13.3 MB Download
full_corpus_231.txt
md5:d276f1a3b0ce2498c1b879f95379fe0d
6.0 kB Download
get_anno_epmc_30K.r
md5:c05de65832c9273548bd94c7aec49dce
1.6 kB Download
interactive_paper_art_network.R
md5:28d77d86408efe63ff1e187008c46369
903 Bytes Download
interactive_timeline.R
md5:939993b0c64ad23d57095a8a2eb26952
5.5 kB Download
isbn_regexp_exctracted_citations.xlsx
md5:cc31b0d89ac91005fb9787722bd25a85
10.9 kB Download
job_submiter_wiki_dump_citation_extract.py
md5:a980845c1ef20a669fa129fcc943bc49
1.6 kB Download
job_submiter_wiki_dump_citation_extract_mwcite.py
md5:a980845c1ef20a669fa129fcc943bc49
1.6 kB Download
job_submitter.py
md5:c71af7a9cb20d5f2618f546b7939cb58
1.4 kB Download
job_submitter_Altmetric.py
md5:9751af20e841960aa32739d08df1e4d0
1.5 kB Download
job_submitter_crossref.py
md5:344cd621f4da3b026d03d01185e04ee0
1.5 kB Download
job_submitter_europmc.py
md5:c71af7a9cb20d5f2618f546b7939cb58
1.4 kB Download
journal_COVID_category.csv
md5:9a4e8906fdde8604eb0582ce75832687
383.1 kB Download
Network_V0.csv
md5:2f21438440defabbe501303e29438ee8
16.8 kB Download
news_COVID_category.csv
md5:5ab3aa1e1e9b5ede27d97bad9380bc4b
1.1 MB Download
news_COVID_category.xlsx
md5:8e61aadea3b7ff1115054d1dec960cb1
537.0 kB Download
news_regexp_exctracted_citations.xlsx
md5:76af6ed936bb06319194166679902280
393.8 kB Download
preprints_table_210520.xlsx
md5:c9e8ac14b73fe3bc72faee45cc8c2ff5
13.9 kB Download
press_release_COVID_category.csv
md5:1be772c29085c3b68a83474c6f43b973
20.5 kB Download
press_release_regexp_exctracted_citations.xlsx
md5:5845396fd81164749e0eaffdcf9d60ed
11.9 kB Download
ref_COVID_category.csv
md5:6f9b3fbd86bba69470c54246bae9c319
10.7 MB Download
ref_regexp_exctracted_citations.xlsx
md5:263e99f53fbc5f39d51d11dea4ffb3fa
665.9 kB Download
report_COVID_category.csv
md5:23cb0868110e1a0141c47c436bf3eac2
5.8 kB Download
report_regexp_exctracted_citations.xlsx
md5:61a9755ea0caf5b4094fd12f92d53689
6.5 kB Download
top_20_wiki_cited_doi_annotated_europmc.csv
md5:d69ae699a912121a470705af4b1900f2
11.0 kB Download
top_20_wiki_cited_doi_annotated_europmc.xlsx
md5:d8f55f8e22097daeef6311f182f9f96b
17.9 kB Download
top_20_wiki_cited_doi_annotated_europmc_clean.csv
md5:dac8b579ed10043f552d69bdb71d0467
9.6 kB Download
top_20_wiki_cited_doi_annotated_europmc_clean.xlsx
md5:2d9132be324f87520e1351832d5d6f1e
15.8 kB Download
top_annotated_dois_europmc200520_small.xlsx
md5:6f6563a0d18181eb40666053e7c93bb4
15.8 kB Download
tweet_regexp_exctracted_citations.xlsx
md5:42eaf8e1ecd288ea480c15e826b6cb78
5.7 kB Download
tweets_COVID_category.csv
md5:65e4f7bce7a8efb00af5312a1d81b2d7
5.8 kB Download
tweets_COVID_category.xlsx
md5:f24c027a9805a5113ee72647164e509d
16.3 kB Download
url_COVID_category.csv
md5:5e57232cc3e0db93001011df9a35b106
1.8 MB Download
url_COVID_category.xlsx
md5:42f767fb58d16cda6799a00158740fff
2.2 MB Download
url_regexp_exctracted_citations.xlsx
md5:5adfbd896ee956a98be633cd5c262a83
814.3 kB Download
web_COVID_category.csv
md5:f710d05d6ce2b1f50d2d71d1d215da98
1.2 MB Download
web_COVID_category.xlsx
md5:7e62ea43ec52b2ff6d52598945520a12
448.8 kB Download
web_regexp_exctracted_citations.xlsx
md5:524091975df6df9d555eee155f56b54d
454.4 kB Download
723
615
views
downloads
All versions This version
Views 723723
Downloads 615615
Data volume 5.1 GB5.1 GB
Unique views 681681
Unique downloads 381381

Share

Cite as