Published July 20, 2020 | Version v1
Presentation Open

Mapping the COVID-19 Literature

  • 1. Harvard University
  • 2. Austrian Academy of Sciences
  • 3. NamSor

Description

The COVID-19 mapping stems from the need to order literature, discovering who are the major actors and how they coordinate their work in a moment in which scientific results have to be treated carefully.

Using the COVID‑19 Open Research Dataset (CORD‑19) provided by the Allen Institute for AI (2020), articles are grouped by authors to analyze the abstracts using techniques of Natural Language Processing (NLP). In particular, through the algorithm Term Frequency – Inverse Document Frequency, also known as TF-IDF (Salton, Wong, and Yang 1975), we created a metric based on the lexical similarity between authors. When two authors share a significant number of terms, a relation between them is established. The approach is substantially different from citation analysis, as the lexical analysis is a more inclusive metric that considers authors regardless of whether they are cited or not (Moon and Rodighiero 2020).

Researchers and lexical relations are drawn respectively as network nodes and edges. The network was created using two JavaScript libraries: PixiJS, a JavaScript library written for video games to obtain high-performance WebGL rendering, and D3.js, a library aimed to develop data visualizations that integrate a force-directed graph based on Verlet integration (1967). The result is a web-based and open-source application called the “Cartography of COVID-19 Literature” (Rodighiero, Wandl-Vogt, and Carsenat 2020).

The term cartography is a metaphor as no geographical information is displayed. The cartographic metaphor refers to the elevation map that is employed to point to the most relevant clusters, estimated according to the space density and the TF-IDF values. Zooming in the map, authors appear as nodes in the hexagonal grid; between each close couple of them, if any, the most relevant term is shown to give the semantic meaning of their proximity. The result is one dense cartography whose reading is driven by the elevation map first, and then by a more detailed layer of information that displays researchers and terms.

The panel on the left offers insights to better analyze the role of each author, providing the number of articles, the publication years, and the most relevant TF-IDF terms. Furthermore, through the NamSor dataset (NamSor API v2.0.9B02), we infer the researcher’s co-author nationality (also called cultural origin or ethnicity) in order to understand the geo-localization of a specific cluster. It is important to notice that this information is aggregated as the NamSor accuracy of personal nationality ranges between 85% and 95%, introducing potential bias and error.

The “Cartography of COVID-19 Literature” is not an analytical method aimed to give statistical results, it belongs rather to the domain of visual methods for personal and collective exploration. As data visualizations do not provide facts but rather interpretations (Schnapp 2013), the employment of this map is both digital, as the artifact is computed, and analog, as the interpretation is subjective to the readers. This reinterprets the “theory of knowledge” introduced by Leonardo in which the visual observation brings us to the synthesis between art and science (Heydenreich 2020).

Files

Presentation.pdf

Files (6.1 MB)

Name Size Download all
md5:10e9dba3ed630671a5cda0aa6ec6972f
6.1 MB Preview Download

Additional details

Funding

Worldwide Map of Research P2ELP1_181930
Swiss National Science Foundation