TopicTracker: a Python pipeline to search, download and explore PubMed entries

Giovanni Spitale; Nikola Biller-Andorno

doi:10.5281/zenodo.7023618

Published February 8, 2021 | Version 1.4.0

Software Open

TopicTracker: a Python pipeline to search, download and explore PubMed entries

1. University of Zurich - Institute of Biomedical Ethics and History of Medicine

A live demo of the TopicTracker is available here (you just need to spawn a new session).

TopicTracker is a Python pipeline intended to streamline and simplify the retrieval and exploration of large amounts of PubMed entries. The software is divided into four Jupyter notebooks: 1. Search and download; 2. Content analyser; 3. Interactive data exploration; 4. Semantic networks.

The first notebook allows to build PubMed queries, download entries, parse them and save them to a .csv file. It takes as input a PubMed query, and outputs a dataset (i.e: a folder containing a PubMed export, its metadata saved in the log file, and the Medline file for eventually importing the references you are analysing in Zotero or similar software). The functions for searching, downloading and parsing are written in a different module in order to simplify adaptations for other projects if need be. The output of the first notebook can be explored with the second and third notebooks of this collection.

The second notebook allows to analyse the trends of entities over time. It takes as input a dataset (i.e: a folder containing a PubMed export generated with the first notebook of this collection, its metadata, and the Medline file) and it outputs a set of .csv files and .svg plots with the trends of keywords, MeSH terms, authors, journals, lemmas in Title/Abstract, amount of COI statements, lemma trends in COI statements. The .csv files can then be explored further with the third notebook of this collection.

The third notebook allows fully interactive exploration of the datasets preprocessed with the second notebook. You can select a dataset to work with, a set of entities to explore, and plot any entity or combination of entities.

The fourth notebook is meant to produce tabular data to be imported in Gephi to generate semantic network maps of keywords (for the time being, potentially I'll expand to mesh terms and lemmas in the future). With some clever clustering and layout this can produce ramarkable visualizations of entire fields. See an example here.

Dependencies (and versions) are listed in every notebook. A couple of toy datasets are provided.

New in v 1.4:

- Added a fourth notebook for semantic network mapping

- Added some much larger toy datasets so that you can have more fun right off the bat

To do in v1.5:
- understand why the PubMed APIs work so strangely with the PDAT tag
- manage exceptions (=empty files -> empty dfs) in the tabs of notebook 3

Files

TopicTracker v1.4.zip

Files (294.6 MB)

Name	Size	Download all
TopicTracker v1.4.zip md5:dcffef20ef65da4d8754ffc5791051fa	294.6 MB	Preview Download

Additional details

10.1016/j.heliyon.2020.e04426

	All versions	This version
Views	2,376	525
Downloads	370	99
Data volume	38.7 GB	30.3 GB

TopicTracker: a Python pipeline to search, download and explore PubMed entries

Authors/Creators

Description

Files

TopicTracker v1.4.zip

Files (294.6 MB)

Additional details

References