TopicTracker: a Python pipeline to search, download and explore PubMed entries
Authors/Creators
- 1. University of Zurich - Institute of Biomedical Ethics and History of Medicine
Description
A live demo of the TopicTracker is available here (you just need to spawn a new session).
TopicTracker is a Python pipeline intended to streamline and simplify the retrieval and exploration of large amounts of PubMed entries. The software is divided into four Jupyter notebooks: 1. Search and download; 2. Content analyser; 3. Interactive data exploration; 4. Semantic networks.
The first notebook allows to build PubMed queries, download entries, parse them and save them to a .csv file. It takes as input a PubMed query, and outputs a dataset (i.e: a folder containing a PubMed export, its metadata saved in the log file, and the Medline file for eventually importing the references you are analysing in Zotero or similar software). The functions for searching, downloading and parsing are written in a different module in order to simplify adaptations for other projects if need be. The output of the first notebook can be explored with the second and third notebooks of this collection.
The second notebook allows to analyse the trends of entities over time. It takes as input a dataset (i.e: a folder containing a PubMed export generated with the first notebook of this collection, its metadata, and the Medline file) and it outputs a set of .csv files and .svg plots with the trends of keywords, MeSH terms, authors, journals, lemmas in Title/Abstract, amount of COI statements, lemma trends in COI statements. The .csv files can then be explored further with the third notebook of this collection.
The third notebook allows fully interactive exploration of the datasets preprocessed with the second notebook. You can select a dataset to work with, a set of entities to explore, and plot any entity or combination of entities.
The fourth notebook is meant to produce tabular data to be imported in Gephi to generate semantic network maps of keywords (for the time being, potentially I'll expand to mesh terms and lemmas in the future). With some clever clustering and layout this can produce ramarkable visualizations of entire fields. See an example here.
Dependencies (and versions) are listed in every notebook. A couple of toy datasets are provided.
New in v 1.4:
- Added a fourth notebook for semantic network mapping
- Added some much larger toy datasets so that you can have more fun right off the bat
To do in v1.5:
- understand why the PubMed APIs work so strangely with the PDAT tag
- manage exceptions (=empty files -> empty dfs) in the tabs of notebook 3
Files
TopicTracker v1.4.zip
Files
(294.6 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:dcffef20ef65da4d8754ffc5791051fa
|
294.6 MB | Preview Download |
Additional details
References
- 10.1016/j.heliyon.2020.e04426