Biomedical Expert Finding Benchmark: Health Claims Matched to Researcher Publication Profiles
Authors/Creators
Description
This dataset provides the first large-scale benchmark for identifying biomedical experts qualified to verify health claims. It contains 93,404 health claims paired with 153,147 biomedical researchers, linked through their PubMed publication histories.
The dataset addresses the critical challenge of connecting informal health claims from news media with formal biomedical literaturea cross-genre retrieval task essential for combating health misinformation. Each claim is associated with relevant experts based on their documented research expertise, creating gold-standard (15,119 claims) and silver-standard (78,285 claims) subsets with different construction methodologies.
Expert profiles include comprehensive publication records from PubMed (2018-2024), with titles and abstracts representing their research domains. The dataset captures the semantic heterogeneity between how health information appears in public discourse versus scientific literature, making it valuable for developing automated expert identification systems.
Applications include:
- Benchmarking expert matching and cross-genre retrieval systems
- Training models for health misinformation detection
- Analyzing expertise patterns in biomedical research
- Supporting fact-checking workflows in clinical and public health contexts
- Studying health communication between scientific and public domains
Methods
Data Collection Methodology
We constructed this dataset through a multi-stage pipeline combining automated extraction with expert curation.
Gold-Standard Corpus (15,119 claims). We collected 150,028 health news articles (2018-2024) from reputable sources including Reuters, CNN, BBC, and specialized medical news platforms. We extracted health claims that explicitly cited peer-reviewed research and identified their Digital Object Identifiers (DOIs). Using PubMed's API, we retrieved metadata for the 14,393 unique referenced publications and their 153,147 authors, establishing direct claim-to-expert links.
Silver-Standard Corpus (78,285 claims). We expanded the dataset using four complementary techniques:
- News article clustering based on temporal proximity, semantic similarity, and named entity overlap to identify additional articles covering the same research.
- Controlled paraphrasing using language models to increase linguistic diversity.
- Headline style transfer to simulate different claim formulations.
- Citation network analysis to identify related publications and their authors.
Expert Profiles. For each researcher, we compiled a comprehensive publication profile by querying PubMed for all their publications from 2018-2024. Expert representations consist of concatenated publication titles and abstracts, capturing their research focus and domain expertise.
Data Format. The dataset includes:
- Claim texts
- Expert identifiers
- Publication metadata (titles, abstracts, publication dates, MeSH terms where available)
- Relevance labels indicating the strength of claim-expert matches
- Clear marking of gold and silver subsets to support different evaluation scenarios
All data were anonymized where appropriate, while preserving the critical relationships between claims and relevant expert publications. The dataset is released to support reproducible research in expert finding and health misinformation detection.
Other
How to Cite
If you use this dataset in your research, please cite both the dataset and the associated paper:
Dataset Citation:
Zuo, C., Wang, Chenlu, & Banerjee, R. (2025). Biomedical Expert Finding Benchmark: Health Claims Matched to Researcher Publication Profiles [Dataset]. Zenodo. https://doi.org/10.5281/zenodo.15009723
@dataset{biomedicalexpert2025,
author = {Zuo, Chaoyuan and Wang, Chenlu and Banerjee, Ritwik},
title = {{Biomedical Expert Finding Benchmark: Health Claims Matched to Researcher Publication Profiles}},
year = {2025},
publisher = {Zenodo},
doi = {10.5281/zenodo.15009723},
url = {https://doi.org/10.5281/zenodo.15009723}
}
**Paper Citation:**
Zuo, C., Wang, Chenlu, & Banerjee, R. (2025). Large-scale Biomedical Expert Finding for Health Claim Verification: A PubMed-based Retrieval Framework. IEEE International Conference on Bioinformatics and Biomedicine (BIBM).
.
BibTeX:
@inproceedings{healthclaimexperts2025paper,
author = {Zuo, Chaoyuan and Wang, Chenlu and Banerjee, Ritwik},
title = {{Large-scale Biomedical Expert Finding for Health Claim Verification: A PubMed-based Retrieval Framework}},
booktitle = {IEEE International Conference on Bioinformatics and Biomedicine (BIBM)},
year = {2025}
}
Citation Note: We recommend citing both the dataset and the paper to acknowledge the benchmark creation and methodology. If you use only specific subsets (gold-standard or silver-standard), please mention this in your paper's methodology section.
Files
dataset.zip
Files
(690.8 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:642ab9f0794f3909fa3db81bb47400c6
|
690.8 MB | Preview Download |
Additional details
Funding
- National Natural Science Foundation of China
- 62406150
- U.S. National Science Foundation
- Collaborative Research: EAGER: MedAn: A Framework for Investigating Live Medical Data against Privacy Laws 2335686
Software
- Repository URL
- https://github.com/chzuo/bibm2025_expertfinding