Published October 30, 2025 | Version v1
Dataset Open

Biomedical Expert Finding Benchmark: Health Claims Matched to Researcher Publication Profiles

  • 1. ROR icon Nankai University
  • 2. EDMO icon Stony Brook University

Description

This dataset provides the first large-scale benchmark for identifying biomedical experts qualified to verify health claims. It contains 93,404 health claims paired with 153,147 biomedical researchers, linked through their PubMed publication histories.

The dataset addresses the critical challenge of connecting informal health claims from news media with formal biomedical literaturea cross-genre retrieval task essential for combating health misinformation. Each claim is associated with relevant experts based on their documented research expertise, creating gold-standard (15,119 claims) and silver-standard (78,285 claims) subsets with different construction methodologies.

Expert profiles include comprehensive publication records from PubMed (2018-2024), with titles and abstracts representing their research domains. The dataset captures the semantic heterogeneity between how health information appears in public discourse versus scientific literature, making it valuable for developing automated expert identification systems.

Applications include:

  • Benchmarking expert matching and cross-genre retrieval systems
  • Training models for health misinformation detection
  • Analyzing expertise patterns in biomedical research
  • Supporting fact-checking workflows in clinical and public health contexts
  • Studying health communication between scientific and public domains

Methods

Data Collection Methodology

We constructed this dataset through a multi-stage pipeline combining automated extraction with expert curation.

Gold-Standard Corpus (15,119 claims). We collected 150,028 health news articles (2018-2024) from reputable sources including Reuters, CNN, BBC, and specialized medical news platforms. We extracted health claims that explicitly cited peer-reviewed research and identified their Digital Object Identifiers (DOIs). Using PubMed's API, we retrieved metadata for the 14,393 unique referenced publications and their 153,147 authors, establishing direct claim-to-expert links.

Silver-Standard Corpus (78,285 claims). We expanded the dataset using four complementary techniques:

  1. News article clustering based on temporal proximity, semantic similarity, and named entity overlap to identify additional articles covering the same research.
  2. Controlled paraphrasing using language models to increase linguistic diversity.
  3. Headline style transfer to simulate different claim formulations.
  4. Citation network analysis to identify related publications and their authors.

Expert Profiles. For each researcher, we compiled a comprehensive publication profile by querying PubMed for all their publications from 2018-2024. Expert representations consist of concatenated publication titles and abstracts, capturing their research focus and domain expertise.

Data Format. The dataset includes:

  • Claim texts
  • Expert identifiers
  • Publication metadata (titles, abstracts, publication dates, MeSH terms where available)
  • Relevance labels indicating the strength of claim-expert matches
  • Clear marking of gold and silver subsets to support different evaluation scenarios

All data were anonymized where appropriate, while preserving the critical relationships between claims and relevant expert publications. The dataset is released to support reproducible research in expert finding and health misinformation detection.

Other

How to Cite

If you use this dataset in your research, please cite both the dataset and the associated paper:

Dataset Citation: 

Zuo, C., Wang, Chenlu, & Banerjee, R. (2025). Biomedical Expert Finding Benchmark: Health Claims Matched to Researcher Publication Profiles [Dataset]. Zenodo. https://doi.org/10.5281/zenodo.15009723
BibTeX:
@dataset{biomedicalexpert2025,
  author = {Zuo, Chaoyuan and Wang, Chenlu and Banerjee, Ritwik},
  title = {{Biomedical Expert Finding Benchmark: Health Claims Matched to Researcher Publication Profiles}},
  year = {2025},
  publisher = {Zenodo},
  doi = {10.5281/zenodo.15009723},
  url = {https://doi.org/10.5281/zenodo.15009723}
}

**Paper Citation:**

Zuo, C., Wang, Chenlu, & Banerjee, R. (2025). Large-scale Biomedical Expert Finding for Health Claim Verification: A PubMed-based Retrieval Framework. IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

.
BibTeX:

@inproceedings{healthclaimexperts2025paper,
  author = {Zuo, Chaoyuan and Wang, Chenlu and Banerjee, Ritwik},
  title = {{Large-scale Biomedical Expert Finding for Health Claim Verification: A PubMed-based Retrieval Framework}},
  booktitle = {IEEE International Conference on Bioinformatics and Biomedicine (BIBM)},
  year = {2025}
}

Citation Note: We recommend citing both the dataset and the paper to acknowledge the benchmark creation and methodology. If you use only specific subsets (gold-standard or silver-standard), please mention this in your paper's methodology section.

Files

dataset.zip

Files (690.8 MB)

Name Size Download all
md5:642ab9f0794f3909fa3db81bb47400c6
690.8 MB Preview Download

Additional details

Funding

National Natural Science Foundation of China
62406150
U.S. National Science Foundation
Collaborative Research: EAGER: MedAn: A Framework for Investigating Live Medical Data against Privacy Laws 2335686