HDCR: Cross-lingual Medical Misinformation Detection Dataset
Description
HDCR: Cross-lingual Medical Misinformation Detection Dataset
Dataset Overview
This dataset contains 72,275 cross-lingual claim-evidence pairs for detecting fine-grained medical misinformation in English and Chinese health communications. Each sample pairs a health claim with peer-reviewed biomedical evidence and provides fine-grained distortion labels.
Dataset Description
Data Format
The dataset is provided in three JSON files: train.json (43,363 samples, 60%), dev.json (7,229 samples, 10%), and test.json (21,683 samples, 30%). Each JSON file contains a list of samples with the following structure:
{"id": "0039959", "claim": "Workplace health promotion program delivers significant benefits", "document": "What Can You Achieve in 8 Years? A Case Study on Workplace Health Promotion Program...", "class_label": 3, "language": "en"}
Field Descriptions
id: Unique identifier for each sample
claim: Health claim from medical news sources (English or Chinese)
document: Corresponding scientific evidence (title and abstract from peer-reviewed publications)
class_label: Distortion category (0-4, see below)
language: Language of the claim ("en" for English, "cn" for Chinese)
Label Definitions
Label 0 - Not Misinformation: Accurate claim with no distortion
Label 1 - Over-generalization: Inappropriately extending limited research findings to broader populations or situations beyond validated scope
Label 2 - Improper Restriction: Inappropriately narrowing the applicability of well-established medical evidence to specific populations without scientific justification
Label 3 - Effect Exaggeration: Inappropriately amplifying treatment effects, risk levels, or statistical significance beyond what evidence supports
Label 4 - Spurious Causation: Incorrectly interpreting correlation or temporal association as a causal relationship without sufficient evidence
Dataset Statistics
Overall Distribution
Total samples: 72,275 English claims: 65,127 (90.1%) Chinese claims: 7,148 (9.9%) Training samples: 43,363 (60%) Development samples: 7,229 (10%) Test samples: 21,683 (30%)
Label Distribution
Not Misinformation (0): 14,455 samples (20%) Over-generalization (1): 14,455 samples (20%) Improper Restriction (2): 14,455 samples (20%) Effect Exaggeration (3): 14,455 samples (20%) Spurious Causation (4): 14,455 samples (20%)
Language Distribution by Split
Train: 39,195 English + 4,168 Chinese Dev: 6,532 English + 697 Chinese Test: 19,400 English + 2,283 Chinese
Data Collection
Source Materials
Health News Articles: 16,547 articles from authoritative medical journalism platforms. English sources include 15,108 articles from Reuters Health, CNN Health, MedPage Today, Medical News Today, and STAT News (2018-2025). Chinese sources include 1,439 articles from Chinese medical news platforms and health websites (2017-2023).
Scientific Evidence: Peer-reviewed publications from PubMed-indexed journals. Titles and abstracts were extracted via PubMed API. Each claim was verified against original biomedical literature.
Recommended Tasks
Primary Task
5-class Fine-grained Medical Misinformation Detection: Classify health claims into one of five categories (accurate or four distortion types)
Alternative Tasks
Binary Classification: Distinguish accurate claims from any type of medical distortion 4-class Classification: Categorize distorted claims by clinical risk type (excluding accurate claims) Cross-lingual Transfer: Evaluate model robustness across English and Chinese medical claims
Version History
v1.0 (2025): Initial release with 72,275 samples across English and Chinese
Ethical Considerations
This dataset is intended for research purposes to improve automated detection of medical misinformation. Users should not use this dataset to deliberately generate or spread medical misinformation. Users should consider the potential impact of false positives in clinical decision-making contexts. Users should be aware of the limitations of automated systems in replacing medical expertise. Users should respect patient privacy and confidentiality when developing applications using this dataset.
Other
How to Cite
If you use this dataset in your research, please cite both the dataset and the associated paper:
Dataset Citation:
Zuo, C., & Banerjee, R. (2025). HDCR: Cross-lingual Medical Misinformation Detection Dataset [Dataset]. Zenodo. https://doi.org/110.5281/zenodo.17486207
@dataset{biomedicalexpert2025,
author = {Zuo, Chaoyuan and Banerjee, Ritwik},
title = {{HDCR: Cross-lingual Medical Misinformation Detection Dataset}},
year = {2025},
publisher = {Zenodo},
doi = {110.5281/zenodo.17486207},
url = {https://doi.org/10.5281/zenodo.17486207}
}
**Paper Citation:**
Zuo, C., Wang, Chenlu, & Banerjee, R. (2025). HDCR: Cross-lingual Medical Misinformation Detection through Contrastive Claim-Evidence Reasoning. IEEE International Conference on Bioinformatics and Biomedicine (BIBM).
.
BibTeX:
@inproceedings{healthclaimexperts2025paper,
author = {Zuo, Chaoyuan and Banerjee, Ritwik},
title = {{HDCR: Cross-lingual Medical Misinformation Detection through Contrastive Claim-Evidence Reasoning}},
booktitle = {IEEE International Conference on Bioinformatics and Biomedicine (BIBM)},
year = {2025}
}
Files
data_set.zip
Files
(50.3 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:b989a09450cce09557e3054b8e9e3e73
|
50.3 MB | Preview Download |
Additional details
Funding
- National Natural Science Foundation of China
- 62406150
Software
- Repository URL
- https://github.com/chzuo/bibm2025_hdcr