ChronoMedKG: A Temporally-Grounded, Evidence-Graded Biomedical Knowledge Graph and Benchmark for Temporal Clinical Reasoning

Ahmed, Md Shamim; Firoozbakht, Farzaneh; Galke Poech, Lukas; Baumbach, Jan; Röttger, Richard

doi:10.5281/zenodo.19697543

Published April 22, 2026 | Version 0.0.1

Dataset Open

ChronoMedKG: A Temporally-Grounded, Evidence-Graded Biomedical Knowledge Graph and Benchmark for Temporal Clinical Reasoning

1. University of Southern Denmark
2. University of Hamburg

ChronoMedKG is a temporally-grounded, evidence-graded biomedical knowledge graph built by running a four-agent disease-autonomous pipeline across 13,431 of PrimeKG's 17,080 diseases (78.6%). The pipeline yields 460,497 validated consensus triples out of 13 million extracted triples; 10,852 diseases produce surviving triples after multi-LLM consensus and Quality Controller filtering. Every edge carries temporal metadata (per-phenotype onset windows, progression stages, clinical milestones), PMID-traceable evidence text, and a six-signal credibility score.

Unlike static biomedical KGs (PrimeKG, iKraph, Hetionet) that treat associations as timeless, ChronoMedKG records WHEN in a disease course each fact applies. The resource adds onset data for 6,250 diseases not present in any reference resource (HPOA, Orphadata, Phenopackets), 1,657 of them Orphanet-coded rare diseases gaining first-time structured onset representation. Validation against Orphadata reaches 92.7%; a three-LLM judge-panel audit on 100 novel-coverage diseases reaches 87.9%.

Construction uses a disease-autonomous four-agent pipeline (Disease Profiler, Evidence Harvester, Knowledge Extractor, Quality Controller) that runs end-to-end from a disease identifier. Multiple frontier LLMs extract triples in parallel; only relations supported by multi-model consensus survive credibility filtering and PrimeKG schema alignment. Total construction cost across 13,431 diseases: ~$2,400 in LLM API spend.

ChronoMedKG ships paired with ChronoTQA, the first temporal biomedical QA benchmark: 3,341 questions across eight reported task types plus a 12-question supplementary HPOA negative-temporal MCQ probe. Frontier LLMs trail their static-question accuracy by ~30 points on temporal items, and selective retrieval against ChronoMedKG rescues 47-65% of failed long-tail queries (vs 17-29% for HPOA-RAG).

This deposit (v0.0.1) contains:
- validated_triples.jsonl (Gold, 527 MB, 460,497 rows): main product, post-QC
- consensus_triples.jsonl.gz (Silver, 30 MB): pre-QC consensus rows
- raw_triples.jsonl.gz (Bronze, 644 MB): full extraction log, 13M rows
- tqa_benchmark.json (3.2 MB): ChronoTQA, 3,341 questions
- pmc_clinical_cases.json (63 KB): 31 diagnostic-odyssey case reports
- novelty_multi_judge_v2.json (168 KB): three-LLM audit verdicts
- croissant.json: Croissant 1.0 ML metadata
- README.md, LICENSE-DATA, NOTICE

Files

croissant.json

Files (1.2 GB)

Name	Size	Download all
consensus_triples.jsonl.gz md5:a5da17f73ff275e3df99860a246a4d88	29.6 MB	Download
croissant.json md5:773d722830f08ac3382c47a741cffd4e	14.0 kB	Preview Download
LICENSE-DATA md5:7922b6da3d1cd1718b5828dd83a319c9	3.3 kB	Download
NOTICE md5:f6f40f8561cbf1d132b0ce2b9c6190ca	1.1 kB	Download
pmc_clinical_cases.json md5:f57d8eb15920df9fbb902cc4872a70a4	63.5 kB	Preview Download
raw_triples.jsonl.gz md5:352d774747f21c19e6fe8e523347072b	643.7 MB	Download
README.md md5:e417513cbcdb1b88a4571113d63c270b	10.7 kB	Preview Download
tqa_benchmark.json md5:a9d077139730a66b5b14110bc43ad010	3.2 MB	Preview Download
validated_triples.jsonl md5:fcca97df3d02a82595ccacdd9e418d09	526.6 MB	Download

	All versions	This version
Views	117	117
Downloads	154	154
Data volume	26.2 GB	26.2 GB

ChronoMedKG: A Temporally-Grounded, Evidence-Graded Biomedical Knowledge Graph and Benchmark for Temporal Clinical Reasoning

Authors/Creators

Description

Files

croissant.json

Files (1.2 GB)