Published February 6, 2025 | Version v1
Preprint Open

Comparative Evaluation of ChatGPT-4o and DeepSeek-R1 for Emergency Severity Index (ESI) Triage Classification

Authors/Creators

Description

Large language models (LLMs) like ChatGPT-4o and DeepSeek-R1 show promise in automating emergency triage, but their alignment with clinical standards remains understudied. This study evaluates both models against a human physician gold standard using the Emergency Severity Index (ESI). ChatGPT-4o demonstrated substantial agreement (Cohen’s Kappa = 0.717, 95% CI: 0.56-0.85; 80% absolute agreement), outperforming DeepSeek-R1 (Cohen’s Kappa = 0.583, 95% CI: 0.41-0.75; 70% absolute agreement). While both models excelled in high-acuity cases (ESI 1-2), their performance declined for mid-level categories (ESI 3-5), underscoring the risks of automation bias in ambiguous scenarios.

Files

Comparative Evaluation of ChatGPT.pdf

Files (476.8 kB)

Name Size Download all
md5:510e4ccb048c3bc8841ee9775417efcd
476.8 kB Preview Download