Published February 6, 2025
| Version v1
Preprint
Open
Comparative Evaluation of ChatGPT-4o and DeepSeek-R1 for Emergency Severity Index (ESI) Triage Classification
Authors/Creators
Description
Large language models (LLMs) like ChatGPT-4o and DeepSeek-R1 show promise in automating emergency triage, but their alignment with clinical standards remains understudied. This study evaluates both models against a human physician gold standard using the Emergency Severity Index (ESI). ChatGPT-4o demonstrated substantial agreement (Cohen’s Kappa = 0.717, 95% CI: 0.56-0.85; 80% absolute agreement), outperforming DeepSeek-R1 (Cohen’s Kappa = 0.583, 95% CI: 0.41-0.75; 70% absolute agreement). While both models excelled in high-acuity cases (ESI 1-2), their performance declined for mid-level categories (ESI 3-5), underscoring the risks of automation bias in ambiguous scenarios.
Files
Comparative Evaluation of ChatGPT.pdf
Files
(476.8 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:510e4ccb048c3bc8841ee9775417efcd
|
476.8 kB | Preview Download |