Comparative Evaluation of ChatGPT-4o and DeepSeek-R1 for Emergency Severity Index (ESI) Triage Classification

Nazzal, Ahmad

doi:10.5281/zenodo.14823265

Published February 6, 2025 | Version v1

Preprint Open

Comparative Evaluation of ChatGPT-4o and DeepSeek-R1 for Emergency Severity Index (ESI) Triage Classification

Nazzal, Ahmad

Large language models (LLMs) like ChatGPT-4o and DeepSeek-R1 show promise in automating emergency triage, but their alignment with clinical standards remains understudied. This study evaluates both models against a human physician gold standard using the Emergency Severity Index (ESI). ChatGPT-4o demonstrated substantial agreement (Cohen’s Kappa = 0.717, 95% CI: 0.56-0.85; 80% absolute agreement), outperforming DeepSeek-R1 (Cohen’s Kappa = 0.583, 95% CI: 0.41-0.75; 70% absolute agreement). While both models excelled in high-acuity cases (ESI 1-2), their performance declined for mid-level categories (ESI 3-5), underscoring the risks of automation bias in ambiguous scenarios.

Files

Comparative Evaluation of ChatGPT.pdf

Files (476.8 kB)

Name	Size	Download all
Comparative Evaluation of ChatGPT.pdf md5:510e4ccb048c3bc8841ee9775417efcd	476.8 kB	Preview Download

Views

Downloads

Show more details

	All versions	This version
Views	23	23
Downloads	20	20
Data volume	11.0 MB	11.0 MB

More info on how stats are collected....

DOI

Resource type

Preprint

Publisher

Zenodo

Languages

English

License: Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: February 6, 2025
Modified: February 6, 2025

Comparative Evaluation of ChatGPT-4o and DeepSeek-R1 for Emergency Severity Index (ESI) Triage Classification

Authors/Creators

Description

Files

Comparative Evaluation of ChatGPT.pdf

Files (476.8 kB)