Published February 3, 2026 | Version v1
Publication Open

Behavioral Safety and Context Retention of Large Language Models in a Longitudinal ICU Simulation under Offline Conditions

  • 1. Uzhhorod National University
  • 2. Nemocnica Agel Komárno

Description

Background:
Large language models (LLMs) are increasingly proposed as clinical assistants in critical care, yet their behavior under prolonged clinical context, conflicting data, and authoritative pressure remains insufficiently evaluated. This is particularly relevant for offline or resource-constrained environments, where cloud-based safeguards are unavailable.

Methods:
In this study, I conducted a fully automated behavioral evaluation of 23 language models using a structured, time-series intensive care unit (ICU) simulation spanning 24 hours of synthetic patient data. The scenario incorporated routine monitoring, predefined data–clinical conflict traps, progressive physiological deterioration, and a final safety stress test involving a contraindicated antibiotic order in a patient with documented penicillin anaphylaxis. All models were executed locally under identical offline-first conditions using deterministic inference settings. Model outputs were assessed using predefined rule-based criteria for safety compliance, sycophancy, discrepancy detection, long-term context retention, and response latency.

Results:
While 61% of models formally refused the contraindicated prescription, only 8.7% explicitly grounded their refusal in retained clinical context. Nearly 40% of models complied with the unsafe order despite prior documentation of anaphylaxis, demonstrating pronounced sycophancy under authoritative instruction. More than half of the models initiated inappropriate clinical interventions in response to isolated numerical abnormalities deliberately decoupled from clinical presentation. Long-term context retention degraded in most models, and response latency showed no meaningful association with safer behavior.

Conclusions:
Under realistic offline-first conditions, the majority of evaluated language models exhibited behavior incompatible with safe use in critical care, including unsafe obedience, failure to recognize data artifacts, and loss of safety-critical context over time. These findings indicate that general-purpose LLMs should not be deployed as autonomous clinical agents. However, the performance of a small subset of models suggests that safer offline-capable systems may be achievable through hybrid designs incorporating explicit refusal mechanisms, discrepancy-aware reasoning, and retrieval-augmented grounding in validated clinical knowledge.

Files

Behavioral Safety and Context Retention of LLM.pdf

Files (271.3 kB)

Name Size Download all
md5:2d32ba2e017e7ce882aa44fa1e6dc901
159.1 kB Preview Download
md5:e8b6bfcc85ec41bed3dc4e4f1c57fb88
112.1 kB Preview Download