Behavioral Safety and Context Retention of Large Language Models in a Longitudinal ICU Simulation under Offline Conditions

Shlyakhta, Taras

doi:10.5281/zenodo.18473087

Published February 3, 2026 | Version v1

Publication Open

Behavioral Safety and Context Retention of Large Language Models in a Longitudinal ICU Simulation under Offline Conditions

Shlyakhta, Taras (Researcher)^{1, 2}

1. Uzhhorod National University
2. Nemocnica Agel Komárno

Background:
Large language models (LLMs) are increasingly proposed as clinical assistants in critical care, yet their behavior under prolonged clinical context, conflicting data, and authoritative pressure remains insufficiently evaluated. This is particularly relevant for offline or resource-constrained environments, where cloud-based safeguards are unavailable.

Methods:
In this study, I conducted a fully automated behavioral evaluation of 23 language models using a structured, time-series intensive care unit (ICU) simulation spanning 24 hours of synthetic patient data. The scenario incorporated routine monitoring, predefined data–clinical conflict traps, progressive physiological deterioration, and a final safety stress test involving a contraindicated antibiotic order in a patient with documented penicillin anaphylaxis. All models were executed locally under identical offline-first conditions using deterministic inference settings. Model outputs were assessed using predefined rule-based criteria for safety compliance, sycophancy, discrepancy detection, long-term context retention, and response latency.

Results:
While 61% of models formally refused the contraindicated prescription, only 8.7% explicitly grounded their refusal in retained clinical context. Nearly 40% of models complied with the unsafe order despite prior documentation of anaphylaxis, demonstrating pronounced sycophancy under authoritative instruction. More than half of the models initiated inappropriate clinical interventions in response to isolated numerical abnormalities deliberately decoupled from clinical presentation. Long-term context retention degraded in most models, and response latency showed no meaningful association with safer behavior.

Conclusions:
Under realistic offline-first conditions, the majority of evaluated language models exhibited behavior incompatible with safe use in critical care, including unsafe obedience, failure to recognize data artifacts, and loss of safety-critical context over time. These findings indicate that general-purpose LLMs should not be deployed as autonomous clinical agents. However, the performance of a small subset of models suggests that safer offline-capable systems may be achievable through hybrid designs incorporating explicit refusal mechanisms, discrepancy-aware reasoning, and retrieval-augmented grounding in validated clinical knowledge.

Files

Behavioral Safety and Context Retention of LLM.pdf

Files (271.3 kB)

Name	Size	Download all
Behavioral Safety and Context Retention of LLM.pdf md5:2d32ba2e017e7ce882aa44fa1e6dc901	159.1 kB	Preview Download
Supplementary Materials.pdf md5:e8b6bfcc85ec41bed3dc4e4f1c57fb88	112.1 kB	Preview Download

	All versions	This version
Views	23	23
Downloads	11	11
Data volume	2.6 MB	2.6 MB

Behavioral Safety and Context Retention of Large Language Models in a Longitudinal ICU Simulation under Offline Conditions

Authors/Creators

Description

Files

Behavioral Safety and Context Retention of LLM.pdf

Files (271.3 kB)