Published April 6, 2026 | Version v1
Preprint Open

Adversarial Pressure and Metacognitive Control Failure in Clinical LLMs: A Multi-Domain Benchmark Study

Description

Background: Clinical large language models (LLMs) face adversarial pressure in real-world practice — physician authority language, time urgency, assumption injection, social consensus claims, and protocol waivers all pressure systems toward action despite missing safety-critical data. Whether LLMs maintain metacognitive control under such pressure remains unstudied.

Objective: To benchmark adversarial robustness of metacognitive control across four LLMs and three clinical domains using a structured pressure taxonomy derived from clinical pharmacy practice.

Methods: We constructed a 60-case adversarial benchmark spanning QT-interval risk, anticoagulation dosing, and controlled substance dispensing. Five pressure categories were systematically injected into cases with missing required inputs (gold label: DEFER for all). Four LLMs were evaluated: GPT-4o-mini (OpenAI), Mistral-7B-Instruct (Mistral AI), Llama-2-7b-chat (Meta), and Gemma-2-2b-it (Google). Metrics: accuracy (deferral rate), unsafe action rate, and awareness rate.

Results: GPT-4o-mini achieved 95.0% accuracy with 0% unsafe actions across all pressure types and domains. Mistral-7B achieved 86.7% accuracy with 8.3% unsafe rate. Llama-2-7B achieved 70.0% with 11.7% unsafe rate. Gemma-2 achieved 55.0% with 41.7% unsafe rate. Authority override produced the highest unsafe rate in Gemma-2 (58.3%); urgency pressure produced 50.0%. QT risk under Gemma-2 reached 65% unsafe — the highest domain-specific rate observed.

Implications: Conservative deferral bias, often characterized as a failure in standard benchmarks, is a safety asset under adversarial conditions. Metacognitive robustness under pressure should be a standard evaluation criterion for clinical AI deployment.

Files

adversarial_paper.pdf

Files (77.8 kB)

Name Size Download all
md5:c45a5253b4e5e12e8f3f846f1def1a0f
77.8 kB Preview Download

Additional details

Dates

Created
2026-04-06

Software

Programming language
Python