Medical LLM Metacognition Is Multidimensional: A MetaMedQA Reanalysis of Confidence, Missing-Answer Recognition, and Unknown-Answer Detection
Authors/Creators
Description
Recent work using MetaMedQA argued that large language models (LLMs) lack essential metacognition for reliable medical reasoning. However, metacognition is not a single construct: confidence–correctness discrimination, missing-answer recognition, unknown-answer detection, and abstention behavior may dissociate. Here, we reanalyzed MetaMedQA using a confidence-centered evaluation framework previously developed for a controlled clinical-evidence benchmark. Two GPT-family models, gpt-4.1-nano and gpt-5.5, were evaluated on 1373 MetaMedQA items using structured outputs containing an answer, numerical confidence, and a more-information-needed judgment. gpt-4.1-nano achieved 56.4% accuracy, mean confidence of 79.7%, Brier score of 0.318, expected calibration error of 0.276, and AUROC2 of 0.582. Missing-answer recall was 19.1%, and unknown/unanswerable recall was 25.9%. gpt-5.5 improved substantially, achieving 84.9% accuracy, mean confidence of 91.2%, Brier score of 0.112, expected calibration error of 0.062, and AUROC2 of 0.819. Missing-answer recall increased to 67.8%, and unknown/unanswerable recall to 56.2%. Nevertheless, incorrect responses from gpt-5.5 still received high mean confidence. These results suggest that medical-LLM metacognition is better understood as a set of dissociable behavioral capacities rather than as a single absent-or-present property. Stronger models can show improved confidence–correctness discrimination and calibration, while still retaining clinically relevant failures in missing-answer and unknown-answer recognition.
Files
Medical LLM Metacognition Is Multidimensional.pdf
Files
(714.9 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:83dde9ed4e2ec042752da63272a0e7ed
|
714.9 kB | Preview Download |