Published February 27, 2026 | Version 2
Journal article Open

Artificial Intelligence Large Language Models for Pulmonary Nodule Surgical Decision-Making: A Comparative Accuracy Study

Description

Background

Artificial intelligence (AI) large language models show promise in medical decision-making, but their reliability in determining surgical indications for pulmonary nodules remains unexplored. We evaluated the diagnostic accuracy and consistency of three leading AI models compared with expert thoracic surgeon consensus.

METHODS

This cross-sectional diagnostic accuracy study evaluated ChatGPT-4, Claude 3.5 Sonnet, and Google Gemini Pro using 45 standardized clinical vignettes representing diverse pulmonary nodule presentations. Six thoracic surgeons with ≥5 years of experience independently reviewed all vignettes to establish consensus. Each AI model was tested three times per vignette to assess test-retest reliability. Primary outcome was overall diagnostic accuracy; secondary outcomes included inter-model agreement and performance across nodule categories and complexity levels.

RESULTS

Expert panel achieved 91.4% mean inter-rater agreement (range: 60-100%), with unanimous consensus in 46.7% of cases. Overall AI-expert agreement was 82.2% (95% CI: 71.1-93.4%). Claude and Gemini both achieved 82.2% accuracy with perfect test-retest reliability (100% consistency across three trials), while GPT-4 demonstrated 80.0% accuracy with 86.8% consistency. Inter-model agreement was highest between Claude and Gemini (100%), versus 62.2% for GPT-4 comparisons with either model. Performance varied significantly by nodule category: 100% agreement in complex scenarios (mixed pattern, multiple nodules, high-risk comorbidities, post-treatment) versus 20% in intermediate-sized solid nodules (21-30 mm).

CONCLUSIONS

Leading AI large language models demonstrate substantial agreement with expert consensus in pulmonary nodule management, with Claude and Gemini showing superior consistency. However, performance varies markedly by clinical context, particularly for intermediate-sized solid nodules where guideline ambiguity is greatest. Current AI capabilities may complement but cannot replace expert thoracic surgical judgment.

KEY WORDS: Artificial intelligence; large language models; pulmonary nodule; surgical indication; diagnostic accuracy

Files

Çavuşoğlu.pdf

Files (372.3 kB)

Name Size Download all
md5:935066ac6f2a39c9d2a63d1cee596870
372.3 kB Preview Download