Published February 18, 2026 | Version 1
Journal article Open

Predicting Spontaneous Pneumothorax Recurrence with Machine Learning: A Synthetic Example

Description

Aim: Recurrence after primary spontaneous pneumothorax (PSP) remains clinically relevant and may influence the intensity of follow-up and the choice of interventions. Reported recurrence rates vary widely across cohorts. Machine learning (ML) can complement conventional risk stratification by combining multiple predictors into an individualized probability estimate.

Methodology: We generated a synthetic dataset of 1,000 patients with a 12-month recurrence prevalence of 50% to demonstrate an end-to-end supervised ML workflow. Predictors were constructed to mimic common clinical and imaging-derived variables (age, sex, smoking exposure, bleb size, emphysema score, prior pneumothorax, treatment strategy, and a muscle-mass proxy). We compared penalized logistic regression with a random forest classifier, using a stratified train/test split. Model performance was assessed by discrimination (ROC-AUC), overall accuracy (Brier score), calibration intercept/slope, and decision curve analysis (DCA) for clinical utility.

Results: On the held-out test set, logistic regression achieved ROC-AUC 0.7633 and Brier score 0.1989; the random forest achieved ROC-AUC 0.7501 and Brier score 0.2055. Calibration intercept/slope were -0.0910/1.1853 for logistic regression and -0.0438/1.2649 for the random forest. Both models showed positive net benefit at decision thresholds of 0.30 and 0.50.

Conclusion: This synthetic example illustrates key practical steps (data preparation, model training, evaluation, and reporting) and common pitfalls (data leakage, overfitting, and miscalibration). For real-world deployment, transparent reporting and external validation are essential.

Files

Karataş.pdf

Files (265.0 kB)

Name Size Download all
md5:78a57fb208f72fc67ad91375dce3eaa0
265.0 kB Preview Download