Predicting Spontaneous Pneumothorax Recurrence with Machine Learning: A Synthetic Example
Authors/Creators
Description
Aim: Recurrence after primary spontaneous pneumothorax (PSP) remains clinically relevant and may influence the intensity of follow-up and the choice of interventions. Reported recurrence rates vary widely across cohorts. Machine learning (ML) can complement conventional risk stratification by combining multiple predictors into an individualized probability estimate.
Methodology: We generated a synthetic dataset of 1,000 patients with a 12-month recurrence prevalence of 50% to demonstrate an end-to-end supervised ML workflow. Predictors were constructed to mimic common clinical and imaging-derived variables (age, sex, smoking exposure, bleb size, emphysema score, prior pneumothorax, treatment strategy, and a muscle-mass proxy). We compared penalized logistic regression with a random forest classifier, using a stratified train/test split. Model performance was assessed by discrimination (ROC-AUC), overall accuracy (Brier score), calibration intercept/slope, and decision curve analysis (DCA) for clinical utility.
Results: On the held-out test set, logistic regression achieved ROC-AUC 0.7633 and Brier score 0.1989; the random forest achieved ROC-AUC 0.7501 and Brier score 0.2055. Calibration intercept/slope were -0.0910/1.1853 for logistic regression and -0.0438/1.2649 for the random forest. Both models showed positive net benefit at decision thresholds of 0.30 and 0.50.
Conclusion: This synthetic example illustrates key practical steps (data preparation, model training, evaluation, and reporting) and common pitfalls (data leakage, overfitting, and miscalibration). For real-world deployment, transparent reporting and external validation are essential.
Files
Karataş.pdf
Files
(265.0 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:78a57fb208f72fc67ad91375dce3eaa0
|
265.0 kB | Preview Download |