COMFORTER study analysis
Methods
We evaluated the transportability of the REMAIN mortality model and, in response to evidence of calibration instability, assessed alternative models developed using a leave-one-site-out internal-external validation framework across the four participating sites. For the original REMAIN model, Site 2 was treated as the temporal validation cohort because it represented a later cohort from the development site, whereas Sites 1, 3, and 4 were treated as external validation cohorts.
Models
For the REMAIN validation, the published logistic regression coefficients were applied unchanged to each patient in the validation dataset, without model refitting or recalibration. The REMAIN model included age and age squared together with categorical terms for sex, comorbidity burden, acute resuscitation plan (ARP), acuity-dependency profile, hospital admission in the previous 24 hours, and surgery in the previous 24 hours.
For model re-development, we used a predictor set that built on prior insights from REMAIN about the utility of illness severity and care dependency in predicting in-hospital mortality. However, we simplified measurement of care dependency so that it was more amenable to point of care measurement. In addition, the acuity-dependency profile was not included in the model. Instead, direct measurements of acuity and dependency were used. Predictors included: age, sex, ARP, hospital admission in the previous 24 hours, surgery in the previous 24 hours, ICU discharge in the previous 24 hours, MET call history in the previous 24 hours, NEWS, modified Index of Caring Dependency, and SOFA. A logistic regression model predicting in-hospital mortality with only the National Early Warning Score (NEWS) was used as a baseline comparator, because the NEWS is the tool currently used in practice to support escalation and decisions about seeking additional assistance for clinical deterioration. Candidate models were TabICL, group lasso logistic regression, group ridge logistic regression, AutoGluon tabular ensembles, and random forest.
TabICL (Tabular In-context Learning) is a pre-trained transformer-based tabular foundation model and a form of prior-data fitted network (Qu et al. 2026). Unlike a conventional regression or classification model, it does not estimate a new set of study-specific coefficients to be used for subsequent prediction. Instead, labelled data in a tabular format are supplied as contextual examples at the time predictions are generated, and the model estimates the probability of in-hospital mortality for a new patient by conditioning on the predictor-outcome relationships represented in those examples. We chose to evaluate the predictive performance of a tabular foundation model for this prediction task because the relationships between mortality and the candidate predictors may be non-linear and may depend on higher-order interactions among physiologic severity measures and recent markers of instability along with measures of care dependency.
AutoGluon was used as an automated tabular prediction framework that trains multiple candidate learners and combines them through ensemble learning rather than selecting a single best model class (Erickson et al. 2020). Several candidate learners were trained including gradient boosting machines, CatBoost, XGBoost, random forest, extremely randomized trees, k-nearest neighbours, and linear models. These models were combined using five-fold bagging and one level of stacked ensembling, so that out-of-fold predictions from the base learners could be used to train a higher-level weighted ensemble while limiting optimism from in-sample prediction. This approach was included to evaluate whether combining multiple tabular learning algorithms with different inductive biases would improve predictive performance and robustness relative to reliance on a single modelling framework.
For the grouped penalised logistic models, continuous predictors were represented using spline terms and grouped penalties were applied so that related terms entered or shrank together. Grouping was used because several predictors were represented by multiple derived coefficients, including spline basis terms for continuous variables and sets of coefficients arising from categorical encoding. Penalising these related coefficients at the group level was intended to reduce unstable selection of isolated transformed terms, to preserve more coherent variable-level effects, and to retain a clinically interpretable linear modelling framework while still allowing greater flexibility than a conventional main-effects logistic regression (Yuan and Lin 2006).
An expanded set of predictors that included component-level NEWS, SOFA, and mICD variables was evaluated with TabICL and AutoGluon, as well as the random forest model. These approaches were considered better suited than the linear models to a larger and more granular predictor space because they can accommodate correlated component variables, non-linear effects, and higher-order interactions without requiring these structures to be specified a priori.
Data analysis
Performance of the REMAIN model was assessed for the whole dataset as an overall measure of external validation as well as separately at each site. Site 2 was treated as a temporal validation subset of the total external validation dataset and Sites 1, 3, and 4 as strict external validation. Discrimination was summarised using the area under the receiver operating characteristic curve (ROC-AUC) and the area under the precision-recall curve (PR-AUC; average precision), overall prediction error using the Brier score, and calibration using calibration-in-the-large and calibration slope, estimated by regressing the observed outcome on the logit of the predicted probability. Site-level 95% confidence intervals for ROC-AUC, PR-AUC, Brier score, calibration-in-the-large, and calibration slope were estimated from 500 non-parametric bootstrap resamples. Pooled estimates were obtained across Sites 1, 3, and 4 using random-effects meta-analysis with Sidik-Jonkman heterogeneity estimation and Hartung-Knapp confidence and prediction intervals as a sensitivity analysis for external validation of the REMAIN model. For the pooled ROC-AUC meta-analysis, within-site variance was estimated using the Hanley-McNeil method. PR-AUC and Brier score were pooled from site-specific estimates and corresponding standard errors.
The new models were developed using leave-one-site-out internal-external validation. In each iteration, one site was held out for testing and the candidate model was trained on the remaining three sites. This process was repeated so that each site served once as the external holdout. For the grouped penalised logistic models, preprocessing and design-matrix construction were repeated within each training set so that no information from the held-out site was used in feature engineering or tuning. Pre-specified continuous predictors were represented using spline basis functions, remaining linear and binary predictors entered as single coefficients, pairwise product terms among the predictors selected for spline modelling were added as individual interaction terms, and categorical predictors were represented by one-hot-encoded indicator variables. Groups were then defined at the predictor level: all spline basis coefficients for a given continuous predictor formed one group, all indicator variables arising from a categorical predictor formed one group, and each remaining linear or interaction term formed its own group. The resulting design matrix was standardised before penalised fitting, and the intercept was not penalised.
For group lasso, penalisation was applied to each coefficient group using the Euclidean norm of that group, with the contribution of each group weighted by the square root of the number of coefficients it contained (Yuan and Lin 2006). For grouped ridge, penalisation was applied to the squared Euclidean norm of each coefficient group, with each group weighted inversely to its size, so that larger groups, such as spline expansions or multi-level categorical predictors, were not penalised more heavily solely because they contributed more coefficients (Yuan and Lin 2006). Group lasso was estimated using accelerated proximal-gradient optimisation and grouped ridge using L-BFGS optimisation. The spline basis for these grouped linear models was selected during preliminary model tuning by evaluating candidate specifications with 3, 4, or 5 knots and degree 2 or 3, ranked by validation ROC-AUC with Brier score used to break ties. The selected basis was then held fixed for the internal-external validation analyses (group lasso: 5 knots, degree 3; group ridge: 3 knots, degree 3). Within each internal-external training set, the penalty parameter was re-selected using inner 5-fold stratified cross-validation. Candidate penalty values were 0.001, 0.003, 0.01, 0.03, and 0.10 for group lasso and 0.0003, 0.001, 0.003, 0.01, 0.03, 0.1, and 0.3 for group ridge, ranked by mean cross-validated ROC-AUC with mean Brier score used to break ties. AutoGluon models were trained within the non-held-out sites and then refit on the full training data before generating holdout predictions. Model comparison focused on held-out-site ROC-AUC pooled across the four sites using the same random-effects meta-analytic framework, while PR-AUC, Brier score, calibration-in-the-large, and calibration slope were summarised descriptively at the site level. For overall discrimination and calibration figures, out-of-site predictions were concatenated across held-out sites for each model.
REMAIN external and temporal validation summary
- Outcome: In-hospital mortality (DeathHospDisch)
- Model: Logistic regression using the REMAIN coefficients
- External validation sites: Site 1, Site 3, Site 4
- Temporal validation site: Site 2
- Primary metrics: ROC-AUC, PR-AUC (average precision), Brier score, calibration slope/intercept.
Interpretation
Discrimination was moderate but varied across sites, indicating that the predictor set retains useful ranking ability outside the development setting.
Calibration slope suggested potential instability in transportability.
- Temporal validation showed a slope <1 (0.825), indicating that predictions were too extreme, consistent with either overfitting or temporal changes in predictor–outcome relationships.
- External validation cohorts showed variable slope estimates, with one cohort demonstrating similar risk of overfitting to the development site (0.789) and another site at which estimated risks were too moderate.
- Overall interpretation indicates that development model may have been overfitted to the development data, leading to overconfident predictions that did not generalize well.
- Temporal validation showed a slope <1 (0.825), indicating that predictions were too extreme, consistent with either overfitting or temporal changes in predictor–outcome relationships.
Uncertainty in calibration estimates: Indicators of poor calibration were observed across temporal and external validation, suggesting that these findings may not be solely due to sampling variability.
Brier score interpretation: Brier scores were similar across cohorts and did not fully reflect differences in calibration, indicating that overall prediction error may remain comparable even when probability calibration differs.
Rationale for a new model vs. only recalibration: Variation in slope direction across cohorts (both <1 and >1) suggests that a single global recalibration is unlikely to fully address calibration differences.
Site-Level Validation Metrics
| Validation type | Site | N | Deaths | CITL (intercept) | Calibration slope | ROC-AUC [95% CI] | PR-AUC [95% CI] | Brier [95% CI] |
|---|---|---|---|---|---|---|---|---|
| External validation | Site 1 | 485 | 95 | 0.159 | 1.020 | 0.811 (0.766 to 0.855) | 0.544 (0.448 to 0.639) | 0.123 (0.105 to 0.141) |
| External validation | Site 3 | 485 | 105 | 0.077 | 0.789 | 0.766 (0.719 to 0.813) | 0.484 (0.388 to 0.580) | 0.146 (0.124 to 0.168) |
| External validation | Site 4 | 482 | 110 | 0.382 | 1.218 | 0.835 (0.796 to 0.875) | 0.597 (0.503 to 0.692) | 0.131 (0.113 to 0.149) |
| Temporal validation | Site 2 | 485 | 90 | -0.070 | 0.825 | 0.791 (0.740 to 0.841) | 0.495 (0.393 to 0.596) | 0.123 (0.102 to 0.143) |
Site-level 95% confidence intervals were estimated from 500 nonparametric bootstrap replicates.
Random-Effects Meta-Analysis (Pooled External Validation)
| Metric | Pooled estimate (95% CI) | 95% PI | Tau² | I² (%) |
|---|---|---|---|---|
| ROC-AUC | 0.800 (0.748 to 0.844) | 0.708 to 0.869 | 0.016 | 14.2 |
| PR-AUC (average precision) | 0.529 (0.447 to 0.611) | 0.393 to 0.661 | 0.019 | 2.5 |
| Brier score | 0.131 (0.115 to 0.148) | 0.105 to 0.162 | 0.004 | 4.8 |
Overall sample: N = 1937, deaths = 400, prevalence = 0.207.
Evaluating new models with internal-external validation
- The above results suggest that the REMAIN logistic regression model demonstrates reasonable discrimination but variable calibration across external and temporal validation cohorts.
- Investigated the performance of models and techniques that could reduce the risk of poor calibration when using a model at a new site. These included approaches to reduce risk of overfitting by using penalized regression techniques.
- Used internal-external validation approach so that the model could be trained on data from multiple sites to reduce risk of overfitting to any single site.
- For group lasso and group ridge, regularization strength (()) was selected within each non-holdout training set using inner 5-fold stratified cross-validation before scoring the held-out site.
- Also used a simplified measurement of care dependency and a set of predictors that are simpler to assess by a MET team at the time of decision-making and that we considered would be less susceptible to between-site measurement variability.
Site-Level Holdout Results
| Site | n | Events | Prevalence | Model | ROC-AUC | PR-AUC | Brier | CITL | Slope |
|---|---|---|---|---|---|---|---|---|---|
| Site 1 | 485 | 95 | 0.1959 | TabICL (Expanded feature set) | 0.8513 | 0.5996 | 0.1142 | -0.0416 | 0.8886 |
| TabICL | 0.8489 | 0.5835 | 0.1145 | -0.0532 | 0.9241 | ||||
| AutoGluon (Expanded feature set) | 0.8487 | 0.5867 | 0.1143 | -0.0427 | 0.9475 | ||||
| Group ridge | 0.8458 | 0.5732 | 0.1156 | 0.1523 | 1.1792 | ||||
| Group lasso | 0.8445 | 0.5795 | 0.1155 | 0.0300 | 1.1050 | ||||
| Random forest (Expanded feature set) | 0.8442 | 0.5425 | 0.1267 | -0.6731 | 1.0702 | ||||
| AutoGluon | 0.8386 | 0.5666 | 0.1165 | -0.0362 | 1.0731 | ||||
| Random forest | 0.8371 | 0.5214 | 0.1372 | -0.9011 | 0.9647 | ||||
| NEWS logistic | 0.7589 | 0.4261 | 0.1343 | 0.0672 | 1.0873 | ||||
| Site 2 | 485 | 90 | 0.1856 | TabICL (Expanded feature set) | 0.8234 | 0.5005 | 0.1226 | -0.3697 | 0.7870 |
| AutoGluon | 0.8154 | 0.4931 | 0.1233 | -0.3669 | 0.8569 | ||||
| Random forest (Expanded feature set) | 0.8150 | 0.4966 | 0.1361 | -0.7969 | 0.9427 | ||||
| TabICL | 0.8132 | 0.5045 | 0.1237 | -0.3483 | 0.7508 | ||||
| Random forest | 0.8099 | 0.4898 | 0.1433 | -0.9567 | 0.8742 | ||||
| AutoGluon (Expanded feature set) | 0.8083 | 0.4732 | 0.1267 | -0.4370 | 0.7905 | ||||
| Group lasso | 0.8060 | 0.4847 | 0.1261 | -0.3725 | 0.8396 | ||||
| Group ridge | 0.8008 | 0.4942 | 0.1251 | -0.3796 | 0.8176 | ||||
| NEWS logistic | 0.7258 | 0.3473 | 0.1389 | -0.2711 | 0.9039 | ||||
| Site 3 | 485 | 105 | 0.2165 | TabICL (Expanded feature set) | 0.8055 | 0.5612 | 0.1338 | -0.0865 | 0.7278 |
| AutoGluon (Expanded feature set) | 0.8009 | 0.5571 | 0.1333 | -0.0278 | 0.8413 | ||||
| TabICL | 0.8001 | 0.5648 | 0.1334 | -0.1135 | 0.6933 | ||||
| Group ridge | 0.7943 | 0.5719 | 0.1350 | -0.1128 | 0.7466 | ||||
| Random forest (Expanded feature set) | 0.7942 | 0.5726 | 0.1419 | -0.5358 | 0.8370 | ||||
| AutoGluon | 0.7926 | 0.5668 | 0.1337 | -0.0534 | 0.8914 | ||||
| Group lasso | 0.7917 | 0.5658 | 0.1351 | -0.0856 | 0.7834 | ||||
| Random forest | 0.7849 | 0.5536 | 0.1512 | -0.7195 | 0.7251 | ||||
| NEWS logistic | 0.6995 | 0.3659 | 0.1575 | -0.3319 | 0.7244 | ||||
| Site 4 | 482 | 110 | 0.2282 | Group lasso | 0.8940 | 0.7306 | 0.1133 | 0.8786 | 1.7451 |
| TabICL (Expanded feature set) | 0.8931 | 0.7263 | 0.1126 | 0.6853 | 1.4088 | ||||
| AutoGluon | 0.8898 | 0.7252 | 0.1140 | 0.7005 | 1.6921 | ||||
| AutoGluon (Expanded feature set) | 0.8890 | 0.7039 | 0.1155 | 0.7120 | 1.4556 | ||||
| TabICL | 0.8867 | 0.7122 | 0.1136 | 0.5004 | 1.3509 | ||||
| Group ridge | 0.8866 | 0.7160 | 0.1159 | 0.8510 | 1.7632 | ||||
| Random forest (Expanded feature set) | 0.8765 | 0.6812 | 0.1235 | -0.4258 | 1.4764 | ||||
| Random forest | 0.8689 | 0.6972 | 0.1353 | -0.8325 | 1.3202 | ||||
| NEWS logistic | 0.8056 | 0.5224 | 0.1422 | 0.6235 | 1.3309 |
Random-Effects Meta-Analysis (Pooled Across Sites)
| Model | Pooled ROC-AUC | 95% CI | 95% PI | Tau² (logit AUC) | I² (%) |
|---|---|---|---|---|---|
| TabICL (Expanded feature set) | 0.8443 | 0.7698 to 0.8979 | 0.6843 to 0.9313 | 0.0600 | 56.92 |
| TabICL | 0.8382 | 0.7631 to 0.8928 | 0.6780 to 0.9273 | 0.0578 | 57.11 |
| AutoGluon (Expanded feature set) | 0.8380 | 0.7586 to 0.8948 | 0.6654 to 0.9308 | 0.0657 | 60.41 |
| Group lasso | 0.8360 | 0.7442 to 0.8993 | 0.6279 to 0.9390 | 0.0896 | 68.30 |
| AutoGluon | 0.8354 | 0.7534 to 0.8939 | 0.6558 to 0.9311 | 0.0694 | 62.13 |
| Group ridge | 0.8334 | 0.7493 to 0.8933 | 0.6483 to 0.9314 | 0.0722 | 63.77 |
| Random forest (Expanded feature set) | 0.8330 | 0.7658 to 0.8838 | 0.6966 to 0.9154 | 0.0418 | 48.61 |
| Random forest | 0.8256 | 0.7587 to 0.8769 | 0.6914 to 0.9091 | 0.0388 | 47.81 |
| NEWS logistic | 0.7481 | 0.6664 to 0.8154 | 0.5840 to 0.8628 | 0.0400 | 58.39 |
ROC-AUC Forest Plot Across Sites
Note: Site-specific confidence intervals were derived from the Hanley-McNeil standard error on the logit(AUC) scale. Weight (%) denotes the random-effects weight contributed by each site on the logit(AUC) scale. Pooled estimates, 95% confidence intervals, and prediction intervals were taken from the random-effects meta-analysis across the four sites.
TabICL Versus NEWS Logistic
Note: For each model, predictions from all held-out sites were concatenated so that each curve reflects pooled out-of-site performance across the full internal-external validation dataset.
Interpretation
TabICL resulted in the numerically highest pooled ROC-AUC. Additionally, the 95% prediction interval for TabICL were the most narrow across the models that were evaluated due to considerably less heterogeneity across sites. Inspection of measures to assess calibration also indicated that TabICL had the most consistent calibration across sites, with CITL values closest to 0 and slope values closest to 1 across sites.
Developing a new model to predict in-hospital mortality after MET review using an internal-external validation approach, which allowed for training on data from multiple sites, did not substantially improve calibration compared to the REMAIN logistic regression model.
On balance, given similar estimates of calibration, the TabICL model may be preferred given its slightly higher and more consistent performance in discrimination across sites.