COMFORTER study analysis

Published

April 27, 2026

Methods

We evaluated the transportability of the REMAIN mortality model and, in response to evidence of calibration instability, assessed alternative models developed using a leave-one-site-out internal-external validation framework across the four participating sites. For the original REMAIN model, Site 2 was treated as the temporal validation cohort because it represented a later cohort from the development site, whereas Sites 1, 3, and 4 were treated as external validation cohorts.

Models

For the REMAIN validation, the published logistic regression coefficients were applied unchanged to each patient in the validation dataset, without model refitting or recalibration. The REMAIN model included age and age squared together with categorical terms for sex, comorbidity burden, acute resuscitation plan (ARP), acuity-dependency profile, hospital admission in the previous 24 hours, and surgery in the previous 24 hours.

For model re-development, we used a predictor set that built on prior insights from REMAIN about the utility of illness severity and care dependency in predicting in-hospital mortality. However, we simplified measurement of care dependency so that it was more amenable to point of care measurement. In addition, the acuity-dependency profile was not included in the model. Instead, direct measurements of acuity and dependency were used. Predictors included: age, sex, ARP, hospital admission in the previous 24 hours, surgery in the previous 24 hours, ICU discharge in the previous 24 hours, MET call history in the previous 24 hours, NEWS, modified Index of Caring Dependency, and SOFA. A logistic regression model predicting in-hospital mortality with only the National Early Warning Score (NEWS) was used as a baseline comparator, because the NEWS is the tool currently used in practice to support escalation and decisions about seeking additional assistance for clinical deterioration. Candidate models were TabICL, group lasso logistic regression, group ridge logistic regression, AutoGluon tabular ensembles, and random forest.

TabICL (Tabular In-context Learning) is a pre-trained transformer-based tabular foundation model and a form of prior-data fitted network (Qu et al. 2026). Unlike a conventional regression or classification model, it does not estimate a new set of study-specific coefficients to be used for subsequent prediction. Instead, labelled data in a tabular format are supplied as contextual examples at the time predictions are generated, and the model estimates the probability of in-hospital mortality for a new patient by conditioning on the predictor-outcome relationships represented in those examples. We chose to evaluate the predictive performance of a tabular foundation model for this prediction task because the relationships between mortality and the candidate predictors may be non-linear and may depend on higher-order interactions among physiologic severity measures and recent markers of instability along with measures of care dependency.

AutoGluon was used as an automated tabular prediction framework that trains multiple candidate learners and combines them through ensemble learning rather than selecting a single best model class (Erickson et al. 2020). Several candidate learners were trained including gradient boosting machines, CatBoost, XGBoost, random forest, extremely randomized trees, k-nearest neighbours, and linear models. These models were combined using five-fold bagging and one level of stacked ensembling, so that out-of-fold predictions from the base learners could be used to train a higher-level weighted ensemble while limiting optimism from in-sample prediction. This approach was included to evaluate whether combining multiple tabular learning algorithms with different inductive biases would improve predictive performance and robustness relative to reliance on a single modelling framework.

For the grouped penalised logistic models, continuous predictors were represented using spline terms and grouped penalties were applied so that related terms entered or shrank together. Grouping was used because several predictors were represented by multiple derived coefficients, including spline basis terms for continuous variables and sets of coefficients arising from categorical encoding. Penalising these related coefficients at the group level was intended to reduce unstable selection of isolated transformed terms, to preserve more coherent variable-level effects, and to retain a clinically interpretable linear modelling framework while still allowing greater flexibility than a conventional main-effects logistic regression (Yuan and Lin 2006).

An expanded set of predictors that included component-level NEWS, SOFA, and mICD variables was evaluated with TabICL and AutoGluon, as well as the random forest model. These approaches were considered better suited than the linear models to a larger and more granular predictor space because they can accommodate correlated component variables, non-linear effects, and higher-order interactions without requiring these structures to be specified a priori.

Data analysis

Performance of the REMAIN model was assessed for the whole dataset as an overall measure of external validation as well as separately at each site. Site 2 was treated as a temporal validation subset of the total external validation dataset and Sites 1, 3, and 4 as strict external validation. Discrimination was summarised using the area under the receiver operating characteristic curve (ROC-AUC) and the area under the precision-recall curve (PR-AUC; average precision), overall prediction error using the Brier score, and calibration using calibration-in-the-large and calibration slope, estimated by regressing the observed outcome on the logit of the predicted probability. Site-level 95% confidence intervals for ROC-AUC, PR-AUC, Brier score, calibration-in-the-large, and calibration slope were estimated from 500 non-parametric bootstrap resamples. Pooled estimates were obtained across Sites 1, 3, and 4 using random-effects meta-analysis with Sidik-Jonkman heterogeneity estimation and Hartung-Knapp confidence and prediction intervals as a sensitivity analysis for external validation of the REMAIN model. For the pooled ROC-AUC meta-analysis, within-site variance was estimated using the Hanley-McNeil method. PR-AUC and Brier score were pooled from site-specific estimates and corresponding standard errors.

The new models were developed using leave-one-site-out internal-external validation. In each iteration, one site was held out for testing and the candidate model was trained on the remaining three sites. This process was repeated so that each site served once as the external holdout. For the grouped penalised logistic models, preprocessing and design-matrix construction were repeated within each training set so that no information from the held-out site was used in feature engineering or tuning. Pre-specified continuous predictors were represented using spline basis functions, remaining linear and binary predictors entered as single coefficients, pairwise product terms among the predictors selected for spline modelling were added as individual interaction terms, and categorical predictors were represented by one-hot-encoded indicator variables. Groups were then defined at the predictor level: all spline basis coefficients for a given continuous predictor formed one group, all indicator variables arising from a categorical predictor formed one group, and each remaining linear or interaction term formed its own group. The resulting design matrix was standardised before penalised fitting, and the intercept was not penalised.

For group lasso, penalisation was applied to each coefficient group using the Euclidean norm of that group, with the contribution of each group weighted by the square root of the number of coefficients it contained (Yuan and Lin 2006). For grouped ridge, penalisation was applied to the squared Euclidean norm of each coefficient group, with each group weighted inversely to its size, so that larger groups, such as spline expansions or multi-level categorical predictors, were not penalised more heavily solely because they contributed more coefficients (Yuan and Lin 2006). Group lasso was estimated using accelerated proximal-gradient optimisation and grouped ridge using L-BFGS optimisation. The spline basis for these grouped linear models was selected during preliminary model tuning by evaluating candidate specifications with 3, 4, or 5 knots and degree 2 or 3, ranked by validation ROC-AUC with Brier score used to break ties. The selected basis was then held fixed for the internal-external validation analyses (group lasso: 5 knots, degree 3; group ridge: 3 knots, degree 3). Within each internal-external training set, the penalty parameter was re-selected using inner 5-fold stratified cross-validation. Candidate penalty values were 0.001, 0.003, 0.01, 0.03, and 0.10 for group lasso and 0.0003, 0.001, 0.003, 0.01, 0.03, 0.1, and 0.3 for group ridge, ranked by mean cross-validated ROC-AUC with mean Brier score used to break ties. AutoGluon models were trained within the non-held-out sites and then refit on the full training data before generating holdout predictions. Model comparison focused on held-out-site ROC-AUC pooled across the four sites using the same random-effects meta-analytic framework, while PR-AUC, Brier score, calibration-in-the-large, and calibration slope were summarised descriptively at the site level. For overall discrimination and calibration figures, out-of-site predictions were concatenated across held-out sites for each model.

REMAIN external and temporal validation summary

  • Outcome: In-hospital mortality (DeathHospDisch)
  • Model: Logistic regression using the REMAIN coefficients
  • External validation sites: Site 1, Site 3, Site 4
  • Temporal validation site: Site 2
  • Primary metrics: ROC-AUC, PR-AUC (average precision), Brier score, calibration slope/intercept.

Interpretation

  • Discrimination was moderate but varied across sites, indicating that the predictor set retains useful ranking ability outside the development setting.

  • Calibration slope suggested potential instability in transportability.

    • Temporal validation showed a slope <1 (0.825), indicating that predictions were too extreme, consistent with either overfitting or temporal changes in predictor–outcome relationships.
    • External validation cohorts showed variable slope estimates, with one cohort demonstrating similar risk of overfitting to the development site (0.789) and another site at which estimated risks were too moderate.
    • Overall interpretation indicates that development model may have been overfitted to the development data, leading to overconfident predictions that did not generalize well.
  • Uncertainty in calibration estimates: Indicators of poor calibration were observed across temporal and external validation, suggesting that these findings may not be solely due to sampling variability.

  • Brier score interpretation: Brier scores were similar across cohorts and did not fully reflect differences in calibration, indicating that overall prediction error may remain comparable even when probability calibration differs.

  • Rationale for a new model vs. only recalibration: Variation in slope direction across cohorts (both <1 and >1) suggests that a single global recalibration is unlikely to fully address calibration differences.

Site-Level Validation Metrics

Validation type Site N Deaths CITL (intercept) Calibration slope ROC-AUC [95% CI] PR-AUC [95% CI] Brier [95% CI]
External validation Site 1 485 95 0.159 1.020 0.811 (0.766 to 0.855) 0.544 (0.448 to 0.639) 0.123 (0.105 to 0.141)
External validation Site 3 485 105 0.077 0.789 0.766 (0.719 to 0.813) 0.484 (0.388 to 0.580) 0.146 (0.124 to 0.168)
External validation Site 4 482 110 0.382 1.218 0.835 (0.796 to 0.875) 0.597 (0.503 to 0.692) 0.131 (0.113 to 0.149)
Temporal validation Site 2 485 90 -0.070 0.825 0.791 (0.740 to 0.841) 0.495 (0.393 to 0.596) 0.123 (0.102 to 0.143)

Site-level 95% confidence intervals were estimated from 500 nonparametric bootstrap replicates.

Random-Effects Meta-Analysis (Pooled External Validation)

Metric Pooled estimate (95% CI) 95% PI Tau² I² (%)
ROC-AUC 0.800 (0.748 to 0.844) 0.708 to 0.869 0.016 14.2
PR-AUC (average precision) 0.529 (0.447 to 0.611) 0.393 to 0.661 0.019 2.5
Brier score 0.131 (0.115 to 0.148) 0.105 to 0.162 0.004 4.8

Overall sample: N = 1937, deaths = 400, prevalence = 0.207.

Receiver operating characteristics curves for in-hospital mortality using the REMAIN model

Precision-recall curves for in-hospital mortality using the REMAIN model

Calibration plots for in-hospital mortality using the REMAIN model

Evaluating new models with internal-external validation

  • The above results suggest that the REMAIN logistic regression model demonstrates reasonable discrimination but variable calibration across external and temporal validation cohorts.
  • Investigated the performance of models and techniques that could reduce the risk of poor calibration when using a model at a new site. These included approaches to reduce risk of overfitting by using penalized regression techniques.
  • Used internal-external validation approach so that the model could be trained on data from multiple sites to reduce risk of overfitting to any single site.
  • For group lasso and group ridge, regularization strength (()) was selected within each non-holdout training set using inner 5-fold stratified cross-validation before scoring the held-out site.
  • Also used a simplified measurement of care dependency and a set of predictors that are simpler to assess by a MET team at the time of decision-making and that we considered would be less susceptible to between-site measurement variability.

Site-Level Holdout Results

Site n Events Prevalence Model ROC-AUC PR-AUC Brier CITL Slope
Site 1 485 95 0.1959 TabICL (Expanded feature set) 0.8513 0.5996 0.1142 -0.0416 0.8886
TabICL 0.8489 0.5835 0.1145 -0.0532 0.9241
AutoGluon (Expanded feature set) 0.8487 0.5867 0.1143 -0.0427 0.9475
Group ridge 0.8458 0.5732 0.1156 0.1523 1.1792
Group lasso 0.8445 0.5795 0.1155 0.0300 1.1050
Random forest (Expanded feature set) 0.8442 0.5425 0.1267 -0.6731 1.0702
AutoGluon 0.8386 0.5666 0.1165 -0.0362 1.0731
Random forest 0.8371 0.5214 0.1372 -0.9011 0.9647
NEWS logistic 0.7589 0.4261 0.1343 0.0672 1.0873
Site 2 485 90 0.1856 TabICL (Expanded feature set) 0.8234 0.5005 0.1226 -0.3697 0.7870
AutoGluon 0.8154 0.4931 0.1233 -0.3669 0.8569
Random forest (Expanded feature set) 0.8150 0.4966 0.1361 -0.7969 0.9427
TabICL 0.8132 0.5045 0.1237 -0.3483 0.7508
Random forest 0.8099 0.4898 0.1433 -0.9567 0.8742
AutoGluon (Expanded feature set) 0.8083 0.4732 0.1267 -0.4370 0.7905
Group lasso 0.8060 0.4847 0.1261 -0.3725 0.8396
Group ridge 0.8008 0.4942 0.1251 -0.3796 0.8176
NEWS logistic 0.7258 0.3473 0.1389 -0.2711 0.9039
Site 3 485 105 0.2165 TabICL (Expanded feature set) 0.8055 0.5612 0.1338 -0.0865 0.7278
AutoGluon (Expanded feature set) 0.8009 0.5571 0.1333 -0.0278 0.8413
TabICL 0.8001 0.5648 0.1334 -0.1135 0.6933
Group ridge 0.7943 0.5719 0.1350 -0.1128 0.7466
Random forest (Expanded feature set) 0.7942 0.5726 0.1419 -0.5358 0.8370
AutoGluon 0.7926 0.5668 0.1337 -0.0534 0.8914
Group lasso 0.7917 0.5658 0.1351 -0.0856 0.7834
Random forest 0.7849 0.5536 0.1512 -0.7195 0.7251
NEWS logistic 0.6995 0.3659 0.1575 -0.3319 0.7244
Site 4 482 110 0.2282 Group lasso 0.8940 0.7306 0.1133 0.8786 1.7451
TabICL (Expanded feature set) 0.8931 0.7263 0.1126 0.6853 1.4088
AutoGluon 0.8898 0.7252 0.1140 0.7005 1.6921
AutoGluon (Expanded feature set) 0.8890 0.7039 0.1155 0.7120 1.4556
TabICL 0.8867 0.7122 0.1136 0.5004 1.3509
Group ridge 0.8866 0.7160 0.1159 0.8510 1.7632
Random forest (Expanded feature set) 0.8765 0.6812 0.1235 -0.4258 1.4764
Random forest 0.8689 0.6972 0.1353 -0.8325 1.3202
NEWS logistic 0.8056 0.5224 0.1422 0.6235 1.3309

Random-Effects Meta-Analysis (Pooled Across Sites)

Model Pooled ROC-AUC 95% CI 95% PI Tau² (logit AUC) I² (%)
TabICL (Expanded feature set) 0.8443 0.7698 to 0.8979 0.6843 to 0.9313 0.0600 56.92
TabICL 0.8382 0.7631 to 0.8928 0.6780 to 0.9273 0.0578 57.11
AutoGluon (Expanded feature set) 0.8380 0.7586 to 0.8948 0.6654 to 0.9308 0.0657 60.41
Group lasso 0.8360 0.7442 to 0.8993 0.6279 to 0.9390 0.0896 68.30
AutoGluon 0.8354 0.7534 to 0.8939 0.6558 to 0.9311 0.0694 62.13
Group ridge 0.8334 0.7493 to 0.8933 0.6483 to 0.9314 0.0722 63.77
Random forest (Expanded feature set) 0.8330 0.7658 to 0.8838 0.6966 to 0.9154 0.0418 48.61
Random forest 0.8256 0.7587 to 0.8769 0.6914 to 0.9091 0.0388 47.81
NEWS logistic 0.7481 0.6664 to 0.8154 0.5840 to 0.8628 0.0400 58.39

ROC-AUC Forest Plot Across Sites

Forest plot of site-specific and pooled ROC-AUC estimates from internal-external validation. Abbreviations: ROC-AUC, area under the receiver operating characteristic curve; CI, confidence interval; PI, prediction interval; Tau2, between-site variance on the logit(AUC) scale; I2, inconsistency statistic.

Note: Site-specific confidence intervals were derived from the Hanley-McNeil standard error on the logit(AUC) scale. Weight (%) denotes the random-effects weight contributed by each site on the logit(AUC) scale. Pooled estimates, 95% confidence intervals, and prediction intervals were taken from the random-effects meta-analysis across the four sites.

TabICL Versus NEWS Logistic

Receiver operating characteristic and calibration plots comparing the TabICL model with expanded feature set and NEWS logistic, using concatenated out-of-site predictions across all held-out sites from internal-external validation

Note: For each model, predictions from all held-out sites were concatenated so that each curve reflects pooled out-of-site performance across the full internal-external validation dataset.

Interpretation

  • TabICL resulted in the numerically highest pooled ROC-AUC. Additionally, the 95% prediction interval for TabICL were the most narrow across the models that were evaluated due to considerably less heterogeneity across sites. Inspection of measures to assess calibration also indicated that TabICL had the most consistent calibration across sites, with CITL values closest to 0 and slope values closest to 1 across sites.

  • Developing a new model to predict in-hospital mortality after MET review using an internal-external validation approach, which allowed for training on data from multiple sites, did not substantially improve calibration compared to the REMAIN logistic regression model.

  • On balance, given similar estimates of calibration, the TabICL model may be preferred given its slightly higher and more consistent performance in discrimination across sites.

TabICL With Expanded Feature Set Holdout-Site Plots (Internal-External)

Receiver operating characteristics curves for in-hospital mortality using the TabICL model with expanded feature set

Precision-recall curves for in-hospital mortality using the TabICL model with expanded feature set

Calibration plots for in-hospital mortality using the TabICL model with expanded feature set

Group Ridge Holdout-Site Plots (Internal-External)

Receiver operating characteristics curves for in-hospital mortality using the Group ridge model

Precision-recall curves for in-hospital mortality using the Group ridge model

Calibration plots for in-hospital mortality using the Group ridge model

References

Erickson, Nick, Jonas Mueller, Alexander Shirkov, Hang Zhang, Pedro Larroy, Mu Li, and Alexander Smola. 2020. “AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data.” https://arxiv.org/abs/2003.06505.
Qu, Jingang, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan. 2026. “TabICLv2: A Better, Faster, Scalable, and Open Tabular Foundation Model.” https://arxiv.org/abs/2602.11139.
Yuan, Ming, and Yi Lin. 2006. “Model Selection and Estimation in Regression with Grouped Variables.” Journal of the Royal Statistical Society Series B: Statistical Methodology 68 (1): 49–67.