Locally Certified Worst-Group Reliability under Covariate Shift: When Subgroup Alarms Are Evidence

Oudghiri, Mehdi

doi:10.5281/zenodo.19708008

Published February 4, 2026 | Version v2

Preprint Open

Locally Certified Worst-Group Reliability under Covariate Shift: When Subgroup Alarms Are Evidence

Oudghiri, Mehdi

This paper studies how to reliably detect subgroup failures in machine learning models under distribution shift.

In many real-world deployments, models are evaluated on a source dataset but applied to a different target population. A common approach is to use importance weighting to estimate performance on the target distribution. However, this work shows that standard “worst-case subgroup” evaluations can be misleading: large errors may appear simply because there is not enough data locally to support the claim.

The main contribution is a framework to determine when a detected subgroup failure is actually supported by sufficient evidence. Instead of relying on global metrics, the method introduces local confidence certificates based on the effective sample size within each subgroup. This allows distinguishing between:

real failures that require action
and apparent failures caused by insufficient data

Experiments on synthetic and real datasets (Adult Census, ACSIncome) show that standard methods often produce false alarms, while the proposed approach correctly identifies when there is insufficient evidence to draw conclusions .

Overall, this work provides a more reliable way to evaluate model robustness under distribution shift, with direct implications for high-stakes applications such as healthcare and public decision systems.

Files

preprint.pdf

Files (522.2 kB)

Name	Size	Download all
preprint.pdf md5:7e7c6872c333f21a0967b137a240cc28	522.2 kB	Preview Download

Additional details

Repository URL: https://github.com/MehdiOudghiri1/shiftstat
Programming language: Python

Richard Berk, Lawrence Brown, Andreas Buja, Kai Zhang, and Linda Zhao. Valid post-selection inference. The Annals of Statistics, 41(2):802–837, 2013. URL https://arxiv.org/abs/1306. 1059.
Yeounoh Chung, Tim Kraska, Neoklis Polyzotis, Ki Hyun Tae, and Steven Euijong Whang. Slice f inder: Automated data slicing for model validation. In Proceedings of the 35th IEEE In ternational Conference on Data Engineering, 2019. URL https://research.google/pubs/ slice-finder-automated-data-slicing-for-model-validation/.
Frances Ding, Moritz Hardt, John Miller, and Ludwig Schmidt. Retiring adult: New datasets for fair machine learning. In Advances in Neural Information Processing Sys tems, volume 34, pages 6478–6490, 2021. URL https://papers.nips.cc/paper/2021/hash/ 32e54441e6382a7fbacbbbaf3c450059-Abstract.html
Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth. The reusable holdout: Preserving validity in adaptive data analysis. Science, 349(6248):636–638, 2015. URL https://www.cis.upenn.edu/~aaroth/reusable.html
Ursula Hébert-Johnson, Michael P. Kim, Omer Reingold, and Guy N. Rothblum. Multicalibration: Calibration for the computationally-identifiable masses. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1939–1948, 2018. URL https://proceedings.mlr.press/v80/hebert-johnson18a.html
Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2564–2572, 2018. URL https://proceedings.mlr.press/v80/kearns18a.html
Ron Kohavi. Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pages 202–207, 1996. URL https://archive.ics.uci.edu/dataset/2/adult
Sangdon Park, Osbert Bastani, James Weimer, and Insup Lee. Calibrated prediction with covariate shift via unsupervised domain adaptation, 2020. URL https://arxiv.org/abs/2003.00343
Felipe Maia Polo and Renato Vicente. Effective sample size, dimensionality, and generalization in covariate shift adaptation. Neural Computing and Applications, 35(25):18187–18199, 2023. URL https://link.springer.com/article/10.1007/s00521-023-08492-8
Masashi Sugiyama, Matthias Krauledat, and Klaus-Robert Müller. Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8:985–1005, 2007. URL https://www.jmlr.org/papers/v8/sugiyama07a.html
Ryan J. Tibshirani, Rina Foygel Barber, Emmanuel J. Candès, and Aaditya Ram das. Conformal prediction under covariate shift. In Advances in Neural Infor mation Processing Systems, volume 32, 2019. URL https://papers.nips.cc/paper/ 8522-conformal-prediction-under-covariate-shift
Yachong Yang, Arun K. Kuchibhotla, and Eric J. Tchetgen Tchetgen. Doubly robust calibration of prediction sets under covariate shift. Journal of the Royal Statistical Society Series B, 86(4): 943–965, 2024. URL https://academic.oup.com/jrsssb/article/86/4/943/7618755

	All versions	This version
Views	48	10
Downloads	41	3
Data volume	63.9 MB	3.7 MB

preprint.pdf

Files (522.2 kB)

Software

References

Locally Certified Worst-Group Reliability under Covariate Shift: When Subgroup Alarms Are Evidence

Authors/Creators

Description

Files

preprint.pdf

Files (522.2 kB)

Additional details

Software

References