Locally Certified Worst-Group Reliability under Covariate Shift: When Subgroup Alarms Are Evidence
Authors/Creators
Description
This paper studies how to reliably detect subgroup failures in machine learning models under distribution shift.
In many real-world deployments, models are evaluated on a source dataset but applied to a different target population. A common approach is to use importance weighting to estimate performance on the target distribution. However, this work shows that standard “worst-case subgroup” evaluations can be misleading: large errors may appear simply because there is not enough data locally to support the claim.
The main contribution is a framework to determine when a detected subgroup failure is actually supported by sufficient evidence. Instead of relying on global metrics, the method introduces local confidence certificates based on the effective sample size within each subgroup. This allows distinguishing between:
- real failures that require action
- and apparent failures caused by insufficient data
Experiments on synthetic and real datasets (Adult Census, ACSIncome) show that standard methods often produce false alarms, while the proposed approach correctly identifies when there is insufficient evidence to draw conclusions .
Overall, this work provides a more reliable way to evaluate model robustness under distribution shift, with direct implications for high-stakes applications such as healthcare and public decision systems.
Files
preprint.pdf
Files
(522.2 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:7e7c6872c333f21a0967b137a240cc28
|
522.2 kB | Preview Download |
Additional details
Software
- Repository URL
- https://github.com/MehdiOudghiri1/shiftstat
- Programming language
- Python
References
- Richard Berk, Lawrence Brown, Andreas Buja, Kai Zhang, and Linda Zhao. Valid post-selection inference. The Annals of Statistics, 41(2):802–837, 2013. URL https://arxiv.org/abs/1306. 1059.
- Yeounoh Chung, Tim Kraska, Neoklis Polyzotis, Ki Hyun Tae, and Steven Euijong Whang. Slice f inder: Automated data slicing for model validation. In Proceedings of the 35th IEEE In ternational Conference on Data Engineering, 2019. URL https://research.google/pubs/ slice-finder-automated-data-slicing-for-model-validation/.
- Frances Ding, Moritz Hardt, John Miller, and Ludwig Schmidt. Retiring adult: New datasets for fair machine learning. In Advances in Neural Information Processing Sys tems, volume 34, pages 6478–6490, 2021. URL https://papers.nips.cc/paper/2021/hash/ 32e54441e6382a7fbacbbbaf3c450059-Abstract.html
- Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth. The reusable holdout: Preserving validity in adaptive data analysis. Science, 349(6248):636–638, 2015. URL https://www.cis.upenn.edu/~aaroth/reusable.html
- Ursula Hébert-Johnson, Michael P. Kim, Omer Reingold, and Guy N. Rothblum. Multicalibration: Calibration for the computationally-identifiable masses. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1939–1948, 2018. URL https://proceedings.mlr.press/v80/hebert-johnson18a.html
- Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2564–2572, 2018. URL https://proceedings.mlr.press/v80/kearns18a.html
- Ron Kohavi. Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pages 202–207, 1996. URL https://archive.ics.uci.edu/dataset/2/adult
- Sangdon Park, Osbert Bastani, James Weimer, and Insup Lee. Calibrated prediction with covariate shift via unsupervised domain adaptation, 2020. URL https://arxiv.org/abs/2003.00343
- Felipe Maia Polo and Renato Vicente. Effective sample size, dimensionality, and generalization in covariate shift adaptation. Neural Computing and Applications, 35(25):18187–18199, 2023. URL https://link.springer.com/article/10.1007/s00521-023-08492-8
- Masashi Sugiyama, Matthias Krauledat, and Klaus-Robert Müller. Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8:985–1005, 2007. URL https://www.jmlr.org/papers/v8/sugiyama07a.html
- Ryan J. Tibshirani, Rina Foygel Barber, Emmanuel J. Candès, and Aaditya Ram das. Conformal prediction under covariate shift. In Advances in Neural Infor mation Processing Systems, volume 32, 2019. URL https://papers.nips.cc/paper/ 8522-conformal-prediction-under-covariate-shift
- Yachong Yang, Arun K. Kuchibhotla, and Eric J. Tchetgen Tchetgen. Doubly robust calibration of prediction sets under covariate shift. Journal of the Royal Statistical Society Series B, 86(4): 943–965, 2024. URL https://academic.oup.com/jrsssb/article/86/4/943/7618755