IWBS: Influence-Weighted Bagged Splines for Robust Regression in Small-Data Regimes
Description
High-capacity machine learning models, such as Gradient Boosting Machines
(GBM) and deep Random Forests, often succumb to the bias-variance trade-off
when training data is scarce (N < 500). While they offer low bias, they frequently
overfit noise or fail to approximate smooth functions due to discrete partitioning.
Conversely, classical linear models (OLS) offer stability but lack the capacity to
model complex dynamics. This paper introduces Influence-Weighted Bagged
Splines (IWBS), a specialized ensemble architecture designed for high-complexity,
small-data regimes. IWBS combines the flexibility of randomized additive splines
with a novel Out-of-Bag (OOB) Stability Weighting mechanism. Unlike stan-
dard bagging, which averages learners uniformly, IWBS penalizes ensemble mem-
bers that exhibit high prediction instability on held-out data. We benchmark IWBS
against fully tuned Tree Ensembles (GBM, Random Forest) and specialized small-
data solvers (Gaussian Processes, GAMs, Kernel Ridge Regression) across physical,
economic, and biological domains. Results demonstrate that IWBS achieves state-
of-the-art performance in signal-rich tasks (Concrete, Moneyball), outperforming
both tree-based methods and kernel smoothers by capturing high-frequency non-
linearities without overfitting. Furthermore, we establish the method’s boundary
conditions, showing that in high-noise regimes (Diabetes), global smoothers like
Kernel Ridge Regression remain superior to structure-discovery approaches.
Files
IWBS.pdf
Files
(342.9 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:828fa5e6c226790a74488d60fa64b4d2
|
342.9 kB | Preview Download |
Additional details
Dates
- Created
-
2026-01-16
Software
- Repository URL
- https://github.com/1zzuk1/IWBS
- Programming language
- R
- Development Status
- Active
References
- Leo Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996.
- Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
- Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least angle regression. The Annals of statistics, 32(2):407–499, 2004. Original source for the Diabetes dataset.
- Jerome H Friedman. Multivariate adaptive regression splines. The annals of statis- tics, pages 1–67, 1991.
- Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232, 2001.
- Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67, 1970.
- Michael Lewis. Moneyball: The art of winning an unfair game. WW Norton & Company, 2004. Context for the MLB Salary dataset.
- Nicolai Meinshausen and Peter B¨uhlmann. Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4):417–473, 2010.
- Mark J Van der Laan, Eric C Polley, and Alan E Hubbard. Super learner. Statistical applications in genetics and molecular biology, 6(1), 2007.
- I-Cheng Yeh. Modeling of strength of high-performance concrete using artificial neural networks. Cement and Concrete research, 28(12):1797–1808, 1998.