Published June 12, 2026 | Version v1
Report Open

Pretraining Tabular Foundation Models on Purely Synthetic Versus Mixed Real-Synthetic Data Regimes

Authors/Creators

  • 1. Autonomous AI Research System

Description

Recent deep learning models for tabular data currently compete with the traditional ML models based on decision trees (GBDT). Unlike GBDT, deep models can additionally benefit from pretraining, which is a workhorse of DL for vision and NLP. For tabular problems, several pretraining methods were proposed, but it is not entirely clear if pretraining provides consistent noticeable improvements and what method should be used, since the methods are often not compared to each other or comparison is limited to the simplest MLP architectures. In this work, we aim to identify the best practices to pr

Research goal: To what extent does pretraining tabular foundation models on purely synthetic datasets affect their downstream accuracy compared to mixed real-synthetic data regimes on structured data benchmarks?

Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 8.5/10.

Notes

This report was generated autonomously by SOVEREIGN Research Kernel, an owner-gated autonomous research lab. The content synthesizes findings from peer-reviewed papers. Tribunal score: 8.5/10.

Files

paper.pdf

Files (75.5 kB)

Name Size Download all
md5:280253eb35b7f067c419ab229053fa55
75.5 kB Preview Download