Academic

Engineering Regression Without Real-Data Training: Domain Adaptation for Tabular Foundation Models Using Multi-Dataset Embeddings

Lyle Regenwetter, Rosen Yu, Cyril Picard, Faez Ahmed · March 7, 2026 · 1 min read · 18 views

#cs.LG

arXiv:2603.04692v1 Announce Type: new Abstract: Predictive modeling in engineering applications has long been dominated by bespoke models and small, siloed tabular datasets, limiting the applicability of large-scale learning approaches. Despite recent progress in tabular foundation models, the resulting synthetic training distributions used for pre-training may not reflect the statistical structure of engineering data, limiting transfer to engineering regression. We introduce TREDBench, a curated collection of 83 real-world tabular regression datasets with expert engineering/non-engineering labels, and use TabPFN 2.5's dataset-level embedding to study domain structure in a common representation space. We find that engineering datasets are partially distinguishable from non-engineering datasets, while standard procedurally generated datasets are highly distinguishable from engineering datasets, revealing a substantial synthetic-real domain gap. To bridge this gap without training on real engineering samples, we propose an embedding-guided synthetic data curation method: we generate and identify "engineering-like" synthetic datasets, and perform continued pre-training of TabPFN 2.5 using only the selected synthetic tasks. Across 35 engineering regression datasets, this synthetic-only adaptation improves predictive accuracy and data efficiency, outperforming TabPFN 2.5 on 29/35 datasets and AutoGluon on 27/35, with mean multiplicative data-efficiency gains of 1.75x and 4.44x, respectively. More broadly, our results indicate that principled synthetic data curation can convert procedural generators into domain-relevant "data engines," enabling foundation models to improve in data-sparse scientific and industrial domains where real data collection is the primary bottleneck.

Executive Summary

This study proposes a novel domain adaptation method for tabular foundation models, leveraging multi-dataset embeddings to bridge the synthetic-real domain gap. The authors develop TREDBench, a curated dataset collection, and demonstrate that principled synthetic data curation can improve predictive accuracy and data efficiency in engineering regression tasks. By training on 'engineering-like' synthetic datasets, the model outperforms state-of-the-art benchmarks. The results have significant implications for data-sparse scientific and industrial domains, where real data collection is the primary bottleneck. This study contributes to the growing field of tabular foundation models, offering a promising solution for engineering regression tasks.

Key Points

▸ Development of TREDBench, a curated dataset collection for engineering regression tasks
▸ Introduction of a domain adaptation method using multi-dataset embeddings
▸ Demonstration of improved predictive accuracy and data efficiency using synthetic-only adaptation

Merits

Strength in Addressing Synthetic-Real Domain Gap

The study effectively bridges the synthetic-real domain gap, enabling foundation models to generalize better to real-world engineering datasets.

Methodological Contribution

The proposed domain adaptation method using multi-dataset embeddings is a novel and valuable contribution to the field of tabular foundation models.

Practical Implications

The study's findings have significant practical implications for data-sparse scientific and industrial domains, where real data collection is the primary bottleneck.

Demerits

Limited Generalizability

The study's results may not generalize to other domains or tasks, as the proposed method is specifically designed for engineering regression tasks.

Dependency on High-Quality Synthetic Data

The effectiveness of the proposed method relies heavily on the quality of the synthetic data, which may not always be available or of high quality.

Scalability Concerns

As the size and complexity of the datasets increase, the proposed method may become computationally expensive and difficult to scale.

Expert Commentary

This study represents a significant contribution to the field of tabular foundation models, offering a novel domain adaptation method for engineering regression tasks. The proposed method using multi-dataset embeddings is a valuable addition to the existing literature, highlighting the potential for improved model generalizability and data efficiency. The study's findings are particularly relevant for data-sparse scientific and industrial domains, where real data collection is the primary bottleneck. However, the study's limitations, including limited generalizability and dependency on high-quality synthetic data, must be carefully considered. Overall, the study's results are encouraging and highlight the potential for principled synthetic data curation to improve model performance in data-sparse domains.

Recommendations

✓ Further research is needed to explore the generalizability of the proposed method to other domains and tasks.
✓ Investigating the scalability of the proposed method for large and complex datasets is essential for its practical adoption.

Sources

arXiv - cs.LG

Engineering Regression Without Real-Data Training: Domain Adaptation for Tabular Foundation Models Using Multi-Dataset Embeddings

AI Commentary

Executive Summary

Key Points

Merits

Strength in Addressing Synthetic-Real Domain Gap

Methodological Contribution

Practical Implications

Demerits

Limited Generalizability

Dependency on High-Quality Synthetic Data

Scalability Concerns

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs