Relational In-Context Learning via Synthetic Pre-training with Structural Prior
arXiv:2603.03805v1 Announce Type: new Abstract: Relational Databases (RDBs) are the backbone of modern business, yet they lack foundation models comparable to those in text or vision. A key obstacle is that high-quality RDBs are private, scarce and structurally heterogeneous, making internet-scale pre-training infeasible. To overcome this data scarcity, We introduce $\textbf{RDB-PFN}$, the first relational foundation model trained purely via $\textbf{synthetic data}$. Inspired by Prior-Data Fitted Networks (PFNs) where synthetic data generated from Structural Causal Models (SCMs) enables reasoning on single tables, we design a $\textbf{Relational Prior Generator}$ to create an infinite stream of diverse RDBs from scratch. Pre-training on $\textbf{over 2 million}$ synthetic single-table and relational tasks, RDB-PFN learns to adapt to any new database instantly via genuine $\textbf{in-context learning}$. Experiments verify RDB-PFN achieves strong few-shot performance on 19 real-world r
arXiv:2603.03805v1 Announce Type: new Abstract: Relational Databases (RDBs) are the backbone of modern business, yet they lack foundation models comparable to those in text or vision. A key obstacle is that high-quality RDBs are private, scarce and structurally heterogeneous, making internet-scale pre-training infeasible. To overcome this data scarcity, We introduce $\textbf{RDB-PFN}$, the first relational foundation model trained purely via $\textbf{synthetic data}$. Inspired by Prior-Data Fitted Networks (PFNs) where synthetic data generated from Structural Causal Models (SCMs) enables reasoning on single tables, we design a $\textbf{Relational Prior Generator}$ to create an infinite stream of diverse RDBs from scratch. Pre-training on $\textbf{over 2 million}$ synthetic single-table and relational tasks, RDB-PFN learns to adapt to any new database instantly via genuine $\textbf{in-context learning}$. Experiments verify RDB-PFN achieves strong few-shot performance on 19 real-world relational prediction tasks, outperforming graph-based and single-table foundation-model baselines (given the same DFS-linearized inputs), while using a lightweight architecture and fast inference. The code is available at https://github.com/MuLabPKU/RDBPFN
Executive Summary
This article introduces RDB-PFN, a relational foundation model trained on synthetic data, addressing the scarcity of high-quality relational databases. RDB-PFN achieves strong few-shot performance on real-world relational prediction tasks, outperforming graph-based and single-table foundation-model baselines. The model's ability to adapt to new databases via in-context learning and its lightweight architecture make it a significant contribution to the field.
Key Points
- ▸ RDB-PFN is the first relational foundation model trained on synthetic data
- ▸ The model uses a Relational Prior Generator to create diverse relational databases from scratch
- ▸ RDB-PFN achieves strong few-shot performance on real-world relational prediction tasks
Merits
Innovative Approach
The use of synthetic data and a Relational Prior Generator allows for the creation of a large and diverse dataset, addressing the scarcity of high-quality relational databases.
Strong Performance
RDB-PFN outperforms graph-based and single-table foundation-model baselines, demonstrating its effectiveness in relational prediction tasks.
Demerits
Limited Real-World Data
The model is trained on synthetic data, which may not fully capture the complexities of real-world relational databases.
Expert Commentary
The introduction of RDB-PFN marks a significant advancement in the field of relational databases, as it addresses the long-standing issue of data scarcity. The model's ability to learn from synthetic data and adapt to new databases via in-context learning makes it a valuable tool for real-world applications. However, further research is needed to fully understand the limitations and potential biases of RDB-PFN, particularly in regards to its performance on complex and heterogeneous databases.
Recommendations
- ✓ Further research should be conducted to evaluate the performance of RDB-PFN on a wider range of real-world relational databases.
- ✓ The development of RDB-PFN should be accompanied by the creation of policies and regulations that promote the responsible use of synthetic data and relational databases.