Academic

TML-Bench: Benchmark for Data Science Agents on Tabular ML Tasks

arXiv:2603.05764v1 Announce Type: new Abstract: Autonomous coding agents can produce strong tabular baselines quickly on Kaggle-style tasks. Practical value depends on end-to-end correctness and reliability under time limits. This paper introduces TML-Bench, a tabular benchmark for data science agents on Kaggle-style tasks. This paper evaluates 10 OSS LLMs on four Kaggle competitions and three time budgets (240s, 600s, and 1200s). Each model is run five times per task and budget. A run is successful if it produces a valid submission and a private-holdout score on hidden labels that are not accessible to the agent. This paper reports median performance, success rates, and run-to-run variability. MiniMax-M2.1 model achieves the best aggregate performance score on all four competitions under the paper's primary aggregation. Average performance improves with larger time budgets. Scaling is noisy for some individual models at the current run count. Code and materials are available at https

M
Mykola Pinchuk
· · 1 min read · 9 views

arXiv:2603.05764v1 Announce Type: new Abstract: Autonomous coding agents can produce strong tabular baselines quickly on Kaggle-style tasks. Practical value depends on end-to-end correctness and reliability under time limits. This paper introduces TML-Bench, a tabular benchmark for data science agents on Kaggle-style tasks. This paper evaluates 10 OSS LLMs on four Kaggle competitions and three time budgets (240s, 600s, and 1200s). Each model is run five times per task and budget. A run is successful if it produces a valid submission and a private-holdout score on hidden labels that are not accessible to the agent. This paper reports median performance, success rates, and run-to-run variability. MiniMax-M2.1 model achieves the best aggregate performance score on all four competitions under the paper's primary aggregation. Average performance improves with larger time budgets. Scaling is noisy for some individual models at the current run count. Code and materials are available at https://github.com/MykolaPinchuk/TML-bench/tree/master.

Executive Summary

This article introduces TML-Bench, a benchmark for data science agents on Kaggle-style tasks. The authors evaluate 10 OSS LLMs on four Kaggle competitions with varying time budgets and report median performance, success rates, and run-to-run variability. The MiniMax-M2.1 model achieves the best aggregate performance score, and average performance improves with larger time budgets. While the study provides valuable insights into the performance of data science agents, scaling is noisy for some individual models at the current run count. The code and materials are available on GitHub, enabling replication and further research. The findings have implications for the development and evaluation of data science agents in real-world applications.

Key Points

  • TML-Bench is a benchmark for data science agents on Kaggle-style tasks
  • The authors evaluate 10 OSS LLMs on four Kaggle competitions with varying time budgets
  • The MiniMax-M2.1 model achieves the best aggregate performance score
  • Average performance improves with larger time budgets

Merits

Strength

TML-Bench provides a standardized evaluation framework for data science agents, enabling fair comparisons and identifying top-performing models. The study's findings have practical implications for the development and deployment of data science agents in Kaggle-style competitions and real-world applications.

Demerits

Limitation

The study's focus on Kaggle-style tasks may limit the generalizability of the findings to other domains. Additionally, the noisy scaling of individual models at the current run count may indicate the need for further research to improve the robustness and reliability of data science agents.

Expert Commentary

The introduction of TML-Bench marks a significant step forward in the evaluation and comparison of data science agents. By providing a standardized framework for benchmarking and evaluation, the authors enable researchers and practitioners to identify top-performing models and develop more effective data science agents. However, further research is needed to address the noisy scaling of individual models and to explore the generalizability of the findings to other domains.

Recommendations

  • Future studies should investigate the application of TML-Bench to other domains and tasks to assess its generalizability and scope.
  • Researchers should explore methods to improve the robustness and reliability of data science agents, such as developing more robust evaluation metrics and incorporating domain knowledge and expertise.

Sources