Academic

Many AI Analysts, One Dataset: Navigating the Agentic Data Science Multiverse

arXiv:2602.18710v1 Announce Type: new Abstract: The conclusions of empirical research depend not only on data but on a sequence of analytic decisions that published results seldom make explicit. Past ``many-analyst" studies have demonstrated this: independent teams testing the same hypothesis on the same dataset regularly reach conflicting conclusions. But such studies require months of coordination among dozens of research groups and are therefore rarely conducted. In this work, we show that fully autonomous AI analysts built on large language models (LLMs) can reproduce a similar structured analytic diversity cheaply and at scale. We task these AI analysts with testing a pre-specified hypothesis on a fixed dataset, varying the underlying model and prompt framing across replicate runs. Each AI analyst independently constructs and executes a full analysis pipeline; an AI auditor then screens each run for methodological validity. Across three datasets spanning experimental and observat

M
Martin Bertran, Riccardo Fogliato, Zhiwei Steven Wu
· · 1 min read · 2 views

arXiv:2602.18710v1 Announce Type: new Abstract: The conclusions of empirical research depend not only on data but on a sequence of analytic decisions that published results seldom make explicit. Past ``many-analyst" studies have demonstrated this: independent teams testing the same hypothesis on the same dataset regularly reach conflicting conclusions. But such studies require months of coordination among dozens of research groups and are therefore rarely conducted. In this work, we show that fully autonomous AI analysts built on large language models (LLMs) can reproduce a similar structured analytic diversity cheaply and at scale. We task these AI analysts with testing a pre-specified hypothesis on a fixed dataset, varying the underlying model and prompt framing across replicate runs. Each AI analyst independently constructs and executes a full analysis pipeline; an AI auditor then screens each run for methodological validity. Across three datasets spanning experimental and observational designs, AI analyst-produced analyses display wide dispersion in effect sizes, $p$-values, and binary decisions on supporting the hypothesis or not, frequently reversing whether a hypothesis is judged supported. This dispersion is structured: recognizable analytic choices in preprocessing, model specification, and inference differ systematically across LLM and persona conditions. Critically, the effects are \emph{steerable}: reassigning the analyst persona or LLM shifts the distribution of outcomes even after excluding methodologically deficient runs.

Executive Summary

This article presents a study where fully autonomous AI analysts, built on large language models (LLMs), are tasked with testing a pre-specified hypothesis on a fixed dataset. The study demonstrates that AI analysts can reproduce a structured analytic diversity, similar to past 'many-analyst' studies, but at a lower cost and scale. The results show wide dispersion in effect sizes, p-values, and binary decisions, with recognizable analytic choices differing systematically across LLM and persona conditions. The study also finds that the effects are steerable, meaning that reassigning the analyst persona or LLM can shift the distribution of outcomes. This research highlights the importance of considering the role of AI analysts in the scientific process and the potential for biases in their decision-making.

Key Points

  • AI analysts can reproduce a structured analytic diversity at a lower cost and scale than past 'many-analyst' studies.
  • The results show wide dispersion in effect sizes, p-values, and binary decisions.
  • Recognizable analytic choices differ systematically across LLM and persona conditions.
  • The effects are steerable, meaning that reassigning the analyst persona or LLM can shift the distribution of outcomes.

Merits

Strength

The study demonstrates the ability of AI analysts to reproduce a structured analytic diversity, which can be useful in understanding the role of AI in the scientific process.

Methodological Innovation

The use of fully autonomous AI analysts built on LLMs is a novel approach to studying the scientific process.

Scalability

The study shows that AI analysts can perform analyses at a lower cost and scale than past 'many-analyst' studies.

Demerits

Limitation

The study only uses three datasets, which may not be representative of all possible datasets.

Bias in AI Analysts

The study highlights the potential for biases in the decision-making of AI analysts, which can affect the results of the analysis.

Interpretability

The study does not provide a clear explanation of how the AI analysts arrived at their conclusions, which can make it difficult to interpret the results.

Expert Commentary

This study is a significant contribution to our understanding of the role of AI in the scientific process. The use of fully autonomous AI analysts built on LLMs is a novel approach that can be useful in understanding the potential applications of AI in various fields. However, the study also raises concerns about the potential for biases in AI decision-making, which can affect the results of the analysis. Further research is needed to understand the potential limitations of AI analysts and how to address them. The study also highlights the need for policymakers to consider the potential uses and limitations of AI analysts in various fields.

Recommendations

  • Further research is needed to understand the potential limitations of AI analysts and how to address them.
  • Policymakers should consider the potential uses and limitations of AI analysts in various fields.
  • Researchers should be transparent about the methods used by AI analysts and the potential biases in their decision-making.

Sources