Academic

This human study did not involve human subjects: Validating LLM simulations as behavioral evidence

Jessica Hullman, David Broska, Huaman Sun, Aaron Shaw · February 23, 2026 · 1 min read · 3 views

#cs.AI

arXiv:2602.15785v1 Announce Type: new Abstract: A growing literature uses large language models (LLMs) as synthetic participants to generate cost-effective and nearly instantaneous responses in social science experiments. However, there is limited guidance on when such simulations support valid inference about human behavior. We contrast two strategies for obtaining valid estimates of causal effects and clarify the assumptions under which each is suitable for exploratory versus confirmatory research. Heuristic approaches seek to establish that simulated and observed human behavior are interchangeable through prompt engineering, model fine-tuning, and other repair strategies designed to reduce LLM-induced inaccuracies. While useful for many exploratory tasks, heuristic approaches lack the formal statistical guarantees typically required for confirmatory research. In contrast, statistical calibration combines auxiliary human data with statistical adjustments to account for discrepancies between observed and simulated responses. Under explicit assumptions, statistical calibration preserves validity and provides more precise estimates of causal effects at lower cost than experiments that rely solely on human participants. Yet the potential of both approaches depends on how well LLMs approximate the relevant populations. We consider what opportunities are overlooked when researchers focus myopically on substituting LLMs for human participants in a study.

Executive Summary

This article discusses the use of large language models (LLMs) as synthetic participants in social science experiments, exploring two strategies for obtaining valid estimates of causal effects. The authors contrast heuristic approaches with statistical calibration, highlighting the strengths and limitations of each method. They emphasize the importance of considering the opportunities and challenges of using LLMs in research, particularly in terms of approximating relevant populations. The article provides guidance on when LLM simulations can support valid inference about human behavior, with implications for both practical and policy applications.

Key Points

▸ LLMs can be used as synthetic participants in social science experiments
▸ Heuristic approaches and statistical calibration are two strategies for obtaining valid estimates of causal effects
▸ The choice of approach depends on the research goal, with heuristic approaches suitable for exploratory tasks and statistical calibration more suitable for confirmatory research

Merits

Cost-effectiveness

Using LLMs can reduce the cost and increase the speed of social science experiments, making it possible to conduct larger and more complex studies

Demerits

Limited generalizability

The accuracy of LLMs in approximating human behavior may be limited, particularly in complex or nuanced contexts, which can affect the validity of the results

Expert Commentary

The article provides a thoughtful and nuanced exploration of the potential of LLMs in social science research. The authors' distinction between heuristic approaches and statistical calibration is particularly useful, highlighting the importance of careful consideration of the research goal and the limitations of the methodology. However, the article also raises important questions about the potential risks and challenges of relying on LLMs, including issues of bias and generalizability. As the use of LLMs in research continues to grow, it will be essential to address these challenges and develop best practices for ensuring the validity and reliability of the results.

Recommendations

✓ Researchers should carefully evaluate the strengths and limitations of LLMs in their specific research context
✓ Further research is needed to develop and refine methods for using LLMs in social science experiments, including the development of more sophisticated statistical calibration techniques

Sources

arXiv - cs.AI

Something extraordinary is coming.

This human study did not involve human subjects: Validating LLM simulations as behavioral evidence

AI Commentary

Executive Summary

Key Points

Merits

Cost-effectiveness

Demerits

Limited generalizability

Expert Commentary

Recommendations

Sources

Related Articles

Humans and LLMs Diverge on Probabilistic Inferences

France or Spain or Germany or France: A Neural Account …

Multi-Agent Causal Reasoning for Suicide Ideation Detection Through Online Conversations

BRIDGE the Gap: Mitigating Bias Amplification in Automated Scoring of …

JCG, PC

HSOLLC Co., Ltd.