Academic

Can LLMs Capture Expert Uncertainty? A Comparative Analysis of Value Alignment in Ethnographic Qualitative Research

arXiv:2603.04897v1 Announce Type: new Abstract: Qualitative analysis of open-ended interviews plays a central role in ethnographic and economic research by uncovering individuals' values, motivations, and culturally embedded financial behaviors. While large language models (LLMs) offer promising support for automating and enriching such interpretive work, their ability to produce nuanced, reliable interpretations under inherent task ambiguity remains unclear. In our work we evaluate LLMs on the task of identifying the top three human values expressed in long-form interviews based on the Schwartz Theory of Basic Values framework. We compare their outputs to expert annotations, analyzing both performance and uncertainty patterns relative to the experts. Results show that LLMs approach the human ceiling on set-based metrics (F1, Jaccard) but struggle to recover exact value rankings, as reflected in lower RBO scores. While the average Schwartz value distributions of most models closely ma

arXiv:2603.04897v1 Announce Type: new Abstract: Qualitative analysis of open-ended interviews plays a central role in ethnographic and economic research by uncovering individuals' values, motivations, and culturally embedded financial behaviors. While large language models (LLMs) offer promising support for automating and enriching such interpretive work, their ability to produce nuanced, reliable interpretations under inherent task ambiguity remains unclear. In our work we evaluate LLMs on the task of identifying the top three human values expressed in long-form interviews based on the Schwartz Theory of Basic Values framework. We compare their outputs to expert annotations, analyzing both performance and uncertainty patterns relative to the experts. Results show that LLMs approach the human ceiling on set-based metrics (F1, Jaccard) but struggle to recover exact value rankings, as reflected in lower RBO scores. While the average Schwartz value distributions of most models closely match those of human analysts, their uncertainty structures across the Schwartz values diverge from expert uncertainty patterns. Among the evaluated models, Qwen performs closest to expert-level agreement and exhibits the strongest alignment with expert Schwartz value distributions. LLM ensemble methods yield consistent gains across metrics, with Majority Vote and Borda Count performing best. Notably, systematic overemphasis on certain Schwartz values, like Security, suggests both the potential of LLMs to provide complementary perspectives and the need to further investigate model-induced value biases. Overall, our findings highlight both the promise and the limitations of LLMs as collaborators in inherently ambiguous qualitative value analysis.

Executive Summary

This article examines the ability of large language models (LLMs) to capture expert uncertainty in ethnographic qualitative research, specifically in identifying human values based on the Schwartz Theory of Basic Values framework. The results show that LLMs approach human-level performance on set-based metrics but struggle with exact value rankings. The study highlights both the promise and limitations of LLMs as collaborators in qualitative value analysis, emphasizing the need to investigate model-induced value biases and the potential for LLMs to provide complementary perspectives.

Key Points

  • LLMs approach human-level performance on set-based metrics (F1, Jaccard) for identifying human values
  • LLMs struggle to recover exact value rankings, as reflected in lower RBO scores
  • Systematic overemphasis on certain Schwartz values suggests potential model-induced value biases

Merits

Promising Performance

LLMs demonstrate strong performance on set-based metrics, indicating their potential as collaborators in qualitative research

Complementary Perspectives

LLMs may provide unique insights and perspectives that can complement human analysis

Demerits

Limited Ranking Ability

LLMs struggle to accurately rank human values, which may limit their utility in certain research contexts

Model-Induced Biases

Systematic overemphasis on certain Schwartz values suggests that LLMs may introduce biases into the research process

Expert Commentary

The study's findings highlight the complex and nuanced role that LLMs can play in qualitative research. While LLMs demonstrate promising performance on certain metrics, their limitations and potential biases must be carefully considered. The use of LLMs in research raises important questions about the nature of expertise, the role of human judgment, and the need for transparency and accountability in AI-driven research. As LLMs become increasingly integrated into research workflows, it is essential to develop a deeper understanding of their strengths and limitations, as well as the potential risks and benefits associated with their use.

Recommendations

  • Researchers should prioritize the development of more nuanced and context-specific evaluation metrics for LLMs
  • Further research is needed to investigate and mitigate model-induced biases in LLMs
  • Policymakers and researchers should work together to establish regulatory frameworks and best practices for the use of LLMs in research and decision-making processes

Sources