Academic

When Do LLM Preferences Predict Downstream Behavior?

arXiv:2602.18971v1 Announce Type: new Abstract: Preference-driven behavior in LLMs may be a necessary precondition for AI misalignment such as sandbagging: models cannot strategically pursue misaligned goals unless their behavior is influenced by their preferences. Yet prior work has typically prompted models explicitly to act in specific ways, leaving unclear whether observed behaviors reflect instruction-following capabilities vs underlying model preferences. Here we test whether this precondition for misalignment is present. Using entity preferences as a behavioral probe, we measure whether stated preferences predict downstream behavior in five frontier LLMs across three domains: donation advice, refusal behavior, and task performance. Conceptually replicating prior work, we first confirm that all five models show highly consistent preferences across two independent measurement methods. We then test behavioral consequences in a simulated user environment. We find that all five mode

arXiv:2602.18971v1 Announce Type: new Abstract: Preference-driven behavior in LLMs may be a necessary precondition for AI misalignment such as sandbagging: models cannot strategically pursue misaligned goals unless their behavior is influenced by their preferences. Yet prior work has typically prompted models explicitly to act in specific ways, leaving unclear whether observed behaviors reflect instruction-following capabilities vs underlying model preferences. Here we test whether this precondition for misalignment is present. Using entity preferences as a behavioral probe, we measure whether stated preferences predict downstream behavior in five frontier LLMs across three domains: donation advice, refusal behavior, and task performance. Conceptually replicating prior work, we first confirm that all five models show highly consistent preferences across two independent measurement methods. We then test behavioral consequences in a simulated user environment. We find that all five models give preference-aligned donation advice. All five models also show preference-correlated refusal patterns when asked to recommend donations, refusing more often for less-preferred entities. All preference-related behaviors that we observe here emerge without instructions to act on preferences. Results for task performance are mixed: on a question-answering benchmark (BoolQ), two models show small but significant accuracy differences favoring preferred entities; one model shows the opposite pattern; and two models show no significant relationship. On complex agentic tasks, we find no evidence of preference-driven performance differences. While LLMs have consistent preferences that reliably predict advice-giving behavior, these preferences do not consistently translate into downstream task performance.

Executive Summary

This article explores the relationship between long short-term memory (LLM) model preferences and downstream behavior. Researchers tested whether stated preferences in five frontier LLMs across three domains (donation advice, refusal behavior, and task performance) can predict subsequent actions. The study found that preferences were consistent across measurement methods and predicted advice-giving behavior, but did not consistently impact task performance. This study sheds light on the potential for LLM misalignment and the importance of understanding model preferences in AI development.

Key Points

  • LLM model preferences are consistent across measurement methods
  • Preferences predict advice-giving behavior in donation advice and refusal behavior
  • Preferences do not consistently impact task performance
  • Mixed results found in complex agentic tasks

Merits

Strength in Conceptual Replication

The study conceptually replicates prior work, demonstrating consistency in LLM preferences across measurement methods.

Demerits

Limitation in Generalizability

The study's findings may not generalize to all LLMs or domains, limiting the broader applicability of the results.

Expert Commentary

This study makes a significant contribution to the field by exploring the relationship between LLM model preferences and downstream behavior. The findings highlight the importance of understanding model preferences in AI development and have implications for mitigating misalignment risks. However, the study's limitations, such as the lack of generalizability to all LLMs or domains, should be considered when interpreting the results. Future research should aim to replicate these findings in diverse contexts to further our understanding of LLM preferences and their impact on AI systems.

Recommendations

  • Develop more robust methods for measuring LLM model preferences to improve generalizability.
  • Investigate the potential for LLM misalignment in diverse domains and contexts to inform AI development and regulation.

Sources