Academic

Extending Minimal Pairs with Ordinal Surprisal Curves and Entropy Across Applied Domains

arXiv:2603.14400v1 Announce Type: new Abstract: The minimal pairs paradigm of comparing model probabilities for contrasting completions has proven useful for evaluating linguistic knowledge in language models, yet its application has largely been confined to binary grammaticality judgments over syntactic phenomena. Additionally, standard prompting-based evaluation requires expensive text generation, may elicit post-hoc rationalizations rather than model judgments, and discards information about model uncertainty. We address both limitations by extending surprisal-based evaluation from binary grammaticality contrasts to ordinal-scaled classification and scoring tasks across multiple domains. Rather than asking models to generate answers, we measure the information-theoretic "surprise" (negative log probability) they assign to each position on rating scales (e.g., 1-5 or 1-9), yielding full surprisal curves that reveal both the model's preferred response and its uncertainty via entropy.

A
Andrew Katz
· · 1 min read · 7 views

arXiv:2603.14400v1 Announce Type: new Abstract: The minimal pairs paradigm of comparing model probabilities for contrasting completions has proven useful for evaluating linguistic knowledge in language models, yet its application has largely been confined to binary grammaticality judgments over syntactic phenomena. Additionally, standard prompting-based evaluation requires expensive text generation, may elicit post-hoc rationalizations rather than model judgments, and discards information about model uncertainty. We address both limitations by extending surprisal-based evaluation from binary grammaticality contrasts to ordinal-scaled classification and scoring tasks across multiple domains. Rather than asking models to generate answers, we measure the information-theoretic "surprise" (negative log probability) they assign to each position on rating scales (e.g., 1-5 or 1-9), yielding full surprisal curves that reveal both the model's preferred response and its uncertainty via entropy. We explore this framework across four domains: social-ecological-technological systems classification, causal statement identification (binary and scaled), figurative language detection, and deductive qualitative coding. Across these domains, surprisal curves produce interpretable classification signals with clear minima near expected ordinal scale positions, and entropy over the completion tended to distinguish genuinely ambiguous items from easier items.

Executive Summary

The article proposes an extension of the minimal pairs paradigm by incorporating ordinal surprisal curves and entropy to evaluate linguistic knowledge in language models across multiple domains. This approach addresses limitations of standard prompting-based evaluation by measuring model uncertainty and providing interpretable classification signals. The framework is applied to four domains, yielding promising results that distinguish between ambiguous and easier items.

Key Points

  • Extension of minimal pairs paradigm to ordinal-scaled classification and scoring tasks
  • Use of surprisal curves to reveal model preferences and uncertainty via entropy
  • Application across four domains: social-ecological-technological systems, causal statement identification, figurative language detection, and deductive qualitative coding

Merits

Improved Evaluation Methodology

The proposed approach provides a more nuanced understanding of model performance and uncertainty, enabling more accurate evaluations.

Interpretable Results

Surprisal curves produce clear and interpretable classification signals, facilitating easier analysis and comparison of model performance.

Demerits

Computational Complexity

The calculation of surprisal curves and entropy may increase computational requirements, potentially limiting the applicability of the approach to large-scale models or datasets.

Domain-Specific Limitations

The effectiveness of the proposed framework may vary across domains, requiring careful consideration of domain-specific characteristics and limitations.

Expert Commentary

The article presents a significant contribution to the field of natural language processing, offering a novel approach to evaluating linguistic knowledge in language models. By leveraging ordinal surprisal curves and entropy, the authors provide a more detailed understanding of model performance and uncertainty. The application of this framework across multiple domains demonstrates its versatility and potential for widespread adoption. However, further research is needed to address potential limitations and ensure the scalability and generalizability of the approach.

Recommendations

  • Further investigation into the computational requirements and potential optimizations of the proposed framework
  • Expansion of the approach to additional domains and tasks, including exploration of its applicability to multimodal and multilingual models

Sources