Skip to main content
Academic

Human Label Variation in Implicit Discourse Relation Recognition

arXiv:2602.22723v1 Announce Type: new Abstract: There is growing recognition that many NLP tasks lack a single ground truth, as human judgments reflect diverse perspectives. To capture this variation, models have been developed to predict full annotation distributions rather than majority labels, while perspectivist models aim to reproduce the interpretations of individual annotators. In this work, we compare these approaches on Implicit Discourse Relation Recognition (IDRR), a highly ambiguous task where disagreement often arises from cognitive complexity rather than ideological bias. Our experiments show that existing annotator-specific models perform poorly in IDRR unless ambiguity is reduced, whereas models trained on label distributions yield more stable predictions. Further analysis indicates that frequent cognitively demanding cases drive inconsistency in human interpretation, posing challenges for perspectivist modeling in IDRR.

arXiv:2602.22723v1 Announce Type: new Abstract: There is growing recognition that many NLP tasks lack a single ground truth, as human judgments reflect diverse perspectives. To capture this variation, models have been developed to predict full annotation distributions rather than majority labels, while perspectivist models aim to reproduce the interpretations of individual annotators. In this work, we compare these approaches on Implicit Discourse Relation Recognition (IDRR), a highly ambiguous task where disagreement often arises from cognitive complexity rather than ideological bias. Our experiments show that existing annotator-specific models perform poorly in IDRR unless ambiguity is reduced, whereas models trained on label distributions yield more stable predictions. Further analysis indicates that frequent cognitively demanding cases drive inconsistency in human interpretation, posing challenges for perspectivist modeling in IDRR.

Executive Summary

The article explores the challenges of capturing human judgment variation in Natural Language Processing (NLP) tasks, focusing on Implicit Discourse Relation Recognition (IDRR). It compares models that predict full annotation distributions with those that aim to reproduce individual annotators' interpretations. The study finds that existing annotator-specific models perform poorly in IDRR unless ambiguity is reduced, while models trained on label distributions yield more stable predictions. The analysis highlights that cognitively demanding cases drive inconsistency in human interpretation, posing challenges for perspectivist modeling in IDRR.

Key Points

  • Human judgments in NLP tasks reflect diverse perspectives, lacking a single ground truth.
  • Models predicting full annotation distributions perform better than perspectivist models in IDRR.
  • Cognitively demanding cases drive inconsistency in human interpretation, challenging perspectivist modeling.

Merits

Comprehensive Comparison

The article provides a rigorous comparison between models predicting full annotation distributions and perspectivist models, offering valuable insights into their performance in IDRR.

Identification of Key Challenges

The study effectively identifies cognitively demanding cases as a significant source of inconsistency in human interpretation, highlighting a critical area for future research.

Demerits

Limited Scope

The analysis is focused on IDRR, which may limit the generalizability of the findings to other NLP tasks with different characteristics.

Model Performance Variability

The performance of annotator-specific models is highly dependent on the reduction of ambiguity, which may not always be feasible in practical applications.

Expert Commentary

The article makes a significant contribution to the field of NLP by addressing the critical issue of human judgment variation. The comparison between models predicting full annotation distributions and perspectivist models is particularly insightful, as it underscores the challenges of capturing individual annotator interpretations in highly ambiguous tasks like IDRR. The identification of cognitively demanding cases as a primary driver of inconsistency is a valuable finding that could guide future research. However, the study's focus on IDRR limits its generalizability, and further research is needed to explore the applicability of these findings to other NLP tasks. The practical implications of this study are substantial, as they suggest that models trained on label distributions may be more reliable for tasks with high ambiguity. Policy-wise, the findings highlight the need for policies that support the development of NLP models capable of handling diverse human perspectives, particularly in cognitively demanding tasks. Overall, this article provides a rigorous and well-reasoned analysis that adds genuine value to the existing literature on NLP and human judgment variation.

Recommendations

  • Future research should explore the generalizability of these findings to other NLP tasks with varying levels of ambiguity.
  • Developers of NLP models should consider incorporating label distribution predictions to improve reliability in tasks with high ambiguity.

Sources