Skip to main content
Academic

Diverse Word Choices, Same Reference: Annotating Lexically-Rich Cross-Document Coreference

arXiv:2602.17424v1 Announce Type: new Abstract: Cross-document coreference resolution (CDCR) identifies and links mentions of the same entities and events across related documents, enabling content analysis that aggregates information at the level of discourse participants. However, existing datasets primarily focus on event resolution and employ a narrow definition of coreference, which limits their effectiveness in analyzing diverse and polarized news coverage where wording varies widely. This paper proposes a revised CDCR annotation scheme of the NewsWCL50 dataset, treating coreference chains as discourse elements (DEs) and conceptual units of analysis. The approach accommodates both identity and near-identity relations, e.g., by linking "the caravan" - "asylum seekers" - "those contemplating illegal entry", allowing models to capture lexical diversity and framing variation in media discourse, while maintaining the fine-grained annotation of DEs. We reannotate the NewsWCL50 and a s

arXiv:2602.17424v1 Announce Type: new Abstract: Cross-document coreference resolution (CDCR) identifies and links mentions of the same entities and events across related documents, enabling content analysis that aggregates information at the level of discourse participants. However, existing datasets primarily focus on event resolution and employ a narrow definition of coreference, which limits their effectiveness in analyzing diverse and polarized news coverage where wording varies widely. This paper proposes a revised CDCR annotation scheme of the NewsWCL50 dataset, treating coreference chains as discourse elements (DEs) and conceptual units of analysis. The approach accommodates both identity and near-identity relations, e.g., by linking "the caravan" - "asylum seekers" - "those contemplating illegal entry", allowing models to capture lexical diversity and framing variation in media discourse, while maintaining the fine-grained annotation of DEs. We reannotate the NewsWCL50 and a subset of ECB+ using a unified codebook and evaluate the new datasets through lexical diversity metrics and a same-head-lemma baseline. The results show that the reannotated datasets align closely, falling between the original ECB+ and NewsWCL50, thereby supporting balanced and discourse-aware CDCR research in the news domain.

Executive Summary

The article titled 'Diverse Word Choices, Same Reference: Annotating Lexically-Rich Cross-Document Coreference' addresses the limitations of current cross-document coreference resolution (CDCR) datasets, which primarily focus on event resolution and employ a narrow definition of coreference. The authors propose a revised annotation scheme for the NewsWCL50 dataset, treating coreference chains as discourse elements (DEs) and conceptual units of analysis. This approach accommodates both identity and near-identity relations, allowing models to capture lexical diversity and framing variation in media discourse. The reannotated datasets, evaluated through lexical diversity metrics and a same-head-lemma baseline, show alignment between the original ECB+ and NewsWCL50, supporting balanced and discourse-aware CDCR research in the news domain.

Key Points

  • Current CDCR datasets focus narrowly on event resolution.
  • The proposed annotation scheme treats coreference chains as discourse elements.
  • The approach accommodates both identity and near-identity relations.
  • Reannotated datasets show alignment between original datasets.
  • Supports balanced and discourse-aware CDCR research.

Merits

Comprehensive Annotation Scheme

The proposed annotation scheme is comprehensive, treating coreference chains as discourse elements and conceptual units of analysis, which allows for a more nuanced understanding of lexical diversity and framing variation in media discourse.

Balanced Evaluation

The reannotated datasets align closely between the original ECB+ and NewsWCL50, providing a balanced approach to CDCR research in the news domain.

Demerits

Limited Scope of Evaluation

The evaluation is limited to lexical diversity metrics and a same-head-lemma baseline, which may not fully capture the complexity of coreference resolution in diverse and polarized news coverage.

Potential Bias in Annotation

The reannotation process may introduce biases, particularly in the interpretation of near-identity relations, which could affect the reliability of the datasets.

Expert Commentary

The article presents a significant advancement in the field of cross-document coreference resolution by addressing the limitations of current datasets and proposing a more comprehensive annotation scheme. The treatment of coreference chains as discourse elements and conceptual units of analysis is a notable strength, as it allows for a more nuanced understanding of lexical diversity and framing variation in media discourse. The reannotated datasets, which align closely between the original ECB+ and NewsWCL50, provide a balanced approach to CDCR research in the news domain. However, the evaluation is limited to lexical diversity metrics and a same-head-lemma baseline, which may not fully capture the complexity of coreference resolution in diverse and polarized news coverage. Additionally, the reannotation process may introduce biases, particularly in the interpretation of near-identity relations. Despite these limitations, the study's findings have important practical and policy implications, particularly in the areas of media regulation and the ethical use of AI in analyzing news coverage.

Recommendations

  • Future research should explore more comprehensive evaluation metrics to fully capture the complexity of coreference resolution in diverse and polarized news coverage.
  • The reannotation process should be carefully reviewed to minimize potential biases, particularly in the interpretation of near-identity relations.

Sources