Piecing Together Cross-Document Coreference Resolution Datasets: Systematic Dataset Analysis and Unification
arXiv:2603.00621v1 Announce Type: new Abstract: Research in CDCR remains fragmented due to heterogeneous dataset formats, varying annotation standards, and the predominance of the CDCR definition as the event coreference resolution (ECR). To address these challenges, we introduce uCDCR, a unified dataset that consolidates diverse publicly available English CDCR corpora across various domains into a consistent format, which we analyze with standardized metrics and evaluation protocols. uCDCR incorporates both entity and event coreference, corrects known inconsistencies, and enriches datasets with missing attributes to facilitate reproducible research. We establish a cohesive framework for fair, interpretable, and cross-dataset analysis in CDCR and compare the datasets on their lexical properties, e.g., lexical composition of the annotated mentions, lexical diversity and ambiguity metrics, discuss the annotation rules and principles that lead to high lexical diversity, and examine how t
arXiv:2603.00621v1 Announce Type: new Abstract: Research in CDCR remains fragmented due to heterogeneous dataset formats, varying annotation standards, and the predominance of the CDCR definition as the event coreference resolution (ECR). To address these challenges, we introduce uCDCR, a unified dataset that consolidates diverse publicly available English CDCR corpora across various domains into a consistent format, which we analyze with standardized metrics and evaluation protocols. uCDCR incorporates both entity and event coreference, corrects known inconsistencies, and enriches datasets with missing attributes to facilitate reproducible research. We establish a cohesive framework for fair, interpretable, and cross-dataset analysis in CDCR and compare the datasets on their lexical properties, e.g., lexical composition of the annotated mentions, lexical diversity and ambiguity metrics, discuss the annotation rules and principles that lead to high lexical diversity, and examine how these metrics influence performance on the same-head-lemma baseline. Our dataset analysis shows that ECB+, the state-of-the-art benchmark for CDCR, has one of the lowest lexical diversities, and its CDCR complexity, measured by the same-head-lemma baseline, lies in the middle among all uCDCR datasets. Moreover, comparing document and mention distributions between ECB+ and uCDCR shows that using all uCDCR datasets for model training and evaluation will improve the generalizability of CDCR models. Finally, the almost identical performance on the same-head-lemma baseline, separately applied to events and entities, shows that resolving both types is a complex task and should not be steered toward ECR alone. The uCDCR dataset is available at https://huggingface.co/datasets/AnZhu/uCDCR, and the code for parsing, analyzing, and scoring the dataset is available at https://github.com/anastasia-zhukova/uCDCR.
Executive Summary
This article introduces uCDCR, a unified dataset for Cross-Document Coreference Resolution (CDCR) that consolidates diverse publicly available English CDCR corpora into a consistent format. The dataset analysis reveals insights into lexical properties, annotation rules, and performance metrics, highlighting the importance of considering both entity and event coreference. The study shows that using all uCDCR datasets for model training and evaluation can improve the generalizability of CDCR models, and that resolving both event and entity coreference is a complex task that should not be limited to event coreference resolution alone.
Key Points
- ▸ Introduction of uCDCR, a unified dataset for CDCR
- ▸ Analysis of lexical properties and annotation rules
- ▸ Comparison of dataset performance metrics
- ▸ Importance of considering both entity and event coreference
Merits
Comprehensive Dataset Unification
The uCDCR dataset provides a consistent format for diverse CDCR corpora, facilitating reproducible research and fair comparison of models.
In-Depth Analysis of Lexical Properties
The study provides valuable insights into lexical composition, diversity, and ambiguity metrics, shedding light on the complexities of CDCR.
Demerits
Limited Domain Coverage
The uCDCR dataset may not cover all domains, potentially limiting its applicability to specific areas of research.
Dependence on Annotation Quality
The accuracy of the uCDCR dataset relies on the quality of annotations, which may be inconsistent or biased.
Expert Commentary
The introduction of the uCDCR dataset is a significant contribution to the field of CDCR, providing a unified framework for dataset analysis and model evaluation. The study's findings highlight the importance of considering both entity and event coreference, and the need for standardized evaluation protocols. However, further research is necessary to address the limitations of the dataset and to explore its applications in various domains. The uCDCR dataset has the potential to facilitate the development of more accurate and robust CDCR models, with implications for natural language processing and information retrieval tasks.
Recommendations
- ✓ Further expansion of the uCDCR dataset to cover additional domains
- ✓ Development of standardized evaluation protocols for CDCR models
- ✓ Investigation of the applications of the uCDCR dataset in various NLP and IR tasks