BanglaSummEval: Reference-Free Factual Consistency Evaluation for Bangla Summarization
arXiv:2602.16843v1 Announce Type: new Abstract: Evaluating factual consistency is essential for reliable text summarization, particularly in high-stakes domains such as healthcare and news. However, most existing evaluation metrics overlook Bangla, a widely spoken yet under-resourced language, and often depend on reference summaries. We introduce BanglaSummEval, a reference-free, question-answering-based framework for evaluating factual consistency in Bangla summarization. The proposed method assesses both factual accuracy and content coverage through automatically generated questions and answers derived from the source document and the summary. A single multilingual instruction-tuned language model handles question generation, question answering, candidate answer extraction, and question importance weighting. This unified design reduces system complexity and computational cost. To capture semantic consistency beyond surface-level overlap, we use BERTScore-Recall for answer comparison
arXiv:2602.16843v1 Announce Type: new Abstract: Evaluating factual consistency is essential for reliable text summarization, particularly in high-stakes domains such as healthcare and news. However, most existing evaluation metrics overlook Bangla, a widely spoken yet under-resourced language, and often depend on reference summaries. We introduce BanglaSummEval, a reference-free, question-answering-based framework for evaluating factual consistency in Bangla summarization. The proposed method assesses both factual accuracy and content coverage through automatically generated questions and answers derived from the source document and the summary. A single multilingual instruction-tuned language model handles question generation, question answering, candidate answer extraction, and question importance weighting. This unified design reduces system complexity and computational cost. To capture semantic consistency beyond surface-level overlap, we use BERTScore-Recall for answer comparison. We validate BanglaSummEval on 300 human-written summaries from educational and medical domains, demonstrating strong correlation with expert human judgments (Pearson's $r = 0.694$, Spearman's $\rho = 0.763$). By providing interpretable, step-wise diagnostics alongside reliable evaluation scores, BanglaSummEval offers a practical and transparent solution for factual consistency evaluation in low-resource language settings.
Executive Summary
The article introduces BanglaSummEval, a reference-free, question-answering-based framework for evaluating factual consistency in Bangla summarization. This framework assesses both factual accuracy and content coverage through automatically generated questions and answers derived from the source document and the summary. The proposed method demonstrates strong correlation with expert human judgments and offers a practical and transparent solution for factual consistency evaluation in low-resource language settings. However, the reliance on a single multilingual language model may limit its scalability and adaptability to different domains and languages.
Key Points
- ▸ BanglaSummEval is a reference-free, question-answering-based framework for evaluating factual consistency in Bangla summarization.
- ▸ The framework assesses both factual accuracy and content coverage through automatically generated questions and answers.
- ▸ The proposed method demonstrates strong correlation with expert human judgments (Pearson's r = 0.694, Spearman's ρ = 0.763).
Merits
Strength in Factual Consistency Evaluation
BanglaSummEval's reference-free design allows for the evaluation of factual consistency in Bangla summarization without relying on human-annotated reference summaries, making it a valuable tool for low-resource language settings.
Scalability and Adaptability
The use of a single multilingual instruction-tuned language model simplifies system complexity and reduces computational cost, making it easier to deploy and maintain.
Demerits
Limitation in Scalability
The reliance on a single multilingual language model may limit the framework's ability to adapt to different domains and languages, potentially reducing its effectiveness in certain contexts.
Potential for Bias
The use of a pre-trained language model may introduce biases and inaccuracies, particularly if the model is not fine-tuned for the specific language or domain being evaluated.
Expert Commentary
BanglaSummEval is a significant contribution to the field of natural language processing, particularly in the area of factual consistency evaluation in low-resource languages. While the framework demonstrates strong correlation with expert human judgments, its reliance on a single multilingual language model may limit its scalability and adaptability. Nevertheless, the proposed method offers a practical and transparent solution for factual consistency evaluation in low-resource language settings. As the demand for summarization systems continues to grow, the development of evaluation metrics like BanglaSummEval will become increasingly important for ensuring the accuracy and reliability of these systems.
Recommendations
- ✓ Future research should focus on fine-tuning the language model for specific languages and domains to improve the framework's adaptability and effectiveness.
- ✓ The development of additional reference-free evaluation metrics should be explored to provide a more comprehensive understanding of summarization quality and accuracy.