Academic

Beyond Accuracy: An Explainability-Driven Analysis of Harmful Content Detection

Trishita Dhara, Siddhesh Sheth · March 20, 2026 · 1 min read · 24 views

#cs.CL #cs.AI

arXiv:2603.18015v1 Announce Type: new Abstract: Although automated harmful content detection systems are frequently used to monitor online platforms, moderators and end users frequently cannot understand the logic underlying their predictions. While recent studies have focused on increasing classification accuracy, little focus has been placed on comprehending why neural models identify content as harmful, especially when it comes to borderline, contextual, and politically sensitive situations. In this work, a neural harmful content detection model trained on the Civil Comments dataset is analyzed explainability-drivenly. Two popular post-hoc explanation methods, Shapley Additive Explanations and Integrated Gradients, are used to analyze the behavior of a RoBERTa-based classifier in both correct predictions and systematic failure cases. Despite strong overall performance, with an area under the curve of 0.93 and an accuracy of 0.94, the analysis reveals limitations that are not observable from aggregate evaluation metrics alone. Integrated Gradients appear to extract more diffuse contextual attributions while Shapley Additive Explanations extract more focused attributions on explicit lexical cues. The consequent divergence in their outputs manifests in both false negatives and false positives. Qualitative case studies reveal recurring failure modes such as indirect toxicity, lexical over-attribution, or political discourse. The results suggest that explainable AI can foster human-in-the-loop moderation by exposing model uncertainty and increasing the interpretable rationale behind automated decisions. Most importantly, this work highlights the role of explainability as a transparency and diagnostic resource for online harmful content detection systems rather than as a performance-enhancing lever.

Executive Summary

This article presents a comprehensive explainability-driven analysis of a neural harmful content detection model, highlighting the limitations of relying solely on aggregate evaluation metrics. The study utilizes two popular post-hoc explanation methods to analyze the behavior of a RoBERTa-based classifier in both correct predictions and systematic failure cases. The results reveal recurring failure modes, including indirect toxicity and lexical over-attribution, and suggest that explainable AI can foster human-in-the-loop moderation by exposing model uncertainty and increasing the interpretable rationale behind automated decisions. The study emphasizes the role of explainability as a transparency and diagnostic resource for online harmful content detection systems.

Key Points

▸ Explainability-driven analysis of a neural harmful content detection model
▸ Revealing limitations of relying solely on aggregate evaluation metrics
▸ Two post-hoc explanation methods used: Shapley Additive Explanations and Integrated Gradients

Merits

Enhancing Transparency and Accountability

The study highlights the importance of explainability in providing transparency and accountability in automated decision-making processes.

Fostering Human-in-the-Loop Moderation

The results suggest that explainable AI can facilitate human-in-the-loop moderation by exposing model uncertainty and increasing the interpretable rationale behind automated decisions.

Demerits

Methodological Limitations

The study relies on a single dataset and model, which may limit the generalizability of the findings to other contexts and datasets.

Scalability Concerns

The computational demands of the post-hoc explanation methods used in the study may pose scalability challenges for large-scale applications.

Expert Commentary

The article makes a significant contribution to the field of AI explainability by highlighting the limitations of relying solely on aggregate evaluation metrics. The study's findings suggest that explainability-driven analysis is essential for fostering transparency and accountability in AI systems. However, the study's methodological limitations and scalability concerns should be addressed in future research. Furthermore, the study's implications for fairness and bias in AI, as well as human-AI collaboration, warrant further exploration. Overall, the article provides a valuable perspective on the importance of explainability in AI and has significant implications for both practical and policy-making contexts.

Recommendations

✓ Recommendation 1: Future research should prioritize the development of more scalable and efficient post-hoc explanation methods to facilitate wider adoption in AI systems.
✓ Recommendation 2: Policymakers and regulatory bodies should incorporate explainability-driven analysis into their decision-making processes to ensure more effective and accountable online content regulation.

Sources

arXiv - cs.CL

Beyond Accuracy: An Explainability-Driven Analysis of Harmful Content Detection

AI Commentary

Executive Summary

Key Points

Merits

Enhancing Transparency and Accountability

Fostering Human-in-the-Loop Moderation

Demerits

Methodological Limitations

Scalability Concerns

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.