Academic

Can We Trust a Black-box LLM? LLM Untrustworthy Boundary Detection via Bias-Diffusion and Multi-Agent Reinforcement Learning

arXiv:2604.05483v1 Announce Type: new Abstract: Large Language Models (LLMs) have shown a high capability in answering questions on a diverse range of topics. However, these models sometimes produce biased, ideologized or incorrect responses, limiting their applications if there is no clear understanding of which topics their answers can be trusted. In this research, we introduce a novel algorithm, named as GMRL-BD, designed to identify the untrustworthy boundaries (in terms of topics) of a given LLM, with black-box access to the LLM and under specific query constraints. Based on a general Knowledge Graph (KG) derived from Wikipedia, our algorithm incorporates with multiple reinforcement learning agents to efficiently identify topics (some nodes in KG) where the LLM is likely to generate biased answers. Our experiments demonstrated the efficiency of our algorithm, which can detect the untrustworthy boundary with just limited queries to the LLM. Additionally, we have released a new dat

X
Xiaotian Zhou, Di Tang, Xiaofeng Wang, Xiaozhong Liu
· · 1 min read · 4 views

arXiv:2604.05483v1 Announce Type: new Abstract: Large Language Models (LLMs) have shown a high capability in answering questions on a diverse range of topics. However, these models sometimes produce biased, ideologized or incorrect responses, limiting their applications if there is no clear understanding of which topics their answers can be trusted. In this research, we introduce a novel algorithm, named as GMRL-BD, designed to identify the untrustworthy boundaries (in terms of topics) of a given LLM, with black-box access to the LLM and under specific query constraints. Based on a general Knowledge Graph (KG) derived from Wikipedia, our algorithm incorporates with multiple reinforcement learning agents to efficiently identify topics (some nodes in KG) where the LLM is likely to generate biased answers. Our experiments demonstrated the efficiency of our algorithm, which can detect the untrustworthy boundary with just limited queries to the LLM. Additionally, we have released a new dataset containing popular LLMs including Llama2, Vicuna, Falcon, Qwen2, Gemma2 and Yi-1.5, along with labels indicating the topics on which each LLM is likely to be biased.

Executive Summary

The article presents a novel algorithm, GMRL-BD, to detect untrustworthy boundaries of Large Language Models (LLMs) by identifying topics prone to biased or incorrect responses. Utilizing a knowledge graph derived from Wikipedia and multi-agent reinforcement learning, the algorithm efficiently maps out LLM vulnerabilities with minimal queries. The study also introduces a new dataset labeling bias-prone topics across popular LLMs such as Llama2, Vicuna, and others. This research addresses a critical gap in LLM reliability, offering a scalable solution for trustworthiness assessment in black-box models, which is essential for their safe deployment in high-stakes applications.

Key Points

  • Proposes GMRL-BD, a black-box algorithm to detect untrustworthy LLM boundaries using reinforcement learning and a knowledge graph.
  • Demonstrates efficiency by identifying biased topics with limited LLM queries, reducing computational overhead.
  • Releases a comprehensive dataset labeling bias-prone topics across multiple LLMs, enabling further research and benchmarking.

Merits

Novelty

Introduces a first-of-its-kind approach combining knowledge graphs, reinforcement learning, and black-box LLM interrogation to systematically map untrustworthy boundaries.

Scalability

Operates efficiently with minimal queries, making it feasible for large-scale LLM evaluations without excessive computational cost.

Practical Utility

Provides actionable insights for developers and policymakers to mitigate LLM biases by identifying high-risk topics.

Demerits

Dataset Limitations

The bias labels are derived from Wikipedia-based knowledge graphs, which may not fully capture the nuances of real-world LLM biases or cultural contexts.

Reinforcement Learning Constraints

The multi-agent reinforcement learning approach may introduce complexity in tuning and interpreting results, potentially limiting reproducibility.

Black-Box Assumptions

The algorithm assumes black-box access, which may not align with scenarios where partial model transparency is available, potentially reducing its applicability.

Expert Commentary

This research represents a significant advancement in the field of AI safety by addressing a critical challenge: the systematic identification of untrustworthy boundaries in black-box LLMs. The authors’ novel integration of knowledge graphs and multi-agent reinforcement learning offers a scalable and efficient solution to a problem that has plagued the deployment of LLMs in sensitive domains. The release of a labeled dataset further enhances the study’s impact, providing a valuable resource for the research community. However, the reliance on Wikipedia-derived knowledge graphs may limit the generalizability of the findings to more nuanced or culturally specific biases. Additionally, the black-box assumption may not fully align with the growing trend toward model transparency. Nonetheless, the work sets a new benchmark for LLM trustworthiness assessment and underscores the importance of proactive bias detection in AI systems.

Recommendations

  • Expand the knowledge graph to include diverse sources beyond Wikipedia to improve cultural and contextual relevance of bias detection.
  • Explore hybrid approaches that leverage both black-box and partial white-box access to enhance the robustness and interpretability of the algorithm.
  • Collaborate with policymakers to develop standardized testing frameworks that incorporate GMRL-BD-like methodologies for regulatory compliance.

Sources

Original: arXiv - cs.AI