INSURE-Dial: A Phase-Aware Conversational Dataset \& Benchmark for Compliance Verification and Phase Detection
arXiv:2602.18448v1 Announce Type: new Abstract: Administrative phone tasks drain roughly 1 trillion USD annually from U.S. healthcare, with over 500 million insurance-benefit verification calls manually handled in 2024. We introduce INSURE-Dial, to our knowledge the first public benchmark for developing and assessing compliance-aware voice agents for phase-aware call auditing with span-based compliance verification. The corpus includes 50 de-identified, AI-initiated calls with live insurance representatives (mean 71 turns/call) and 1,000 synthetically generated calls that mirror the same workflow. All calls are annotated with a phase-structured JSON schema covering IVR navigation, patient identification, coverage status, medication checks (up to two drugs), and agent identification (CRN), and each phase is labeled for Information and Procedural compliance under explicit ask/answer logic. We define two novel evaluation tasks: (1) Phase Boundary Detection (span segmentation under phase-
arXiv:2602.18448v1 Announce Type: new Abstract: Administrative phone tasks drain roughly 1 trillion USD annually from U.S. healthcare, with over 500 million insurance-benefit verification calls manually handled in 2024. We introduce INSURE-Dial, to our knowledge the first public benchmark for developing and assessing compliance-aware voice agents for phase-aware call auditing with span-based compliance verification. The corpus includes 50 de-identified, AI-initiated calls with live insurance representatives (mean 71 turns/call) and 1,000 synthetically generated calls that mirror the same workflow. All calls are annotated with a phase-structured JSON schema covering IVR navigation, patient identification, coverage status, medication checks (up to two drugs), and agent identification (CRN), and each phase is labeled for Information and Procedural compliance under explicit ask/answer logic. We define two novel evaluation tasks: (1) Phase Boundary Detection (span segmentation under phase-specific acceptance rules) and (2) Compliance Verification (IC/PC decisions given fixed spans). Per-phase scores are strong across small, low-latency baselines, but end-to-end reliability is constrained by span-boundary errors. On real calls, full-call exact segmentation is low, showing a gap between conversational fluency and audit-grade evidence.
Executive Summary
This study introduces INSURE-Dial, a novel conversational dataset and benchmark for evaluating compliance-aware voice agents. The dataset includes 50 de-identified calls with live representatives and 1,000 synthetically generated calls, annotated with a phase-structured JSON schema covering various insurance-related tasks. The authors propose two novel evaluation tasks: Phase Boundary Detection and Compliance Verification. The study finds that while per-phase scores are strong, end-to-end reliability is limited by span-boundary errors. The results suggest a gap between conversational fluency and audit-grade evidence. This study has significant implications for the development of compliance-aware voice agents and the automation of administrative phone tasks in the healthcare industry.
Key Points
- ▸ INSURE-Dial is a novel conversational dataset and benchmark for evaluating compliance-aware voice agents.
- ▸ The dataset includes 50 de-identified calls with live representatives and 1,000 synthetically generated calls, annotated with a phase-structured JSON schema.
- ▸ Two novel evaluation tasks are proposed: Phase Boundary Detection and Compliance Verification.
Merits
Strength in Phase-Aware Design
The study's focus on phase-aware design allows for a more nuanced understanding of compliance verification, which is essential for automating administrative phone tasks in healthcare.
Novel Evaluation Tasks
The introduction of Phase Boundary Detection and Compliance Verification tasks provides a more comprehensive evaluation framework for compliance-aware voice agents.
Demerits
Limited Generalizability
The study's focus on insurance-related tasks may limit the generalizability of the findings to other domains, such as customer service or technical support.
Need for Further Research
The study highlights the importance of further research into the development of compliance-aware voice agents, particularly in addressing the gap between conversational fluency and audit-grade evidence.
Expert Commentary
The INSURE-Dial study is a significant contribution to the field of conversational AI and compliance. The study's focus on phase-aware design and novel evaluation tasks provides a more comprehensive understanding of compliance verification. However, the study's limitations, including the need for further research and the potential for limited generalizability, highlight the importance of ongoing investigation in this area. Ultimately, the development of compliance-aware voice agents has the potential to significantly impact the healthcare industry and beyond.
Recommendations
- ✓ Future research should focus on developing more robust evaluation frameworks for compliance-aware voice agents, including the use of more diverse and representative datasets.
- ✓ Policymakers should consider the implications of AI on regulatory compliance and the potential for AI to streamline bureaucratic processes, and develop guidelines for the development and deployment of compliance-aware voice agents.