Academic

ICR-Drive: Instruction Counterfactual Robustness for End-to-End Language-Driven Autonomous Driving

arXiv:2604.05378v1 Announce Type: new Abstract: Recent progress in vision-language-action (VLA) models has enabled language-conditioned driving agents to execute natural-language navigation commands in closed-loop simulation, yet standard evaluations largely assume instructions are precise and well-formed. In deployment, instructions vary in phrasing and specificity, may omit critical qualifiers, and can occasionally include misleading, authority-framed text, leaving instruction-level robustness under-measured. We introduce ICR-Drive, a diagnostic framework for instruction counterfactual robustness in end-to-end language-conditioned autonomous driving. ICR-Drive generates controlled instruction variants spanning four perturbation families: Paraphrase, Ambiguity, Noise, and Misleading, where Misleading variants conflict with the navigation goal and attempt to override intent. We replay identical CARLA routes under matched simulator configurations and seeds to isolate performance change

K
Kaiser Hamid, Can Cui, Nade Liang
· · 1 min read · 12 views

arXiv:2604.05378v1 Announce Type: new Abstract: Recent progress in vision-language-action (VLA) models has enabled language-conditioned driving agents to execute natural-language navigation commands in closed-loop simulation, yet standard evaluations largely assume instructions are precise and well-formed. In deployment, instructions vary in phrasing and specificity, may omit critical qualifiers, and can occasionally include misleading, authority-framed text, leaving instruction-level robustness under-measured. We introduce ICR-Drive, a diagnostic framework for instruction counterfactual robustness in end-to-end language-conditioned autonomous driving. ICR-Drive generates controlled instruction variants spanning four perturbation families: Paraphrase, Ambiguity, Noise, and Misleading, where Misleading variants conflict with the navigation goal and attempt to override intent. We replay identical CARLA routes under matched simulator configurations and seeds to isolate performance changes attributable to instruction language. Robustness is quantified using standard CARLA Leaderboard metrics and per-family performance degradation relative to the baseline instruction. Experiments on LMDrive and BEVDriver show that minor instruction changes can induce substantial performance drops and distinct failure modes, revealing a reliability gap for deploying embodied foundation models in safety-critical driving.

Executive Summary

The article presents ICR-Drive, a diagnostic framework designed to evaluate the robustness of language-conditioned autonomous driving systems against variations in instruction phrasing, specificity, and potential misleading inputs. By introducing controlled perturbations across four families—Paraphrase, Ambiguity, Noise, and Misleading—the framework systematically assesses how minor linguistic changes impact driving performance. The study demonstrates that even subtle alterations in instructions can significantly degrade performance, exposing critical reliability gaps in current autonomous driving models. This work highlights the necessity of incorporating instruction counterfactual robustness in safety-critical deployments of embodied foundation models.

Key Points

  • ICR-Drive introduces a structured methodology to evaluate the robustness of language-driven autonomous driving systems to instruction perturbations.
  • The framework isolates performance changes attributable to instruction language by replaying identical routes under matched simulator configurations, ensuring controlled comparisons.
  • Experiments on models like LMDrive and BEVDriver reveal substantial performance drops and distinct failure modes due to minor instruction variations, underscoring existing reliability gaps.
  • The study quantifies robustness using standard CARLA Leaderboard metrics, providing a measurable benchmark for future improvements.
  • Misleading instructions, which conflict with navigation goals, are particularly detrimental, simulating real-world scenarios where adversarial or poorly framed inputs may occur.

Merits

Methodological Rigor

The article introduces a highly controlled and systematic approach to evaluating instruction robustness, which is critical for safety-critical applications like autonomous driving. The use of matched simulator configurations and replayed routes ensures that performance changes are attributable solely to instruction variations.

Novelty and Relevance

The focus on instruction counterfactual robustness is novel in the context of autonomous driving, addressing a gap in current evaluations that assume perfect or well-formed instructions. The introduction of perturbation families like Misleading and Ambiguity provides a realistic lens through which to assess system resilience.

Empirical Insight

The study provides empirical evidence that minor linguistic changes can induce significant performance degradation, offering actionable insights into the vulnerabilities of current models. The findings are particularly relevant for deployment scenarios where instructions may vary in quality or intent.

Scalability

The framework is designed to be scalable and adaptable to other embodied foundation models beyond autonomous driving, making it a versatile tool for evaluating language robustness in various domains.

Demerits

Limited Real-World Validation

While the study uses the CARLA simulator for controlled experiments, the real-world applicability of these findings may be limited due to the controlled nature of simulation environments. Real-world driving scenarios involve far greater complexity and unpredictability in instruction variations.

Focus on Language Perturbations Only

The framework primarily evaluates linguistic perturbations and does not account for other critical factors in autonomous driving, such as sensor noise, dynamic environmental changes, or multi-modal input variations. This narrow focus may overlook broader robustness challenges.

Dependence on CARLA Metrics

The reliance on CARLA Leaderboard metrics for quantifying robustness may not fully capture the nuances of real-world driving performance. Additional metrics or human evaluations could provide a more comprehensive assessment.

Potential Overfitting to Perturbations

The systematic generation of perturbation families may lead to models that are optimized for these specific variations rather than generalizing to unseen or more naturalistic instruction variations encountered in real-world deployments.

Expert Commentary

The introduction of ICR-Drive represents a significant step forward in the evaluation of language-conditioned autonomous driving systems, addressing a critical gap in current robustness assessments. The methodological rigor of the framework, particularly its use of controlled perturbations and matched simulator conditions, provides a robust foundation for isolating the impact of instruction variations. However, the study also raises important questions about the scalability and real-world applicability of these findings. While the CARLA simulator offers a controlled environment, the complexity of real-world driving—where instructions are dynamic, multi-modal, and context-dependent—poses additional challenges not fully captured by this work. The emphasis on misleading instructions is particularly noteworthy, as it aligns with broader concerns in AI safety regarding adversarial inputs and intent misalignment. For practical deployment, this framework should be complemented by real-world testing and integration with broader robustness evaluation methodologies. The insights provided by ICR-Drive are invaluable for guiding both developers and regulators in ensuring that autonomous systems are not only technically proficient but also resilient to the vagaries of human communication.

Recommendations

  • Integrate ICR-Drive or similar frameworks into the development and testing pipelines of autonomous driving systems to systematically evaluate instruction robustness from early stages.
  • Expand the perturbation families to include multi-modal variations (e.g., visual or auditory instructions) and dynamic scenarios to better reflect real-world conditions.
  • Collaborate with regulatory bodies to establish standardized benchmarks for instruction robustness, ensuring consistency and comparability across models and deployments.
  • Develop hybrid models that combine learned policies with rule-based systems to handle ambiguous or conflicting instructions more robustly.
  • Invest in research to explore the generalization of instruction robustness across different languages, cultural contexts, and dialects, ensuring inclusivity and equitable performance.

Sources

Original: arXiv - cs.CL