Academic

When Adaptive Rewards Hurt: Causal Probing and the Switching-Stability Dilemma in LLM-Guided LEO Satellite Scheduling

arXiv:2604.03562v1 Announce Type: new Abstract: Adaptive reward design for deep reinforcement learning (DRL) in multi-beam LEO satellite scheduling is motivated by the intuition that regime-aware reward weights should outperform static ones. We systematically test this intuition and uncover a switching-stability dilemma: near-constant reward weights (342.1 Mbps) outperform carefully-tuned dynamic weights (103.3+/-96.8 Mbps) because PPO requires a quasistationary reward signal for value function convergence. Weight adaptation-regardless of quality-degrades performance by repeatedly restarting convergence. To understand why specific weights matter, we introduce a single-variable causal probing method that independently perturbs each reward term by +/-20% and measures PPO response after 50k steps. Probing reveals counterintuitive leverage: a +20% increase in the switching penalty yields +157 Mbps for polar handover and +130 Mbps for hot-cold regimes-findings inaccessible to human experts

Y
Yuanhang Li
· · 1 min read · 4 views

arXiv:2604.03562v1 Announce Type: new Abstract: Adaptive reward design for deep reinforcement learning (DRL) in multi-beam LEO satellite scheduling is motivated by the intuition that regime-aware reward weights should outperform static ones. We systematically test this intuition and uncover a switching-stability dilemma: near-constant reward weights (342.1 Mbps) outperform carefully-tuned dynamic weights (103.3+/-96.8 Mbps) because PPO requires a quasistationary reward signal for value function convergence. Weight adaptation-regardless of quality-degrades performance by repeatedly restarting convergence. To understand why specific weights matter, we introduce a single-variable causal probing method that independently perturbs each reward term by +/-20% and measures PPO response after 50k steps. Probing reveals counterintuitive leverage: a +20% increase in the switching penalty yields +157 Mbps for polar handover and +130 Mbps for hot-cold regimes-findings inaccessible to human experts or trained MLPs without systematic probing. We evaluate four MDP architect variants (fixed, rule-based, learned MLP, finetuned LLM) across known and novel traffic regimes. The MLP achieves 357.9 Mbps on known regimes and 325.2 Mbps on novel regimes, while the fine-tuned LLM collapses to 45.3+/-43.0 Mbps due to weight oscillation rather than lack of domain knowledge-output consistency, not knowledge, is the binding constraint. Our findings provide an empirically-grounded roadmap for LLM-DRL integration in communication systems, identifying where LLMs add irreplaceable value (natural language intent understanding) versus where simpler methods suffice.

Executive Summary

The study critically examines adaptive reward design in deep reinforcement learning (DRL) for multi-beam LEO satellite scheduling, challenging the prevailing assumption that dynamic rewards outperform static ones. The research uncovers a 'switching-stability dilemma,' wherein near-constant reward weights yield superior performance (342.1 Mbps) compared to carefully tuned dynamic weights (103.3±96.8 Mbps). The authors introduce a novel causal probing method to dissect reward term perturbations, revealing counterintuitive insights such as a +20% increase in switching penalty improving throughput by up to 157 Mbps. The study evaluates four MDP architect variants, demonstrating that while MLPs achieve robust performance (357.9 Mbps on known regimes, 325.2 Mbps on novel regimes), fine-tuned LLMs collapse due to reward weight oscillation. The findings provide a pragmatic framework for integrating LLMs with DRL in communication systems, emphasizing the primacy of reward signal stability over domain-specific knowledge.

Key Points

  • Adaptive reward design in DRL for LEO satellite scheduling fails to outperform static rewards due to the 'switching-stability dilemma,' where reward signal non-stationarity disrupts PPO convergence.
  • A single-variable causal probing method reveals that specific reward term perturbations (e.g., +20% switching penalty) can yield significant throughput gains (up to +157 Mbps), challenging human intuition and traditional MLP-based approaches.
  • Fine-tuned LLMs underperform MLPs (45.3±43.0 Mbps vs. 325.2+ Mbps) due to reward weight oscillation, highlighting that output consistency—not domain knowledge—is the binding constraint in LLM-DRL integration.

Merits

Methodological Innovation

The introduction of causal probing as a systematic tool to dissect reward term perturbations is a significant advancement. It enables empirical validation of counterintuitive insights that are inaccessible to traditional methods, such as human expert intuition or even trained MLPs.

Empirical Rigor

The study employs a robust experimental framework, evaluating four MDP architect variants across known and novel traffic regimes. The inclusion of both static and dynamic reward designs, alongside comprehensive performance metrics, ensures a thorough assessment of the switching-stability dilemma.

Theoretical Contribution

The identification of the 'switching-stability dilemma' advances the theoretical understanding of reward design in DRL. It underscores the critical role of reward signal stationarity in value function convergence, a principle with broad implications beyond satellite scheduling.

Demerits

Limited Generalizability

The study focuses narrowly on LEO satellite scheduling, leaving open the question of whether the switching-stability dilemma applies to other domains or DRL architectures. Further research is needed to validate the findings across diverse applications.

Probing Method Constraints

The causal probing method, while innovative, is limited to ±20% perturbations and 50k-step evaluations. The extent to which these constraints capture the full complexity of reward term interactions remains unclear.

LLM Collapse Attribution

The claim that LLM collapse is solely due to reward weight oscillation lacks nuanced analysis. Additional factors, such as model capacity, training data quality, or reward function design, may also contribute to performance degradation.

Expert Commentary

This study makes a seminal contribution to the intersection of reinforcement learning, reward design, and large language models, challenging conventional wisdom with empirical rigor. The discovery of the 'switching-stability dilemma' is particularly noteworthy, as it reframes the debate on adaptive rewards by demonstrating the primacy of reward signal stationarity over adaptability. The introduction of causal probing as a methodological tool is a standout innovation, providing a systematic pathway to dissect reward term interactions and uncover counterintuitive insights that elude traditional approaches. However, the study’s narrow focus on LEO satellite scheduling raises questions about generalizability, and the attribution of LLM collapse to reward oscillation alone warrants further investigation. Nonetheless, the findings offer a pragmatic roadmap for AI-DRL integration in communication systems, with broader implications for reward-sensitive applications. The work underscores the need for a paradigm shift in AI system design—one that prioritizes stability and empirical validation over domain-specific knowledge or adaptive complexity.

Recommendations

  • Adopt causal probing as a standard tool in reward function design to systematically validate reward term interactions and uncover non-intuitive insights.
  • Prioritize reward signal stationarity in DRL systems, particularly in PPO-based architectures, to ensure robust value function convergence and performance stability.
  • Exercise caution when integrating LLMs with DRL systems, conducting rigorous tests for reward stability and weight oscillation to mitigate performance collapse.
  • Expand the scope of causal probing to include longitudinal studies across diverse DRL architectures and domains to validate the generalizability of the switching-stability dilemma.
  • Develop policy frameworks that mandate transparency and robustness testing in AI-driven scheduling systems, particularly in critical infrastructure like satellite networks.

Sources

Original: arXiv - cs.AI