Academic

K^2-Agent: Co-Evolving Know-What and Know-How for Hierarchical Mobile Device Control

arXiv:2603.00676v1 Announce Type: new Abstract: Existing mobile device control agents often perform poorly when solving complex tasks requiring long-horizon planning and precise operations, typically due to a lack of relevant task experience or unfamiliarity with skill execution. We propose K2-Agent, a hierarchical framework that models human-like cognition by separating and co-evolving declarative (knowing what) and procedural (knowing how) knowledge for planning and execution. K2-Agent's high level reasoner is bootstrapped from a single demonstration per task and runs a Summarize-Reflect-Locate-Revise (SRLR) loop to distill and iteratively refine task-level declarative knowledge through self-evolution. The low-level executor is trained with our curriculum-guided Group Relative Policy Optimization (C-GRPO), which (i) constructs a balanced sample pool using decoupled reward signals and (ii) employs dynamic demonstration injection to guide the model in autonomously generating successfu

arXiv:2603.00676v1 Announce Type: new Abstract: Existing mobile device control agents often perform poorly when solving complex tasks requiring long-horizon planning and precise operations, typically due to a lack of relevant task experience or unfamiliarity with skill execution. We propose K2-Agent, a hierarchical framework that models human-like cognition by separating and co-evolving declarative (knowing what) and procedural (knowing how) knowledge for planning and execution. K2-Agent's high level reasoner is bootstrapped from a single demonstration per task and runs a Summarize-Reflect-Locate-Revise (SRLR) loop to distill and iteratively refine task-level declarative knowledge through self-evolution. The low-level executor is trained with our curriculum-guided Group Relative Policy Optimization (C-GRPO), which (i) constructs a balanced sample pool using decoupled reward signals and (ii) employs dynamic demonstration injection to guide the model in autonomously generating successful trajectories for training. On the challenging AndroidWorld benchmark, K2-Agent achieves a 76.1% success rate using only raw screenshots and open-source backbones. Furthermore, K2-Agent shows powerful dual generalization: its high-level declarative knowledge transfers across diverse base models, while its low-level procedural skills achieve competitive performance on unseen tasks in ScreenSpot-v2 and Android-in-the-Wild (AitW).

Executive Summary

The K2-Agent framework proposes a hierarchical approach to mobile device control by co-evolving declarative and procedural knowledge. It achieves a 76.1% success rate on the AndroidWorld benchmark using raw screenshots and open-source backbones. The framework demonstrates dual generalization, with high-level declarative knowledge transferring across diverse base models and low-level procedural skills performing competitively on unseen tasks.

Key Points

  • Hierarchical framework for mobile device control
  • Co-evolution of declarative and procedural knowledge
  • Dual generalization across diverse base models and unseen tasks

Merits

Effective Knowledge Representation

The K2-Agent's separation of declarative and procedural knowledge enables effective representation and utilization of task-level information.

Improved Generalization

The framework's dual generalization capability allows for robust performance across diverse scenarios and tasks.

Demerits

Complexity of Implementation

The K2-Agent's hierarchical architecture and co-evolution mechanism may introduce complexity in implementation and require significant computational resources.

Expert Commentary

The K2-Agent framework represents a significant advancement in mobile device control, leveraging hierarchical knowledge representation and co-evolution to achieve impressive performance on complex tasks. The dual generalization capability is particularly noteworthy, as it enables the framework to adapt to diverse scenarios and tasks. However, the complexity of implementation and potential computational requirements must be carefully considered in future applications and extensions of this work.

Recommendations

  • Further research on simplifying the implementation and reducing computational requirements
  • Exploration of potential applications in accessibility and assistive technologies

Sources