Academic

Learning Next Action Predictors from Human-Computer Interaction

arXiv:2603.05923v1 Announce Type: new Abstract: Truly proactive AI systems must anticipate what we will do next. This foresight demands far richer information than the sparse signals we type into our prompts -- it demands reasoning over the entire context of what we see and do. We formalize this as next action prediction (NAP): given a sequence of a user's multimodal interactions with a computer (screenshots, clicks, sensor data), predict that user's next action. Progress on this task requires both new data and modeling approaches. To scale data, we annotate longitudinal, naturalistic computer use with vision-language models. We release an open-source pipeline for performing this labeling on private infrastructure, and label over 360K actions across one month of continuous phone usage from 20 users, amounting to 1,800 hours of screen time. We then introduce LongNAP, a user model that combines parametric and in-context learning to reason over long interaction histories. LongNAP is trai

arXiv:2603.05923v1 Announce Type: new Abstract: Truly proactive AI systems must anticipate what we will do next. This foresight demands far richer information than the sparse signals we type into our prompts -- it demands reasoning over the entire context of what we see and do. We formalize this as next action prediction (NAP): given a sequence of a user's multimodal interactions with a computer (screenshots, clicks, sensor data), predict that user's next action. Progress on this task requires both new data and modeling approaches. To scale data, we annotate longitudinal, naturalistic computer use with vision-language models. We release an open-source pipeline for performing this labeling on private infrastructure, and label over 360K actions across one month of continuous phone usage from 20 users, amounting to 1,800 hours of screen time. We then introduce LongNAP, a user model that combines parametric and in-context learning to reason over long interaction histories. LongNAP is trained via policy gradient methods to generate user-specific reasoning traces given some context; retrieve relevant traces from a library of past traces; and then apply retrieved traces in-context to predict future actions. Using an LLM-as-judge evaluation metric (0-1 similarity to ground truth), LongNAP significantly outperforms supervised finetuning and prompted baselines on held-out data (by 79% and 39% respectively). Additionally, LongNAP generalizes to held out users when trained across individuals. The space of next actions a user might take at any moment is unbounded, spanning thousands of possible outcomes. Despite this, 17.1% of LongNAP's predicted trajectories are well-aligned with what a user does next (LLM-judge score $\geq$ 0.5). This rises to 26% when we filter to highly confident predictions. In sum, we argue that learning from the full context of user behavior to anticipate user needs is now a viable task with substantial opportunity.

Executive Summary

The article introduces a novel framework for next action prediction (NAP) in proactive AI systems, leveraging multimodal user interactions to anticipate user behavior. By annotating longitudinal, naturalistic data using vision-language models and releasing an open-source pipeline, the authors scale data collection to over 360K actions across 20 users and 1,800 hours of usage. The LongNAP model combines parametric and in-context learning to reason over interaction histories and is trained via policy gradients. Evaluated via an LLM-as-judge metric, LongNAP significantly outperforms baselines (79% over supervised fine-tuning, 39% over prompted models), with 17.1% of predictions aligning with user actions (rising to 26% with high confidence). The work establishes NAP as a viable and scalable task with meaningful potential for real-world applications.

Key Points

  • Formalization of NAP as a predictive task requiring multimodal context
  • Use of open-source pipeline to annotate large-scale longitudinal data
  • Development of LongNAP to integrate parametric and in-context learning for improved prediction

Merits

Scalability

The study effectively scales data collection using an open-source pipeline, enabling robust training on real-world user behavior.

Performance

LongNAP demonstrates statistically significant improvements over existing baselines, indicating a meaningful advancement in predictive accuracy.

Demerits

Generalization Constraint

While LongNAP generalizes across users, the unbounded nature of possible next actions (thousands of outcomes) limits the precision of predictions, with only 17.1% aligning with user behavior, indicating room for improvement.

Expert Commentary

This work represents a pivotal step in the evolution of proactive AI, particularly in the integration of multimodal context and in-context learning. The authors address a critical gap in AI anticipatory capabilities by grounding predictions in rich interaction histories rather than sparse prompts. The use of policy gradient methods to optimize trace retrieval and application is particularly innovative, as it aligns with recent advances in reinforcement learning for dynamic user modeling. However, the limitations of current prediction accuracy—particularly the 26% high-confidence threshold—suggest that the field must advance beyond probabilistic heuristics to achieve true anticipatory reliability. Moreover, the ethical dimensions of inferring user intent without explicit consent warrant deeper exploration, especially as these systems scale. This paper sets a new benchmark for next-action prediction but also underscores the necessity for ongoing evaluation of both technical efficacy and societal impact.

Recommendations

  • Invest in hybrid models that combine behavioral analytics with user consent frameworks to balance prediction accuracy with ethical oversight.
  • Encourage benchmarking of predictive systems using diverse user demographics and interaction modalities to ensure generalizability and fairness.

Sources