Learning to Rewrite Tool Descriptions for Reliable LLM-Agent Tool Use
arXiv:2602.20426v1 Announce Type: new Abstract: The performance of LLM-based agents depends not only on the agent itself but also on the quality of the tool interfaces it consumes. While prior work has focused heavily on agent fine-tuning, tool interfaces-including natural...
PreScience: A Benchmark for Forecasting Scientific Contributions
arXiv:2602.20459v1 Announce Type: new Abstract: Can AI systems trained on the scientific record up to a fixed point in time forecast the scientific advances that follow? Such a capability could help researchers identify collaborators and impactful research directions, and anticipate...
KairosVL: Orchestrating Time Series and Semantics for Unified Reasoning
arXiv:2602.20494v1 Announce Type: new Abstract: Driven by the increasingly complex and decision-oriented demands of time series analysis, we introduce the Semantic-Conditional Time Series Reasoning task, which extends conventional time series analysis beyond purely numerical modeling to incorporate contextual and semantic...
Inner Speech as Behavior Guides: Steerable Imitation of Diverse Behaviors for Human-AI coordination
arXiv:2602.20517v1 Announce Type: new Abstract: Effective human-AI coordination requires artificial agents capable of exhibiting and responding to human-like behaviors while adapting to changing contexts. Imitation learning has emerged as one of the prominent approaches to build such agents by training...
From Logs to Language: Learning Optimal Verbalization for LLM-Based Recommendation in Production
arXiv:2602.20558v1 Announce Type: new Abstract: Large language models (LLMs) are promising backbones for generative recommender systems, yet a key challenge remains underexplored: verbalization, i.e., converting structured user interaction logs into effective natural language inputs. Existing methods rely on rigid templates...
CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification and Estimation
arXiv:2602.20571v1 Announce Type: new Abstract: Many benchmarks for automated causal inference evaluate a system's performance based on a single numerical output, such as an Average Treatment Effect (ATE). This approach conflates two distinct steps in causal analysis: identification-formulating a valid...
Physics-based phenomenological characterization of cross-modal bias in multimodal models
arXiv:2602.20624v1 Announce Type: new Abstract: The term 'algorithmic fairness' is used to evaluate whether AI models operate fairly in both comparative (where fairness is understood as formal equality, such as "treat like cases as like") and non-comparative (where unfairness arises...
When can we trust untrusted monitoring? A safety case sketch across collusion strategies
arXiv:2602.20628v1 Announce Type: new Abstract: AIs are increasingly being deployed with greater autonomy and capabilities, which increases the risk that a misaligned AI may be able to cause catastrophic harm. Untrusted monitoring -- using one untrusted model to oversee another...
Identifying two piecewise linear additive value functions from anonymous preference information
arXiv:2602.20638v1 Announce Type: new Abstract: Eliciting a preference model involves asking a person, named decision-maker, a series of questions. We assume that these preferences can be represented by an additive value function. In this work, we query simultaneously two decision-makers...
How Foundational Skills Influence VLM-based Embodied Agents:A Native Perspective
arXiv:2602.20687v1 Announce Type: new Abstract: Recent advances in vision-language models (VLMs) have shown promise for human-level embodied intelligence. However, existing benchmarks for VLM-driven embodied agents often rely on high-level commands or discretized action spaces, which are non-native settings that differ...
Online Algorithms with Unreliable Guidance
arXiv:2602.20706v1 Announce Type: new Abstract: This paper introduces a new model for ML-augmented online decision making, called online algorithms with unreliable guidance (OAG). This model completely separates between the predictive and algorithmic components, thus offering a single well-defined analysis framework...
Buffer Matters: Unleashing the Power of Off-Policy Reinforcement Learning in Large Language Model Reasoning
arXiv:2602.20722v1 Announce Type: new Abstract: Traditional on-policy Reinforcement Learning with Verifiable Rewards (RLVR) frameworks suffer from experience waste and reward homogeneity, which directly hinders learning efficiency on difficult samples during large language models post-training. In this paper, we introduce Batch...
Modality-Guided Mixture of Graph Experts with Entropy-Triggered Routing for Multimodal Recommendation
arXiv:2602.20723v1 Announce Type: new Abstract: Multimodal recommendation enhances ranking by integrating user-item interactions with item content, which is particularly effective under sparse feedback and long-tail distributions. However, multimodal signals are inherently heterogeneous and can conflict in specific contexts, making effective...
Balancing Multiple Objectives in Urban Traffic Control with Reinforcement Learning from AI Feedback
arXiv:2602.20728v1 Announce Type: new Abstract: Reward design has been one of the central challenges for real world reinforcement learning (RL) deployment, especially in settings with multiple objectives. Preference-based RL offers an appealing alternative by learning from human preferences over pairs...
PyVision-RL: Forging Open Agentic Vision Models via RL
arXiv:2602.20739v1 Announce Type: new Abstract: Reinforcement learning for agentic multimodal models often suffers from interaction collapse, where models learn to reduce tool usage and multi-turn reasoning, limiting the benefits of agentic behavior. We introduce PyVision-RL, a reinforcement learning framework for...
POMDPPlanners: Open-Source Package for POMDP Planning
arXiv:2602.20810v1 Announce Type: new Abstract: We present POMDPPlanners, an open-source Python package for empirical evaluation of Partially Observable Markov Decision Process (POMDP) planning algorithms. The package integrates state-of-the-art planning algorithms, a suite of benchmark environments with safety-critical variants, automated hyperparameter...
Pressure Reveals Character: Behavioural Alignment Evaluation at Depth
arXiv:2602.20813v1 Announce Type: new Abstract: Evaluating alignment in language models requires testing how they behave under realistic pressure, not just what they claim they would do. While alignment failures increasingly cause real-world harm, comprehensive evaluation frameworks with realistic multi-turn scenarios...
Diagnosing Causal Reasoning in Vision-Language Models via Structured Relevance Graphs
arXiv:2602.20878v1 Announce Type: new Abstract: Large Vision-Language Models (LVLMs) achieve strong performance on visual question answering benchmarks, yet often rely on spurious correlations rather than genuine causal reasoning. Existing evaluations primarily assess the correctness of the answers, making it unclear...
Predicting Sentence Acceptability Judgments in Multimodal Contexts
arXiv:2602.20918v1 Announce Type: new Abstract: Previous work has examined the capacity of deep neural networks (DNNs), particularly transformers, to predict human sentence acceptability judgments, both independently of context, and in document contexts. We consider the effect of prior exposure to...
Architecting AgentOS: From Token-Level Context to Emergent System-Level Intelligence
arXiv:2602.20934v1 Announce Type: new Abstract: The paradigm of Large Language Models is undergoing a fundamental transition from static inference engines to dynamic autonomous cognitive systems.While current research primarily focuses on scaling context windows or optimizing prompt engineering the theoretical bridge...
LogicGraph : Benchmarking Multi-Path Logical Reasoning via Neuro-Symbolic Generation and Verification
arXiv:2602.21044v1 Announce Type: new Abstract: Evaluations of large language models (LLMs) primarily emphasize convergent logical reasoning, where success is defined by producing a single correct proof. However, many real-world reasoning problems admit multiple valid derivations, requiring models to explore diverse...
Tool Building as a Path to "Superintelligence"
arXiv:2602.21061v1 Announce Type: new Abstract: The Diligent Learner framework suggests LLMs can achieve superintelligence via test-time search, provided a sufficient step-success probability $\gamma$. In this work, we design a benchmark to measure $\gamma$ on logical out-of-distribution inference. We construct a...
The Initial Exploration Problem in Knowledge Graph Exploration
arXiv:2602.21066v1 Announce Type: new Abstract: Knowledge Graphs (KGs) enable the integration and representation of complex information across domains, but their semantic richness and structural complexity create substantial barriers for lay users without expertise in semantic web technologies. When encountering an...
CG-DMER: Hybrid Contrastive-Generative Framework for Disentangled Multimodal ECG Representation Learning
arXiv:2602.21154v1 Announce Type: new Abstract: Accurate interpretation of electrocardiogram (ECG) signals is crucial for diagnosing cardiovascular diseases. Recent multimodal approaches that integrate ECGs with accompanying clinical reports show strong potential, but they still face two main concerns from a modality...
Interpretable Medical Image Classification using Prototype Learning and Privileged Information
arXiv:2310.15741v1 Announce Type: cross Abstract: Interpretability is often an essential requirement in medical imaging. Advanced deep learning methods are required to address this need for explainability and high performance. In this work, we investigate whether additional information available during the...
ConceptRM: The Quest to Mitigate Alert Fatigue through Consensus-Based Purity-Driven Data Cleaning for Reflection Modelling
arXiv:2602.20166v1 Announce Type: cross Abstract: In many applications involving intelligent agents, the overwhelming volume of alerts (mostly false) generated by the agents may desensitize users and cause them to overlook critical issues, leading to the so-called ''alert fatigue''. A common...
Benchmarking Early Deterioration Prediction Across Hospital-Rich and MCI-Like Emergency Triage Under Constrained Sensing
arXiv:2602.20168v1 Announce Type: cross Abstract: Emergency triage decisions are made under severe information constraints, yet most data-driven deterioration models are evaluated using signals unavailable during initial assessment. We present a leakage-aware benchmarking framework for early deterioration prediction that evaluates model...
InterviewSim: A Scalable Framework for Interview-Grounded Personality Simulation
arXiv:2602.20294v1 Announce Type: new Abstract: Simulating real personalities with large language models requires grounding generation in authentic personal data. Existing evaluation approaches rely on demographic surveys, personality questionnaires, or short AI-led interviews as proxies, but lack direct assessment against what...
What Makes a Good Query? Measuring the Impact of Human-Confusing Linguistic Features on LLM Performance
arXiv:2602.20300v1 Announce Type: new Abstract: Large Language Model (LLM) hallucinations are usually treated as defects of the model or its decoding strategy. Drawing on classical linguistics, we argue that a query's form can also shape a listener's (and model's) response....
No One Size Fits All: QueryBandits for Hallucination Mitigation
arXiv:2602.20332v1 Announce Type: new Abstract: Advanced reasoning capabilities in Large Language Models (LLMs) have led to more frequent hallucinations; yet most mitigation work focuses on open-source models for post-hoc detection and parameter editing. The dearth of studies focusing on hallucinations...