Recycling Failures: Salvaging Exploration in RLVR via Fine-Grained Off-Policy Guidance
arXiv:2602.24110v1 Announce Type: new Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing the complex reasoning capabilities of Large Reasoning Models. However, standard outcome-based supervision suffers from a critical limitation that penalizes trajectories that...
Higress-RAG: A Holistic Optimization Framework for Enterprise Retrieval-Augmented Generation via Dual Hybrid Retrieval, Adaptive Routing, and CRAG
arXiv:2602.23374v1 Announce Type: cross Abstract: The integration of Large Language Models (LLMs) into enterprise knowledge management systems has been catalyzed by the Retrieval-Augmented Generation (RAG) paradigm, which augments parametric memory with non-parametric external data. However, the transition from proof-of-concept to...
Now You See Me: Designing Responsible AI Dashboards for Early-Stage Health Innovation
arXiv:2602.23378v1 Announce Type: cross Abstract: Innovative HealthTech teams develop Artificial Intelligence (AI) systems in contexts where ethical expectations and organizational priorities must be balanced under severe resource constraints. While Responsible AI practices are expected to guide the design and evaluation...
Task-Lens: Cross-Task Utility Based Speech Dataset Profiling for Low-Resource Indian Languages
arXiv:2602.23388v1 Announce Type: cross Abstract: The rising demand for inclusive speech technologies amplifies the need for multilingual datasets for Natural Language Processing (NLP) research. However, limited awareness of existing task-specific resources in low-resource languages hinders research. This challenge is especially...
Multi-Sourced, Multi-Agent Evidence Retrieval for Fact-Checking
arXiv:2603.00267v1 Announce Type: new Abstract: Misinformation spreading over the Internet poses a significant threat to both societies and individuals, necessitating robust and scalable fact-checking that relies on retrieving accurate and trustworthy evidence. Previous methods rely on semantic and social-contextual patterns...
TraderBench: How Robust Are AI Agents in Adversarial Capital Markets?
arXiv:2603.00285v1 Announce Type: new Abstract: Evaluating AI agents in finance faces two key challenges: static benchmarks require costly expert annotation yet miss the dynamic decision-making central to real-world trading, while LLM-based judges introduce uncontrolled variance on domain-specific tasks. We introduce...
DIG to Heal: Scaling General-purpose Agent Collaboration via Explainable Dynamic Decision Paths
arXiv:2603.00309v1 Announce Type: new Abstract: The increasingly popular agentic AI paradigm promises to harness the power of multiple, general-purpose large language model (LLM) agents to collaboratively complete complex tasks. While many agentic AI systems utilize predefined workflows or agent roles...
EmCoop: A Framework and Benchmark for Embodied Cooperation Among LLM Agents
arXiv:2603.00349v1 Announce Type: new Abstract: Real-world scenarios increasingly require multiple embodied agents to collaborate in dynamic environments under embodied constraints, as many tasks exceed the capabilities of any single agent. Recent advances in large language models (LLMs) enable high-level cognitive...
NeuroHex: Highly-Efficient Hex Coordinate System for Creating World Models to Enable Adaptive AI
arXiv:2603.00376v1 Announce Type: new Abstract: \textit{NeuroHex} is a hexagonal coordinate system designed to support highly efficient world models and reference frames for online adaptive AI systems. Inspired by the hexadirectional firing structure of grid cells in the human brain, NeuroHex...
AI Runtime Infrastructure
arXiv:2603.00495v1 Announce Type: new Abstract: We introduce AI Runtime Infrastructure, a distinct execution-time layer that operates above the model and below the application, actively observing, reasoning over, and intervening in agent behavior to optimize task success, latency, token efficiency, reliability,...
DenoiseFlow: Uncertainty-Aware Denoising for Reliable LLM Agentic Workflows
arXiv:2603.00532v1 Announce Type: new Abstract: Autonomous agents are increasingly entrusted with complex, long-horizon tasks, ranging from mathematical reasoning to software generation. While agentic workflows facilitate these tasks by decomposing them into multi-step reasoning chains, reliability degrades significantly as the sequence...
EMPA: Evaluating Persona-Aligned Empathy as a Process
arXiv:2603.00552v1 Announce Type: new Abstract: Evaluating persona-aligned empathy in LLM-based dialogue agents remains challenging. User states are latent, feedback is sparse and difficult to verify in situ, and seemingly supportive turns can still accumulate into trajectories that drift from persona-specific...
Draft-Thinking: Learning Efficient Reasoning in Long Chain-of-Thought LLMs
arXiv:2603.00578v1 Announce Type: new Abstract: Long chain-of-thought~(CoT) has become a dominant paradigm for enhancing the reasoning capability of large reasoning models~(LRMs); however, the performance gains often come with a substantial increase in reasoning budget. Recent studies show that existing CoT...
Heterophily-Agnostic Hypergraph Neural Networks with Riemannian Local Exchanger
arXiv:2603.00599v1 Announce Type: new Abstract: Hypergraphs are the natural description of higher-order interactions among objects, widely applied in social network analysis, cross-modal retrieval, etc. Hypergraph Neural Networks (HGNNs) have become the dominant solution for learning on hypergraphs. Traditional HGNNs are...
Machine Learning Grade Prediction Using Students' Grades and Demographics
arXiv:2603.00608v1 Announce Type: new Abstract: Student repetition in secondary education imposes significant resource burdens, particularly in resource-constrained contexts. Addressing this challenge, this study introduces a unified machine learning framework that simultaneously predicts pass/fail outcomes and continuous grades, a departure from...
MetaMind: General and Cognitive World Models in Multi-Agent Systems by Meta-Theory of Mind
arXiv:2603.00808v1 Announce Type: new Abstract: A major challenge for world models in multi-agent systems is to understand interdependent agent dynamics, predict interactive multi-agent trajectories, and plan over long horizons with collective awareness, without centralized supervision or explicit communication. In this...
MC-Search: Evaluating and Enhancing Multimodal Agentic Search with Structured Long Reasoning Chains
arXiv:2603.00873v1 Announce Type: new Abstract: With the increasing demand for step-wise, cross-modal, and knowledge-grounded reasoning, multimodal large language models (MLLMs) are evolving beyond the traditional fixed retrieve-then-generate paradigm toward more sophisticated agentic multimodal retrieval-augmented generation (MM-RAG). Existing benchmarks, however, mainly...
HiMAC: Hierarchical Macro-Micro Learning for Long-Horizon LLM Agents
arXiv:2603.00977v1 Announce Type: new Abstract: Large language model (LLM) agents have recently demonstrated strong capabilities in interactive decision-making, yet they remain fundamentally limited in long-horizon tasks that require structured planning and reliable execution. Existing approaches predominantly rely on flat autoregressive...
CollabEval: Enhancing LLM-as-a-Judge via Multi-Agent Collaboration
arXiv:2603.00993v1 Announce Type: new Abstract: Large Language Models (LLMs) have revolutionized AI-generated content evaluation, with the LLM-as-a-Judge paradigm becoming increasingly popular. However, current single-LLM evaluation approaches face significant challenges, including inconsistent judgments and inherent biases from pre-training data. To address...
DIVA-GRPO: Enhancing Multimodal Reasoning through Difficulty-Adaptive Variant Advantage
arXiv:2603.01106v1 Announce Type: new Abstract: Reinforcement learning (RL) with group relative policy optimization (GRPO) has become a widely adopted approach for enhancing the reasoning capabilities of multimodal large language models (MLLMs). While GRPO enables long-chain reasoning without a critic, it...
TAB-PO: Preference Optimization with a Token-Level Adaptive Barrier for Token-Critical Structured Generation
arXiv:2603.00025v1 Announce Type: new Abstract: Direct Preference Optimization is an offline post-SFT method for aligning language models from preference pairs, with strong results in instruction following and summarization. However, DPO's sequence-level implicit reward can be brittle for token-critical structured prediction...
Embracing Anisotropy: Turning Massive Activations into Interpretable Control Knobs for Large Language Models
arXiv:2603.00029v1 Announce Type: new Abstract: Large Language Models (LLMs) exhibit highly anisotropic internal representations, often characterized by massive activations, a phenomenon where a small subset of feature dimensions possesses magnitudes significantly larger than the rest. While prior works view these...
GRIP: Geometric Refinement and Adaptive Information Potential for Data Efficiency
arXiv:2603.00031v1 Announce Type: new Abstract: The performance of Large Language Models (LLMs) is increasingly governed by data efficiency rather than raw scaling volume. However, existing selection methods often decouple global distribution balancing from local instance selection, compromising the hierarchical integrity...
Stepwise Penalization for Length-Efficient Chain-of-Thought Reasoning
arXiv:2603.00296v1 Announce Type: new Abstract: Large reasoning models improve with more test-time computation, but often overthink, producing unnecessarily long chains-of-thought that raise cost without improving accuracy. Prior reinforcement learning approaches typically rely on a single outcome reward with trajectory-level length...
Federated Inference: Toward Privacy-Preserving Collaborative and Incentivized Model Serving
arXiv:2603.02214v1 Announce Type: new Abstract: Federated Inference (FI) studies how independently trained and privately owned models can collaborate at inference time without sharing data or model parameters. While recent work has explored secure and distributed inference from disparate perspectives, a...
Engineering Reasoning and Instruction (ERI) Benchmark: A Large Taxonomy-driven Dataset for Foundation Models and Agents
arXiv:2603.02239v1 Announce Type: new Abstract: The Engineering Reasoning and Instruction (ERI) benchmark is a taxonomy-driven instruction dataset designed to train and evaluate engineering-capable large language models (LLMs) and agents. This dataset spans nine engineering fields (namely: civil, mechanical, electrical, chemical,...
SuperLocalMemory: Privacy-Preserving Multi-Agent Memory with Bayesian Trust Defense Against Memory Poisoning
arXiv:2603.02240v1 Announce Type: new Abstract: We present SuperLocalMemory, a local-first memory system for multi-agent AI that defends against OWASP ASI06 memory poisoning through architectural isolation and Bayesian trust scoring, while personalizing retrieval through adaptive learning-to-rank -- all without cloud dependencies...
A Neuropsychologically Grounded Evaluation of LLM Cognitive Abilities
arXiv:2603.02540v1 Announce Type: new Abstract: Large language models (LLMs) exhibit a unified "general factor" of capability across 10 benchmarks, a finding confirmed by our factor analysis of 156 models, yet they still struggle with simple, trivial tasks for humans. This...
AgentAssay: Token-Efficient Regression Testing for Non-Deterministic AI Agent Workflows
arXiv:2603.02601v1 Announce Type: new Abstract: Autonomous AI agents are deployed at unprecedented scale, yet no principled methodology exists for verifying that an agent has not regressed after changes to its prompts, tools, models, or orchestration logic. We present AgentAssay, the...
See and Remember: A Multimodal Agent for Web Traversal
arXiv:2603.02626v1 Announce Type: new Abstract: Autonomous web navigation requires agents to perceive complex visual environments and maintain long-term context, yet current Large Language Model (LLM) based agents often struggle with spatial disorientation and navigation loops. In this paper, we propose...