Intelligence as Trajectory-Dominant Pareto Optimization
arXiv:2602.13230v1 Announce Type: new Abstract: Despite recent advances in artificial intelligence, many systems exhibit stagnation in long-horizon adaptability despite continued performance optimization. This work argues that such limitations do not primarily arise from insufficient learning, data, or model capacity, but...
PlotChain: Deterministic Checkpointed Evaluation of Multimodal LLMs on Engineering Plot Reading
arXiv:2602.13232v1 Announce Type: new Abstract: We present PlotChain, a deterministic, generator-based benchmark for evaluating multimodal large language models (MLLMs) on engineering plot reading-recovering quantitative values from classic plots (e.g., Bode/FFT, step response, stress-strain, pump curves) rather than OCR-only extraction or...
DPBench: Large Language Models Struggle with Simultaneous Coordination
arXiv:2602.13255v1 Announce Type: new Abstract: Large language models are increasingly deployed in multi-agent systems, yet we lack benchmarks that test whether they can coordinate under resource contention. We introduce DPBench, a benchmark based on the Dining Philosophers problem that evaluates...
Information Fidelity in Tool-Using LLM Agents: A Martingale Analysis of the Model Context Protocol
arXiv:2602.13320v1 Announce Type: new Abstract: As AI agents powered by large language models (LLMs) increasingly use external tools for high-stakes decisions, a critical reliability question arises: how do errors propagate across sequential tool calls? We introduce the first theoretical framework...
Hippocampus: An Efficient and Scalable Memory Module for Agentic AI
arXiv:2602.13594v1 Announce Type: new Abstract: Agentic AI require persistent memory to store user-specific histories beyond the limited context window of LLMs. Existing memory systems use dense vector databases or knowledge-graph traversal (or hybrid), incurring high retrieval latency and poor storage...
Small Reward Models via Backward Inference
arXiv:2602.13551v1 Announce Type: new Abstract: Reward models (RMs) play a central role throughout the language model (LM) pipeline, particularly in non-verifiable domains. However, the dominant LLM-as-a-Judge paradigm relies on the strong reasoning capabilities of large models, while alternative approaches require...
GRRM: Group Relative Reward Modeling for Machine Translation
arXiv:2602.14028v1 Announce Type: new Abstract: While Group Relative Policy Optimization (GRPO) offers a powerful framework for LLM post-training, its effectiveness in open-ended domains like Machine Translation hinges on accurate intra-group ranking. We identify that standard Scalar Quality Metrics (SQM) fall...
WorkflowPerturb: Calibrated Stress Tests for Evaluating Multi-Agent Workflow Metrics
arXiv:2602.17990v1 Announce Type: new Abstract: LLM-based systems increasingly generate structured workflows for complex tasks. In practice, automatic evaluation of these workflows is difficult, because metric scores are often not calibrated, and score changes do not directly communicate the severity of...
Federated Reasoning Distillation Framework with Model Learnability-Aware Data Allocation
arXiv:2602.18749v1 Announce Type: new Abstract: Data allocation plays a critical role in federated large language model (LLM) and small language models (SLMs) reasoning collaboration. Nevertheless, existing data allocation methods fail to address an under-explored challenge in collaboration: bidirectional model learnability...
LAMMI-Pathology: A Tool-Centric Bottom-Up LVLM-Agent Framework for Molecularly Informed Medical Intelligence in Pathology
arXiv:2602.18773v1 Announce Type: new Abstract: The emergence of tool-calling-based agent systems introduces a more evidence-driven paradigm for pathology image analysis in contrast to the coarse-grained text-image diagnostic approaches. With the recent large-scale experimental adoption of spatial transcriptomics technologies, molecularly validated...
How Far Can We Go with Pixels Alone? A Pilot Study on Screen-Only Navigation in Commercial 3D ARPGs
arXiv:2602.18981v1 Announce Type: new Abstract: Modern 3D game levels rely heavily on visual guidance, yet the navigability of level layouts remains difficult to quantify. Prior work either simulates play in simplified environments or analyzes static screenshots for visual affordances, but...
Sycophantic Chatbots Cause Delusional Spiraling, Even in Ideal Bayesians
arXiv:2602.19141v1 Announce Type: new Abstract: "AI psychosis" or "delusional spiraling" is an emerging phenomenon where AI chatbot users find themselves dangerously confident in outlandish beliefs after extended chatbot conversations. This phenomenon is typically attributed to AI chatbots' well-documented bias towards...
DoAtlas-1: A Causal Compilation Paradigm for Clinical AI
arXiv:2602.19158v1 Announce Type: new Abstract: Medical foundation models generate narrative explanations but cannot quantify intervention effects, detect evidence conflicts, or validate literature claims, limiting clinical auditability. We propose causal compilation, a paradigm that transforms medical evidence from narrative text into...
Rethinking Retrieval-Augmented Generation as a Cooperative Decision-Making Problem
arXiv:2602.18734v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) has demonstrated strong effectiveness in knowledge-intensive tasks by grounding language generation in external evidence. Despite its success, many existing RAG systems are built based on a ranking-centric, asymmetric dependency paradigm, where the...
BURMESE-SAN: Burmese NLP Benchmark for Evaluating Large Language Models
arXiv:2602.18788v1 Announce Type: new Abstract: We introduce BURMESE-SAN, the first holistic benchmark that systematically evaluates large language models (LLMs) for Burmese across three core NLP competencies: understanding (NLU), reasoning (NLR), and generation (NLG). BURMESE-SAN consolidates seven subtasks spanning these competencies,...
Capable but Unreliable: Canonical Path Deviation as a Causal Mechanism of Agent Failure in Long-Horizon Tasks
arXiv:2602.19008v1 Announce Type: new Abstract: Why do language agents fail on tasks they are capable of solving? We argue that many such failures are reliability failures caused by stochastic drift from a task's latent solution structure, not capability failures. Every...
Construct, Merge, Solve & Adapt with Reinforcement Learning for the min-max Multiple Traveling Salesman Problem
arXiv:2602.23579v1 Announce Type: new Abstract: The Multiple Traveling Salesman Problem (mTSP) extends the Traveling Salesman Problem to m tours that start and end at a common depot and jointly visit all customers exactly once. In the min-max variant, the objective...
SleepLM: Natural-Language Intelligence for Human Sleep
arXiv:2602.23605v1 Announce Type: new Abstract: We present SleepLM, a family of sleep-language foundation models that enable human sleep alignment, interpretation, and interaction with natural language. Despite the critical role of sleep, learning-based sleep analysis systems operate in closed label spaces...
MMKG-RDS: Reasoning Data Synthesis via Deep Mining of Multimodal Knowledge Graphs
arXiv:2602.23632v1 Announce Type: new Abstract: Synthesizing high-quality training data is crucial for enhancing domain models' reasoning abilities. Existing methods face limitations in long-tail knowledge coverage, effectiveness verification, and interpretability. Knowledge-graph-based approaches still fall short in functionality, granularity, customizability, and evaluation....
Reasoning-Driven Multimodal LLM for Domain Generalization
arXiv:2602.23777v1 Announce Type: new Abstract: This paper addresses the domain generalization (DG) problem in deep learning. While most DG methods focus on enforcing visual feature invariance, we leverage the reasoning capability of multimodal large language models (MLLMs) and explore the...
A Minimal Agent for Automated Theorem Proving
arXiv:2602.24273v1 Announce Type: new Abstract: We propose a minimal agentic baseline that enables systematic comparison across different AI-based theorem prover architectures. This design implements the core features shared among state-of-the-art systems: iterative proof refinement, library search and context management. We...
Toward General Semantic Chunking: A Discriminative Framework for Ultra-Long Documents
arXiv:2602.23370v1 Announce Type: cross Abstract: Long-document topic segmentation plays an important role in information retrieval and document understanding, yet existing methods still show clear shortcomings in ultra-long text settings. Traditional discriminative models are constrained by fixed windows and cannot model...
Democratizing GraphRAG: Linear, CPU-Only Graph Retrieval for Multi-Hop QA
arXiv:2602.23372v1 Announce Type: cross Abstract: GraphRAG systems improve multi-hop retrieval by modeling structure, but many approaches rely on expensive LLM-based graph construction and GPU-heavy inference. We present SPRIG (Seeded Propagation for Retrieval In Graphs), a CPU-only, linear-time, token-free GraphRAG pipeline...
TraderBench: How Robust Are AI Agents in Adversarial Capital Markets?
arXiv:2603.00285v1 Announce Type: new Abstract: Evaluating AI agents in finance faces two key challenges: static benchmarks require costly expert annotation yet miss the dynamic decision-making central to real-world trading, while LLM-based judges introduce uncontrolled variance on domain-specific tasks. We introduce...
Optimizing In-Context Demonstrations for LLM-based Automated Grading
arXiv:2603.00465v1 Announce Type: new Abstract: Automated assessment of open-ended student responses is a critical capability for scaling personalized feedback in education. While large language models (LLMs) have shown promise in grading tasks via in-context learning (ICL), their reliability is heavily...
Advancing Multimodal Judge Models through a Capability-Oriented Benchmark and MCTS-Driven Data Generation
arXiv:2603.00546v1 Announce Type: new Abstract: Using Multimodal Large Language Models (MLLMs) as judges to achieve precise and consistent evaluations has gradually become an emerging paradigm across various domains. Evaluating the capability and reliability of MLLM-as-a-judge systems is therefore essential for...
InfoPO: Information-Driven Policy Optimization for User-Centric Agents
arXiv:2603.00656v1 Announce Type: new Abstract: Real-world user requests to LLM agents are often underspecified. Agents must interact to acquire missing information and make correct downstream decisions. However, current multi-turn GRPO-based methods often rely on trajectory-level reward computation, which leads to...
DIVA-GRPO: Enhancing Multimodal Reasoning through Difficulty-Adaptive Variant Advantage
arXiv:2603.01106v1 Announce Type: new Abstract: Reinforcement learning (RL) with group relative policy optimization (GRPO) has become a widely adopted approach for enhancing the reasoning capabilities of multimodal large language models (MLLMs). While GRPO enables long-chain reasoning without a critic, it...
Delaware Journal of Corporate Law
Delaware Journal of Corporate Law | 604 followers on LinkedIn. The Delaware Journal of Corporate Law continues to operate as a nationally recognized student-edited publication | The Delaware Journal of Corporate Law is a student-edited publication established in 1975 at...
DeepResearch-9K: A Challenging Benchmark Dataset of Deep-Research Agent
arXiv:2603.01152v1 Announce Type: new Abstract: Deep-research agents are capable of executing multi-step web exploration, targeted retrieval, and sophisticated question answering. Despite their powerful capabilities, deep-research agents face two critical bottlenecks: (1) the lack of large-scale, challenging datasets with real-world difficulty,...