A Geometric Taxonomy of Hallucinations in LLMs
arXiv:2602.13224v1 Announce Type: new Abstract: The term "hallucination" in large language models conflates distinct phenomena with different geometric signatures in embedding space. We propose a taxonomy identifying three types: unfaithfulness (failure to engage with provided context), confabulation (invention of semantically...
Intelligence as Trajectory-Dominant Pareto Optimization
arXiv:2602.13230v1 Announce Type: new Abstract: Despite recent advances in artificial intelligence, many systems exhibit stagnation in long-horizon adaptability despite continued performance optimization. This work argues that such limitations do not primarily arise from insufficient learning, data, or model capacity, but...
PlotChain: Deterministic Checkpointed Evaluation of Multimodal LLMs on Engineering Plot Reading
arXiv:2602.13232v1 Announce Type: new Abstract: We present PlotChain, a deterministic, generator-based benchmark for evaluating multimodal large language models (MLLMs) on engineering plot reading-recovering quantitative values from classic plots (e.g., Bode/FFT, step response, stress-strain, pump curves) rather than OCR-only extraction or...
Stay in Character, Stay Safe: Dual-Cycle Adversarial Self-Evolution for Safety Role-Playing Agents
arXiv:2602.13234v1 Announce Type: new Abstract: LLM-based role-playing has rapidly improved in fidelity, yet stronger adherence to persona constraints commonly increases vulnerability to jailbreak attacks, especially for risky or negative personas. Most prior work mitigates this issue with training-time solutions (e.g.,...
NL2LOGIC: AST-Guided Translation of Natural Language into First-Order Logic with Large Language Models
arXiv:2602.13237v1 Announce Type: new Abstract: Automated reasoning is critical in domains such as law and governance, where verifying claims against facts in documents requires both accuracy and interpretability. Recent work adopts structured reasoning pipelines that translate natural language into first-order...
Accuracy Standards for AI at Work vs. Personal Life: Evidence from an Online Survey
arXiv:2602.13283v1 Announce Type: new Abstract: We study how people trade off accuracy when using AI-powered tools in professional versus personal contexts for adoption purposes, the determinants of those trade-offs, and how users cope when AI/apps are unavailable. Because modern AI...
DECKBench: Benchmarking Multi-Agent Frameworks for Academic Slide Generation and Editing
arXiv:2602.13318v1 Announce Type: new Abstract: Automatically generating and iteratively editing academic slide decks requires more than document summarization. It demands faithful content selection, coherent slide organization, layout-aware rendering, and robust multi-turn instruction following. However, existing benchmarks and evaluation protocols do...
Information Fidelity in Tool-Using LLM Agents: A Martingale Analysis of the Model Context Protocol
arXiv:2602.13320v1 Announce Type: new Abstract: As AI agents powered by large language models (LLMs) increasingly use external tools for high-stakes decisions, a critical reliability question arises: how do errors propagate across sequential tool calls? We introduce the first theoretical framework...
Translating Dietary Standards into Healthy Meals with Minimal Substitutions
arXiv:2602.13502v1 Announce Type: new Abstract: An important goal for personalized diet systems is to improve nutritional quality without compromising convenience or affordability. We present an end-to-end framework that converts dietary standards into complete meals with minimal change. Using the What...
REMem: Reasoning with Episodic Memory in Language Agent
arXiv:2602.13530v1 Announce Type: new Abstract: Humans excel at remembering concrete experiences along spatiotemporal contexts and performing reasoning across those events, i.e., the capacity for episodic memory. In contrast, memory in language agents remains mainly semantic, and current agents are not...
OpAgent: Operator Agent for Web Navigation
arXiv:2602.13559v1 Announce Type: new Abstract: To fulfill user instructions, autonomous web agents must contend with the inherent complexity and volatile nature of real-world websites. Conventional paradigms predominantly rely on Supervised Fine-Tuning (SFT) or Offline Reinforcement Learning (RL) using static datasets....
Hippocampus: An Efficient and Scalable Memory Module for Agentic AI
arXiv:2602.13594v1 Announce Type: new Abstract: Agentic AI require persistent memory to store user-specific histories beyond the limited context window of LLMs. Existing memory systems use dense vector databases or knowledge-graph traversal (or hybrid), incurring high retrieval latency and poor storage...
Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens
arXiv:2602.13517v1 Announce Type: new Abstract: Large language models (LLMs) have demonstrated impressive reasoning capabilities by scaling test-time compute via long Chain-of-Thought (CoT). However, recent findings suggest that raw token counts are unreliable proxies for reasoning quality: increased generation length does...
On Calibration of Large Language Models: From Response To Capability
arXiv:2602.13540v1 Announce Type: new Abstract: Large language models (LLMs) are widely deployed as general-purpose problem solvers, making accurate confidence estimation critical for reliable use. Prior work on LLM calibration largely focuses on response-level confidence, which estimates the correctness of a...
Small Reward Models via Backward Inference
arXiv:2602.13551v1 Announce Type: new Abstract: Reward models (RMs) play a central role throughout the language model (LM) pipeline, particularly in non-verifiable domains. However, the dominant LLM-as-a-Judge paradigm relies on the strong reasoning capabilities of large models, while alternative approaches require...
DistillLens: Symmetric Knowledge Distillation Through Logit Lens
arXiv:2602.13567v1 Announce Type: new Abstract: Standard Knowledge Distillation (KD) compresses Large Language Models (LLMs) by optimizing final outputs, yet it typically treats the teacher's intermediate layer's thought process as a black box. While feature-based distillation attempts to bridge this gap,...
On Theoretically-Driven LLM Agents for Multi-Dimensional Discourse Analysis
arXiv:2602.13713v1 Announce Type: new Abstract: Identifying the strategic uses of reformulation in discourse remains a key challenge for computational argumentation. While LLMs can detect surface-level similarity, they often fail to capture the pragmatic functions of rephrasing, such as its role...
RMPL: Relation-aware Multi-task Progressive Learning with Stage-wise Training for Multimedia Event Extraction
arXiv:2602.13748v1 Announce Type: new Abstract: Multimedia Event Extraction (MEE) aims to identify events and their arguments from documents that contain both text and images. It requires grounding event semantics across different modalities. Progress in MEE is limited by the lack...
Beyond Words: Evaluating and Bridging Epistemic Divergence in User-Agent Interaction via Theory of Mind
arXiv:2602.13832v1 Announce Type: new Abstract: Large Language Models (LLMs) have developed rapidly and are widely applied to both general-purpose and professional tasks to assist human users. However, they still struggle to comprehend and respond to the true user needs when...
PrivAct: Internalizing Contextual Privacy Preservation via Multi-Agent Preference Training
arXiv:2602.13840v1 Announce Type: new Abstract: Large language model (LLM) agents are increasingly deployed in personalized tasks involving sensitive, context-dependent information, where privacy violations may arise in agents' action due to the implicitness of contextual privacy. Existing approaches rely on external,...
Evaluating Prompt Engineering Techniques for RAG in Small Language Models: A Multi-Hop QA Approach
arXiv:2602.13890v1 Announce Type: new Abstract: Retrieval Augmented Generation (RAG) is a powerful approach for enhancing the factual grounding of language models by integrating external knowledge. While widely studied for large language models, the optimization of RAG for Small Language Models...
The Sufficiency-Conciseness Trade-off in LLM Self-Explanation from an Information Bottleneck Perspective
arXiv:2602.14002v1 Announce Type: new Abstract: Large Language Models increasingly rely on self-explanations, such as chain of thought reasoning, to improve performance on multi step question answering. While these explanations enhance accuracy, they are often verbose and costly to generate, raising...
Named Entity Recognition for Payment Data Using NLP
arXiv:2602.14009v1 Announce Type: new Abstract: Named Entity Recognition (NER) has emerged as a critical component in automating financial transaction processing, particularly in extracting structured information from unstructured payment data. This paper presents a comprehensive analysis of state-of-the-art NER algorithms specifically...
GRRM: Group Relative Reward Modeling for Machine Translation
arXiv:2602.14028v1 Announce Type: new Abstract: While Group Relative Policy Optimization (GRPO) offers a powerful framework for LLM post-training, its effectiveness in open-ended domains like Machine Translation hinges on accurate intra-group ranking. We identify that standard Scalar Quality Metrics (SQM) fall...
Panini: Continual Learning in Token Space via Structured Memory
arXiv:2602.15156v1 Announce Type: new Abstract: Language models are increasingly used to reason over content they were not trained on, such as new documents, evolving knowledge, and user-specific data. A common approach is retrieval-augmented generation (RAG), which stores verbatim documents externally...
da Costa and Tarski meet Goguen and Carnap: a novel approach for ontological heterogeneity based on consequence systems
arXiv:2602.15158v1 Announce Type: new Abstract: This paper presents a novel approach for ontological heterogeneity that draws heavily from Carnapian-Goguenism, as presented by Kutz, Mossakowski and L\"ucke (2010). The approach is provisionally designated da Costian-Tarskianism, named after da Costa's Principle of...
Mind the (DH) Gap! A Contrast in Risky Choices Between Reasoning and Conversational LLMs
arXiv:2602.15173v1 Announce Type: new Abstract: The use of large language models either as decision support systems, or in agentic workflows, is rapidly transforming the digital ecosystem. However, the understanding of LLM decision-making under uncertainty remains limited. We initiate a comparative...
Predicting Invoice Dilution in Supply Chain Finance with Leakage Free Two Stage XGBoost, KAN (Kolmogorov Arnold Networks), and Ensemble Models
arXiv:2602.15248v1 Announce Type: new Abstract: Invoice or payment dilution is the gap between the approved invoice amount and the actual collection is a significant source of non credit risk and margin loss in supply chain finance. Traditionally, this risk is...
WorkflowPerturb: Calibrated Stress Tests for Evaluating Multi-Agent Workflow Metrics
arXiv:2602.17990v1 Announce Type: new Abstract: LLM-based systems increasingly generate structured workflows for complex tasks. In practice, automatic evaluation of these workflows is difficult, because metric scores are often not calibrated, and score changes do not directly communicate the severity of...
CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models
arXiv:2602.17684v1 Announce Type: cross Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) has driven recent progress in code large language models by leveraging execution-based feedback from unit tests, but its scalability is fundamentally constrained by the availability and reliability of high-quality...