BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence
arXiv:2604.03216v1 Announce Type: new Abstract: Large language models (LLMs) often produce confident but incorrect answers in settings where abstention would be safer. Standard evaluation protocols, however, require a response and do not account for how confidence should guide decisions under...
LogicPoison: Logical Attacks on Graph Retrieval-Augmented Generation
arXiv:2604.02954v1 Announce Type: new Abstract: Graph-based Retrieval-Augmented Generation (GraphRAG) enhances the reasoning capabilities of Large Language Models (LLMs) by grounding their responses in structured knowledge graphs. Leveraging community detection and relation filtering techniques, GraphRAG systems demonstrate inherent resistance to traditional...
Product-Stability: Provable Convergence for Gradient Descent on the Edge of Stability
arXiv:2604.02653v1 Announce Type: new Abstract: Empirically, modern deep learning training often occurs at the Edge of Stability (EoS), where the sharpness of the loss exceeds the threshold below which classical convergence analysis applies. Despite recent progress, existing theoretical explanations of...
Trivial Vocabulary Bans Improve LLM Reasoning More Than Deep Linguistic Constraints
arXiv:2604.02699v1 Announce Type: new Abstract: A previous study reported that E-Prime (English without the verb "to be") selectively altered reasoning in language models, with cross-model correlations suggesting a structural signature tied to which vocabulary was removed. I designed a replication...
Modeling and Controlling Deployment Reliability under Temporal Distribution Shift
arXiv:2604.02351v1 Announce Type: new Abstract: Machine learning models deployed in non-stationary environments are exposed to temporal distribution shift, which can erode predictive reliability over time. While common mitigation strategies such as periodic retraining and recalibration aim to preserve performance, they...
Using LLM-as-a-Judge/Jury to Advance Scalable, Clinically-Validated Safety Evaluations of Model Responses to Users Demonstrating Psychosis
arXiv:2604.02359v1 Announce Type: cross Abstract: General-purpose Large Language Models (LLMs) are becoming widely adopted by people for mental health support. Yet emerging evidence suggests there are significant risks associated with high-frequency use, particularly for individuals suffering from psychosis, as LLMs...
AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents
arXiv:2604.02947v1 Announce Type: new Abstract: Computer-use agents extend language models from text generation to persistent action over tools, files, and execution environments. Unlike chat systems, they maintain state across interactions and translate intermediate outputs into concrete actions. This creates a...
An Initial Exploration of Contrastive Prompt Tuning to Generate Energy-Efficient Code
arXiv:2604.02352v1 Announce Type: cross Abstract: Although LLMs are capable of generating functionally correct code, they also tend to produce less energy-efficient code in comparison to human-written solutions. As these inefficiencies lead to higher computational overhead, they are in direct conflict...
Querying Structured Data Through Natural Language Using Language Models
arXiv:2604.03057v1 Announce Type: new Abstract: This paper presents an open source methodology for allowing users to query structured non textual datasets through natural language Unlike Retrieval Augmented Generation RAG which struggles with numerical and highly structured information our approach trains...
Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence?
arXiv:2604.03016v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) are evolving from passive observers into active agents, solving problems through Visual Expansion (invoking visual tools) and Knowledge Expansion (open-web search). However, existing evaluations fall short: they lack flexible tool...
Skeleton-based Coherence Modeling in Narratives
arXiv:2604.02451v1 Announce Type: new Abstract: Modeling coherence in text has been a task that has excited NLP researchers since a long time. It has applications in detecting incoherent structures and helping the author fix them. There has been recent work...
CharTool: Tool-Integrated Visual Reasoning for Chart Understanding
arXiv:2604.02794v1 Announce Type: new Abstract: Charts are ubiquitous in scientific and financial literature for presenting structured data. However, chart reasoning remains challenging for multimodal large language models (MLLMs) due to the lack of high-quality training data, as well as the...
Complex-Valued GNNs for Distributed Basis-Invariant Control of Planar Systems
arXiv:2604.02615v1 Announce Type: new Abstract: Graph neural networks (GNNs) are a well-regarded tool for learned control of networked dynamical systems due to their ability to be deployed in a distributed manner. However, current distributed GNN architectures assume that all nodes...
Valence-Arousal Subspace in LLMs: Circular Emotion Geometry and Multi-Behavioral Control
arXiv:2604.03147v1 Announce Type: new Abstract: We present a method to identify a valence-arousal (VA) subspace within large language model representations. From 211k emotion-labeled texts, we derive emotion steering vectors, then learn VA axes as linear combinations of their top PCA...
Analytic Drift Resister for Non-Exemplar Continual Graph Learning
arXiv:2604.02633v1 Announce Type: new Abstract: Non-Exemplar Continual Graph Learning (NECGL) seeks to eliminate the privacy risks intrinsic to rehearsal-based paradigms by retaining solely class-level prototype representations rather than raw graph examples for mitigating catastrophic forgetting. However, this design choice inevitably...
DrugPlayGround: Benchmarking Large Language Models and Embeddings for Drug Discovery
arXiv:2604.02346v1 Announce Type: cross Abstract: Large language models (LLMs) are in the ascendancy for research in drug discovery, offering unprecedented opportunities to reshape drug research by accelerating hypothesis generation, optimizing candidate prioritization, and enabling more scalable and cost-effective drug discovery...
Interpretable Deep Reinforcement Learning for Element-level Bridge Life-cycle Optimization
arXiv:2604.02528v1 Announce Type: new Abstract: The new Specifications for the National Bridge Inventory (SNBI), in effect from 2022, emphasize the use of element-level condition states (CS) for risk-based bridge management. Instead of a general component rating, element-level condition data use...
Time-Warping Recurrent Neural Networks for Transfer Learning
arXiv:2604.02474v1 Announce Type: new Abstract: Dynamical systems describe how a physical system evolves over time. Physical processes can evolve faster or slower in different environmental conditions. We use time-warping as rescaling the time in a model of a physical system....
Too Polite to Disagree: Understanding Sycophancy Propagation in Multi-Agent Systems
arXiv:2604.02668v1 Announce Type: new Abstract: Large language models (LLMs) often exhibit sycophancy: agreement with user stance even when it conflicts with the model's opinion. While prior work has mostly studied this in single-agent settings, it remains underexplored in collaborative multi-agent...
Audio Spatially-Guided Fusion for Audio-Visual Navigation
arXiv:2604.02389v1 Announce Type: cross Abstract: Audio-visual Navigation refers to an agent utilizing visual and auditory information in complex 3D environments to accomplish target localization and path planning, thereby achieving autonomous navigation. The core challenge of this task lies in the...
Speaking of Language: Reflections on Metalanguage Research in NLP
arXiv:2604.02645v1 Announce Type: new Abstract: This work aims to shine a spotlight on the topic of metalanguage. We first define metalanguage, link it to NLP and LLMs, and then discuss our two labs' metalanguage-centered efforts. Finally, we discuss four dimensions...
Principled and Scalable Diversity-Aware Retrieval via Cardinality-Constrained Binary Quadratic Programming
arXiv:2604.02554v1 Announce Type: new Abstract: Diversity-aware retrieval is essential for Retrieval-Augmented Generation (RAG), yet existing methods lack theoretical guarantees and face scalability issues as the number of retrieved passages $k$ increases. We propose a principled formulation of diversity retrieval as...
AutoVerifier: An Agentic Automated Verification Framework Using Large Language Models
arXiv:2604.02617v1 Announce Type: new Abstract: Scientific and Technical Intelligence (S&TI) analysis requires verifying complex technical claims across rapidly growing literature, where existing approaches fail to bridge the verification gap between surface-level accuracy and deeper methodological validity. We present AutoVerifier, an...
Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks
arXiv:2604.02795v1 Announce Type: new Abstract: Rubric-based Reinforcement Learning (RL) has emerged as a promising approach for aligning Large Language Models (LLMs) with complex, open-domain instruction following tasks. However, existing methods predominantly rely on response-level rewards, introducing severe reward sparsity and...
Pragmatics Meets Culture: Culturally-adapted Artwork Description Generation and Evaluation
arXiv:2604.02557v1 Announce Type: new Abstract: Language models are known to exhibit various forms of cultural bias in decision-making tasks, yet much less is known about their degree of cultural familiarity in open-ended text generation tasks. In this paper, we introduce...
Multi-Turn Reinforcement Learning for Tool-Calling Agents with Iterative Reward Calibration
arXiv:2604.02869v1 Announce Type: new Abstract: Training tool-calling agents with reinforcement learning on multi-turn tasks remains challenging due to sparse outcome rewards and difficult credit assignment across conversation turns. We present the first application of MT-GRPO (Multi-Turn Group Relative Policy Optimization)...
Beyond Precision: Importance-Aware Recall for Factuality Evaluation in Long-Form LLM Generation
arXiv:2604.03141v1 Announce Type: new Abstract: Evaluating the factuality of long-form output generated by large language models (LLMs) remains challenging, particularly when responses are open-ended and contain many fine-grained factual statements. Existing evaluation methods primarily focus on precision: they decompose a...
SEDGE: Structural Extrapolated Data Generation
arXiv:2604.02482v1 Announce Type: new Abstract: This paper proposes a framework for Structural Extrapolated Data GEneration (SEDGE) based on suitable assumptions on the underlying data generating process. We provide conditions under which data satisfying new specifications can be generated reliably, together...
Generalization Limits of Reinforcement Learning Alignment
arXiv:2604.02652v1 Announce Type: new Abstract: The safety of large language models (LLMs) relies on alignment techniques such as reinforcement learning from human feedback (RLHF). However, recent theoretical analyses suggest that reinforcement learning-based training does not acquire new capabilities but merely...
Dynamical structure of vanishing gradient and overfitting in multi-layer perceptrons
arXiv:2604.02393v1 Announce Type: new Abstract: Vanishing gradient and overfitting are two of the most extensively studied problems in the literature about machine learning. However, they are frequently considered in some asymptotic setting, which obscure the underlying dynamical mechanisms responsible for...