Efficient Inference for Large Vision-Language Models: Bottlenecks, Techniques, and Prospects
arXiv:2604.05546v1 Announce Type: new Abstract: Large Vision-Language Models (LVLMs) enable sophisticated reasoning over images and videos, yet their inference is hindered by a systemic efficiency barrier known as visual token dominance. This overhead is driven by a multi-regime interplay between...
CODESTRUCT: Code Agents over Structured Action Spaces
arXiv:2604.05407v1 Announce Type: new Abstract: LLM-based code agents treat repositories as unstructured text, applying edits through brittle string matching that frequently fails due to formatting drift or ambiguous patterns. We propose reframing the codebase as a structured action space where...
ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces
arXiv:2604.05172v1 Announce Type: new Abstract: Large language model (LLM) agents are increasingly deployed to automate productivity tasks (e.g., email, scheduling, document management), but evaluating them on live services is risky due to potentially irreversible changes. Existing benchmarks rely on simplified...
Multi-Drafter Speculative Decoding with Alignment Feedback
arXiv:2604.05417v1 Announce Type: new Abstract: Speculative decoding (SD) accelerates large language model (LLM) inference by using a smaller model to draft future tokens, which are then verified by the target LLM. This preserves generation quality by accepting only aligned tokens....
I can’t help rooting for tiny open source AI model maker Arcee
Arcee is a tiny 26-person U.S. startup that built a high-performing, massive, open source LLM. And it's gaining popularity with OpenClaw users.
The Higher Education Accommodation Mistake
AI Appeals Processor: A Deep Learning Approach to Automated Classification of Citizen Appeals in Government Services
arXiv:2604.03672v1 Announce Type: new Abstract: Government agencies worldwide face growing volumes of citizen appeals, with electronic submissions increasing significantly over recent years. Traditional manual processing averages 20 minutes per appeal with only 67% classification accuracy, creating significant bottlenecks in public...
The Format Tax
arXiv:2604.03616v1 Announce Type: new Abstract: Asking a large language model to respond in JSON should be a formatting choice, not a capability tax. Yet we find that structured output requirements -- JSON, XML, LaTeX, Markdown -- substantially degrade reasoning and...
Cultural Authenticity: Comparing LLM Cultural Representations to Native Human Expectations
arXiv:2604.03493v1 Announce Type: new Abstract: Cultural representation in Large Language Model (LLM) outputs has primarily been evaluated through the proxies of cultural diversity and factual accuracy. However, a crucial gap remains in assessing cultural alignment: the degree to which generated...
LightThinker++: From Reasoning Compression to Memory Management
arXiv:2604.03679v1 Announce Type: new Abstract: Large language models (LLMs) excel at complex reasoning, yet their efficiency is limited by the surging cognitive overhead of long thought traces. In this paper, we propose LightThinker, a method that enables LLMs to dynamically...
Evolutionary Search for Automated Design of Uncertainty Quantification Methods
arXiv:2604.03473v1 Announce Type: new Abstract: Uncertainty quantification (UQ) methods for large language models are predominantly designed by hand based on domain knowledge and heuristics, limiting their scalability and generality. We apply LLM-powered evolutionary search to automatically discover unsupervised UQ methods...
Automated Analysis of Global AI Safety Initiatives: A Taxonomy-Driven LLM Approach
arXiv:2604.03533v1 Announce Type: new Abstract: We present an automated crosswalk framework that compares an AI safety policy document pair under a shared taxonomy of activities. Using the activity categories defined in Activity Map on AI Safety as fixed aspects, the...
Position: Science of AI Evaluation Requires Item-level Benchmark Data
arXiv:2604.03244v1 Announce Type: new Abstract: AI evaluations have become the primary evidence for deploying generative AI systems across high-stakes domains. However, current evaluation paradigms often exhibit systemic validity failures. These issues, ranging from unjustified design choices to misaligned metrics, remain...
I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigation
arXiv:2604.03904v1 Announce Type: new Abstract: Large language models (LLMs) frequently produce confident but incorrect answers, partly because common binary scoring conventions reward answering over honestly expressing uncertainty. We study whether prompt-only interventions -- explicitly announcing reward schemes for answer-versus-abstain decisions...
ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation
arXiv:2604.03922v1 Announce Type: new Abstract: Selecting LLM-generated code candidates using LLM-generated tests is challenging because the tests themselves may be incorrect. Existing methods either treat all tests equally or rely on ad-hoc heuristics to filter unreliable tests. Yet determining test...
FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning
arXiv:2604.03893v1 Announce Type: new Abstract: Breakthroughs in frontier theory often depend on the combination of concrete diagrammatic notations with rigorous logic. While multimodal large language models (MLLMs) show promise in general scientific tasks, current benchmarks often focus on local information...
Diagonal-Tiled Mixed-Precision Attention for Efficient Low-Bit MXFP Inference
arXiv:2604.03950v1 Announce Type: new Abstract: Transformer-based large language models (LLMs) have demonstrated remarkable performance across a wide range of real-world tasks, but their inference cost remains prohibitively high due to the quadratic complexity of attention and the memory bandwidth limitations...
PRAISE: Prefix-Based Rollout Reuse in Agentic Search Training
arXiv:2604.03675v1 Announce Type: new Abstract: In agentic search, large language models (LLMs) are trained to perform multi-turn retrieval and reasoning for complex tasks such as multi-hop question answering (QA). However, current search-based Reinforcement Learning (RL) methods suffer from two core...
Single-agent vs. Multi-agents for Automated Video Analysis of On-Screen Collaborative Learning Behaviors
arXiv:2604.03631v1 Announce Type: new Abstract: On-screen learning behavior provides valuable insights into how students seek, use, and create information during learning. Analyzing on-screen behavioral engagement is essential for capturing students' cognitive and collaborative processes. The recent development of Vision Language...
Selective Forgetting for Large Reasoning Models
arXiv:2604.03571v1 Announce Type: new Abstract: Large Reasoning Models (LRMs) generate structured chains of thought (CoTs) before producing final answers, making them especially vulnerable to knowledge leakage through intermediate reasoning steps. Yet, the memorization of sensitive information in the training data...
Structural Rigidity and the 57-Token Predictive Window: A Physical Framework for Inference-Layer Governability in Large Language Models
arXiv:2604.03524v1 Announce Type: new Abstract: Current AI safety relies on behavioral monitoring and post-training alignment, yet empirical measurement shows these approaches produce no detectable pre-commitment signal in a majority of instruction-tuned models tested. We present an energy-based governance framework connecting...
Comparative reversal learning reveals rigid adaptation in LLMs under non-stationary uncertainty
arXiv:2604.04182v1 Announce Type: new Abstract: Non-stationary environments require agents to revise previously learned action values when contingencies change. We treat large language models (LLMs) as sequential decision policies in a two-option probabilistic reversal-learning task with three latent states and switch...
Explainable Model Routing for Agentic Workflows
arXiv:2604.03527v1 Announce Type: new Abstract: Modern agentic workflows decompose complex tasks into specialized subtasks and route them to diverse models to minimize cost without sacrificing quality. However, current routing architectures focus exclusively on performance optimization, leaving underlying trade-offs between model...
When Do Hallucinations Arise? A Graph Perspective on the Evolution of Path Reuse and Path Compression
arXiv:2604.03557v1 Announce Type: new Abstract: Reasoning hallucinations in large language models (LLMs) often appear as fluent yet unsupported conclusions that violate either the given context or underlying factual knowledge. Although such failures are widely observed, the mechanisms by which decoder-only...
Beyond Retrieval: Modeling Confidence Decay and Deterministic Agentic Platforms in Generative Engine Optimization
arXiv:2604.03656v1 Announce Type: new Abstract: Generative Engine Optimization (GEO) is rapidly reshaping digital marketing paradigms in the era of Large Language Models (LLMs). However, current GEO strategies predominantly rely on Retrieval-Augmented Generation (RAG), which inherently suffers from probabilistic hallucinations and...
Neural Global Optimization via Iterative Refinement from Noisy Samples
arXiv:2604.03614v1 Announce Type: new Abstract: Global optimization of black-box functions from noisy samples is a fundamental challenge in machine learning and scientific computing. Traditional methods such as Bayesian Optimization often converge to local minima on multi-modal functions, while gradient-free methods...
Algebraic Diversity: Group-Theoretic Spectral Estimation from Single Observations
arXiv:2604.03634v1 Announce Type: new Abstract: We prove that temporal averaging over multiple observations can be replaced by algebraic group action on a single observation for second-order statistical estimation. A General Replacement Theorem establishes conditions under which a group-averaged estimator from...
Improving Feasibility via Fast Autoencoder-Based Projections
arXiv:2604.03489v1 Announce Type: new Abstract: Enforcing complex (e.g., nonconvex) operational constraints is a critical challenge in real-world learning and control systems. However, existing methods struggle to efficiently enforce general classes of constraints. To address this, we propose a novel data-driven...
Are Arabic Benchmarks Reliable? QIMMA's Quality-First Approach to LLM Evaluation
arXiv:2604.03395v1 Announce Type: new Abstract: We present QIMMA, a quality-assured Arabic LLM leaderboard that places systematic benchmark validation at its core. Rather than aggregating existing resources as-is, QIMMA applies a multi-model assessment pipeline combining automated LLM judgment with human review...
Representational Collapse in Multi-Agent LLM Committees: Measurement and Diversity-Aware Consensus
arXiv:2604.03809v1 Announce Type: new Abstract: Multi-agent LLM committees replicate the same model under different role prompts and aggregate outputs by majority vote, implicitly assuming that agents contribute complementary evidence. We embed each agent's chain-of-thought rationale and measure pairwise similarity: across...