Incremental LTLf Synthesis
arXiv:2603.01201v1 Announce Type: new Abstract: In this paper, we study incremental LTLf synthesis -- a form of reactive synthesis where the goals are given incrementally while in execution. In other words, the protagonist agent is already executing a strategy for...
How Well Does Agent Development Reflect Real-World Work?
arXiv:2603.01203v1 Announce Type: new Abstract: AI agents are increasingly developed and evaluated on benchmarks relevant to human work, yet it remains unclear how representative these benchmarking efforts are of the labor market as a whole. In this work, we systematically...
TAB-PO: Preference Optimization with a Token-Level Adaptive Barrier for Token-Critical Structured Generation
arXiv:2603.00025v1 Announce Type: new Abstract: Direct Preference Optimization is an offline post-SFT method for aligning language models from preference pairs, with strong results in instruction following and summarization. However, DPO's sequence-level implicit reward can be brittle for token-critical structured prediction...
SimpleTool: Parallel Decoding for Real-Time LLM Function Calling
arXiv:2603.00030v1 Announce Type: new Abstract: LLM-based function calling enables intelligent agents to interact with external tools and environments, yet autoregressive decoding imposes a fundamental latency bottleneck that limits real-time applications such as embodied intelligence, game AI, and interactive avatars (e.g.,...
Autorubric: A Unified Framework for Rubric-Based LLM Evaluation
arXiv:2603.00077v1 Announce Type: new Abstract: Rubric-based evaluation with large language models (LLMs) has become standard practice for assessing text generation at scale, yet the underlying techniques are scattered across papers with inconsistent terminology and partial solutions. We present a unified...
Iterative LLM-based improvement for French Clinical Interview Transcription and Speaker Diarization
arXiv:2603.00086v1 Announce Type: new Abstract: Automatic speech recognition for French medical conversations remains challenging, with word error rates often exceeding 30% in spontaneous clinical speech. This study proposes a multi-pass LLM post-processing architecture alternating between Speaker Recognition and Word Recognition...
Stepwise Penalization for Length-Efficient Chain-of-Thought Reasoning
arXiv:2603.00296v1 Announce Type: new Abstract: Large reasoning models improve with more test-time computation, but often overthink, producing unnecessarily long chains-of-thought that raise cost without improving accuracy. Prior reinforcement learning approaches typically rely on a single outcome reward with trajectory-level length...
When Metrics Disagree: Automatic Similarity vs. LLM-as-a-Judge for Clinical Dialogue Evaluation
arXiv:2603.00314v1 Announce Type: new Abstract: This paper details the baseline model selection, fine-tuning process, evaluation methods, and the implications of deploying more accurate LLMs in healthcare settings. As large language models (LLMs) are increasingly employed to address diverse problems, including...
Federated Inference: Toward Privacy-Preserving Collaborative and Incentivized Model Serving
arXiv:2603.02214v1 Announce Type: new Abstract: Federated Inference (FI) studies how independently trained and privately owned models can collaborate at inference time without sharing data or model parameters. While recent work has explored secure and distributed inference from disparate perspectives, a...
Diagnosing Retrieval vs. Utilization Bottlenecks in LLM Agent Memory
arXiv:2603.02473v1 Announce Type: new Abstract: Memory-augmented LLM agents store and retrieve information from prior interactions, yet the relative importance of how memories are written versus how they are retrieved remains unclear. We introduce a diagnostic framework that analyzes how performance...
NeuroProlog: Multi-Task Fine-Tuning for Neurosymbolic Mathematical Reasoning via the Cocktail Effect
arXiv:2603.02504v1 Announce Type: new Abstract: Large Language Models (LLMs) achieve strong performance on natural language tasks but remain unreliable in mathematical reasoning, frequently generating fluent yet logically inconsistent solutions. We present \textbf{NeuroProlog}, a neurosymbolic framework that ensures verifiable reasoning by...
LLM-MLFFN: Multi-Level Autonomous Driving Behavior Feature Fusion via Large Language Model
arXiv:2603.02528v1 Announce Type: new Abstract: Accurate classification of autonomous vehicle (AV) driving behaviors is critical for safety validation, performance diagnosis, and traffic integration analysis. However, existing approaches primarily rely on numerical time-series modeling and often lack semantic abstraction, limiting interpretability...
A Neuropsychologically Grounded Evaluation of LLM Cognitive Abilities
arXiv:2603.02540v1 Announce Type: new Abstract: Large language models (LLMs) exhibit a unified "general factor" of capability across 10 benchmarks, a finding confirmed by our factor analysis of 156 models, yet they still struggle with simple, trivial tasks for humans. This...
AgentAssay: Token-Efficient Regression Testing for Non-Deterministic AI Agent Workflows
arXiv:2603.02601v1 Announce Type: new Abstract: Autonomous AI agents are deployed at unprecedented scale, yet no principled methodology exists for verifying that an agent has not regressed after changes to its prompts, tools, models, or orchestration logic. We present AgentAssay, the...
FinTexTS: Financial Text-Paired Time-Series Dataset via Semantic-Based and Multi-Level Pairing
arXiv:2603.02702v1 Announce Type: new Abstract: The financial domain involves a variety of important time-series problems. Recently, time-series analysis methods that jointly leverage textual and numerical information have gained increasing attention. Accordingly, numerous efforts have been made to construct text-paired time-series...
Rethinking Code Similarity for Automated Algorithm Design with LLMs
arXiv:2603.02787v1 Announce Type: new Abstract: The rise of Large Language Model-based Automated Algorithm Design (LLM-AAD) has transformed algorithm development by autonomously generating code implementations of expert-level algorithms. Unlike traditional expert-driven algorithm development, in the LLM-AAD paradigm, the main design principle...
Agentified Assessment of Logical Reasoning Agents
arXiv:2603.02788v1 Announce Type: new Abstract: We present a framework for evaluating and benchmarking logical reasoning agents when assessment itself must be reproducible, auditable, and robust to execution failures. Building on agentified assessment, we use an assessor agent to issue tasks,...
LLM-based Argument Mining meets Argumentation and Description Logics: a Unified Framework for Reasoning about Debates
arXiv:2603.02858v1 Announce Type: new Abstract: Large Language Models (LLMs) achieve strong performance in analyzing and generating text, yet they struggle with explicit, transparent, and verifiable reasoning over complex texts such as those containing debates. In particular, they lack structured representations...
SAE as a Crystal Ball: Interpretable Features Predict Cross-domain Transferability of LLMs without Training
arXiv:2603.02908v1 Announce Type: new Abstract: In recent years, pre-trained large language models have achieved remarkable success across diverse tasks. Besides the pivotal role of self-supervised pre-training, their effectiveness in downstream applications also depends critically on the post-training process, which adapts...
ShipTraj-R1: Reinforcing Ship Trajectory Prediction in Large Language Models via Group Relative Policy Optimization
arXiv:2603.02939v1 Announce Type: new Abstract: Recent advancements in reinforcement fine-tuning have significantly improved the reasoning ability of large language models (LLMs). In particular, methods such as group relative policy optimization (GRPO) have demonstrated strong capabilities across various fields. However, applying...
Architecting Trust in Artificial Epistemic Agents
arXiv:2603.02960v1 Announce Type: new Abstract: Large language models increasingly function as epistemic agents -- entities that can 1) autonomously pursue epistemic goals and 2) actively shape our shared knowledge environment. They curate the information we receive, often supplanting traditional search-based...
OrchMAS: Orchestrated Reasoning with Multi Collaborative Heterogeneous Scientific Expert Structured Agents
arXiv:2603.03005v1 Announce Type: new Abstract: Multi-agent large language model frameworks are promising for complex multi step reasoning, yet existing systems remain weak for scientific and knowledge intensive domains due to static prompts and agent roles, rigid workflows, and homogeneous model...
TikZilla: Scaling Text-to-TikZ with High-Quality Data and Reinforcement Learning
arXiv:2603.03072v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used to assist scientists across diverse workflows. A key challenge is generating high-quality figures from textual descriptions, often represented as TikZ programs that can be rendered as scientific images....
Beyond Factual Correctness: Mitigating Preference-Inconsistent Explanations in Explainable Recommendation
arXiv:2603.03080v1 Announce Type: new Abstract: LLM-based explainable recommenders can produce fluent explanations that are factually correct, yet still justify items using attributes that conflict with a user's historical preferences. Such preference-inconsistent explanations yield logically valid but unconvincing reasoning and are...
Agentic AI-based Coverage Closure for Formal Verification
arXiv:2603.03147v1 Announce Type: new Abstract: Coverage closure is a critical requirement in Integrated Chip (IC) development process and key metric for verification sign-off. However, traditional exhaustive approaches often fail to achieve full coverage within project timelines. This study presents an...
Expectation and Acoustic Neural Network Representations Enhance Music Identification from Brain Activity
arXiv:2603.03190v1 Announce Type: new Abstract: During music listening, cortical activity encodes both acoustic and expectation-related information. Prior work has shown that ANN representations resemble cortical representations and can serve as supervisory signals for EEG recognition. Here we show that distinguishing...
AI-for-Science Low-code Platform with Bayesian Adversarial Multi-Agent Framework
arXiv:2603.03233v1 Announce Type: new Abstract: Large Language Models (LLMs) demonstrate potentials for automating scientific code generation but face challenges in reliability, error propagation in multi-agent workflows, and evaluation in domains with ill-defined success metrics. We present a Bayesian adversarial multi-agent...
Density-Guided Response Optimization: Community-Grounded Alignment via Implicit Acceptance Signals
arXiv:2603.03242v1 Announce Type: new Abstract: Language models deployed in online communities must adapt to norms that vary across social, cultural, and domain-specific contexts. Prior alignment approaches rely on explicit preference supervision or predefined principles, which are effective for well-resourced settings...
A Zipf-preserving, long-range correlated surrogate for written language and other symbolic sequences
arXiv:2603.02213v1 Announce Type: new Abstract: Symbolic sequences such as written language and genomic DNA display characteristic frequency distributions and long-range correlations extending over many symbols. In language, this takes the form of Zipf's law for word frequencies together with persistent...
RO-N3WS: Enhancing Generalization in Low-Resource ASR with Diverse Romanian Speech Benchmarks
arXiv:2603.02368v1 Announce Type: new Abstract: We introduce RO-N3WS, a benchmark Romanian speech dataset designed to improve generalization in automatic speech recognition (ASR), particularly in low-resource and out-of-distribution (OOD) conditions. RO-N3WS comprises over 126 hours of transcribed audio collected from broadcast...