Let's Have a Conversation: Designing and Evaluating LLM Agents for Interactive Optimization
arXiv:2604.02666v1 Announce Type: new Abstract: Optimization is as much about modeling the right problem as solving it. Identifying the right objectives, constraints, and trade-offs demands extensive interaction between researchers and stakeholders. Large language models can empower decision-makers with optimization capabilities...
Interpretable Deep Reinforcement Learning for Element-level Bridge Life-cycle Optimization
arXiv:2604.02528v1 Announce Type: new Abstract: The new Specifications for the National Bridge Inventory (SNBI), in effect from 2022, emphasize the use of element-level condition states (CS) for risk-based bridge management. Instead of a general component rating, element-level condition data use...
Jump Start or False Start? A Theoretical and Empirical Evaluation of LLM-initialized Bandits
arXiv:2604.02527v1 Announce Type: new Abstract: The recent advancement of Large Language Models (LLMs) offers new opportunities to generate user preference data to warm-start bandits. Recent studies on contextual bandits with LLM initialization (CBLI) have shown that these synthetic priors can...
Causal-Audit: A Framework for Risk Assessment of Assumption Violations in Time-Series Causal Discovery
arXiv:2604.02488v1 Announce Type: new Abstract: Time-series causal discovery methods rely on assumptions such as stationarity, regular sampling, and bounded temporal dependence. When these assumptions are violated, structure learning can produce confident but misleading causal graphs without warning. We introduce Causal-Audit,...
An Initial Exploration of Contrastive Prompt Tuning to Generate Energy-Efficient Code
arXiv:2604.02352v1 Announce Type: cross Abstract: Although LLMs are capable of generating functionally correct code, they also tend to produce less energy-efficient code in comparison to human-written solutions. As these inefficiencies lead to higher computational overhead, they are in direct conflict...
FoE: Forest of Errors Makes the First Solution the Best in Large Reasoning Models
arXiv:2604.02967v1 Announce Type: new Abstract: Recent Large Reasoning Models (LRMs) like DeepSeek-R1 have demonstrated remarkable success in complex reasoning tasks, exhibiting human-like patterns in exploring multiple alternative solutions. Upon closer inspection, however, we uncover a surprising phenomenon: The First is...
AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents
arXiv:2604.02947v1 Announce Type: new Abstract: Computer-use agents extend language models from text generation to persistent action over tools, files, and execution environments. Unlike chat systems, they maintain state across interactions and translate intermediate outputs into concrete actions. This creates a...
Revealing the Learning Dynamics of Long-Context Continual Pre-training
arXiv:2604.02650v1 Announce Type: new Abstract: Existing studies on Long-Context Continual Pre-training (LCCP) mainly focus on small-scale models and limited data regimes (tens of billions of tokens). We argue that directly migrating these small-scale settings to industrial-grade models risks insufficient adaptation...
A Multi-head-based architecture for effective morphological tagging in Russian with open dictionary
arXiv:2604.02926v1 Announce Type: new Abstract: The article proposes a new architecture based on Multi-head attention to solve the problem of morphological tagging for the Russian language. The preprocessing of the word vectors includes splitting the words into subtokens, followed by...
Conditional Sampling via Wasserstein Autoencoders and Triangular Transport
arXiv:2604.02644v1 Announce Type: new Abstract: We present Conditional Wasserstein Autoencoders (CWAEs), a framework for conditional simulation that exploits low-dimensional structure in both the conditioned and the conditioning variables. The key idea is to modify a Wasserstein autoencoder to use a...
I must delete the evidence: AI Agents Explicitly Cover up Fraud and Violent Crime
arXiv:2604.02500v1 Announce Type: new Abstract: As ongoing research explores the ability of AI agents to be insider threats and act against company interests, we showcase the abilities of such agents to act against human well being in service of corporate...
VoxelCodeBench: Benchmarking 3D World Modeling Through Code Generation
arXiv:2604.02580v1 Announce Type: new Abstract: Evaluating code generation models for 3D spatial reasoning requires executing generated code in realistic environments and assessing outputs beyond surface-level correctness. We introduce a platform VoxelCode, for analyzing code generation capabilities for 3D understanding and...
SocioEval: A Template-Based Framework for Evaluating Socioeconomic Status Bias in Foundation Models
arXiv:2604.02660v1 Announce Type: new Abstract: As Large Language Models (LLMs) increasingly power decision-making systems across critical domains, understanding and mitigating their biases becomes essential for responsible AI deployment. Although bias assessment frameworks have proliferated for attributes such as race and...
Failing to Falsify: Evaluating and Mitigating Confirmation Bias in Language Models
arXiv:2604.02485v1 Announce Type: new Abstract: Confirmation bias, the tendency to seek evidence that supports rather than challenges one's belief, hinders one's reasoning ability. We examine whether large language models (LLMs) exhibit confirmation bias by adapting the rule-discovery study from human...
From Broad Exploration to Stable Synthesis: Entropy-Guided Optimization for Autoregressive Image Generation
arXiv:2604.02355v1 Announce Type: new Abstract: Combining Chain-of-Thought (CoT) with Reinforcement Learning (RL) improves text-to-image (T2I) generation, yet the underlying interaction between CoT's exploration and RL's optimization remains unclear. We present a systematic entropy-based analysis that yields three key insights: (1)...
FTimeXer: Frequency-aware Time-series Transformer with Exogenous variables for Robust Carbon Footprint Forecasting
arXiv:2604.02347v1 Announce Type: new Abstract: Accurate and up-to-date forecasting of the power grid's carbon footprint is crucial for effective product carbon footprint (PCF) accounting and informed decarbonization decisions. However, the carbon intensity of the grid exhibits high non-stationarity, and existing...
OpenAI executive shuffle includes new role for COO Brad Lightcap to lead ‘special projects’
In addition to Lightcap's new role, OpenAI CMO Kate Rouch will be stepping away from the company to focus on cancer recovery, with a plan to return when her health allows.
The Enumerated-Rights Reading of the Privileges or Immunities Clause: A Response to Barnett and Bernick
ARTICLE The Enumerated-Rights Reading of the Privileges or Immunities Clause: A Response to Barnett and Bernick Kurt T. Lash* In 1871, John Bingham explained the meaning of the Fourteenth Amendment’s Privileges or Immunities Clause—a clause Bingham himself drafted and had...
Experience as a Compass: Multi-agent RAG with Evolving Orchestration and Agent Prompts
arXiv:2604.00901v1 Announce Type: new Abstract: Multi-agent Retrieval-Augmented Generation (RAG), wherein each agent takes on a specific role, supports hard queries that require multiple steps and sources, or complex reasoning. Existing approaches, however, rely on static agent behaviors and fixed orchestration...
Improvisational Games as a Benchmark for Social Intelligence of AI Agents: The Case of Connections
arXiv:2604.00284v1 Announce Type: new Abstract: We formally introduce a improvisational wordplay game called Connections to explore reasoning capabilities of AI agents. Playing Connections combines skills in knowledge retrieval, summarization and awareness of cognitive states of other agents. We show how...
Agent Q-Mix: Selecting the Right Action for LLM Multi-Agent Systems through Reinforcement Learning
arXiv:2604.00344v1 Announce Type: new Abstract: Large Language Models (LLMs) have shown remarkable performance in completing various tasks. However, solving complex problems often requires the coordination of multiple agents, raising a fundamental question: how to effectively select and interconnect these agents....
Can Large Language Models Self-Correct in Medical Question Answering? An Exploratory Study
arXiv:2604.00261v2 Announce Type: new Abstract: Large language models (LLMs) have achieved strong performance on medical question answering (medical QA), and chain-of-thought (CoT) prompting has further improved results by eliciting explicit intermediate reasoning; meanwhile, self-reflective (self-corrective) prompting has been widely claimed...
In harmony with gpt-oss
arXiv:2604.00362v1 Announce Type: new Abstract: No one has independently reproduced OpenAI's published scores for gpt-oss-20b with tools, because the original paper discloses neither the tools nor the agent harness. We reverse-engineered the model's in-distribution tools: when prompted without tool definitions,...
LLM Essay Scoring Under Holistic and Analytic Rubrics: Prompt Effects and Bias
arXiv:2604.00259v1 Announce Type: new Abstract: Despite growing interest in using Large Language Models (LLMs) for educational assessment, it remains unclear how closely they align with human scoring. We present a systematic evaluation of instruction-tuned LLMs across three open essay-scoring datasets...
Training In-Context and In-Weights Mixtures Via Contrastive Context Sampling
arXiv:2604.01601v1 Announce Type: new Abstract: We investigate training strategies that co-develop in-context learning (ICL) and in-weights learning (IWL), and the ability to switch between them based on context relevance. Although current LLMs exhibit both modes, standard task-specific fine-tuning often erodes...
Therefore I am. I Think
arXiv:2604.01202v2 Announce Type: new Abstract: We consider the question: when a large language reasoning model makes a choice, did it think first and then decide to, or decide first and then think? In this paper, we present evidence that detectable,...
DISCO-TAB: A Hierarchical Reinforcement Learning Framework for Privacy-Preserving Synthesis of Complex Clinical Data
arXiv:2604.01481v1 Announce Type: new Abstract: The development of robust clinical decision support systems is frequently impeded by the scarcity of high-fidelity, privacy-preserving biomedical data. While Generative Large Language Models (LLMs) offer a promising avenue for synthetic data generation, they often...
Criterion Validity of LLM-as-Judge for Business Outcomes in Conversational Commerce
arXiv:2604.00022v1 Announce Type: cross Abstract: Multi-dimensional rubric-based dialogue evaluation is widely used to assess conversational AI, yet its criterion validity -- whether quality scores are associated with the downstream outcomes they are meant to serve -- remains largely untested. We...
FourierMoE: Fourier Mixture-of-Experts Adaptation of Large Language Models
arXiv:2604.01762v1 Announce Type: new Abstract: Parameter-efficient fine-tuning (PEFT) has emerged as a crucial paradigm for adapting large language models (LLMs) under constrained computational budgets. However, standard PEFT methods often struggle in multi-task fine-tuning settings, where diverse optimization objectives induce task...
Optimizing EEG Graph Structure for Seizure Detection: An Information Bottleneck and Self-Supervised Learning Approach
arXiv:2604.01595v1 Announce Type: new Abstract: Seizure detection from EEG signals is highly challenging due to complex spatiotemporal dynamics and extreme inter-patient variability. To model them, recent methods construct dynamic graphs via statistical correlations, predefined similarity measures, or implicit learning, yet...