Simulating the Evolution of Alignment and Values in Machine Intelligence
arXiv:2604.05274v1 Announce Type: new Abstract: Model alignment is currently applied in a vacuum, evaluated primarily through standardised benchmark performance. The purpose of this study is to examine the effects of alignment on populations of models through time. We focus on...
Human Values Matter: Investigating How Misalignment Shapes Collective Behaviors in LLM Agent Communities
arXiv:2604.05339v1 Announce Type: new Abstract: As LLMs become increasingly integrated into human society, evaluating their orientations on human values from social science has drawn growing attention. Nevertheless, it is still unclear why human values matter for LLMs, especially in LLM-based...
Learning to Edit Knowledge via Instruction-based Chain-of-Thought Prompting
arXiv:2604.05540v1 Announce Type: new Abstract: Large language models (LLMs) can effectively handle outdated information through knowledge editing. However, current approaches face two key limitations: (I) Poor generalization: Most approaches rigidly inject new knowledge without ensuring that the model can use...
Turbulence-like 5/3 spectral scaling in contextual representations of language as a complex system
arXiv:2604.05536v1 Announce Type: new Abstract: Natural language is a complex system that exhibits robust statistical regularities. Here, we represent text as a trajectory in a high-dimensional embedding space generated by transformer-based language models, and quantify scale-dependent fluctuations along the token...
ETR: Entropy Trend Reward for Efficient Chain-of-Thought Reasoning
arXiv:2604.05355v1 Announce Type: new Abstract: Chain-of-thought (CoT) reasoning improves large language model performance on complex tasks, but often produces excessively long and inefficient reasoning traces. Existing methods shorten CoTs using length penalties or global entropy reduction, implicitly assuming that low...
See the Forest for the Trees: Loosely Speculative Decoding via Visual-Semantic Guidance for Efficient Inference of Video LLMs
arXiv:2604.05650v1 Announce Type: new Abstract: Video Large Language Models (Video-LLMs) excel in video understanding but suffer from high inference latency during autoregressive generation. Speculative Decoding (SD) mitigates this by applying a draft-and-verify paradigm, yet existing methods are constrained by rigid...
ALTO: Adaptive LoRA Tuning and Orchestration for Heterogeneous LoRA Training Workloads
arXiv:2604.05426v1 Announce Type: new Abstract: Low-Rank Adaptation (LoRA) is now the dominant method for parameter-efficient fine-tuning of large language models, but achieving a high-quality adapter often requires systematic hyperparameter tuning because LoRA performance is highly sensitive to configuration choices. In...
From Retinal Evidence to Safe Decisions: RETINA-SAFE and ECRT for Hallucination Risk Triage in Medical LLMs
arXiv:2604.05348v1 Announce Type: new Abstract: Hallucinations in medical large language models (LLMs) remain a safety-critical issue, particularly when available evidence is insufficient or conflicting. We study this problem in diabetic retinopathy (DR) decision settings and introduce RETINA-SAFE, an evidence-grounded benchmark...
Efficient Inference for Large Vision-Language Models: Bottlenecks, Techniques, and Prospects
arXiv:2604.05546v1 Announce Type: new Abstract: Large Vision-Language Models (LVLMs) enable sophisticated reasoning over images and videos, yet their inference is hindered by a systemic efficiency barrier known as visual token dominance. This overhead is driven by a multi-regime interplay between...
HYVE: Hybrid Views for LLM Context Engineering over Machine Data
arXiv:2604.05400v1 Announce Type: new Abstract: Machine data is central to observability and diagnosis in modern computing systems, appearing in logs, metrics, telemetry traces, and configuration snapshots. When provided to large language models (LLMs), this data typically arrives as a mixture...
MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU
arXiv:2604.05091v1 Announce Type: new Abstract: We present MegaTrain, a memory-centric system that efficiently trains 100B+ parameter large language models at full precision on a single GPU. Unlike traditional GPU-centric systems, MegaTrain stores parameters and optimizer states in host memory (CPU...
Multi-Agent Pathfinding with Non-Unit Integer Edge Costs via Enhanced Conflict-Based Search and Graph Discretization
arXiv:2604.05416v1 Announce Type: new Abstract: Multi-Agent Pathfinding (MAPF) plays a critical role in various domains. Traditional MAPF methods typically assume unit edge costs and single-timestep actions, which limit their applicability to real-world scenarios. MAPFR extends MAPF to handle non-unit costs...
RAG or Learning? Understanding the Limits of LLM Adaptation under Continuous Knowledge Drift in the Real World
arXiv:2604.05096v1 Announce Type: new Abstract: Large language models (LLMs) acquire most of their knowledge during pretraining, which ties them to a fixed snapshot of the world and makes adaptation to continuously evolving knowledge challenging. As facts, entities, and events change...
Bivariate Causal Discovery Using Rate-Distortion MDL: An Information Dimension Approach
arXiv:2604.05829v1 Announce Type: new Abstract: Approaches to bivariate causal discovery based on the minimum description length (MDL) principle approximate the (uncomputable) Kolmogorov complexity of the models in each causal direction, selecting the one with the lower total complexity. The premise...
DQA: Diagnostic Question Answering for IT Support
arXiv:2604.05350v1 Announce Type: new Abstract: Enterprise IT support interactions are fundamentally diagnostic: effective resolution requires iterative evidence gathering from ambiguous user reports to identify an underlying root cause. While retrieval-augmented generation (RAG) provides grounding through historical cases, standard multi-turn RAG...
UniCreative: Unifying Long-form Logic and Short-form Sparkle via Reference-Free Reinforcement Learning
arXiv:2604.05517v1 Announce Type: new Abstract: A fundamental challenge in creative writing lies in reconciling the inherent tension between maintaining global coherence in long-form narratives and preserving local expressiveness in short-form texts. While long-context generation necessitates explicit macroscopic planning, short-form creativity...
EpiBench: Benchmarking Multi-turn Research Workflows for Multimodal Agents
arXiv:2604.05557v1 Announce Type: new Abstract: Scientific research follows multi-turn, multi-step workflows that require proactively searching the literature, consulting figures and tables, and integrating evidence across papers to align experimental settings and support reproducible conclusions. This joint capability is not systematically...
Dynamic Linear Coregionalization for Realistic Synthetic Multivariate Time Series
arXiv:2604.05064v1 Announce Type: new Abstract: Synthetic data is essential for training foundation models for time series (FMTS), but most generators assume static correlations, and are typically missing realistic inter-channel dependencies. We introduce DynLMC, a Dynamic Linear Model of Coregionalization, that...
Extending Tabular Denoising Diffusion Probabilistic Models for Time-Series Data Generation
arXiv:2604.05257v1 Announce Type: new Abstract: Diffusion models are increasingly being utilised to create synthetic tabular and time series data for privacy-preserving augmentation. Tabular Denoising Diffusion Probabilistic Models (TabDDPM) generate high-quality synthetic data from heterogeneous tabular datasets but assume independence between...
Beneath the Surface: Investigating LLMs' Capabilities for Communicating with Subtext
arXiv:2604.05273v1 Announce Type: new Abstract: Human communication is fundamentally creative, and often makes use of subtext -- implied meaning that goes beyond the literal content of the text. Here, we systematically study whether language models can use subtext in communicative...
SenseAI: A Human-in-the-Loop Dataset for RLHF-Aligned Financial Sentiment Reasoning
arXiv:2604.05135v1 Announce Type: new Abstract: We introduce SenseAI, a human-in-the-loop (HITL) validated financial sentiment dataset designed to capture not only model outputs but the full reasoning process behind them. Unlike existing resources, SenseAI incorporates reasoning chains, confidence scores, human correction...
From Governance Norms to Enforceable Controls: A Layered Translation Method for Runtime Guardrails in Agentic AI
arXiv:2604.05229v1 Announce Type: new Abstract: Agentic AI systems plan, use tools, maintain state, and produce multi-step trajectories with external effects. Those properties create a governance problem that differs materially from single-turn generative AI: important risks emerge dur- ing execution, not...
Cross-Machine Anomaly Detection Leveraging Pre-trained Time-series Model
arXiv:2604.05335v1 Announce Type: new Abstract: Achieving resilient and high-quality manufacturing requires reliable data-driven anomaly detection methods that are capable of addressing differences in behaviors among different individual machines which are nominally the same and are executing the same processes. To...
Pramana: Fine-Tuning Large Language Models for Epistemic Reasoning through Navya-Nyaya
arXiv:2604.04937v1 Announce Type: new Abstract: Large language models produce fluent text but struggle with systematic reasoning, often hallucinating confident but unfounded claims. When Apple researchers added irrelevant context to mathematical problems, LLM performance degraded by 65% Apple Machine Learning Research,...
A Theoretical Framework for Statistical Evaluability of Generative Models
arXiv:2604.05324v1 Announce Type: new Abstract: Statistical evaluation aims to estimate the generalization performance of a model using held-out i.i.d.\ test data sampled from the ground-truth distribution. In supervised learning settings such as classification, performance metrics such as error rate are...
Learning Stable Predictors from Weak Supervision under Distribution Shift
arXiv:2604.05002v1 Announce Type: new Abstract: Learning from weak or proxy supervision is common when ground-truth labels are unavailable, yet robustness under distribution shift remains poorly understood, especially when the supervision mechanism itself changes. We formalize this as supervision drift, defined...
TFRBench: A Reasoning Benchmark for Evaluating Forecasting Systems
arXiv:2604.05364v1 Announce Type: new Abstract: We introduce TFRBench, the first benchmark designed to evaluate the reasoning capabilities of forecasting systems. Traditionally, time-series forecasting has been evaluated solely on numerical accuracy, treating foundation models as ``black boxes.'' Unlike existing benchmarks, TFRBench...
What Makes a Good Response? An Empirical Analysis of Quality in Qualitative Interviews
arXiv:2604.05163v1 Announce Type: new Abstract: Qualitative interviews provide essential insights into human experiences when they elicit high-quality responses. While qualitative and NLP researchers have proposed various measures of interview quality, these measures lack validation that high-scoring responses actually contribute to...
Inclusion-of-Thoughts: Mitigating Preference Instability via Purifying the Decision Space
arXiv:2604.04944v1 Announce Type: new Abstract: Multiple-choice questions (MCQs) are widely used to evaluate large language models (LLMs). However, LLMs remain vulnerable to the presence of plausible distractors. This often diverts attention toward irrelevant choices, resulting in unstable oscillation between correct...
Cross-Modal Coreference Alignment: Enabling Reliable Information Transfer in Omni-LLMs
arXiv:2604.05522v1 Announce Type: new Abstract: Omni Large Language Models (Omni-LLMs) have demonstrated impressive capabilities in holistic multi-modal perception, yet they consistently falter in complex scenarios requiring synergistic omni-modal reasoning. Beyond understanding global multimodal context, effective reasoning also hinges on fine-grained...