Expert Divergence Learning for MoE-based Language Models
arXiv:2603.00054v1 Announce Type: new Abstract: The Mixture-of-Experts (MoE) architecture is a powerful technique for scaling language models, yet it often suffers from expert homogenization, where experts learn redundant functionalities, thereby limiting MoE's full potential. To address this, we introduce Expert...
M3-AD: Reflection-aware Multi-modal, Multi-category, and Multi-dimensional Benchmark and Framework for Industrial Anomaly Detection
arXiv:2603.00055v1 Announce Type: new Abstract: Although multimodal large language models (MLLMs) have advanced industrial anomaly detection toward a zero-shot paradigm, they still tend to produce high-confidence yet unreliable decisions in fine-grained and structurally complex industrial scenarios, and lack effective self-corrective...
Bridging Policy and Real-World Dynamics: LLM-Augmented Rebalancing for Shared Micromobility Systems
arXiv:2603.00176v1 Announce Type: new Abstract: Shared micromobility services such as e-scooters and bikes have become an integral part of urban transportation, yet their efficiency critically depends on effective vehicle rebalancing. Existing methods either optimize for average demand patterns or employ...
OSF: On Pre-training and Scaling of Sleep Foundation Models
arXiv:2603.00190v1 Announce Type: new Abstract: Polysomnography (PSG) provides the gold standard for sleep assessment but suffers from substantial heterogeneity across recording devices and cohorts. There have been growing efforts to build general-purpose foundation models (FMs) for sleep physiology, but lack...
Improving Full Waveform Inversion in Large Model Era
arXiv:2603.00377v1 Announce Type: new Abstract: Full Waveform Inversion (FWI) is a highly nonlinear and ill-posed problem that aims to recover subsurface velocity maps from surface-recorded seismic waveforms data. Existing data-driven FWI typically uses small models, as available datasets have limited...
Alibaba’s Qwen tech lead steps down after major AI push
Reactions rippled through Alibaba's Qwen team after tech lead Junyang Lin stepped down following a major model launch.
Multi-Agent Causal Reasoning for Suicide Ideation Detection Through Online Conversations
arXiv:2602.23577v1 Announce Type: new Abstract: Suicide remains a pressing global public health concern. While social media platforms offer opportunities for early risk detection through online conversation trees, existing approaches face two major limitations: (1) They rely on predefined rules (e.g.,...
Task Complexity Matters: An Empirical Study of Reasoning in LLMs for Sentiment Analysis
arXiv:2602.24060v1 Announce Type: new Abstract: Large language models (LLMs) with reasoning capabilities have fueled a compelling narrative that reasoning universally improves performance across language tasks. We test this claim through a comprehensive evaluation of 504 configurations across seven model families--including...
ArgLLM-App: An Interactive System for Argumentative Reasoning with Large Language Models
arXiv:2602.24172v1 Announce Type: new Abstract: Argumentative LLMs (ArgLLMs) are an existing approach leveraging Large Language Models (LLMs) and computational argumentation for decision-making, with the aim of making the resulting decisions faithfully explainable to and contestable by humans. Here we propose...
MT-PingEval: Evaluating Multi-Turn Collaboration with Private Information Games
arXiv:2602.24188v1 Announce Type: new Abstract: We present a scalable methodology for evaluating language models in multi-turn interactions, using a suite of collaborative games that require effective communication about private information. This enables an interactive scaling analysis, in which a fixed...
LK Losses: Direct Acceptance Rate Optimization for Speculative Decoding
arXiv:2602.23881v1 Announce Type: cross Abstract: Speculative decoding accelerates autoregressive large language model (LLM) inference by using a lightweight draft model to propose candidate tokens that are then verified in parallel by the target model. The speedup is significantly determined by...
RewardUQ: A Unified Framework for Uncertainty-Aware Reward Models
arXiv:2602.24040v1 Announce Type: cross Abstract: Reward models are central to aligning large language models (LLMs) with human preferences. Yet most approaches rely on pointwise reward estimates that overlook the epistemic uncertainty in reward models arising from limited human feedback. Recent...
Dynamics of Learning under User Choice: Overspecialization and Peer-Model Probing
arXiv:2602.23565v1 Announce Type: new Abstract: In many economically relevant contexts where machine learning is deployed, multiple platforms obtain data from the same pool of users, each of whom selects the platform that best serves them. Prior work in this setting...
When Does Multimodal Learning Help in Healthcare? A Benchmark on EHR and Chest X-Ray Fusion
arXiv:2602.23614v1 Announce Type: new Abstract: Machine learning holds promise for advancing clinical decision support, yet it remains unclear when multimodal learning truly helps in practice, particularly under modality missingness and fairness constraints. In this work, we conduct a systematic benchmark...
Disentangled Mode-Specific Representations for Tensor Time Series via Contrastive Learning
arXiv:2602.23663v1 Announce Type: new Abstract: Multi-mode tensor time series (TTS) can be found in many domains, such as search engines and environmental monitoring systems. Learning representations of a TTS benefits various applications, but it is also challenging since the complexities...
Beyond State-Wise Mirror Descent: Offline Policy Optimization with Parameteric Policies
arXiv:2602.23811v1 Announce Type: new Abstract: We investigate the theoretical aspects of offline reinforcement learning (RL) under general function approximation. While prior works (e.g., Xie et al., 2021) have established the theoretical foundations of learning a good policy from offline data...
InfoNCE Induces Gaussian Distribution
arXiv:2602.24012v1 Announce Type: new Abstract: Contrastive learning has become a cornerstone of modern representation learning, allowing training with massive unlabeled data for both task-specific and general (foundation) models. A prototypical loss in contrastive training is InfoNCE and its variants. In...
Users are ditching ChatGPT for Claude — here’s how to make the switch
Following controversies surrounding ChatGPT, many users are ditching the AI chatbot for Claude instead. Here's how to make the switch.
Anthropic’s Claude reports widespread outage
Anthropic's AI chatbot Claude experienced widespread service disruptions on Monday morning, with thousands of users reporting issues accessing the bot.
Uncovering Context Reliance in Unstructured Knowledge Editing
arXiv:2602.19043v1 Announce Type: new Abstract: Editing Large language models (LLMs) with real-world, unstructured knowledge is essential for correcting and updating their internal parametric knowledge. In this work, we revisit the fundamental next-token prediction (NTP) as a candidate paradigm for unstructured...
Implicit Intelligence -- Evaluating Agents on What Users Don't Say
arXiv:2602.20424v1 Announce Type: new Abstract: Real-world requests to AI agents are fundamentally underspecified. Natural human communication relies on shared context and unstated constraints that speakers expect listeners to infer. Current agentic benchmarks test explicit instruction-following but fail to evaluate whether...
PreScience: A Benchmark for Forecasting Scientific Contributions
arXiv:2602.20459v1 Announce Type: new Abstract: Can AI systems trained on the scientific record up to a fixed point in time forecast the scientific advances that follow? Such a capability could help researchers identify collaborators and impactful research directions, and anticipate...
Inner Speech as Behavior Guides: Steerable Imitation of Diverse Behaviors for Human-AI coordination
arXiv:2602.20517v1 Announce Type: new Abstract: Effective human-AI coordination requires artificial agents capable of exhibiting and responding to human-like behaviors while adapting to changing contexts. Imitation learning has emerged as one of the prominent approaches to build such agents by training...
From Logs to Language: Learning Optimal Verbalization for LLM-Based Recommendation in Production
arXiv:2602.20558v1 Announce Type: new Abstract: Large language models (LLMs) are promising backbones for generative recommender systems, yet a key challenge remains underexplored: verbalization, i.e., converting structured user interaction logs into effective natural language inputs. Existing methods rely on rigid templates...
Recursive Belief Vision Language Model
arXiv:2602.20659v1 Announce Type: new Abstract: Current vision-language-action (VLA) models struggle with long-horizon manipulation under partial observability. Most existing approaches remain observation-driven, relying on short context windows or repeated queries to vision-language models (VLMs). This leads to loss of task progress,...
Counterfactual Simulation Training for Chain-of-Thought Faithfulness
arXiv:2602.20710v1 Announce Type: new Abstract: Inspecting Chain-of-Thought reasoning is among the most common means of understanding why an LLM produced its output. But well-known problems with CoT faithfulness severely limit what insights can be gained from this practice. In this...
Modality-Guided Mixture of Graph Experts with Entropy-Triggered Routing for Multimodal Recommendation
arXiv:2602.20723v1 Announce Type: new Abstract: Multimodal recommendation enhances ranking by integrating user-item interactions with item content, which is particularly effective under sparse feedback and long-tail distributions. However, multimodal signals are inherently heterogeneous and can conflict in specific contexts, making effective...
Pipeline for Verifying LLM-Generated Mathematical Solutions
arXiv:2602.20770v1 Announce Type: new Abstract: With the growing popularity of Large Reasoning Models and their results in solving mathematical problems, it becomes crucial to measure their capabilities. We introduce a pipeline for both automatic and interactive verification as a more...
POMDPPlanners: Open-Source Package for POMDP Planning
arXiv:2602.20810v1 Announce Type: new Abstract: We present POMDPPlanners, an open-source Python package for empirical evaluation of Partially Observable Markov Decision Process (POMDP) planning algorithms. The package integrates state-of-the-art planning algorithms, a suite of benchmark environments with safety-critical variants, automated hyperparameter...
Pressure Reveals Character: Behavioural Alignment Evaluation at Depth
arXiv:2602.20813v1 Announce Type: new Abstract: Evaluating alignment in language models requires testing how they behave under realistic pressure, not just what they claim they would do. While alignment failures increasingly cause real-world harm, comprehensive evaluation frameworks with realistic multi-turn scenarios...