Detecting Multi-Agent Collusion Through Multi-Agent Interpretability
arXiv:2604.01151v1 Announce Type: new Abstract: As LLM agents are increasingly deployed in multi-agent systems, they introduce risks of covert coordination that may evade standard forms of human oversight. While linear probes on model activations have shown promise for detecting deception...
Can Large Language Models Self-Correct in Medical Question Answering? An Exploratory Study
arXiv:2604.00261v2 Announce Type: new Abstract: Large language models (LLMs) have achieved strong performance on medical question answering (medical QA), and chain-of-thought (CoT) prompting has further improved results by eliciting explicit intermediate reasoning; meanwhile, self-reflective (self-corrective) prompting has been widely claimed...
Therefore I am. I Think
arXiv:2604.01202v2 Announce Type: new Abstract: We consider the question: when a large language reasoning model makes a choice, did it think first and then decide to, or decide first and then think? In this paper, we present evidence that detectable,...
MiCA Learns More Knowledge Than LoRA and Full Fine-Tuning
arXiv:2604.01694v1 Announce Type: new Abstract: Minor Component Adaptation (MiCA) is a novel parameter-efficient fine-tuning method for large language models that focuses on adapting underutilized subspaces of model representations. Unlike conventional methods such as Low-Rank Adaptation (LoRA), which target dominant subspaces,...
Training In-Context and In-Weights Mixtures Via Contrastive Context Sampling
arXiv:2604.01601v1 Announce Type: new Abstract: We investigate training strategies that co-develop in-context learning (ICL) and in-weights learning (IWL), and the ability to switch between them based on context relevance. Although current LLMs exhibit both modes, standard task-specific fine-tuning often erodes...
Hierarchical Chain-of-Thought Prompting: Enhancing LLM Reasoning Performance and Efficiency
arXiv:2604.00130v1 Announce Type: new Abstract: Chain-of-Thought (CoT) prompting has significantly improved the reasoning capabilities of large language models (LLMs). However, conventional CoT often relies on unstructured, flat reasoning chains that suffer from redundancy and suboptimal performance. In this work, we...
OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models
arXiv:2604.00688v2 Announce Type: new Abstract: We present OmniVoice, a massive multilingual zero-shot text-to-speech (TTS) model that scales to over 600 languages. At its core is a novel diffusion language model-style discrete non-autoregressive (NAR) architecture. Unlike conventional discrete NAR models that...
Less than a month: StrictlyVC San Francisco brings leaders from TDK Ventures, Replit, and more together
StrictlyVC San Francisco brings leaders from TDK Ventures, Replit, and more together on April 30. Space is limited. Register here for your pass.
Adaptive Parallel Monte Carlo Tree Search for Efficient Test-time Compute Scaling
arXiv:2604.00510v1 Announce Type: new Abstract: Monte Carlo Tree Search (MCTS) is an effective test-time compute scaling (TTCS) method for improving the reasoning performance of large language models, but its highly variable execution time leads to severe long-tail latency in practice....
Agent psychometrics: Task-level performance prediction in agentic coding benchmarks
arXiv:2604.00594v1 Announce Type: new Abstract: As the focus in LLM-based coding shifts from static single-step code generation to multi-step agentic interaction with tools and environments, understanding which tasks will challenge agents and why becomes increasingly difficult. This is compounded by...
Detecting Complex Money Laundering Patterns with Incremental and Distributed Graph Modeling
arXiv:2604.01315v1 Announce Type: new Abstract: Money launderers take advantage of limitations in existing detection approaches by hiding their financial footprints in a deceitful manner. They manage this by replicating transaction patterns that the monitoring systems cannot easily distinguish. As a...
Criterion Validity of LLM-as-Judge for Business Outcomes in Conversational Commerce
arXiv:2604.00022v1 Announce Type: cross Abstract: Multi-dimensional rubric-based dialogue evaluation is widely used to assess conversational AI, yet its criterion validity -- whether quality scores are associated with the downstream outcomes they are meant to serve -- remains largely untested. We...
DISCO-TAB: A Hierarchical Reinforcement Learning Framework for Privacy-Preserving Synthesis of Complex Clinical Data
arXiv:2604.01481v1 Announce Type: new Abstract: The development of robust clinical decision support systems is frequently impeded by the scarcity of high-fidelity, privacy-preserving biomedical data. While Generative Large Language Models (LLMs) offer a promising avenue for synthetic data generation, they often...
Finding and Reactivating Post-Trained LLMs' Hidden Safety Mechanisms
arXiv:2604.00012v1 Announce Type: cross Abstract: Despite the impressive performance of general-purpose large language models (LLMs), they often require fine-tuning or post-training to excel at specific tasks. For instance, large reasoning models (LRMs), such as the DeepSeek-R1 series, demonstrate strong reasoning...
Collaborative AI Agents and Critics for Fault Detection and Cause Analysis in Network Telemetry
arXiv:2604.00319v1 Announce Type: new Abstract: We develop algorithms for collaborative control of AI agents and critics in a multi-actor, multi-critic federated multi-agent system. Each AI agent and critic has access to classical machine learning or generative AI foundation models. The...
AI Company Safety Practices Fall Short of Public Commitments and Show Structural Weaknesses, as Top Performers Widen the Gap
But in a win for transparency, five leading companies participated in the scorecard's survey for the first time, providing critical new information to the public.
Data centers get ready — the Senate wants to see your power bills
Senators Josh Hawley and Elizabeth Warren want the Energy Information Administration to gather more details about how data centers use power — and how that affects the grid.
MedMT-Bench: Can LLMs Memorize and Understand Long Multi-Turn Conversations in Medical Scenarios?
arXiv:2603.23519v1 Announce Type: new Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities across various specialist domains and have been integrated into high-stakes areas such as medicine. However, as existing medical-related benchmarks rarely stress-test the long-context memory, interference robustness, and...
MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens
arXiv:2603.23516v1 Announce Type: new Abstract: Long-term memory is a cornerstone of human intelligence. Enabling AI to process lifetime-scale information remains a long-standing pursuit in the field. Due to the constraints of full-attention architectures, the effective context length of large language...
Fast and Faithful: Real-Time Verification for Long-Document Retrieval-Augmented Generation Systems
arXiv:2603.23508v1 Announce Type: new Abstract: Retrieval-augmented generation (RAG) is increasingly deployed in enterprise search and document-centric assistants, where responses must be grounded in long and complex source materials. In practice, verifying that generated answers faithfully reflect retrieved documents is difficult:...
Internal Safety Collapse in Frontier Large Language Models
arXiv:2603.23509v1 Announce Type: new Abstract: This work identifies a critical failure mode in frontier large language models (LLMs), which we term Internal Safety Collapse (ISC): under certain task conditions, models enter a state in which they continuously generate harmful content...
Konkani LLM: Multi-Script Instruction Tuning and Evaluation for a Low-Resource Indian Language
arXiv:2603.23529v1 Announce Type: new Abstract: Large Language Models (LLMs) consistently under perform in low-resource linguistic contexts such as Konkani. This performance deficit stems from acute training data scarcity compounded by high script diversity across Devanagari, Romi and Kannada orthographies. To...
Chitrakshara: A Large Multilingual Multimodal Dataset for Indian languages
arXiv:2603.23521v1 Announce Type: new Abstract: Multimodal research has predominantly focused on single-image reasoning, with limited exploration of multi-image scenarios. Recent models have sought to enhance multi-image understanding through large-scale pretraining on interleaved image-text datasets. However, most Vision-Language Models (VLMs) are...
Do 3D Large Language Models Really Understand 3D Spatial Relationships?
arXiv:2603.23523v1 Announce Type: new Abstract: Recent 3D Large-Language Models (3D-LLMs) claim to understand 3D worlds, especially spatial relationships among objects. Yet, we find that simply fine-tuning a language model on text-only question-answer pairs can perform comparably or even surpass these...
Beyond Masks: Efficient, Flexible Diffusion Language Models via Deletion-Insertion Processes
arXiv:2603.23507v1 Announce Type: new Abstract: While Masked Diffusion Language Models (MDLMs) relying on token masking and unmasking have shown promise in language modeling, their computational efficiency and generation flexibility remain constrained by the masking paradigm. In this paper, we propose...
Swiss-Bench SBP-002: A Frontier Model Comparison on Swiss Legal and Regulatory Tasks
arXiv:2603.23646v1 Announce Type: new Abstract: While recent work has benchmarked large language models on Swiss legal translation (Niklaus et al., 2025) and academic legal reasoning from university exams (Fan et al., 2025), no existing benchmark evaluates frontier model performance on...
IslamicMMLU: A Benchmark for Evaluating LLMs on Islamic Knowledge
arXiv:2603.23750v1 Announce Type: new Abstract: Large language models are increasingly consulted for Islamic knowledge, yet no comprehensive benchmark evaluates their performance across core Islamic disciplines. We introduce IslamicMMLU, a benchmark of 10,013 multiple-choice questions spanning three tracks: Quran (2,013 questions),...
Self-Distillation for Multi-Token Prediction
arXiv:2603.23911v1 Announce Type: new Abstract: As Large Language Models (LLMs) scale up, inference efficiency becomes a critical bottleneck. Multi-Token Prediction (MTP) could accelerate LLM inference by predicting multiple future tokens in parallel. However, existing MTP approaches still face two challenges:...
OmniACBench: A Benchmark for Evaluating Context-Grounded Acoustic Control in Omni-Modal Models
arXiv:2603.23938v1 Announce Type: new Abstract: Most testbeds for omni-modal models assess multimodal understanding via textual outputs, leaving it unclear whether these models can properly speak their answers. To study this, we introduce OmniACBench, a benchmark for evaluating context-grounded acoustic control...
Implicit Turn-Wise Policy Optimization for Proactive User-LLM Interaction
arXiv:2603.23550v1 Announce Type: new Abstract: Multi-turn human-AI collaboration is fundamental to deploying interactive services such as adaptive tutoring, conversational recommendation, and professional consultation. However, optimizing these interactions via reinforcement learning is hindered by the sparsity of verifiable intermediate rewards and...