CREATE: Testing LLMs for Associative Creativity
arXiv:2603.09970v1 Announce Type: new Abstract: A key component of creativity is associative reasoning: the ability to draw novel yet meaningful connections between concepts. We introduce CREATE, a benchmark designed to evaluate models' capacity for creative associative reasoning. CREATE requires models...
Self-hosted Lecture-to-Quiz: Local LLM MCQ Generation with Deterministic Quality Control
arXiv:2603.08729v1 Announce Type: cross Abstract: We present an end-to-end self-hosted (API-free) pipeline, where API-free means that lecture content is not sent to any external LLM service, that converts lecture PDFs into multiple-choice questions (MCQs) using a local LLM plus deterministic...
Fish Audio S2 Technical Report
arXiv:2603.08823v1 Announce Type: cross Abstract: We introduce Fish Audio S2, an open-sourced text-to-speech system featuring multi-speaker, multi-turn generation, and, most importantly, instruction-following control via natural-language descriptions. To scale training, we develop a multi-stage training recipe together with a staged data...
PathoScribe: Transforming Pathology Data into a Living Library with a Unified LLM-Driven Framework for Semantic Retrieval and Clinical Integration
arXiv:2603.08935v1 Announce Type: cross Abstract: Pathology underpins modern diagnosis and cancer care, yet its most valuable asset, the accumulated experience encoded in millions of narrative reports, remains largely inaccessible. Although institutions are rapidly digitizing pathology workflows, storing data without effective...
VoxEmo: Benchmarking Speech Emotion Recognition with Speech LLMs
arXiv:2603.08936v1 Announce Type: cross Abstract: Speech Large Language Models (LLMs) show great promise for speech emotion recognition (SER) via generative interfaces. However, shifting from closed-set classification to open text generation introduces zero-shot stochasticity, making evaluation highly sensitive to prompts. Additionally,...
Hindsight Credit Assignment for Long-Horizon LLM Agents
arXiv:2603.08754v1 Announce Type: new Abstract: Large Language Model (LLM) agents often face significant credit assignment challenges in long-horizon, multi-step tasks due to sparse rewards. Existing value-free methods, such as Group Relative Policy Optimization (GRPO), encounter two fundamental bottlenecks: inaccurate step-level...
SoftJAX & SoftTorch: Empowering Automatic Differentiation Libraries with Informative Gradients
arXiv:2603.08824v1 Announce Type: new Abstract: Automatic differentiation (AD) frameworks such as JAX and PyTorch have enabled gradient-based optimization for a wide range of scientific fields. Yet, many "hard" primitives in these libraries such as thresholding, Boolean logic, discrete indexing, and...
Expressivity-Efficiency Tradeoffs for Hybrid Sequence Models
arXiv:2603.08859v1 Announce Type: new Abstract: Hybrid sequence models--combining Transformer and state-space model layers--seek to gain the expressive versatility of attention as well as the computational efficiency of state-space model layers. Despite burgeoning interest in hybrid models, we lack a basic...
When to Retrain after Drift: A Data-Only Test of Post-Drift Data Size Sufficiency
arXiv:2603.09024v1 Announce Type: new Abstract: Sudden concept drift makes previously trained predictors unreliable, yet deciding when to retrain and what post-drift data size is sufficient is rarely addressed. We propose CALIPER - a detector- and model-agnostic, data-only test that estimates...
Two Teachers Better Than One: Hardware-Physics Co-Guided Distributed Scientific Machine Learning
arXiv:2603.09032v1 Announce Type: new Abstract: Scientific machine learning (SciML) is increasingly applied to in-field processing, controlling, and monitoring; however, wide-area sensing, real-time demands, and strict energy and reliability constraints make centralized SciML implementation impractical. Most SciML models assume raw data...
SCALAR: Learning and Composing Skills through LLM Guided Symbolic Planning and Deep RL Grounding
arXiv:2603.09036v1 Announce Type: new Abstract: LM-based agents excel when given high-level action APIs but struggle to ground language into low-level control. Prior work has LLMs generate skills or reward functions for RL, but these one-shot approaches lack feedback to correct...
Exclusive Self Attention
arXiv:2603.09078v1 Announce Type: new Abstract: We introduce exclusive self attention (XSA), a simple modification of self attention (SA) that improves Transformer's sequence modeling performance. The key idea is to constrain attention to capture only information orthogonal to the token's own...
Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards
arXiv:2603.09117v1 Announce Type: new Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) significantly enhances large language models (LLMs) reasoning but severely suffers from calibration degeneration, where models become excessively over-confident in incorrect answers. Previous studies devote to directly incorporating calibration objective...
Interactive 3D visualization of surface roughness predictions in additive manufacturing: A data-driven framework
arXiv:2603.09353v1 Announce Type: new Abstract: Surface roughness in Material Extrusion Additive Manufacturing varies across a part and is difficult to anticipate during process planning because it depends on both printing parameters and local surface inclination, which governs the staircase effect....
Amazon launches its healthcare AI assistant on its website and app
Health AI can answer questions, explain health records, manage prescription renewals, book appointments, and more.
Elaborating a Human Rights-Friendly Copyright Framework for Generative AI
"Dark Triad" Model Organisms of Misalignment: Narrow Fine-Tuning Mirrors Human Antisocial Behavior
arXiv:2603.06816v1 Announce Type: new Abstract: The alignment problem refers to concerns regarding powerful intelligences, ensuring compatibility with human preferences and values as capabilities increase. Current large language models (LLMs) show misaligned behaviors, such as strategic deception, manipulation, and reward-seeking, that...
Rethinking Personalization in Large Language Models at the Token Level
arXiv:2603.06595v1 Announce Type: new Abstract: With large language models (LLMs) now performing strongly across diverse tasks, there is growing demand for them to personalize outputs for individual users. Personalization is typically framed as an additional layer on top of a...
Deep Research, Shallow Evaluation: A Case Study in Meta-Evaluation for Long-Form QA Benchmarks
arXiv:2603.06942v1 Announce Type: new Abstract: Recent advances have made long-form report-generating systems widely available. This has prompted evaluation frameworks that use LLM-as-judge protocols and claim verification, along with meta-evaluation frameworks that seek to validate these methods. Many of the meta-evaluations...
Elenchus: Generating Knowledge Bases from Prover-Skeptic Dialogues
arXiv:2603.06974v1 Announce Type: new Abstract: We present Elenchus, a dialogue system for knowledge base construction grounded in inferentialist semantics, where knowledge engineering is re-conceived as explicitation rather than extraction from expert testimony or textual content. A human expert develops a...
Can Safety Emerge from Weak Supervision? A Systematic Analysis of Small Language Models
arXiv:2603.07017v1 Announce Type: new Abstract: Safety alignment is critical for deploying large language models (LLMs) in real-world applications, yet most existing approaches rely on large human-annotated datasets and static red-teaming benchmarks that are costly, difficult to scale, and slow to...
AutoChecklist: Composable Pipelines for Checklist Generation and Scoring with LLM-as-a-Judge
arXiv:2603.07019v1 Announce Type: new Abstract: Checklists have emerged as a popular approach for interpretable and fine-grained evaluation, particularly with LLM-as-a-Judge. Beyond evaluation, these structured criteria can serve as signals for model alignment, reinforcement learning, and self-correction. To support these use...
Hit-RAG: Learning to Reason with Long Contexts via Preference Alignment
arXiv:2603.07023v1 Announce Type: new Abstract: Despite the promise of Retrieval-Augmented Generation in grounding Multimodal Large Language Models with external knowledge, the transition to extensive contexts often leads to significant attention dilution and reasoning hallucinations. The surge in information density causes...
Emotion Transcription in Conversation: A Benchmark for Capturing Subtle and Complex Emotional States through Natural Language
arXiv:2603.07138v1 Announce Type: new Abstract: Emotion Recognition in Conversation (ERC) is critical for enabling natural human-machine interactions. However, existing methods predominantly employ categorical or dimensional emotion annotations, which often fail to adequately represent complex, subtle, or culturally specific emotional nuances....
Scaling Self-Supervised Speech Models Uncovers Deep Linguistic Relationships: Evidence from the Pacific Cluster
arXiv:2603.07238v1 Announce Type: new Abstract: Similarities between language representations derived from Self-Supervised Speech Models (S3Ms) have been observed to primarily reflect geographic proximity or surface typological similarities driven by recent expansion or contact, potentially missing deeper genealogical signals. We investigate...
Domain-Specific Quality Estimation for Machine Translation in Low-Resource Scenarios
arXiv:2603.07372v1 Announce Type: new Abstract: Quality Estimation (QE) is essential for assessing machine translation quality in reference-less settings, particularly for domain-specific and low-resource language scenarios. In this paper, we investigate sentence-level QE for English to Indic machine translation across four...
Can Large Language Models Keep Up? Benchmarking Online Adaptation to Continual Knowledge Streams
arXiv:2603.07392v1 Announce Type: new Abstract: LLMs operating in dynamic real-world contexts often encounter knowledge that evolves continuously or emerges incrementally. To remain accurate and effective, models must adapt to newly arriving information on the fly. We introduce Online Adaptation to...
The Dual-Stream Transformer: Channelized Architecture for Interpretable Language Modeling
arXiv:2603.07461v1 Announce Type: new Abstract: Standard transformers entangle all computation in a single residual stream, obscuring which components perform which functions. We introduce the Dual-Stream Transformer, which decomposes the residual stream into two functionally distinct components: a token stream updated...
Accent Vector: Controllable Accent Manipulation for Multilingual TTS Without Accented Data
arXiv:2603.07534v1 Announce Type: new Abstract: Accent is an integral part of society, reflecting multiculturalism and shaping how individuals express identity. The majority of English speakers are non-native (L2) speakers, yet current Text-To-Speech (TTS) systems primarily model American-accented English due limited...