Hit-RAG: Learning to Reason with Long Contexts via Preference Alignment
arXiv:2603.07023v1 Announce Type: new Abstract: Despite the promise of Retrieval-Augmented Generation in grounding Multimodal Large Language Models with external knowledge, the transition to extensive contexts often leads to significant attention dilution and reasoning hallucinations. The surge in information density causes...
Emotion Transcription in Conversation: A Benchmark for Capturing Subtle and Complex Emotional States through Natural Language
arXiv:2603.07138v1 Announce Type: new Abstract: Emotion Recognition in Conversation (ERC) is critical for enabling natural human-machine interactions. However, existing methods predominantly employ categorical or dimensional emotion annotations, which often fail to adequately represent complex, subtle, or culturally specific emotional nuances....
Taiwan Safety Benchmark and Breeze Guard: Toward Trustworthy AI for Taiwanese Mandarin
arXiv:2603.07286v1 Announce Type: new Abstract: Global safety models exhibit strong performance across widely used benchmarks, yet their training data rarely captures the cultural and linguistic nuances of Taiwanese Mandarin. This limitation results in systematic blind spots when interpreting region-specific risks...
How Much Noise Can BERT Handle? Insights from Multilingual Sentence Difficulty Detection
arXiv:2603.07346v1 Announce Type: new Abstract: Noisy training data can significantly degrade the performance of language-model-based classifiers, particularly in non-topical classification tasks. In this study we designed a methodological framework to assess the impact of denoising. More specifically, we explored a...
Domain-Specific Quality Estimation for Machine Translation in Low-Resource Scenarios
arXiv:2603.07372v1 Announce Type: new Abstract: Quality Estimation (QE) is essential for assessing machine translation quality in reference-less settings, particularly for domain-specific and low-resource language scenarios. In this paper, we investigate sentence-level QE for English to Indic machine translation across four...
The Dual-Stream Transformer: Channelized Architecture for Interpretable Language Modeling
arXiv:2603.07461v1 Announce Type: new Abstract: Standard transformers entangle all computation in a single residual stream, obscuring which components perform which functions. We introduce the Dual-Stream Transformer, which decomposes the residual stream into two functionally distinct components: a token stream updated...
TableMind++: An Uncertainty-Aware Programmatic Agent for Tool-Augmented Table Reasoning
arXiv:2603.07528v1 Announce Type: new Abstract: Table reasoning requires models to jointly perform semantic understanding and precise numerical operations. Most existing methods rely on a single-turn reasoning paradigm over tables which suffers from context overflow and weak numerical sensitivity. To address...
MAWARITH: A Dataset and Benchmark for Legal Inheritance Reasoning with LLMs
arXiv:2603.07539v1 Announce Type: new Abstract: Islamic inheritance law ('ilm al-mawarith) is challenging for large language models because solving inheritance cases requires complex, structured multi-step reasoning and the correct application of juristic rules to compute heirs' shares. We introduce MAWARITH, a...
StyleBench: Evaluating Speech Language Models on Conversational Speaking Style Control
arXiv:2603.07599v1 Announce Type: new Abstract: Speech language models (SLMs) have significantly extended the interactive capability of text-based Large Language Models (LLMs) by incorporating paralinguistic information. For more realistic interactive experience with customized styles, current SLMs have managed to interpret and...
Benchmarking Large Language Models for Quebec Insurance: From Closed-Book to Retrieval-Augmented Generation
arXiv:2603.07825v1 Announce Type: new Abstract: The digitization of insurance distribution in the Canadian province of Quebec, accelerated by legislative changes such as Bill 141, has created a significant "advice gap", leaving consumers to interpret complex financial contracts without professional guidance....
An Efficient and Effective Evaluator for Text2SQL Models on Unseen and Unlabeled Data
arXiv:2603.07841v1 Announce Type: new Abstract: Recent advances in large language models has strengthened Text2SQL systems that translate natural language questions into database queries. A persistent deployment challenge is to assess a newly trained Text2SQL system on an unseen and unlabeled...
CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training
arXiv:2603.06610v1 Announce Type: new Abstract: Large language model (LLM) post-training enhances latent skills, unlocks value alignment, improves performance, and enables domain adaptation. Unfortunately, post-training is known to induce forgetting, especially in the ubiquitous use-case of leveraging third-party pre-trained models, which...
OptiRoulette Optimizer: A New Stochastic Meta-Optimizer for up to 5.3x Faster Convergence
arXiv:2603.06613v1 Announce Type: new Abstract: This paper presents OptiRoulette, a stochastic meta-optimizer that selects update rules during training instead of fixing a single optimizer. The method combines warmup optimizer locking, random sampling from an active optimizer pool, compatibility-aware learning-rate scaling...
Reward Under Attack: Analyzing the Robustness and Hackability of Process Reward Models
arXiv:2603.06621v1 Announce Type: new Abstract: Process Reward Models (PRMs) are rapidly becoming the backbone of LLM reasoning pipelines, yet we demonstrate that state-of-the-art PRMs are systematically exploitable under adversarial optimization pressure. To address this, we introduce a three-tiered diagnostic framework...
Grouter: Decoupling Routing from Representation for Accelerated MoE Training
arXiv:2603.06626v1 Announce Type: new Abstract: Traditional Mixture-of-Experts (MoE) training typically proceeds without any structural priors, effectively requiring the model to simultaneously train expert weights while searching for an optimal routing policy within a vast combinatorial space. This entanglement often leads...
Leakage Safe Graph Features for Interpretable Fraud Detection in Temporal Transaction Networks
arXiv:2603.06632v1 Announce Type: new Abstract: Illicit transaction detection is often driven by transaction level attributes however, fraudulent behavior may also manifest through network structure such as central hubs, high flow intermediaries, and coordinated neighborhoods. This paper presents a time respecting,...
SmartBench: Evaluating LLMs in Smart Homes with Anomalous Device States and Behavioral Contexts
arXiv:2603.06636v1 Announce Type: new Abstract: Due to the strong context-awareness capabilities demonstrated by large language models (LLMs), recent research has begun exploring their integration into smart home assistants to help users manage and adjust their living environments. While LLMs have...
SR-TTT: Surprisal-Aware Residual Test-Time Training
arXiv:2603.06642v1 Announce Type: new Abstract: Test-Time Training (TTT) language models achieve theoretically infinite context windows with an O(1) memory footprint by replacing the standard exact-attention KV-cache with hidden state ``fast weights'' W_fast updated via self-supervised learning during inference. However, pure...
Safe Transformer: An Explicit Safety Bit For Interpretable And Controllable Alignment
arXiv:2603.06727v1 Announce Type: new Abstract: Current safety alignment methods encode safe behavior implicitly within model parameters, creating a fundamental opacity: we cannot easily inspect why a model refuses a request, nor intervene when its safety judgments fail. We propose Safe...
Improved Constrained Generation by Bridging Pretrained Generative Models
arXiv:2603.06742v1 Announce Type: new Abstract: Constrained generative modeling is fundamental to applications such as robotic control and autonomous driving, where models must respect physical laws and safety-critical constraints. In real-world settings, these constraints rarely take the form of simple linear...
Post Fusion Bird's Eye View Feature Stabilization for Robust Multimodal 3D Detection
arXiv:2603.05623v1 Announce Type: cross Abstract: Camera-LiDAR fusion is widely used in autonomous driving to enable accurate 3D object detection. However, bird's-eye view (BEV) fusion detectors can degrade significantly under domain shift and sensor failures, limiting reliability in real-world deployment. Existing...
On the Reliability of AI Methods in Drug Discovery: Evaluation of Boltz-2 for Structure and Binding Affinity Prediction
arXiv:2603.05532v1 Announce Type: cross Abstract: Despite continuing hype about the role of AI in drug discovery, no "AI-discovered drugs" have so far received regulatory approval. Here we assess one of the latest AI based tools in this domain. The ability...
Tool-Genesis: A Task-Driven Tool Creation Benchmark for Self-Evolving Language Agent
arXiv:2603.05578v1 Announce Type: cross Abstract: Research on self-evolving language agents has accelerated, drawing increasing attention to their ability to create, adapt, and maintain tools from task requirements. However, existing benchmarks predominantly rely on predefined specifications, which limits scalability and hinders...
RACAS: Controlling Diverse Robots With a Single Agentic System
arXiv:2603.05621v1 Announce Type: cross Abstract: Many robotic platforms expose an API through which external software can command their actuators and read their sensors. However, transitioning from these low-level interfaces to high-level autonomous behaviour requires a complicated pipeline, whose components demand...
Artificial Intelligence for Climate Adaptation: Reinforcement Learning for Climate Change-Resilient Transport
arXiv:2603.06278v1 Announce Type: new Abstract: Climate change is expected to intensify rainfall and, consequently, pluvial flooding, leading to increased disruptions in urban transportation systems over the coming decades. Designing effective adaptation strategies is challenging due to the long-term, sequential nature...
Molecular Representations for AI in Chemistry and Materials Science: An NLP Perspective
arXiv:2603.05525v1 Announce Type: cross Abstract: Deep learning, a subfield of machine learning, has gained importance in various application areas in recent years. Its growing popularity has led it to enter the natural sciences as well. This has created the need...
From Toil to Thought: Designing for Strategic Exploration and Responsible AI in Systematic Literature Reviews
arXiv:2603.05514v1 Announce Type: cross Abstract: Systematic Literature Reviews (SLRs) are fundamental to scientific progress, yet the process is hindered by a fragmented tool ecosystem that imposes a high cognitive load. This friction suppresses the iterative, exploratory nature of scholarly work....
DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality
arXiv:2603.05912v1 Announce Type: new Abstract: Search-augmented LLM agents can produce deep research reports (DRRs), but verifying claim-level factuality remains challenging. Existing fact-checkers are primarily designed for general-domain, factoid-style atomic claims, and there is no benchmark to test whether such verifiers...
Model Change for Description Logic Concepts
arXiv:2603.05562v1 Announce Type: cross Abstract: We consider the problem of modifying a description logic concept in light of models represented as pointed interpretations. We call this setting model change, and distinguish three main kinds of changes: eviction, which consists of...
VDCook:DIY video data cook your MLLMs
arXiv:2603.05539v1 Announce Type: cross Abstract: We introduce VDCook: a self-evolving video data operating system, a configurable video data construction platform for researchers and vertical domain teams. Users initiate data requests via natural language queries and adjustable parameters (scale, retrieval-synthesis ratio,...