Beyond Test-Time Training: Learning to Reason via Hardware-Efficient Optimal Control
arXiv:2603.09221v1 Announce Type: new Abstract: Associative memory has long underpinned the design of sequential models. Beyond recall, humans reason by projecting future states and selecting goal-directed actions, a capability that modern language models increasingly require but do not natively encode....
Efficient Reasoning at Fixed Test-Time Cost via Length-Aware Attention Priors and Gain-Aware Training
arXiv:2603.09253v1 Announce Type: new Abstract: We study efficient reasoning under tight compute. We ask how to make structured, correct decisions without increasing test time cost. We add two training only components to small and medium Transformers that also transfer to...
Transductive Generalization via Optimal Transport and Its Application to Graph Node Classification
arXiv:2603.09257v1 Announce Type: new Abstract: Many existing transductive bounds rely on classical complexity measures that are computationally intractable and often misaligned with empirical behavior. In this work, we establish new representation-based generalization bounds in a distribution-free transductive setting, where learned...
TA-GGAD: Testing-time Adaptive Graph Model for Generalist Graph Anomaly Detection
arXiv:2603.09349v1 Announce Type: new Abstract: A significant number of anomalous nodes in the real world, such as fake news, noncompliant users, malicious transactions, and malicious posts, severely compromises the health of the graph data ecosystem and urgently requires effective identification...
Birthright citizenship: legal takeaways of mice and men and elephants and dogs
Brothers in Law is a recurring series by brothers Akhil and Vikram Amar, with special emphasis on measuring what the Supreme Court says against what the Constitution itself says. For more content from […]The postBirthright citizenship: legal takeaways of mice...
AI Now Co-ED Amba Kak Gives Remarks Before the UN General Assembly on AI Governance - AI Now Institute
ChatGPT can now create interactive visuals to help you understand math and science concepts
Instead of just reading an explanation or looking at a static diagram, users can now engage directly with interactive visuals.
AgentMail raises $6M to build an email service for AI agents
AgentMail provides an API platform that lets you give AI agents their own email inboxes, with support for two-way conversations, parsing, threading, labeling, searching, and replying.
YouTube expands AI deepfake detection to politicians, government officials, and journalists
YouTube's AI deepfake detection tool is becoming available to politicians, journalists, and officials, letting them flag unauthorized likenesses for removal.
"Dark Triad" Model Organisms of Misalignment: Narrow Fine-Tuning Mirrors Human Antisocial Behavior
arXiv:2603.06816v1 Announce Type: new Abstract: The alignment problem refers to concerns regarding powerful intelligences, ensuring compatibility with human preferences and values as capabilities increase. Current large language models (LLMs) show misaligned behaviors, such as strategic deception, manipulation, and reward-seeking, that...
A Coin Flip for Safety: LLM Judges Fail to Reliably Measure Adversarial Robustness
arXiv:2603.06594v1 Announce Type: new Abstract: Automated \enquote{LLM-as-a-Judge} frameworks have become the de facto standard for scalable evaluation across natural language processing. For instance, in safety evaluation, these judges are relied upon to evaluate harmfulness in order to benchmark the robustness...
Hierarchical Latent Structures in Data Generation Process Unify Mechanistic Phenomena across Scale
arXiv:2603.06592v1 Announce Type: new Abstract: Contemporary studies have uncovered many puzzling phenomena in the neural information processing of Transformer-based language models. Building a robust, unified understanding of these phenomena requires disassembling a model within the scope of its training. While...
Reforming the Mechanism: Editing Reasoning Patterns in LLMs with Circuit Reshaping
arXiv:2603.06923v1 Announce Type: new Abstract: Large language models (LLMs) often exhibit flawed reasoning ability that undermines reliability. Existing approaches to improving reasoning typically treat it as a general and monolithic skill, applying broad training which is inefficient and unable to...
AutoChecklist: Composable Pipelines for Checklist Generation and Scoring with LLM-as-a-Judge
arXiv:2603.07019v1 Announce Type: new Abstract: Checklists have emerged as a popular approach for interpretable and fine-grained evaluation, particularly with LLM-as-a-Judge. Beyond evaluation, these structured criteria can serve as signals for model alignment, reinforcement learning, and self-correction. To support these use...
Hit-RAG: Learning to Reason with Long Contexts via Preference Alignment
arXiv:2603.07023v1 Announce Type: new Abstract: Despite the promise of Retrieval-Augmented Generation in grounding Multimodal Large Language Models with external knowledge, the transition to extensive contexts often leads to significant attention dilution and reasoning hallucinations. The surge in information density causes...
Emotion Transcription in Conversation: A Benchmark for Capturing Subtle and Complex Emotional States through Natural Language
arXiv:2603.07138v1 Announce Type: new Abstract: Emotion Recognition in Conversation (ERC) is critical for enabling natural human-machine interactions. However, existing methods predominantly employ categorical or dimensional emotion annotations, which often fail to adequately represent complex, subtle, or culturally specific emotional nuances....
Taiwan Safety Benchmark and Breeze Guard: Toward Trustworthy AI for Taiwanese Mandarin
arXiv:2603.07286v1 Announce Type: new Abstract: Global safety models exhibit strong performance across widely used benchmarks, yet their training data rarely captures the cultural and linguistic nuances of Taiwanese Mandarin. This limitation results in systematic blind spots when interpreting region-specific risks...
How Much Noise Can BERT Handle? Insights from Multilingual Sentence Difficulty Detection
arXiv:2603.07346v1 Announce Type: new Abstract: Noisy training data can significantly degrade the performance of language-model-based classifiers, particularly in non-topical classification tasks. In this study we designed a methodological framework to assess the impact of denoising. More specifically, we explored a...
RILEC: Detection and Generation of L1 Russian Interference Errors in English Learner Texts
arXiv:2603.07366v1 Announce Type: new Abstract: Many errors in student essays can be explained by influence from the native language (L1). L1 interference refers to errors influenced by a speaker's first language, such as using stadion instead of stadium, reflecting lexical...
Domain-Specific Quality Estimation for Machine Translation in Low-Resource Scenarios
arXiv:2603.07372v1 Announce Type: new Abstract: Quality Estimation (QE) is essential for assessing machine translation quality in reference-less settings, particularly for domain-specific and low-resource language scenarios. In this paper, we investigate sentence-level QE for English to Indic machine translation across four...
The Dual-Stream Transformer: Channelized Architecture for Interpretable Language Modeling
arXiv:2603.07461v1 Announce Type: new Abstract: Standard transformers entangle all computation in a single residual stream, obscuring which components perform which functions. We introduce the Dual-Stream Transformer, which decomposes the residual stream into two functionally distinct components: a token stream updated...
Bolbosh: Script-Aware Flow Matching for Kashmiri Text-to-Speech
arXiv:2603.07513v1 Announce Type: new Abstract: Kashmiri is spoken by around 7 million people but remains critically underserved in speech technology, despite its official status and rich linguistic heritage. The lack of robust Text-to-Speech (TTS) systems limits digital accessibility and inclusive...
TableMind++: An Uncertainty-Aware Programmatic Agent for Tool-Augmented Table Reasoning
arXiv:2603.07528v1 Announce Type: new Abstract: Table reasoning requires models to jointly perform semantic understanding and precise numerical operations. Most existing methods rely on a single-turn reasoning paradigm over tables which suffers from context overflow and weak numerical sensitivity. To address...
MAWARITH: A Dataset and Benchmark for Legal Inheritance Reasoning with LLMs
arXiv:2603.07539v1 Announce Type: new Abstract: Islamic inheritance law ('ilm al-mawarith) is challenging for large language models because solving inheritance cases requires complex, structured multi-step reasoning and the correct application of juristic rules to compute heirs' shares. We introduce MAWARITH, a...
StyleBench: Evaluating Speech Language Models on Conversational Speaking Style Control
arXiv:2603.07599v1 Announce Type: new Abstract: Speech language models (SLMs) have significantly extended the interactive capability of text-based Large Language Models (LLMs) by incorporating paralinguistic information. For more realistic interactive experience with customized styles, current SLMs have managed to interpret and...
Benchmarking Large Language Models for Quebec Insurance: From Closed-Book to Retrieval-Augmented Generation
arXiv:2603.07825v1 Announce Type: new Abstract: The digitization of insurance distribution in the Canadian province of Quebec, accelerated by legislative changes such as Bill 141, has created a significant "advice gap", leaving consumers to interpret complex financial contracts without professional guidance....
An Efficient and Effective Evaluator for Text2SQL Models on Unseen and Unlabeled Data
arXiv:2603.07841v1 Announce Type: new Abstract: Recent advances in large language models has strengthened Text2SQL systems that translate natural language questions into database queries. A persistent deployment challenge is to assess a newly trained Text2SQL system on an unseen and unlabeled...
Switchable Activation Networks
arXiv:2603.06601v1 Announce Type: new Abstract: Deep neural networks, and more recently large-scale generative models such as large language models (LLMs) and large vision-action models (LVAs), achieve remarkable performance across diverse domains, yet their prohibitive computational cost hinders deployment in resource-constrained...
Scale Dependent Data Duplication
arXiv:2603.06603v1 Announce Type: new Abstract: Data duplication during pretraining can degrade generalization and lead to memorization, motivating aggressive deduplication pipelines. However, at web scale, it is unclear what constitutes a ``duplicate'': beyond surface-form matches, semantically equivalent documents (e.g. translations) may...
Know When You're Wrong: Aligning Confidence with Correctness for LLM Error Detection
arXiv:2603.06604v1 Announce Type: new Abstract: As large language models (LLMs) are increasingly deployed in critical decision-making systems, the lack of reliable methods to measure their uncertainty presents a fundamental trustworthiness risk. We introduce a normalized confidence score based on output...