SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables
arXiv:2602.23286v1 Announce Type: new Abstract: Real-world Table-Text question answering (QA) tasks require models that can reason across long text and source tables, traversing multiple hops and executing complex operations such as aggregation. Yet existing benchmarks are small, manually curated -...
Sustainable LLM Inference using Context-Aware Model Switching
arXiv:2602.22261v1 Announce Type: new Abstract: Large language models have become central to many AI applications, but their growing energy consumption raises serious sustainability concerns. A key limitation in current AI deployments is the reliance on a one-size-fits-all inference strategy where...
Support Tokens, Stability Margins, and a New Foundation for Robust LLMs
arXiv:2602.22271v1 Announce Type: new Abstract: Self-attention is usually described as a flexible, content-adaptive way to mix a token with information from its past. We re-interpret causal self-attention transformers, the backbone of modern foundation models, within a probabilistic framework, much like...
Training Agents to Self-Report Misbehavior
arXiv:2602.22303v1 Announce Type: new Abstract: Frontier AI agents may pursue hidden goals while concealing their pursuit from oversight. Alignment training aims to prevent such behavior by reinforcing the correct goals, but alignment may not always succeed and can lead to...
Calibrated Test-Time Guidance for Bayesian Inference
arXiv:2602.22428v1 Announce Type: new Abstract: Test-time guidance is a widely used mechanism for steering pretrained diffusion models toward outcomes specified by a reward function. Existing approaches, however, focus on maximizing reward rather than sampling from the true Bayesian posterior, leading...
Last 24 hours to get TechCrunch Disrupt 2026 tickets at the lowest rates of the year
The lowest rates of the year for TechCrunch Disrupt 2026 end after today. Prices go up at 11:59 p.m. PT. Don't miss connecting with 10,000 founders, investors, and operators, and key takeaways from 250+ industry leaders. Register now to save...
Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning
arXiv:2602.21420v1 Announce Type: cross Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become the leading paradigm for enhancing reasoning in Large Language Models (LLMs). However, standard RLVR algorithms suffer from a well-documented pathology: while they improve Pass@1 accuracy through sharpened...
ECHOSAT: Estimating Canopy Height Over Space And Time
arXiv:2602.21421v1 Announce Type: cross Abstract: Forest monitoring is critical for climate change mitigation. However, existing global tree height maps provide only static snapshots and do not capture temporal forest dynamics, which are essential for accurate carbon accounting. We introduce ECHOSAT,...
Disaster Question Answering with LoRA Efficiency and Accurate End Position
arXiv:2602.21212v1 Announce Type: new Abstract: Natural disasters such as earthquakes, torrential rainfall, floods, and volcanic eruptions occur with extremely low frequency and affect limited geographic areas. When individuals face disaster situations, they often experience confusion and lack the domain-specific knowledge...
Structured Prompt Language: Declarative Context Management for LLMs
arXiv:2602.21257v1 Announce Type: new Abstract: We present SPL (Structured Prompt Language), a declarative SQL-inspired language that treats large language models as generative knowledge bases and their context windows as constrained resources. SPL provides explicit WITH BUDGET/LIMIT token management, an automatic...
Under the Influence: Quantifying Persuasion and Vigilance in Large Language Models
arXiv:2602.21262v1 Announce Type: new Abstract: With increasing integration of Large Language Models (LLMs) into areas of high-stakes human decision-making, it is important to understand the risks they introduce as advisors. To be useful advisors, LLMs must sift through large amounts...
ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning
arXiv:2602.21265v1 Announce Type: new Abstract: We introduce \ToolMATH, a math-grounded benchmark that evaluates tool-augmented language models in realistic multi-tool environments where the output depends on calling schema-specified tools and sustaining multi-step execution. It turns math problems into a controlled, correctness-checkable...
Enhancing Multilingual Embeddings via Multi-Way Parallel Text Alignment
arXiv:2602.21543v1 Announce Type: new Abstract: Multilingual pretraining typically lacks explicit alignment signals, leading to suboptimal cross-lingual alignment in the representation space. In this work, we show that training standard pretrained models for cross-lingual alignment with a multi-way parallel corpus in...
MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification
arXiv:2602.21608v1 Announce Type: new Abstract: Bangla-English code-mixing is widespread across South Asian social media, yet resources for implicit meaning identification in this setting remain scarce. Existing sentiment and sarcasm models largely focus on monolingual English or high-resource languages and struggle...
Scalable Multilingual Multimodal Machine Translation with Speech-Text Fusion
arXiv:2602.21646v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) have achieved notable success in enhancing translation performance by integrating multimodal information. However, existing research primarily focuses on image-guided methods, whose applicability is constrained by the scarcity of multilingual image-text...
Sparsity Induction for Accurate Post-Training Pruning of Large Language Models
arXiv:2602.21652v1 Announce Type: new Abstract: Large language models have demonstrated capabilities in text generation, while their increasing parameter scales present challenges in computational and memory efficiency. Post-training sparsity (PTS), which reduces model cost by removing weights from dense networks, is...
Evaluating the relationship between regularity and learnability in recursive numeral systems using Reinforcement Learning
arXiv:2602.21720v1 Announce Type: new Abstract: Human recursive numeral systems (i.e., counting systems such as English base-10 numerals), like many other grammatical systems, are highly regular. Following prior work that relates cross-linguistic tendencies to biases in learning, we ask whether regular...
D-COT: Disciplined Chain-of-Thought Learning for Efficient Reasoning in Small Language Models
arXiv:2602.21786v1 Announce Type: new Abstract: Chain-of-Thought (CoT) distillation from Large Language Models (LLMs) often induces "overthinking" in Small Language Models (SLMs), leading to performance degradation and excessive token consumption. In this study, we propose Disciplined Chain-of-Thought (D-CoT), a novel framework...
FewMMBench: A Benchmark for Multimodal Few-Shot Learning
arXiv:2602.21854v1 Announce Type: new Abstract: As multimodal large language models (MLLMs) advance in handling interleaved image-text data, assessing their few-shot learning capabilities remains an open challenge. In this paper, we introduce FewMMBench, a comprehensive benchmark designed to evaluate MLLMs under...
ExpLang: Improved Exploration and Exploitation in LLM Reasoning with On-Policy Thinking Language Selection
arXiv:2602.21887v1 Announce Type: new Abstract: Current large reasoning models (LRMs) have shown strong ability on challenging tasks after reinforcement learning (RL) based post-training. However, previous work mainly focuses on English reasoning in expectation of the strongest performance, despite the demonstrated...
MERRY: Semantically Decoupled Evaluation of Multimodal Emotional and Role Consistencies of Role-Playing Agents
arXiv:2602.21941v1 Announce Type: new Abstract: Multimodal Role-Playing Agents (MRPAs) are attracting increasing attention due to their ability to deliver more immersive multimodal emotional interactions. However, existing studies still rely on pure textual benchmarks to evaluate the text responses of MRPAs,...
Large Language Models are Algorithmically Blind
arXiv:2602.21947v1 Announce Type: new Abstract: Large language models (LLMs) demonstrate remarkable breadth of knowledge, yet their ability to reason about computational processes remains poorly understood. Closing this gap matters for practitioners who rely on LLMs to guide algorithm selection and...
Robust AI Evaluation through Maximal Lotteries
arXiv:2602.21297v1 Announce Type: new Abstract: The standard way to evaluate language models on subjective tasks is through pairwise comparisons: an annotator chooses the "better" of two responses to a prompt. Leaderboards aggregate these comparisons into a single Bradley-Terry (BT) ranking,...
Proximal-IMH: Proximal Posterior Proposals for Independent Metropolis-Hastings with Approximate Operators
arXiv:2602.21426v1 Announce Type: new Abstract: We consider the problem of sampling from a posterior distribution arising in Bayesian inverse problems in science, engineering, and imaging. Our method belongs to the family of independence Metropolis-Hastings (IMH) sampling algorithms, which are common...
WaterVIB: Learning Minimal Sufficient Watermark Representations via Variational Information Bottleneck
arXiv:2602.21508v1 Announce Type: new Abstract: Robust watermarking is critical for intellectual property protection, whereas existing methods face a severe vulnerability against regeneration-based AIGC attacks. We identify that existing methods fail because they entangle the watermark with high-frequency cover texture, which...
Muon+: Towards Better Muon via One Additional Normalization Step
arXiv:2602.21545v1 Announce Type: new Abstract: The Muon optimizer has demonstrated promising performance in pre-training large language models through gradient (or momentum) orthogonalization. In this work, we propose a simple yet effective enhancement to Muon, namely Muon+, which introduces an additional...
Deep Clustering based Boundary-Decoder Net for Inter and Intra Layer Stress Prediction of Heterogeneous Integrated IC Chip
arXiv:2602.21601v1 Announce Type: new Abstract: High stress occurs when 3D heterogeneous IC packages are subjected to thermal cycling at extreme temperatures. Stress mainly occurs at the interface between different materials. We investigate stress image using latent space representation which is...
How Does NLP Benefit Legal System: A Summary of Legal Artificial Intelligence
Legal Artificial Intelligence (LegalAI) focuses on applying the technology of artificial intelligence, especially natural language processing, to benefit tasks in the legal domain. In recent years, LegalAI has drawn increasing attention rapidly from both AI researchers and legal professionals, as...
Anthropic CEO stands firm as Pentagon deadline looms
Anthropic CEO Dario Amodei said Thursday that he "cannot in good conscience accede" to the Pentagon's demands to give the military unrestricted access to its AI systems.
Read AI launches an email-based ‘digital twin’ to help you with schedules and answers
Read AI is launching Ada, which can reply with your availability and extract answers from the company knowledge base and the web.