SCOTUStoday for Monday, March 2
If you are looking for a great introduction to this morning’s argument in United States v. Hemani, please check out this animated explainer, done in partnership with Briefly. Our live […]The postSCOTUStoday for Monday, March 2appeared first onSCOTUSblog.
From Logs to Language: Learning Optimal Verbalization for LLM-Based Recommendation in Production
arXiv:2602.20558v1 Announce Type: new Abstract: Large language models (LLMs) are promising backbones for generative recommender systems, yet a key challenge remains underexplored: verbalization, i.e., converting structured user interaction logs into effective natural language inputs. Existing methods rely on rigid templates...
PyVision-RL: Forging Open Agentic Vision Models via RL
arXiv:2602.20739v1 Announce Type: new Abstract: Reinforcement learning for agentic multimodal models often suffers from interaction collapse, where models learn to reduce tool usage and multi-turn reasoning, limiting the benefits of agentic behavior. We introduce PyVision-RL, a reinforcement learning framework for...
Qwen-BIM: developing large language model for BIM-based design with domain-specific benchmark and dataset
arXiv:2602.20812v1 Announce Type: new Abstract: As the construction industry advances toward digital transformation, BIM (Building Information Modeling)-based design has become a key driver supporting intelligent construction. Despite Large Language Models (LLMs) have shown potential in promoting BIM-based design, the lack...
Pressure Reveals Character: Behavioural Alignment Evaluation at Depth
arXiv:2602.20813v1 Announce Type: new Abstract: Evaluating alignment in language models requires testing how they behave under realistic pressure, not just what they claim they would do. While alignment failures increasingly cause real-world harm, comprehensive evaluation frameworks with realistic multi-turn scenarios...
Architecting AgentOS: From Token-Level Context to Emergent System-Level Intelligence
arXiv:2602.20934v1 Announce Type: new Abstract: The paradigm of Large Language Models is undergoing a fundamental transition from static inference engines to dynamic autonomous cognitive systems.While current research primarily focuses on scaling context windows or optimizing prompt engineering the theoretical bridge...
LogicGraph : Benchmarking Multi-Path Logical Reasoning via Neuro-Symbolic Generation and Verification
arXiv:2602.21044v1 Announce Type: new Abstract: Evaluations of large language models (LLMs) primarily emphasize convergent logical reasoning, where success is defined by producing a single correct proof. However, many real-world reasoning problems admit multiple valid derivations, requiring models to explore diverse...
Tool Building as a Path to "Superintelligence"
arXiv:2602.21061v1 Announce Type: new Abstract: The Diligent Learner framework suggests LLMs can achieve superintelligence via test-time search, provided a sufficient step-success probability $\gamma$. In this work, we design a benchmark to measure $\gamma$ on logical out-of-distribution inference. We construct a...
NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning
arXiv:2602.21172v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models are advancing autonomous driving by replacing modular pipelines with unified end-to-end architectures. However, current VLAs face two expensive requirements: (1) massive dataset collection, and (2) dense reasoning annotations. In this work, we...
No One Size Fits All: QueryBandits for Hallucination Mitigation
arXiv:2602.20332v1 Announce Type: new Abstract: Advanced reasoning capabilities in Large Language Models (LLMs) have led to more frequent hallucinations; yet most mitigation work focuses on open-source models for post-hoc detection and parameter editing. The dearth of studies focusing on hallucinations...
A Dynamic Survey of Soft Set Theory and Its Extensions
arXiv:2602.21268v1 Announce Type: new Abstract: Soft set theory provides a direct framework for parameterized decision modeling by assigning to each attribute (parameter) a subset of a given universe, thereby representing uncertainty in a structured way [1, 2]. Over the past...
Latent Context Compilation: Distilling Long Context into Compact Portable Memory
arXiv:2602.21221v1 Announce Type: cross Abstract: Efficient long-context LLM deployment is stalled by a dichotomy between amortized compression, which struggles with out-of-distribution generalization, and Test-Time Training, which incurs prohibitive synthetic data costs and requires modifying model weights, creating stateful parameters that...
Measuring Pragmatic Influence in Large Language Model Instructions
arXiv:2602.21223v1 Announce Type: cross Abstract: It is not only what we ask large language models (LLMs) to do that matters, but also how we prompt. Phrases like "This is urgent" or "As your supervisor" can shift model behavior without altering...
Bounded Rationality and the Theory of Property
ARTICLE Bounded Rationality and the Theory of Property Oren Bar-Gill* & Nicola Persico** Strong, property rule protection—implemented via injunctions, criminal sanctions, and supercompensatory damages—is a defining aspect of property. What is the theoretical justification for property rule protection? The conventional...
Autonomous Vehicles and Liability: Who Is Responsible When AI Drives?
As autonomous vehicles approach widespread deployment, legal frameworks for determining liability in accidents involving self-driving cars remain uncertain.
ACAR: Adaptive Complexity Routing for Multi-Model Ensembles with Auditable Decision Traces
arXiv:2602.21231v1 Announce Type: cross Abstract: We present ACAR (Adaptive Complexity and Attribution Routing), a measurement framework for studying multi-model orchestration under auditable conditions. ACAR uses self-consistency variance (sigma) computed from N=3 probe samples to route tasks across single-model, two-model, and...
A General Equilibrium Theory of Orchestrated AI Agent Systems
arXiv:2602.21255v1 Announce Type: cross Abstract: We establish a general equilibrium theory for systems of large language model (LLM) agents operating under centralized orchestration. The framework is a production economy in the sense of Arrow-Debreu (1954), extended to infinite-dimensional commodity spaces...
A Systematic Review of Algorithmic Red Teaming Methodologies for Assurance and Security of AI Applications
arXiv:2602.21267v1 Announce Type: cross Abstract: Cybersecurity threats are becoming increasingly sophisticated, making traditional defense mechanisms and manual red teaming approaches insufficient for modern organizations. While red teaming has long been recognized as an effective method to identify vulnerabilities by simulating...
Equitable Evaluation via Elicitation
arXiv:2602.21327v1 Announce Type: cross Abstract: Individuals with similar qualifications and skills may vary in their demeanor, or outward manner: some tend toward self-promotion while others are modest to the point of omitting crucial information. Comparing the self-descriptions of equally qualified...
Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment
arXiv:2602.21346v1 Announce Type: cross Abstract: Recent advances in alignment techniques such as Supervised Fine-Tuning (SFT), Reinforcement Learning from Human Feedback (RLHF), and Direct Preference Optimization (DPO) have improved the safety of large language models (LLMs). However, these LLMs remain vulnerable...
Towards single-shot coherent imaging via overlap-free ptychography
arXiv:2602.21361v1 Announce Type: cross Abstract: Ptychographic imaging at synchrotron and XFEL sources requires dense overlapping scans, limiting throughput and increasing dose. Extending coherent diffractive imaging to overlap-free operation on extended samples remains an open problem. Here, we extend PtychoPINN (O....
FedVG: Gradient-Guided Aggregation for Enhanced Federated Learning
arXiv:2602.21399v1 Announce Type: cross Abstract: Federated Learning (FL) enables collaborative model training across multiple clients without sharing their private data. However, data heterogeneity across clients leads to client drift, which degrades the overall generalization performance of the model. This effect...
Graph Your Way to Inspiration: Integrating Co-Author Graphs with Retrieval-Augmented Generation for Large Language Model Based Scientific Idea Generation
arXiv:2602.22215v1 Announce Type: new Abstract: Large Language Models (LLMs) demonstrate potential in the field of scientific idea generation. However, the generated results often lack controllable academic context and traceable inspiration pathways. To bridge this gap, this paper proposes a scientific...
FIRE: A Comprehensive Benchmark for Financial Intelligence and Reasoning Evaluation
arXiv:2602.22273v1 Announce Type: new Abstract: We introduce FIRE, a comprehensive benchmark designed to evaluate both the theoretical financial knowledge of LLMs and their ability to handle practical business scenarios. For theoretical assessment, we curate a diverse set of examination questions...
ConstraintBench: Benchmarking LLM Constraint Reasoning on Direct Optimization
arXiv:2602.22465v1 Announce Type: new Abstract: Large language models are increasingly applied to operational decision-making where the underlying structure is constrained optimization. Existing benchmarks evaluate whether LLMs can formulate optimization problems as solver code, but leave open a complementary question. Can...
VeRO: An Evaluation Harness for Agents to Optimize Agents
arXiv:2602.22480v1 Announce Type: new Abstract: An important emerging application of coding agents is agent optimization: the iterative improvement of a target agent through edit-execute-evaluate cycles. Despite its relevance, the community lacks a systematic understanding of coding agent performance on this...
A Mathematical Theory of Agency and Intelligence
arXiv:2602.22519v1 Announce Type: new Abstract: To operate reliably under changing conditions, complex systems require feedback on how effectively they use resources, not just whether objectives are met. Current AI systems process vast information to produce sophisticated predictions, yet predictions can...
Decomposing Physician Disagreement in HealthBench
arXiv:2602.22758v1 Announce Type: new Abstract: We decompose physician disagreement in the HealthBench medical AI evaluation dataset to understand where variance resides and what observable features can explain it. Rubric identity accounts for 15.8% of met/not-met label variance but only 3.6-6.9%...
Certified Circuits: Stability Guarantees for Mechanistic Circuits
arXiv:2602.22968v1 Announce Type: new Abstract: Understanding how neural networks arrive at their predictions is essential for debugging, auditing, and deployment. Mechanistic interpretability pursues this goal by identifying circuits - minimal subnetworks responsible for specific behaviors. However, existing circuit discovery methods...
Scaling In, Not Up? Testing Thick Citation Context Analysis with GPT-5 and Fragile Prompts
arXiv:2602.22359v1 Announce Type: new Abstract: This paper tests whether large language models (LLMs) can support interpretative citation context analysis (CCA) by scaling in thick, text-grounded readings of a single hard case rather than scaling up typological labels. It foregrounds prompt-sensitivity...