IRPAPERS: A Visual Document Benchmark for Scientific Retrieval and Question Answering
arXiv:2602.17687v1 Announce Type: cross Abstract: AI systems have achieved remarkable success in processing text and relational data, yet visual document processing remains relatively underexplored. Whereas traditional systems require OCR transcriptions to convert these visual documents into text and metadata, recent...
Robust Pre-Training of Medical Vision-and-Language Models with Domain-Invariant Multi-Modal Masked Reconstruction
arXiv:2602.17689v1 Announce Type: cross Abstract: Medical vision-language models show strong potential for joint reasoning over medical images and clinical text, but their performance often degrades under domain shift caused by variations in imaging devices, acquisition protocols, and reporting styles. Existing...
Agentic Unlearning: When LLM Agent Meets Machine Unlearning
arXiv:2602.17692v1 Announce Type: cross Abstract: In this paper, we introduce \textbf{agentic unlearning} which removes specified information from both model parameters and persistent memory in agents with closed-loop interaction. Existing unlearning methods target parameters alone, leaving two critical gaps: (i) parameter-memory...
UBio-MolFM: A Universal Molecular Foundation Model for Bio-Systems
arXiv:2602.17709v1 Announce Type: cross Abstract: All-atom molecular simulation serves as a quintessential ``computational microscope'' for understanding the machinery of life, yet it remains fundamentally limited by the trade-off between quantum-mechanical (QM) accuracy and biological scale. We present UBio-MolFM, a universal...
"Everyone's using it, but no one is allowed to talk about it": College Students' Experiences Navigating the Higher Education Environment in a Generative AI World
arXiv:2602.17720v1 Announce Type: cross Abstract: Higher education students are increasingly using generative AI in their academic work. However, existing institutional practices have not yet adapted to this shift. Through semi-structured interviews with 23 college students, our study examines the environmental...
Five Fatal Assumptions: Why T-Shirt Sizing Systematically Fails for AI Projects
arXiv:2602.17734v1 Announce Type: cross Abstract: Agile estimation techniques, particularly T-shirt sizing, are widely used in software development for their simplicity and utility in scoping work. However, when we apply these methods to artificial intelligence initiatives -- especially those involving large...
GeneZip: Region-Aware Compression for Long Context DNA Modeling
arXiv:2602.17739v1 Announce Type: cross Abstract: Genomic sequences span billions of base pairs (bp), posing a fundamental challenge for genome-scale foundation models. Existing approaches largely sidestep this barrier by either scaling relatively small models to long contexts or relying on heavy...
The 2025 AI Agent Index: Documenting Technical and Safety Features of Deployed Agentic AI Systems
arXiv:2602.17753v1 Announce Type: cross Abstract: Agentic AI systems are increasingly capable of performing professional and personal tasks with limited human involvement. However, tracking these developments is difficult because the AI agent ecosystem is complex, rapidly evolving, and inconsistently documented, posing...
Impact of Artificial Intelligence on Dental Education: A Review and Guide for Curriculum Update
In this intellectual work, the clinical and educational aspects of dentistry were confronted with practical applications of artificial intelligence (AI). The aim was to provide an up-to-date overview of the upcoming changes and a brief analysis of the influential advancements...
Hierarchical Reward Design from Language: Enhancing Alignment of Agent Behavior with Human Specifications
arXiv:2602.18582v1 Announce Type: new Abstract: When training artificial intelligence (AI) to perform tasks, humans often care not only about whether a task is completed but also how it is performed. As AI agents tackle increasingly complex tasks, aligning their behavior...
Many AI Analysts, One Dataset: Navigating the Agentic Data Science Multiverse
arXiv:2602.18710v1 Announce Type: new Abstract: The conclusions of empirical research depend not only on data but on a sequence of analytic decisions that published results seldom make explicit. Past ``many-analyst" studies have demonstrated this: independent teams testing the same hypothesis...
Task-Aware Exploration via a Predictive Bisimulation Metric
arXiv:2602.18724v1 Announce Type: new Abstract: Accelerating exploration in visual reinforcement learning under sparse rewards remains challenging due to the substantial task-irrelevant variations. Despite advances in intrinsic exploration, many methods either assume access to low-dimensional states or lack task-aware exploration strategies,...
Federated Reasoning Distillation Framework with Model Learnability-Aware Data Allocation
arXiv:2602.18749v1 Announce Type: new Abstract: Data allocation plays a critical role in federated large language model (LLM) and small language models (SLMs) reasoning collaboration. Nevertheless, existing data allocation methods fail to address an under-explored challenge in collaboration: bidirectional model learnability...
TPRU: Advancing Temporal and Procedural Understanding in Large Multimodal Models
arXiv:2602.18884v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs), particularly smaller, deployable variants, exhibit a critical deficiency in understanding temporal and procedural visual data, a bottleneck hindering their application in real-world embodied AI. This gap is largely caused by...
DREAM: Deep Research Evaluation with Agentic Metrics
arXiv:2602.18940v1 Announce Type: new Abstract: Deep Research Agents generate analyst-grade reports, yet evaluating them remains challenging due to the absence of a single ground truth and the multidimensional nature of research quality. Recent benchmarks propose distinct methodologies, yet they suffer...
(Perlin) Noise as AI coordinator
arXiv:2602.18947v1 Announce Type: new Abstract: Large scale control of nonplayer agents is central to modern games, while production systems still struggle to balance several competing goals: locally smooth, natural behavior, and globally coordinated variety across space and time. Prior approaches...
When Do LLM Preferences Predict Downstream Behavior?
arXiv:2602.18971v1 Announce Type: new Abstract: Preference-driven behavior in LLMs may be a necessary precondition for AI misalignment such as sandbagging: models cannot strategically pursue misaligned goals unless their behavior is influenced by their preferences. Yet prior work has typically prompted...
How Far Can We Go with Pixels Alone? A Pilot Study on Screen-Only Navigation in Commercial 3D ARPGs
arXiv:2602.18981v1 Announce Type: new Abstract: Modern 3D game levels rely heavily on visual guidance, yet the navigability of level layouts remains difficult to quantify. Prior work either simulates play in simplified environments or analyzes static screenshots for visual affordances, but...
Benchmark Test-Time Scaling of General LLM Agents
arXiv:2602.18998v1 Announce Type: new Abstract: LLM agents are increasingly expected to function as general-purpose systems capable of resolving open-ended user requests. While existing benchmarks focus on domain-aware environments for developing specialized agents, evaluating general-purpose agents requires more realistic settings that...
Evaluating Large Language Models on Quantum Mechanics: A Comparative Study Across Diverse Models and Tasks
arXiv:2602.19006v1 Announce Type: new Abstract: We present a systematic evaluation of large language models on quantum mechanics problem-solving. Our study evaluates 15 models from five providers (OpenAI, Anthropic, Google, Alibaba, DeepSeek) spanning three capability tiers on 20 tasks covering derivations,...
Defining Explainable AI for Requirements Analysis
arXiv:2602.19071v1 Announce Type: new Abstract: Explainable Artificial Intelligence (XAI) has become popular in the last few years. The Artificial Intelligence (AI) community in general, and the Machine Learning (ML) community in particular, is coming to the realisation that in many...
Post-Routing Arithmetic in Llama-3: Last-Token Result Writing and Rotation-Structured Digit Directions
arXiv:2602.19109v1 Announce Type: new Abstract: We study three-digit addition in Meta-Llama-3-8B (base) under a one-token readout to characterize how arithmetic answers are finalized after cross-token routing becomes causally irrelevant. Causal residual patching and cumulative attention ablations localize a sharp boundary...
DoAtlas-1: A Causal Compilation Paradigm for Clinical AI
arXiv:2602.19158v1 Announce Type: new Abstract: Medical foundation models generate narrative explanations but cannot quantify intervention effects, detect evidence conflicts, or validate literature claims, limiting clinical auditability. We propose causal compilation, a paradigm that transforms medical evidence from narrative text into...
Reasoning Capabilities of Large Language Models. Lessons Learned from General Game Playing
arXiv:2602.19160v1 Announce Type: new Abstract: This paper examines the reasoning capabilities of Large Language Models (LLMs) from a novel perspective, focusing on their ability to operate within formally specified, rule-governed environments. We evaluate four LLMs (Gemini 2.5 Pro and Flash...
Robust Exploration in Directed Controller Synthesis via Reinforcement Learning with Soft Mixture-of-Experts
arXiv:2602.19244v1 Announce Type: new Abstract: On-the-fly Directed Controller Synthesis (OTF-DCS) mitigates state-space explosion by incrementally exploring the system and relies critically on an exploration policy to guide search efficiently. Recent reinforcement learning (RL) approaches learn such policies and achieve promising...
Time Series, Vision, and Language: Exploring the Limits of Alignment in Contrastive Representation Spaces
arXiv:2602.19367v1 Announce Type: new Abstract: The Platonic Representation Hypothesis posits that learned representations from models trained on different modalities converge to a shared latent structure of the world. However, this hypothesis has largely been examined in vision and language, and...
Semantic Substrate Theory: An Operator-Theoretic Framework for Geometric Semantic Drift
arXiv:2602.18699v1 Announce Type: new Abstract: Most semantic drift studies report multiple signals e.g., embedding displacement, neighbor changes, distributional divergence, and recursive trajectory instability, without a shared explanatory theory that relates them. This paper proposes a formalization of these signals in...
DeepInnovator: Triggering the Innovative Capabilities of LLMs
arXiv:2602.18920v1 Announce Type: new Abstract: The application of Large Language Models (LLMs) in accelerating scientific discovery has garnered increasing attention, with a key focus on constructing research agents endowed with innovative capability, i.e., the ability to autonomously generate novel and...
Yor-Sarc: A gold-standard dataset for sarcasm detection in a low-resource African language
arXiv:2602.18964v1 Announce Type: new Abstract: Sarcasm detection poses a fundamental challenge in computational semantics, requiring models to resolve disparities between literal and intended meaning. The challenge is amplified in low-resource languages where annotated datasets are scarce or nonexistent. We present...
Capable but Unreliable: Canonical Path Deviation as a Causal Mechanism of Agent Failure in Long-Horizon Tasks
arXiv:2602.19008v1 Announce Type: new Abstract: Why do language agents fail on tasks they are capable of solving? We argue that many such failures are reliability failures caused by stochastic drift from a task's latent solution structure, not capability failures. Every...