Gadgets
The Verge is about technology and how it makes us feel. Founded in 2011, we offer our audience everything from breaking news to reviews to award-winning features and investigations, on our site, in video, and in podcasts.
HBO
Originally a private cable network, then a premium cable channel, then a mini-network of specialized and dedicated channels, HBO has evolved into a powerhouse of original content production. Game of Thrones is its most obvious success story. But series like...
Entertainment
The Verge’s entertainment section collects the latest news from the worlds of pop culture, music, movies, television, and video games. Whether you want to know what to watch on Netflix or how to make the most of your streaming service...
Laptops
The Verge is about technology and how it makes us feel. Founded in 2011, we offer our audience everything from breaking news to reviews to award-winning features and investigations, on our site, in video, and in podcasts.
Nintendo
The Verge is about technology and how it makes us feel. Founded in 2011, we offer our audience everything from breaking news to reviews to award-winning features and investigations, on our site, in video, and in podcasts.
AI
Artificial intelligence is more a part of our lives than ever before. While some might call it hype and compare it to NFTs or 3D TVs, AI is causing a sea change in nearly every part of the technology industry....
VeRA: Verified Reasoning Data Augmentation at Scale
arXiv:2602.13217v1 Announce Type: new Abstract: The main issue with most evaluation schemes today is their "static" nature: the same problems are reused repeatedly, allowing for memorization, format exploitation, and eventual saturation. To measure genuine AI progress, we need evaluation that...
Intelligence as Trajectory-Dominant Pareto Optimization
arXiv:2602.13230v1 Announce Type: new Abstract: Despite recent advances in artificial intelligence, many systems exhibit stagnation in long-horizon adaptability despite continued performance optimization. This work argues that such limitations do not primarily arise from insufficient learning, data, or model capacity, but...
PlotChain: Deterministic Checkpointed Evaluation of Multimodal LLMs on Engineering Plot Reading
arXiv:2602.13232v1 Announce Type: new Abstract: We present PlotChain, a deterministic, generator-based benchmark for evaluating multimodal large language models (MLLMs) on engineering plot reading-recovering quantitative values from classic plots (e.g., Bode/FFT, step response, stress-strain, pump curves) rather than OCR-only extraction or...
X-Blocks: Linguistic Building Blocks of Natural Language Explanations for Automated Vehicles
arXiv:2602.13248v1 Announce Type: new Abstract: Natural language explanations play a critical role in establishing trust and acceptance of automated vehicles (AVs), yet existing approaches lack systematic frameworks for analysing how humans linguistically construct driving rationales across diverse scenarios. This paper...
DPBench: Large Language Models Struggle with Simultaneous Coordination
arXiv:2602.13255v1 Announce Type: new Abstract: Large language models are increasingly deployed in multi-agent systems, yet we lack benchmarks that test whether they can coordinate under resource contention. We introduce DPBench, a benchmark based on the Dining Philosophers problem that evaluates...
Information Fidelity in Tool-Using LLM Agents: A Martingale Analysis of the Model Context Protocol
arXiv:2602.13320v1 Announce Type: new Abstract: As AI agents powered by large language models (LLMs) increasingly use external tools for high-stakes decisions, a critical reliability question arises: how do errors propagate across sequential tool calls? We introduce the first theoretical framework...
Hippocampus: An Efficient and Scalable Memory Module for Agentic AI
arXiv:2602.13594v1 Announce Type: new Abstract: Agentic AI require persistent memory to store user-specific histories beyond the limited context window of LLMs. Existing memory systems use dense vector databases or knowledge-graph traversal (or hybrid), incurring high retrieval latency and poor storage...
LLM-Powered Automatic Translation and Urgency in Crisis Scenarios
arXiv:2602.13452v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly proposed for crisis preparedness and response, particularly for multilingual communication. However, their suitability for high-stakes crisis contexts remains insufficiently evaluated. This work examines the performance of state-of-the-art LLMs and...
Small Reward Models via Backward Inference
arXiv:2602.13551v1 Announce Type: new Abstract: Reward models (RMs) play a central role throughout the language model (LM) pipeline, particularly in non-verifiable domains. However, the dominant LLM-as-a-Judge paradigm relies on the strong reasoning capabilities of large models, while alternative approaches require...
GRRM: Group Relative Reward Modeling for Machine Translation
arXiv:2602.14028v1 Announce Type: new Abstract: While Group Relative Policy Optimization (GRPO) offers a powerful framework for LLM post-training, its effectiveness in open-ended domains like Machine Translation hinges on accurate intra-group ranking. We identify that standard Scalar Quality Metrics (SQM) fall...
Alignment in Time: Peak-Aware Orchestration for Long-Horizon Agentic Systems
arXiv:2602.17910v1 Announce Type: new Abstract: Traditional AI alignment primarily focuses on individual model outputs; however, autonomous agents in long-horizon workflows require sustained reliability across entire interaction trajectories. We introduce APEMO (Affect-aware Peak-End Modulation for Orchestration), a runtime scheduling layer that...
WorkflowPerturb: Calibrated Stress Tests for Evaluating Multi-Agent Workflow Metrics
arXiv:2602.17990v1 Announce Type: new Abstract: LLM-based systems increasingly generate structured workflows for complex tasks. In practice, automatic evaluation of these workflows is difficult, because metric scores are often not calibrated, and score changes do not directly communicate the severity of...
Curriculum Learning for Efficient Chain-of-Thought Distillation via Structure-Aware Masking and GRPO
arXiv:2602.17686v1 Announce Type: cross Abstract: Distilling Chain-of-Thought (CoT) reasoning from large language models into compact student models presents a fundamental challenge: teacher rationales are often too verbose for smaller models to faithfully reproduce. Existing approaches either compress reasoning into single-step,...
Federated Reasoning Distillation Framework with Model Learnability-Aware Data Allocation
arXiv:2602.18749v1 Announce Type: new Abstract: Data allocation plays a critical role in federated large language model (LLM) and small language models (SLMs) reasoning collaboration. Nevertheless, existing data allocation methods fail to address an under-explored challenge in collaboration: bidirectional model learnability...
LAMMI-Pathology: A Tool-Centric Bottom-Up LVLM-Agent Framework for Molecularly Informed Medical Intelligence in Pathology
arXiv:2602.18773v1 Announce Type: new Abstract: The emergence of tool-calling-based agent systems introduces a more evidence-driven paradigm for pathology image analysis in contrast to the coarse-grained text-image diagnostic approaches. With the recent large-scale experimental adoption of spatial transcriptomics technologies, molecularly validated...
Early Evidence of Vibe-Proving with Consumer LLMs: A Case Study on Spectral Region Characterization with ChatGPT-5.2 (Thinking)
arXiv:2602.18918v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly used as scientific copilots, but evidence on their role in research-level mathematics remains limited, especially for workflows accessible to individual researchers. We present early evidence for vibe-proving with a...
How Far Can We Go with Pixels Alone? A Pilot Study on Screen-Only Navigation in Commercial 3D ARPGs
arXiv:2602.18981v1 Announce Type: new Abstract: Modern 3D game levels rely heavily on visual guidance, yet the navigability of level layouts remains difficult to quantify. Prior work either simulates play in simplified environments or analyzes static screenshots for visual affordances, but...
K-Search: LLM Kernel Generation via Co-Evolving Intrinsic World Model
arXiv:2602.19128v1 Announce Type: new Abstract: Optimizing GPU kernels is critical for efficient modern machine learning systems yet remains challenging due to the complex interplay of design factors and rapid hardware evolution. Existing automated approaches typically treat Large Language Models (LLMs)...
Sycophantic Chatbots Cause Delusional Spiraling, Even in Ideal Bayesians
arXiv:2602.19141v1 Announce Type: new Abstract: "AI psychosis" or "delusional spiraling" is an emerging phenomenon where AI chatbot users find themselves dangerously confident in outlandish beliefs after extended chatbot conversations. This phenomenon is typically attributed to AI chatbots' well-documented bias towards...
DoAtlas-1: A Causal Compilation Paradigm for Clinical AI
arXiv:2602.19158v1 Announce Type: new Abstract: Medical foundation models generate narrative explanations but cannot quantify intervention effects, detect evidence conflicts, or validate literature claims, limiting clinical auditability. We propose causal compilation, a paradigm that transforms medical evidence from narrative text into...
Characterizing MARL for Energy Control: A Multi-KPI Benchmark on the CityLearn Environment
arXiv:2602.19223v1 Announce Type: new Abstract: The optimization of urban energy systems is crucial for the advancement of sustainable and resilient smart cities, which are becoming increasingly complex with multiple decision-making units. To address scalability and coordination concerns, Multi-Agent Reinforcement Learning...
IR$^3$: Contrastive Inverse Reinforcement Learning for Interpretable Detection and Mitigation of Reward Hacking
arXiv:2602.19416v1 Announce Type: new Abstract: Reinforcement Learning from Human Feedback (RLHF) enables powerful LLM alignment but can introduce reward hacking - models exploit spurious correlations in proxy rewards without genuine alignment. Compounding this, the objectives internalized during RLHF remain opaque,...
Rethinking Retrieval-Augmented Generation as a Cooperative Decision-Making Problem
arXiv:2602.18734v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) has demonstrated strong effectiveness in knowledge-intensive tasks by grounding language generation in external evidence. Despite its success, many existing RAG systems are built based on a ranking-centric, asymmetric dependency paradigm, where the...
BURMESE-SAN: Burmese NLP Benchmark for Evaluating Large Language Models
arXiv:2602.18788v1 Announce Type: new Abstract: We introduce BURMESE-SAN, the first holistic benchmark that systematically evaluates large language models (LLMs) for Burmese across three core NLP competencies: understanding (NLU), reasoning (NLR), and generation (NLG). BURMESE-SAN consolidates seven subtasks spanning these competencies,...