PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments
arXiv:2603.23231v1 Announce Type: new Abstract: Empowering large language models with long-term memory is crucial for building agents that adapt to users' evolving needs. However, prior evaluations typically interleave preference-related dialogues with irrelevant conversations, reducing the task to needle-in-a-haystack retrieval while...
Towards Automated Community Notes Generation with Large Vision Language Models for Combating Contextual Deception
arXiv:2603.22453v1 Announce Type: new Abstract: Community Notes have emerged as an effective crowd-sourced mechanism for combating online deception on social media platforms. However, its reliance on human contributors limits both the timeliness and scalability. In this work, we study the...
Synthetic or Authentic? Building Mental Patient Simulators from Longitudinal Evidence
arXiv:2603.22704v1 Announce Type: new Abstract: Patient simulation is essential for developing and evaluating mental health dialogue systems. As most existing approaches rely on snapshot-style prompts with limited profile information, homogeneous behaviors and incoherent disease progression in multi-turn interactions have become...
Can LLM Agents Generate Real-World Evidence? Evaluating Observational Studies in Medical Databases
arXiv:2603.22767v1 Announce Type: new Abstract: Observational studies can yield clinically actionable evidence at scale, but executing them on real-world databases is open-ended and requires coherent decisions across cohort construction, analysis, and reporting. Prior evaluations of LLM agents emphasize isolated steps...
Sparse but Critical: A Token-Level Analysis of Distributional Shifts in RLVR Fine-Tuning of LLMs
arXiv:2603.22446v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has significantly improved reasoning in large language models (LLMs), yet the token-level mechanisms underlying these improvements remain unclear. We present a systematic empirical study of RLVR's distributional effects organized...
Analysing LLM Persona Generation and Fairness Interpretation in Polarised Geopolitical Contexts
arXiv:2603.22837v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly utilised for social simulation and persona generation, necessitating an understanding of how they represent geopolitical identities. In this paper, we analyse personas generated for Palestinian and Israeli identities by...
PaperVoyager : Building Interactive Web with Visual Language Models
arXiv:2603.22999v1 Announce Type: new Abstract: Recent advances in visual language models have enabled autonomous agents for complex reasoning, tool use, and document understanding. However, existing document agents mainly transform papers into static artifacts such as summaries, webpages, or slides, which...
When Language Models Lose Their Mind: The Consequences of Brain Misalignment
arXiv:2603.23091v1 Announce Type: new Abstract: While brain-aligned large language models (LLMs) have garnered attention for their potential as cognitive models and for potential for enhanced safety and trustworthiness in AI, the role of this brain alignment for linguistic competence remains...
Why AI-Generated Text Detection Fails: Evidence from Explainable AI Beyond Benchmark Accuracy
arXiv:2603.23146v1 Announce Type: new Abstract: The widespread adoption of Large Language Models (LLMs) has made the detection of AI-Generated text a pressing and complex challenge. Although many detection systems report high benchmark accuracy, their reliability in real-world settings remains uncertain,...
UniDial-EvalKit: A Unified Toolkit for Evaluating Multi-Faceted Conversational Abilities
arXiv:2603.23160v1 Announce Type: new Abstract: Benchmarking AI systems in multi-turn interactive scenarios is essential for understanding their practical capabilities in real-world applications. However, existing evaluation protocols are highly heterogeneous, differing significantly in dataset formats, model interfaces, and evaluation pipelines, which...
ImplicitRM: Unbiased Reward Modeling from Implicit Preference Data for LLM alignment
arXiv:2603.23184v1 Announce Type: new Abstract: Reward modeling represents a long-standing challenge in reinforcement learning from human feedback (RLHF) for aligning language models. Current reward modeling is heavily contingent upon experimental feedback data with high collection costs. In this work, we...
Is AI Catching Up to Human Expression? Exploring Emotion, Personality, Authorship, and Linguistic Style in English and Arabic with Six Large Language Models
arXiv:2603.23251v1 Announce Type: new Abstract: The advancing fluency of LLMs raises important questions about their ability to emulate complex human traits, including emotional expression and personality, across diverse linguistic and cultural contexts. This study investigates whether LLMs can convincingly mimic...
ST-GDance++: A Scalable Spatial-Temporal Diffusion for Long-Duration Group Choreography
arXiv:2603.22316v1 Announce Type: new Abstract: Group dance generation from music requires synchronizing multiple dancers while maintaining spatial coordination, making it highly relevant to applications such as film production, gaming, and animation. Recent group dance generation models have achieved promising generation...
Large Language Models for Missing Data Imputation: Understanding Behavior, Hallucination Effects, and Control Mechanisms
arXiv:2603.22332v1 Announce Type: new Abstract: Data imputation is a cornerstone technique for handling missing values in real-world datasets, which are often plagued by missingness. Despite recent progress, prior studies on Large Language Models-based imputation remain limited by scalability challenges, restricted...
Unveiling the Mechanism of Continuous Representation Full-Waveform Inversion: A Wave Based Neural Tangent Kernel Framework
arXiv:2603.22362v1 Announce Type: new Abstract: Full-waveform inversion (FWI) estimates physical parameters in the wave equation from limited measurements and has been widely applied in geophysical exploration, medical imaging, and non-destructive testing. Conventional FWI methods are limited by their notorious sensitivity...
Instruction-Tuned, but Not More Verifiable Instruction-Following: A Cross-Task Diagnosis for LoRA Adapters
arXiv:2603.22379v1 Announce Type: new Abstract: Adapters are often selected and deployed based on nominal labels (e.g., instruction-tuned), which implicitly suggest what capability improves after adaptation. We test whether nominal training objectives reliably align with realized cross-task capability gains by evaluating...
Symbolic Graph Networks for Robust PDE Discovery from Noisy Sparse Data
arXiv:2603.22380v1 Announce Type: new Abstract: Data-driven discovery of partial differential equations (PDEs) offers a promising paradigm for uncovering governing physical laws from observational data. However, in practical scenarios, measurements are often contaminated by noise and limited by sparse sampling, which...
Neural Structure Embedding for Symbolic Regression via Continuous Structure Search and Coefficient Optimization
arXiv:2603.22429v1 Announce Type: new Abstract: Symbolic regression aims to discover human-interpretable equations that explain observational data. However, existing approaches rely heavily on discrete structure search (e.g., genetic programming), which often leads to high computational cost, unstable performance, and limited scalability...
Model Predictive Control with Differentiable World Models for Offline Reinforcement Learning
arXiv:2603.22430v1 Announce Type: new Abstract: Offline Reinforcement Learning (RL) aims to learn optimal policies from fixed offline datasets, without further interactions with the environment. Such methods train an offline policy (or value function), and apply it at inference time without...
All of DOGE’s work could be undone as lawsuit against Musk proceeds
Musk’s X posts bragging about DOGE may trigger reversals of its biggest wins.
Governance-Aware Vector Subscriptions for Multi-Agent Knowledge Ecosystems
arXiv:2603.20833v1 Announce Type: new Abstract: As AI agent ecosystems grow, agents need mechanisms to monitor relevant knowledge in real time. Semantic publish-subscribe systems address this by matching new content against vector subscriptions. However, in multi-agent settings where agents operate under...
Me, Myself, and $\pi$ : Evaluating and Explaining LLM Introspection
arXiv:2603.20276v1 Announce Type: new Abstract: A hallmark of human intelligence is Introspection-the ability to assess and reason about one's own cognitive processes. Introspection has emerged as a promising but contested capability in large language models (LLMs). However, current evaluations often...
Modeling Epistemic Uncertainty in Social Perception via Rashomon Set Agents
arXiv:2603.20750v1 Announce Type: new Abstract: We present an LLM-driven multi-agent probabilistic modeling framework that demonstrates how differences in students' subjective social perceptions arise and evolve in real-world classroom settings, under constraints from an observed social network and limited questionnaire data....
Towards Intelligent Geospatial Data Discovery: a knowledge graph-driven multi-agent framework powered by large language models
arXiv:2603.20670v1 Announce Type: new Abstract: The rapid growth in the volume, variety, and velocity of geospatial data has created data ecosystems that are highly distributed, heterogeneous, and semantically inconsistent. Existing data catalogs, portals, and infrastructures still rely largely on keyword-based...
Locally Coherent Parallel Decoding in Diffusion Language Models
arXiv:2603.20216v1 Announce Type: new Abstract: Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive (AR) models, offering sub-linear generation latency and bidirectional capabilities that are particularly appealing for code generation and editing. Achieving sub-linear latency in discrete...
Decoding the decoder: Contextual sequence-to-sequence modeling for intracortical speech decoding
arXiv:2603.20246v1 Announce Type: new Abstract: Speech brain--computer interfaces require decoders that translate intracortical activity into linguistic output while remaining robust to limited data and day-to-day variability. While prior high-performing systems have largely relied on framewise phoneme decoding combined with downstream...
FinReflectKG -- HalluBench: GraphRAG Hallucination Benchmark for Financial Question Answering Systems
arXiv:2603.20252v1 Announce Type: new Abstract: As organizations increasingly integrate AI-powered question-answering systems into financial information systems for compliance, risk assessment, and decision support, ensuring the factual accuracy of AI-generated outputs becomes a critical engineering challenge. Current Knowledge Graph (KG)-augmented QA...
RoboAlign: Learning Test-Time Reasoning for Language-Action Alignment in Vision-Language-Action Models
arXiv:2603.21341v1 Announce Type: new Abstract: Improving embodied reasoning in multimodal-large-language models (MLLMs) is essential for building vision-language-action models (VLAs) on top of them to readily translate multimodal understanding into low-level actions. Accordingly, recent work has explored enhancing embodied reasoning in...
LLM-Driven Heuristic Synthesis for Industrial Process Control: Lessons from Hot Steel Rolling
arXiv:2603.20537v1 Announce Type: new Abstract: Industrial process control demands policies that are interpretable and auditable, requirements that black-box neural policies struggle to meet. We study an LLM-driven heuristic synthesis framework for hot steel rolling, in which a language model iteratively...
Can we automatize scientific discovery in the cognitive sciences?
arXiv:2603.20988v1 Announce Type: new Abstract: The cognitive sciences aim to understand intelligence by formalizing underlying operations as computational models. Traditionally, this follows a cycle of discovery where researchers develop paradigms, collect data, and test predefined model classes. However, this manual...