RetroReasoner: A Reasoning LLM for Strategic Retrosynthesis Prediction
arXiv:2603.12666v1 Announce Type: new Abstract: Retrosynthesis prediction is a core task in organic synthesis that aims to predict reactants for a given product molecule. Traditionally, chemists select a plausible bond disconnection and derive corresponding reactants, which is time-consuming and requires...
Disentangled Latent Dynamics Manifold Fusion for Solving Parameterized PDEs
arXiv:2603.12676v1 Announce Type: new Abstract: Generalizing neural surrogate models across different PDE parameters remains difficult because changes in PDE coefficients often make learning harder and optimization less stable. The problem becomes even more severe when the model must also predict...
Before quantum computing arrives, this startup wants enterprises already running on it
After selling his AI startup to AMD for $665 million, Peter Sarlin is back with Qutwo, a new venture building the infrastructure it believes enterprises will need when quantum computing finally arrives.
MDER-DR: Multi-Hop Question Answering with Entity-Centric Summaries
arXiv:2603.11223v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) over Knowledge Graphs (KGs) suffers from the fact that indexing approaches may lose important contextual nuance when text is reduced to triples, thereby degrading performance in downstream Question-Answering (QA) tasks, particularly for...
RewardHackingAgents: Benchmarking Evaluation Integrity for LLM ML-Engineering Agents
arXiv:2603.11337v1 Announce Type: new Abstract: LLM agents increasingly perform end-to-end ML engineering tasks where success is judged by a single scalar test metric. This creates a structural vulnerability: an agent can increase the reported score by compromising the evaluation pipeline...
ThReadMed-QA: A Multi-Turn Medical Dialogue Benchmark from Real Patient Questions
arXiv:2603.11281v1 Announce Type: new Abstract: Medical question-answering benchmarks predominantly evaluate single-turn exchanges, failing to capture the iterative, clarification-seeking nature of real patient consultations. We introduce ThReadMed-QA, a benchmark of 2,437 fully-answered patient-physician conversation threads extracted from r/AskDocs, comprising 8,204 question-answer...
Algorithmic Consequences of Particle Filters for Sentence Processing: Amplified Garden-Paths and Digging-In Effects
arXiv:2603.11412v1 Announce Type: new Abstract: Under surprisal theory, linguistic representations affect processing difficulty only through the bottleneck of surprisal. Our best estimates of surprisal come from large language models, which have no explicit representation of structural ambiguity. While LLM surprisal...
Temporal Text Classification with Large Language Models
arXiv:2603.11295v1 Announce Type: new Abstract: Languages change over time. Computational models can be trained to recognize such changes enabling them to estimate the publication date of texts. Despite recent advancements in Large Language Models (LLMs), their performance on automatic dating...
CreativeBench: Benchmarking and Enhancing Machine Creativity via Self-Evolving Challenges
arXiv:2603.11863v1 Announce Type: new Abstract: The saturation of high-quality pre-training data has shifted research focus toward evolutionary systems capable of continuously generating novel artifacts, leading to the success of AlphaEvolve. However, the progress of such systems is hindered by the...
DocSage: An Information Structuring Agent for Multi-Doc Multi-Entity Question Answering
arXiv:2603.11798v1 Announce Type: new Abstract: Multi-document Multi-entity Question Answering inherently demands models to track implicit logic between multiple entities across scattered documents. However, existing Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) frameworks suffer from critical limitations: standard RAG's vector...
Gender Bias in Generative AI-assisted Recruitment Processes
arXiv:2603.11736v1 Announce Type: new Abstract: In recent years, generative artificial intelligence (GenAI) systems have assumed increasingly crucial roles in selection processes, personnel recruitment and analysis of candidates' profiles. However, the employment of large language models (LLMs) risks reproducing, and in...
VisiFold: Long-Term Traffic Forecasting via Temporal Folding Graph and Node Visibility
arXiv:2603.11816v1 Announce Type: new Abstract: Traffic forecasting is a cornerstone of intelligent transportation systems. While existing research has made significant progress in short-term prediction, long-term forecasting remains a largely uncharted and challenging frontier. Extending the prediction horizon intensifies two critical...
Deactivating Refusal Triggers: Understanding and Mitigating Overrefusal in Safety Alignment
arXiv:2603.11388v1 Announce Type: new Abstract: Safety alignment aims to ensure that large language models (LLMs) refuse harmful requests by post-training on harmful queries paired with refusal answers. Although safety alignment is widely adopted in industry, the overrefusal problem where aligned...
The Density of Cross-Persistence Diagrams and Its Applications
arXiv:2603.11623v1 Announce Type: new Abstract: Topological Data Analysis (TDA) provides powerful tools to explore the shape and structure of data through topological features such as clusters, loops, and voids. Persistence diagrams are a cornerstone of TDA, capturing the evolution of...
Adversarial Reinforcement Learning for Detecting False Data Injection Attacks in Vehicular Routing
arXiv:2603.11433v1 Announce Type: new Abstract: In modern transportation networks, adversaries can manipulate routing algorithms using false data injection attacks, such as simulating heavy traffic with multiple devices running crowdsourced navigation applications, to mislead vehicles toward suboptimal routes and increase congestion....
Evaluating Explainable AI Attribution Methods in Neural Machine Translation via Attention-Guided Knowledge Distillation
arXiv:2603.11342v1 Announce Type: new Abstract: The study of the attribution of input features to the output of neural network models is an active area of research. While numerous Explainable AI (XAI) techniques have been proposed to interpret these models, the...
Automating Skill Acquisition through Large-Scale Mining of Open-Source Agentic Repositories: A Framework for Multi-Agent Procedural Knowledge Extraction
arXiv:2603.11808v1 Announce Type: new Abstract: The transition from monolithic large language models (LLMs) to modular, skill-equipped agents represents a fundamental architectural shift in artificial intelligence deployment. While general-purpose models demonstrate remarkable breadth in declarative knowledge, their utility in autonomous workflows...
Markovian Generation Chains in Large Language Models
arXiv:2603.11228v1 Announce Type: new Abstract: The widespread use of large language models (LLMs) raises an important question: how do texts evolve when they are repeatedly processed by LLMs? In this paper, we define this iterative inference process as Markovian generation...
The Unlearning Mirage: A Dynamic Framework for Evaluating LLM Unlearning
arXiv:2603.11266v1 Announce Type: new Abstract: Unlearning in Large Language Models (LLMs) aims to enhance safety, mitigate biases, and comply with legal mandates, such as the right to be forgotten. However, existing unlearning methods are brittle: minor query modifications, such as...
Summarize Before You Speak with ARACH: A Training-Free Inference-Time Plug-In for Enhancing LLMs via Global Attention Reallocation
arXiv:2603.11067v1 Announce Type: new Abstract: Large language models (LLMs) achieve remarkable performance, yet further gains often require costly training. This has motivated growing interest in post-training techniques-especially training-free approaches that improve models at inference time without updating weights. Most training-free...
PACED: Distillation at the Frontier of Student Competence
arXiv:2603.11178v1 Announce Type: new Abstract: Standard LLM distillation wastes compute on two fronts: problems the student has already mastered (near-zero gradients) and problems far beyond its reach (incoherent gradients that erode existing capabilities). We show that this waste is not...
MaterialFigBENCH: benchmark dataset with figures for evaluating college-level materials science problem-solving abilities of multimodal large language models
arXiv:2603.11414v1 Announce Type: new Abstract: We present MaterialFigBench, a benchmark dataset designed to evaluate the ability of multimodal large language models (LLMs) to solve university-level materials science problems that require accurate interpretation of figures. Unlike existing benchmarks that primarily rely...
AI Knows What's Wrong But Cannot Fix It: Helicoid Dynamics in Frontier LLMs Under High-Stakes Decisions
arXiv:2603.11559v1 Announce Type: new Abstract: Large language models perform reliably when their outputs can be checked: solving equations, writing code, retrieving facts. They perform differently when checking is impossible, as when a clinician chooses an irreversible treatment on incomplete data,...
Can Small Language Models Use What They Retrieve? An Empirical Study of Retrieval Utilization Across Model Scale
arXiv:2603.11513v1 Announce Type: new Abstract: Retrieval augmented generation RAG is widely deployed to improve factual accuracy in language models yet it remains unclear whether smaller models of size 7B parameters or less can effectively utilize retrieved information. To investigate this...
Streaming Translation and Transcription Through Speech-to-Text Causal Alignment
arXiv:2603.11578v1 Announce Type: new Abstract: Simultaneous machine translation (SiMT) has traditionally relied on offline machine translation models coupled with human-engineered heuristics or learned policies. We propose Hikari, a policy-free, fully end-to-end model that performs simultaneous speech-to-text translation and streaming transcription...
Compression Favors Consistency, Not Truth: When and Why Language Models Prefer Correct Information
arXiv:2603.11749v1 Announce Type: new Abstract: Why do language models sometimes prefer correct statements even when trained on mixed-quality data? We introduce the Compression--Consistency Principle: next-token prediction favors hypotheses that allow shorter and more internally consistent descriptions of the training data....
Legal-DC: Benchmarking Retrieval-Augmented Generation for Legal Documents
arXiv:2603.11772v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) has emerged as a promising technology for legal document consultation, yet its application in Chinese legal scenarios faces two key limitations: existing benchmarks lack specialized support for joint retriever-generator evaluation, and mainstream...
DatedGPT: Preventing Lookahead Bias in Large Language Models with Time-Aware Pretraining
arXiv:2603.11838v1 Announce Type: new Abstract: In financial backtesting, large language models pretrained on internet-scale data risk introducing lookahead bias that undermines their forecasting validity, as they may have already seen the true outcome during training. To address this, we present...
BTZSC: A Benchmark for Zero-Shot Text Classification Across Cross-Encoders, Embedding Models, Rerankers and LLMs
arXiv:2603.11991v1 Announce Type: new Abstract: Zero-shot text classification (ZSC) offers the promise of eliminating costly task-specific annotation by matching texts directly to human-readable label descriptions. While early approaches have predominantly relied on cross-encoder models fine-tuned for natural language inference (NLI),...
To Words and Beyond: Probing Large Language Models for Sentence-Level Psycholinguistic Norms of Memorability and Reading Times
arXiv:2603.12105v1 Announce Type: new Abstract: Large Language Models (LLMs) have recently been shown to produce estimates of psycholinguistic norms, such as valence, arousal, or concreteness, for words and multiword expressions, that correlate with human judgments. These estimates are obtained by...