LLM-Driven Heuristic Synthesis for Industrial Process Control: Lessons from Hot Steel Rolling
arXiv:2603.20537v1 Announce Type: new Abstract: Industrial process control demands policies that are interpretable and auditable, requirements that black-box neural policies struggle to meet. We study an LLM-driven heuristic synthesis framework for hot steel rolling, in which a language model iteratively...
FinReflectKG -- HalluBench: GraphRAG Hallucination Benchmark for Financial Question Answering Systems
arXiv:2603.20252v1 Announce Type: new Abstract: As organizations increasingly integrate AI-powered question-answering systems into financial information systems for compliance, risk assessment, and decision support, ensuring the factual accuracy of AI-generated outputs becomes a critical engineering challenge. Current Knowledge Graph (KG)-augmented QA...
Reasoning Traces Shape Outputs but Models Won't Say So
arXiv:2603.20620v1 Announce Type: new Abstract: Can we trust the reasoning traces that large reasoning models (LRMs) produce? We investigate whether these traces faithfully reflect what drives model outputs, and whether models will honestly report their influence. We introduce Thought Injection,...
Agentic AI and the next intelligence explosion
arXiv:2603.20639v1 Announce Type: new Abstract: The "AI singularity" is often miscast as a monolithic, godlike mind. Evolution suggests a different path: intelligence is fundamentally plural, social, and relational. Recent advances in agentic AI reveal that frontier reasoning models, such as...
Efficient Counterfactual Reasoning in ProbLog via Single World Intervention Programs
arXiv:2603.20505v1 Announce Type: new Abstract: Probabilistic Logic Programming (PLP) languages, like ProbLog, naturally support reasoning under uncertainty, while maintaining a declarative and interpretable framework. Meanwhile, counterfactual reasoning (i.e., answering ``what if'' questions) is critical for ensuring AI systems are robust...
Seed1.8 Model Card: Towards Generalized Real-World Agency
arXiv:2603.20633v1 Announce Type: new Abstract: We present Seed1.8, a foundation model aimed at generalized real-world agency: going beyond single-turn prediction to multi-turn interaction, tool use, and multi-step execution. Seed1.8 keeps strong LLM and vision-language performance while supporting a unified agentic...
KLDrive: Fine-Grained 3D Scene Reasoning for Autonomous Driving based on Knowledge Graph
arXiv:2603.21029v1 Announce Type: new Abstract: Autonomous driving requires reliable reasoning over fine-grained 3D scene facts. Fine-grained question answering over multi-modal driving observations provides a natural way to evaluate this capability, yet existing perception pipelines and driving-oriented large language model (LLM)...
Grounded Chess Reasoning in Language Models via Master Distillation
arXiv:2603.20510v1 Announce Type: new Abstract: Language models often lack grounded reasoning capabilities in specialized domains where training data is scarce but bespoke systems excel. We introduce a general framework for distilling expert system reasoning into natural language chain-of-thought explanations, enabling...
Do LLM-Driven Agents Exhibit Engagement Mechanisms? Controlled Tests of Information Load, Descriptive Norms, and Popularity Cues
arXiv:2603.20911v1 Announce Type: new Abstract: Large language models make agent-based simulation more behaviorally expressive, but they also sharpen a basic methodological tension: fluent, human-like output is not, by itself, evidence for theory. We evaluate what an LLM-driven simulation can credibly...
Enhancing Safety of Large Language Models via Embedding Space Separation
arXiv:2603.20206v1 Announce Type: new Abstract: Large language models (LLMs) have achieved impressive capabilities, yet ensuring their safety against harmful prompts remains a critical challenge. Recent work has revealed that the latent representations (embeddings) of harmful and safe queries in LLMs...
Deep reflective reasoning in interdependence constrained structured data extraction from clinical notes for digital health
arXiv:2603.20435v1 Announce Type: new Abstract: Extracting structured information from clinical notes requires navigating a dense web of interdependent variables where the value of one attribute logically constrains others. Existing Large Language Model (LLM)-based extraction pipelines often struggle to capture these...
A Framework for Low-Latency, LLM-driven Multimodal Interaction on the Pepper Robot
arXiv:2603.21013v1 Announce Type: new Abstract: Despite recent advances in integrating Large Language Models (LLMs) into social robotics, two weaknesses persist. First, existing implementations on platforms like Pepper often rely on cascaded Speech-to-Text (STT)->LLM->Text-to-Speech (TTS) pipelines, resulting in high latency and...
From 50% to Mastery in 3 Days: A Low-Resource SOP for Localizing Graduate-Level AI Tutors via Shadow-RAG
arXiv:2603.20650v1 Announce Type: new Abstract: Deploying high-fidelity AI tutors in schools is often blocked by the Resource Curse -- the need for expensive cloud GPUs and massive data engineering. In this practitioner report, we present a replicable Standard Operating Procedure...
Thinking into the Future: Latent Lookahead Training for Transformers
arXiv:2603.20219v1 Announce Type: new Abstract: Autoregressive language models trained with next-token prediction generate text by sampling one discrete token at a time. Although very scalable, this objective forces the model to commit at every step, preventing it from exploring or...
Knowledge Boundary Discovery for Large Language Models
arXiv:2603.21022v1 Announce Type: new Abstract: We propose Knowledge Boundary Discovery (KBD), a reinforcement learning based framework to explore the knowledge boundaries of the Large Language Models (LLMs). We define the knowledge boundary by automatically generating two types of questions: (i)...
Where can AI be used? Insights from a deep ontology of work activities
arXiv:2603.20619v1 Announce Type: new Abstract: Artificial intelligence (AI) is poised to profoundly reshape how work is executed and organized, but we do not yet have deep frameworks for understanding where AI can be used. Here we provide a comprehensive ontology...
AgenticGEO: A Self-Evolving Agentic System for Generative Engine Optimization
arXiv:2603.20213v1 Announce Type: new Abstract: Generative search engines represent a transition from traditional ranking-based retrieval to Large Language Model (LLM)-based synthesis, transforming optimization goals from ranking prominence towards content inclusion. Generative Engine Optimization (GEO), specifically, aims to maximize visibility and...
The AI Scientific Community: Agentic Virtual Lab Swarms
arXiv:2603.21344v1 Announce Type: new Abstract: In this short note we propose using agentic swarms of virtual labs as a model of an AI Science Community. In this paradigm, each particle in the swarm represents a complete virtual laboratory instance, enabling...
Expected Reward Prediction, with Applications to Model Routing
arXiv:2603.20217v1 Announce Type: new Abstract: Reward models are a standard tool to score responses from LLMs. Reward models are built to rank responses to a fixed prompt sampled from a single model, for example to choose the best of n...
An experimental study of KV cache reuse strategies in chunk-level caching systems
arXiv:2603.20218v1 Announce Type: new Abstract: Retrieval-augmented generation improves large language models' accuracy by adding relevant retrieved text to the prompt. Chunk level caching (CLC) accelerates inference by precomputing KV caches for these retrieved chunks and reusing them. However, these caches...
Multi-RF Fusion with Multi-GNN Blending for Molecular Property Prediction
arXiv:2603.20724v1 Announce Type: new Abstract: Multi-RF Fusion achieves a test ROC-AUC of 0.8476 +/- 0.0002 on ogbg-molhiv (10 seeds), placing #1 on the OGB leaderboard ahead of HyperFusion (0.8475 +/- 0.0003). The core of the method is a rank-averaged ensemble...
SciNav: A General Agent Framework for Scientific Coding Tasks
arXiv:2603.20256v1 Announce Type: new Abstract: Autonomous science agents built on large language models (LLMs) are increasingly used to generate hypotheses, design experiments, and produce reports. However, prior work mainly targets open-ended scientific problems with subjective outputs that are difficult to...
Coding Agents are Effective Long-Context Processors
arXiv:2603.20432v1 Announce Type: new Abstract: Large Language Models (LLMs) have demonstrated remarkable progress in scaling to access massive contexts. However, the access is via the latent and uninterpretable attention mechanisms, and LLMs fail to effective process long context, exhibiting significant...
Evaluating Large Language Models on Historical Health Crisis Knowledge in Resource-Limited Settings: A Hybrid Multi-Metric Study
arXiv:2603.20514v1 Announce Type: new Abstract: Large Language Models (LLMs) offer significant potential for delivering health information. However, their reliability in low-resource contexts remains uncertain. This study evaluates GPT-4, Gemini Pro, Llama~3, and Mistral-7B on health crisis-related enquiries concerning COVID-19, dengue,...
Permutation-Consensus Listwise Judging for Robust Factuality Evaluation
arXiv:2603.20562v1 Announce Type: new Abstract: Large language models (LLMs) are now widely used as judges, yet their decisions can change under presentation choices that should be irrelevant. We study one such source of instability: candidate-order sensitivity in listwise factuality evaluation,...
JUBAKU: An Adversarial Benchmark for Exposing Culturally Grounded Stereotypes in Japanese LLMs
arXiv:2603.20581v1 Announce Type: new Abstract: Social biases reflected in language are inherently shaped by cultural norms, which vary significantly across regions and lead to diverse manifestations of stereotypes. Existing evaluations of social bias in large language models (LLMs) for non-English...