LLM-WikiRace: Benchmarking Long-term Planning and Reasoning over Real-World Knowledge Graphs
arXiv:2602.16902v1 Announce Type: new Abstract: We introduce LLM-Wikirace, a benchmark for evaluating planning, reasoning, and world knowledge in large language models (LLMs). In LLM-Wikirace, models must efficiently navigate Wikipedia hyperlinks step by step to reach a target page from a...
Narrow fine-tuning erodes safety alignment in vision-language agents
arXiv:2602.16931v1 Announce Type: new Abstract: Lifelong multimodal agents must continuously adapt to new tasks through post-training, but this creates fundamental tension between acquiring capabilities and preserving safety alignment. We demonstrate that fine-tuning aligned vision-language models on narrow-domain harmful datasets induces...
SourceBench: Can AI Answers Reference Quality Web Sources?
arXiv:2602.16942v1 Announce Type: new Abstract: Large language models (LLMs) increasingly answer queries by citing web sources, but existing evaluations emphasize answer correctness rather than evidence quality. We introduce SourceBench, a benchmark for measuring the quality of cited web sources across...
LLM4Cov: Execution-Aware Agentic Learning for High-coverage Testbench Generation
arXiv:2602.16953v1 Announce Type: new Abstract: Execution-aware LLM agents offer a promising paradigm for learning from tool feedback, but such feedback is often expensive and slow to obtain, making online reinforcement learning (RL) impractical. High-coverage hardware verification exemplifies this challenge due...
Automating Agent Hijacking via Structural Template Injection
arXiv:2602.16958v1 Announce Type: new Abstract: Agent hijacking, highlighted by OWASP as a critical threat to the Large Language Model (LLM) ecosystem, enables adversaries to manipulate execution by injecting malicious instructions into retrieved content. Most existing attacks rely on manually crafted,...
HQFS: Hybrid Quantum Classical Financial Security with VQC Forecasting, QUBO Annealing, and Audit-Ready Post-Quantum Signing
arXiv:2602.16976v1 Announce Type: new Abstract: Here's the corrected paragraph with all punctuation and formatting issues fixed: Financial risk systems usually follow a two-step routine: a model predicts return or risk, and then an optimizer makes a decision such as a...
Sonar-TS: Search-Then-Verify Natural Language Querying for Time Series Databases
arXiv:2602.17001v1 Announce Type: new Abstract: Natural Language Querying for Time Series Databases (NLQ4TSDB) aims to assist non-expert users retrieve meaningful events, intervals, and summaries from massive temporal records. However, existing Text-to-SQL methods are not designed for continuous morphological intents such...
M2F: Automated Formalization of Mathematical Literature at Scale
arXiv:2602.17016v1 Announce Type: new Abstract: Automated formalization of mathematics enables mechanical verification but remains limited to isolated theorems and short snippets. Scaling to textbooks and research papers is largely unaddressed, as it requires managing cross-file dependencies, resolving imports, and ensuring...
Sales Research Agent and Sales Research Bench
arXiv:2602.17017v1 Announce Type: new Abstract: Enterprises increasingly need AI systems that can answer sales-leader questions over live, customized CRM data, but most available models do not expose transparent, repeatable evidence of quality. This paper describes the Sales Research Agent in...
IntentCUA: Learning Intent-level Representations for Skill Abstraction and Multi-Agent Planning in Computer-Use Agents
arXiv:2602.17049v1 Announce Type: new Abstract: Computer-use agents operate over long horizons under noisy perception, multi-window contexts, evolving environment states. Existing approaches, from RL-based planners to trajectory retrieval, often drift from user intent and repeatedly solve routine subproblems, leading to error...
Retaining Suboptimal Actions to Follow Shifting Optima in Multi-Agent Reinforcement Learning
arXiv:2602.17062v1 Announce Type: new Abstract: Value decomposition is a core approach for cooperative multi-agent reinforcement learning (MARL). However, existing methods still rely on a single optimal action and struggle to adapt when the underlying value function shifts during training, often...
Predictive Batch Scheduling: Accelerating Language Model Training Through Loss-Aware Sample Prioritization
arXiv:2602.17066v1 Announce Type: new Abstract: We introduce Predictive Batch Scheduling (PBS), a novel training optimization technique that accelerates language model convergence by dynamically prioritizing high-loss samples during batch construction. Unlike curriculum learning approaches that require predefined difficulty metrics or hard...
Instructor-Aligned Knowledge Graphs for Personalized Learning
arXiv:2602.17111v1 Announce Type: new Abstract: Mastering educational concepts requires understanding both their prerequisites (e.g., recursion before merge sort) and sub-concepts (e.g., merge sort as part of sorting algorithms). Capturing these dependencies is critical for identifying students' knowledge gaps and enabling...
Epistemology of Generative AI: The Geometry of Knowing
arXiv:2602.17116v1 Announce Type: new Abstract: Generative AI presents an unprecedented challenge to our understanding of knowledge and its production. Unlike previous technological transformations, where engineering understanding preceded or accompanied deployment, generative AI operates through mechanisms whose epistemic character remains obscure,...
Continual learning and refinement of causal models through dynamic predicate invention
arXiv:2602.17217v1 Announce Type: new Abstract: Efficiently navigating complex environments requires agents to internalize the underlying logic of their world, yet standard world modelling methods often struggle with sample inefficiency, lack of transparency, and poor scalability. We propose a framework for...
From Labor to Collaboration: A Methodological Experiment Using AI Agents to Augment Research Perspectives in Taiwan's Humanities and Social Sciences
arXiv:2602.17221v1 Announce Type: new Abstract: Generative AI is reshaping knowledge work, yet existing research focuses predominantly on software engineering and the natural sciences, with limited methodological exploration for the humanities and social sciences. Positioned as a "methodological experiment," this study...
Decoding the Human Factor: High Fidelity Behavioral Prediction for Strategic Foresight
arXiv:2602.17222v1 Announce Type: new Abstract: Predicting human decision-making in high-stakes environments remains a central challenge for artificial intelligence. While large language models (LLMs) demonstrate strong general reasoning, they often struggle to generate consistent, individual-specific behavior, particularly when accurate prediction depends...
Mechanistic Interpretability of Cognitive Complexity in LLMs via Linear Probing using Bloom's Taxonomy
arXiv:2602.17229v1 Announce Type: new Abstract: The black-box nature of Large Language Models necessitates novel evaluation frameworks that transcend surface-level performance metrics. This study investigates the internal neural representations of cognitive complexity using Bloom's Taxonomy as a hierarchical lens. By analyzing...
References Improve LLM Alignment in Non-Verifiable Domains
arXiv:2602.16802v1 Announce Type: new Abstract: While Reinforcement Learning with Verifiable Rewards (RLVR) has shown strong effectiveness in reasoning tasks, it cannot be directly applied to non-verifiable domains lacking ground-truth verifiers, such as LLM alignment. In this work, we investigate whether...
Evaluating Monolingual and Multilingual Large Language Models for Greek Question Answering: The DemosQA Benchmark
arXiv:2602.16811v1 Announce Type: new Abstract: Recent advancements in Natural Language Processing and Deep Learning have enabled the development of Large Language Models (LLMs), which have significantly advanced the state-of-the-art across a wide range of tasks, including Question Answering (QA). Despite...
Claim Automation using Large Language Model
arXiv:2602.16836v1 Announce Type: new Abstract: While Large Language Models (LLMs) have achieved strong performance on general-purpose language tasks, their deployment in regulated and data-sensitive domains, including insurance, remains limited. Leveraging millions of historical warranty claims, we propose a locally deployed...
BanglaSummEval: Reference-Free Factual Consistency Evaluation for Bangla Summarization
arXiv:2602.16843v1 Announce Type: new Abstract: Evaluating factual consistency is essential for reliable text summarization, particularly in high-stakes domains such as healthcare and news. However, most existing evaluation metrics overlook Bangla, a widely spoken yet under-resourced language, and often depend on...
Meenz bleibt Meenz, but Large Language Models Do Not Speak Its Dialect
arXiv:2602.16852v1 Announce Type: new Abstract: Meenzerisch, the dialect spoken in the German city of Mainz, is also the traditional language of the Mainz carnival, a yearly celebration well known throughout Germany. However, Meenzerisch is on the verge of dying out-a...
Evaluating Cross-Lingual Classification Approaches Enabling Topic Discovery for Multilingual Social Media Data
arXiv:2602.17051v1 Announce Type: new Abstract: Analysing multilingual social media discourse remains a major challenge in natural language processing, particularly when large-scale public debates span across diverse languages. This study investigates how different approaches for cross-lingual text classification can support reliable...
BankMathBench: A Benchmark for Numerical Reasoning in Banking Scenarios
arXiv:2602.17072v1 Announce Type: new Abstract: Large language models (LLMs)-based chatbots are increasingly being adopted in the financial domain, particularly in digital banking, to handle customer inquiries about products such as deposits, savings, and loans. However, these models still exhibit low...
Projective Psychological Assessment of Large Multimodal Models Using Thematic Apperception Tests
arXiv:2602.17108v1 Announce Type: new Abstract: Thematic Apperception Test (TAT) is a psychometrically grounded, multidimensional assessment framework that systematically differentiates between cognitive-representational and affective-relational components of personality-like functioning. This test is a projective psychological framework designed to uncover unconscious aspects of...
What Makes a Good Doctor Response? An Analysis on a Romanian Telemedicine Platform
arXiv:2602.17194v1 Announce Type: new Abstract: Text-based telemedicine has become a common mode of care, requiring clinicians to deliver medical advice clearly and effectively in writing. As platforms increasingly rely on patient ratings and feedback, clinicians face growing pressure to maintain...
Towards Cross-lingual Values Assessment: A Consensus-Pluralism Perspective
arXiv:2602.17283v1 Announce Type: new Abstract: While large language models (LLMs) have become pivotal to content safety, current evaluation paradigms primarily focus on detecting explicit harms (e.g., violence or hate speech), neglecting the subtler value dimensions conveyed in digital content. To...
Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation
arXiv:2602.17316v1 Announce Type: new Abstract: The rapid advancement of Large Language Models (LLMs) has established standardized evaluation benchmarks as the primary instrument for model comparison. Yet, their reliability is increasingly questioned due to sensitivity to shallow variations in input prompts....
RPDR: A Round-trip Prediction-Based Data Augmentation Framework for Long-Tail Question Answering
arXiv:2602.17366v1 Announce Type: new Abstract: Long-tail question answering presents significant challenges for large language models (LLMs) due to their limited ability to acquire and accurately recall less common knowledge. Retrieval-augmented generation (RAG) systems have shown great promise in mitigating this...