Consistency of Large Reasoning Models Under Multi-Turn Attacks
arXiv:2602.13093v2 Announce Type: new Abstract: Large reasoning models with reasoning capabilities achieve state-of-the-art performance on complex tasks, but their robustness under multi-turn adversarial pressure remains underexplored. We evaluate nine frontier reasoning models under adversarial attacks. Our findings reveal that reasoning...
Optimal Take-off under Fuzzy Clearances
arXiv:2602.13166v1 Announce Type: new Abstract: This paper presents a hybrid obstacle avoidance architecture that integrates Optimal Control under clearance with a Fuzzy Rule Based System (FRBS) to enable adaptive constraint handling for unmanned aircraft. Motivated by the limitations of classical...
A Lightweight LLM Framework for Disaster Humanitarian Information Classification
arXiv:2602.12284v1 Announce Type: cross Abstract: Timely classification of humanitarian information from social media is critical for effective disaster response. However, deploying large language models (LLMs) for this task faces challenges in resource-constrained emergency settings. This paper develops a lightweight, cost-effective...
Energy-Aware Reinforcement Learning for Robotic Manipulation of Articulated Components in Infrastructure Operation and Maintenance
arXiv:2602.12288v1 Announce Type: cross Abstract: With the growth of intelligent civil infrastructure and smart cities, operation and maintenance (O&M) increasingly requires safe, efficient, and energy-conscious robotic manipulation of articulated components, including access doors, service drawers, and pipeline valves. However, existing...
OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model
arXiv:2602.12304v1 Announce Type: cross Abstract: Existing mainstream video customization methods focus on generating identity-consistent videos based on given reference images and textual prompts. Benefiting from the rapid advancement of joint audio-video generation, this paper proposes a more compelling new task:...
Perceptual Self-Reflection in Agentic Physics Simulation Code Generation
arXiv:2602.12311v1 Announce Type: cross Abstract: We present a multi-agent framework for generating physics simulation code from natural language descriptions, featuring a novel perceptual self-reflection mechanism for validation. The system employs four specialized agents: a natural language interpreter that converts user...
Visible and Hyperspectral Imaging for Quality Assessment of Milk: Property Characterisation and Identification
arXiv:2602.12313v1 Announce Type: cross Abstract: Rapid and non-destructive assessment of milk quality is crucial to ensuring both nutritional value and food safety. In this study, we investigated the potential of visible and hyperspectral imaging as cost-effective and quick-response alternatives to...
Free Lunch in Medical Image Foundation Model Pre-training via Randomized Synthesis and Disentanglement
arXiv:2602.12317v1 Announce Type: cross Abstract: Medical image foundation models (MIFMs) have demonstrated remarkable potential for a wide range of clinical tasks, yet their development is constrained by the scarcity, heterogeneity, and high cost of large-scale annotated datasets. Here, we propose...
ForeAct: Steering Your VLA with Efficient Visual Foresight Planning
arXiv:2602.12322v1 Announce Type: cross Abstract: Vision-Language-Action (VLA) models convert high-level language instructions into concrete, executable actions, a task that is especially challenging in open-world environments. We present Visual Foresight Planning (ForeAct), a general and efficient planner that guides a VLA...
Policy4OOD: A Knowledge-Guided World Model for Policy Intervention Simulation against the Opioid Overdose Crisis
arXiv:2602.12373v1 Announce Type: cross Abstract: The opioid epidemic remains one of the most severe public health crises in the United States, yet evaluating policy interventions before implementation is difficult: multiple policies interact within a dynamic system where targeting one risk...
Reproducing DragDiffusion: Interactive Point-Based Editing with Diffusion Models
arXiv:2602.12393v1 Announce Type: cross Abstract: DragDiffusion is a diffusion-based method for interactive point-based image editing that enables users to manipulate images by directly dragging selected points. The method claims that accurate spatial control can be achieved by optimizing a single...
Correctness, Artificial Intelligence, and the Epistemic Value of Mathematical Proof
arXiv:2602.12463v1 Announce Type: cross Abstract: We argue that it is neither necessary nor sufficient for a mathematical proof to have epistemic value that it be "correct", in the sense of formalizable in a formal proof system. We then present a...
Designing RNAs with Language Models
arXiv:2602.12470v1 Announce Type: cross Abstract: RNA design, the task of finding a sequence that folds into a target secondary structure, has broad biological and biomedical impact but remains computationally challenging due to the exponentially large sequence space and exponentially many...
propella-1: Multi-Property Document Annotation for LLM Data Curation at Scale
arXiv:2602.12414v1 Announce Type: new Abstract: Since FineWeb-Edu, data curation for LLM pretraining has predominantly relied on single scalar quality scores produced by small classifiers. A single score conflates multiple quality dimensions, prevents flexible filtering, and offers no interpretability. We introduce...
RBCorr: Response Bias Correction in Language Models
arXiv:2602.12445v1 Announce Type: new Abstract: Language models (LMs) are known to be prone to response biases, which present as option preference biases in fixed-response questions. It is therefore imperative to develop low-cost and effective response bias correction methods to improve...
Discovering Semantic Latent Structures in Psychological Scales: A Response-Free Pathway to Efficient Simplification
arXiv:2602.12575v1 Announce Type: new Abstract: Psychological scale refinement traditionally relies on response-based methods such as factor analysis, item response theory, and network psychometrics to optimize item composition. Although rigorous, these approaches require large samples and may be constrained by data...
Beyond Normalization: Rethinking the Partition Function as a Difficulty Scheduler for RLVR
arXiv:2602.12642v1 Announce Type: new Abstract: Reward-maximizing RL methods enhance the reasoning performance of LLMs, but often reduce the diversity among outputs. Recent works address this issue by adopting GFlowNets, training LLMs to match a target distribution while jointly learning its...
ReFilter: Improving Robustness of Retrieval-Augmented Generation via Gated Filter
arXiv:2602.12709v1 Announce Type: new Abstract: Retrieval-augmented generation (RAG) has become a dominant paradigm for grounding large language models (LLMs) with external evidence in knowledge-intensive question answering. A core design choice is how to fuse retrieved samples into the LLMs, where...
Towards a Diagnostic and Predictive Evaluation Methodology for Sequence Labeling Tasks
arXiv:2602.12759v1 Announce Type: new Abstract: Standard evaluation in NLP typically indicates that system A is better on average than system B, but it provides little info on how to improve performance and, what is worse, it should not come as...
RAT-Bench: A Comprehensive Benchmark for Text Anonymization
arXiv:2602.12806v1 Announce Type: new Abstract: Data containing personal information is increasingly used to train, fine-tune, or query Large Language Models (LLMs). Text is typically scrubbed of identifying information prior to use, often with tools such as Microsoft's Presidio or Anthropic's...
Left-right asymmetry in predicting brain activity from LLMs' representations emerges with their formal linguistic competence
arXiv:2602.12811v1 Announce Type: new Abstract: When humans and large language models (LLMs) process the same text, activations in the LLMs correlate with brain activity measured, e.g., with functional magnetic resonance imaging (fMRI). Moreover, it has been shown that, as the...
MentalBench: A Benchmark for Evaluating Psychiatric Diagnostic Capability of Large Language Models
arXiv:2602.12871v1 Announce Type: new Abstract: We introduce MentalBench, a benchmark for evaluating psychiatric diagnostic decision-making in large language models (LLMs). Existing mental health benchmarks largely rely on social media data, limiting their ability to assess DSM-grounded diagnostic judgments. At the...
When Words Don't Mean What They Say: Figurative Understanding in Bengali Idioms
arXiv:2602.12921v1 Announce Type: new Abstract: Figurative language understanding remains a significant challenge for Large Language Models (LLMs), especially for low-resource languages. To address this, we introduce a new idiom dataset, a large-scale, culturally-grounded corpus of 10,361 Bengali idioms. Each idiom...
Curriculum Learning and Pseudo-Labeling Improve the Generalization of Multi-Label Arabic Dialect Identification Models
arXiv:2602.12937v1 Announce Type: new Abstract: Being modeled as a single-label classification task for a long time, recent work has argued that Arabic Dialect Identification (ADI) should be framed as a multi-label classification task. However, ADI remains constrained by the availability...
Know More, Know Clearer: A Meta-Cognitive Framework for Knowledge Augmentation in Large Language Models
arXiv:2602.12996v1 Announce Type: new Abstract: Knowledge augmentation has significantly enhanced the performance of Large Language Models (LLMs) in knowledge-intensive tasks. However, existing methods typically operate on the simplistic premise that model performance equates with internal knowledge, overlooking the knowledge-confidence gaps...
Can we trust AI to detect healthy multilingual English speakers among the cognitively impaired cohort in the UK? An investigation using real-world conversational speech
arXiv:2602.13047v1 Announce Type: new Abstract: Conversational speech often reveals early signs of cognitive decline, such as dementia and MCI. In the UK, one in four people belongs to an ethnic minority, and dementia prevalence is expected to rise most rapidly...
Exploring a New Competency Modeling Process with Large Language Models
arXiv:2602.13084v1 Announce Type: new Abstract: Competency modeling is widely used in human resource management to select, develop, and evaluate talent. However, traditional expert-driven approaches rely heavily on manual analysis of large volumes of interview transcripts, making them costly and prone...
SCOPE: Selective Conformal Optimized Pairwise LLM Judging
arXiv:2602.13110v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used as judges to replace costly human preference labels in pairwise evaluation. Despite their practicality, LLM judges remain prone to miscalibration and systematic biases. This paper proposes SCOPE (Selective...
OpenLID-v3: Improving the Precision of Closely Related Language Identification -- An Experience Report
arXiv:2602.13139v1 Announce Type: new Abstract: Language identification (LID) is an essential step in building high-quality multilingual datasets from web data. Existing LID tools (such as OpenLID or GlotLID) often struggle to identify closely related languages and to distinguish valid natural...
Beyond Musical Descriptors: Extracting Preference-Bearing Intent in Music Queries
arXiv:2602.12301v1 Announce Type: cross Abstract: Although annotated music descriptor datasets for user queries are increasingly common, few consider the user's intent behind these descriptors, which is essential for effectively meeting their needs. We introduce MusicRecoIntent, a manually annotated corpus of...