Toward Trustworthy Evaluation of Sustainability Rating Methodologies: A Human-AI Collaborative Framework for Benchmark Dataset Construction
arXiv:2602.17106v1 Announce Type: new Abstract: Sustainability or ESG rating agencies use company disclosures and external data to produce scores or ratings that assess the environmental, social, and governance performance of a company. However, sustainability ratings across agencies for a single...
From Labor to Collaboration: A Methodological Experiment Using AI Agents to Augment Research Perspectives in Taiwan's Humanities and Social Sciences
arXiv:2602.17221v1 Announce Type: new Abstract: Generative AI is reshaping knowledge work, yet existing research focuses predominantly on software engineering and the natural sciences, with limited methodological exploration for the humanities and social sciences. Positioned as a "methodological experiment," this study...
Decoding the Human Factor: High Fidelity Behavioral Prediction for Strategic Foresight
arXiv:2602.17222v1 Announce Type: new Abstract: Predicting human decision-making in high-stakes environments remains a central challenge for artificial intelligence. While large language models (LLMs) demonstrate strong general reasoning, they often struggle to generate consistent, individual-specific behavior, particularly when accurate prediction depends...
Evaluating Monolingual and Multilingual Large Language Models for Greek Question Answering: The DemosQA Benchmark
arXiv:2602.16811v1 Announce Type: new Abstract: Recent advancements in Natural Language Processing and Deep Learning have enabled the development of Large Language Models (LLMs), which have significantly advanced the state-of-the-art across a wide range of tasks, including Question Answering (QA). Despite...
One-step Language Modeling via Continuous Denoising
arXiv:2602.16813v1 Announce Type: new Abstract: Language models based on discrete diffusion have attracted widespread interest for their potential to provide faster generation than autoregressive models. In practice, however, they exhibit a sharp degradation of sample quality in the few-step regime,...
Claim Automation using Large Language Model
arXiv:2602.16836v1 Announce Type: new Abstract: While Large Language Models (LLMs) have achieved strong performance on general-purpose language tasks, their deployment in regulated and data-sensitive domains, including insurance, remains limited. Leveraging millions of historical warranty claims, we propose a locally deployed...
ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders
arXiv:2602.16938v1 Announce Type: new Abstract: The promise of LLM-based user simulators to improve conversational AI is hindered by a critical "realism gap," leading to systems that are optimized for simulated interactions, but may fail to perform well in the real...
ReIn: Conversational Error Recovery with Reasoning Inception
arXiv:2602.17022v1 Announce Type: new Abstract: Conversational agents powered by large language models (LLMs) with tool integration achieve strong performance on fixed task-oriented dialogue datasets but remain vulnerable to unanticipated, user-induced errors. Rather than focusing on error prevention, this work focuses...
What Makes a Good Doctor Response? An Analysis on a Romanian Telemedicine Platform
arXiv:2602.17194v1 Announce Type: new Abstract: Text-based telemedicine has become a common mode of care, requiring clinicians to deliver medical advice clearly and effectively in writing. As platforms increasingly rely on patient ratings and feedback, clinicians face growing pressure to maintain...
Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation
arXiv:2602.17316v1 Announce Type: new Abstract: The rapid advancement of Large Language Models (LLMs) has established standardized evaluation benchmarks as the primary instrument for model comparison. Yet, their reliability is increasingly questioned due to sensitivity to shallow variations in input prompts....
Small LLMs for Medical NLP: a Systematic Analysis of Few-Shot, Constraint Decoding, Fine-Tuning and Continual Pre-Training in Italian
arXiv:2602.17475v1 Announce Type: new Abstract: Large Language Models (LLMs) consistently excel in diverse medical Natural Language Processing (NLP) tasks, yet their substantial computational requirements often limit deployment in real-world healthcare settings. In this work, we investigate whether "small" LLMs (around...
Bridging the Domain Divide: Supervised vs. Zero-Shot Clinical Section Segmentation from MIMIC-III to Obstetrics
arXiv:2602.17513v1 Announce Type: new Abstract: Clinical free-text notes contain vital patient information. They are structured into labelled sections; recognizing these sections has been shown to support clinical decision-making and downstream NLP tasks. In this paper, we advance clinical section segmentation...
Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning
arXiv:2602.17546v1 Announce Type: new Abstract: Instruction-following language models are trained to be helpful and safe, yet their safety behavior can deteriorate under benign fine-tuning and worsen under adversarial updates. Existing defenses often offer limited protection or force a trade-off between...
Intent Laundering: AI Safety Datasets Are Not What They Seem
arXiv:2602.16729v1 Announce Type: cross Abstract: We systematically evaluate the quality of widely used AI safety datasets from two perspectives: in isolation and in practice. In isolation, we examine how well these datasets reflect real-world attacks based on three key properties:...
PETS: A Principled Framework Towards Optimal Trajectory Allocation for Efficient Test-Time Self-Consistency
arXiv:2602.16745v1 Announce Type: new Abstract: Test-time scaling can improve model performance by aggregating stochastic reasoning trajectories. However, achieving sample-efficient test-time self-consistency under a limited budget remains an open challenge. We introduce PETS (Principled and Efficient Test-TimeSelf-Consistency), which initiates a principled...
TopoFlow: Physics-guided Neural Networks for high-resolution air quality prediction
arXiv:2602.16821v1 Announce Type: new Abstract: We propose TopoFlow (Topography-aware pollutant Flow learning), a physics-guided neural network for efficient, high-resolution air quality prediction. To explicitly embed physical processes into the learning framework, we identify two critical factors governing pollutant dynamics: topography...
What is the Value of Censored Data? An Exact Analysis for the Data-driven Newsvendor
arXiv:2602.16842v1 Announce Type: new Abstract: We study the offline data-driven newsvendor problem with censored demand data. In contrast to prior works where demand is fully observed, we consider the setting where demand is censored at the inventory level and only...
Malliavin Calculus as Stochastic Backpropogation
arXiv:2602.17013v1 Announce Type: new Abstract: We establish a rigorous connection between pathwise (reparameterization) and score-function (Malliavin) gradient estimators by showing that both arise from the Malliavin integration-by-parts identity. Building on this equivalence, we introduce a unified and variance-aware hybrid estimator...
Transforming Behavioral Neuroscience Discovery with In-Context Learning and AI-Enhanced Tensor Methods
arXiv:2602.17027v1 Announce Type: new Abstract: Scientific discovery pipelines typically involve complex, rigid, and time-consuming processes, from data preparation to analyzing and interpreting findings. Recent advances in AI have the potential to transform such pipelines in a way that domain experts...
Spatio-temporal dual-stage hypergraph MARL for human-centric multimodal corridor traffic signal control
arXiv:2602.17068v1 Announce Type: new Abstract: Human-centric traffic signal control in corridor networks must increasingly account for multimodal travelers, particularly high-occupancy public transportation, rather than focusing solely on vehicle-centric performance. This paper proposes STDSH-MARL (Spatio-Temporal Dual-Stage Hypergraph based Multi-Agent Reinforcement Learning),...
AdvSynGNN: Structure-Adaptive Graph Neural Nets via Adversarial Synthesis and Self-Corrective Propagation
arXiv:2602.17071v1 Announce Type: new Abstract: Graph neural networks frequently encounter significant performance degradation when confronted with structural noise or non-homophilous topologies. To address these systemic vulnerabilities, we present AdvSynGNN, a comprehensive architecture designed for resilient node-level representation learning. The proposed...
Adam Improves Muon: Adaptive Moment Estimation with Orthogonalized Momentum
arXiv:2602.17080v1 Announce Type: new Abstract: Efficient stochastic optimization typically integrates an update direction that performs well in the deterministic regime with a mechanism adapting to stochastic perturbations. While Adam uses adaptive moment estimates to promote stability, Muon utilizes the weight...
A breakdown of the court’s tariff decision
Empirical SCOTUS is a recurring series by Adam Feldman that looks at Supreme Court data, primarily in the form of opinions and oral arguments, to provide insights into the justices’ decision making and […]The postA breakdown of the court’s tariff...
Resp-Agent: An Agent-Based System for Multimodal Respiratory Sound Generation and Disease Diagnosis
arXiv:2602.15909v1 Announce Type: cross Abstract: Deep learning-based respiratory auscultation is currently hindered by two fundamental challenges: (i) inherent information loss, as converting signals into spectrograms discards transient acoustic events and clinical context; (ii) limited data availability, exacerbated by severe class...
From Transcripts to AI Agents: Knowledge Extraction, RAG Integration, and Robust Evaluation of Conversational AI Assistants
arXiv:2602.15859v1 Announce Type: new Abstract: Building reliable conversational AI assistants for customer-facing industries remains challenging due to noisy conversational data, fragmented knowledge, and the requirement for accurate human hand-off - particularly in domains that depend heavily on real-time information. This...
VDLM: Variable Diffusion LMs via Robust Latent-to-Text Rendering
arXiv:2602.15870v1 Announce Type: new Abstract: Autoregressive language models decode left-to-right with irreversible commitments, limiting revision during multi-step reasoning. We propose \textbf{VDLM}, a modular variable diffusion language model that separates semantic planning from text rendering. VDLM applies LLaDA-style masked diffusion over...
P-RAG: Prompt-Enhanced Parametric RAG with LoRA and Selective CoT for Biomedical and Multi-Hop QA
arXiv:2602.15874v1 Announce Type: new Abstract: Large Language Models (LLMs) demonstrate remarkable capabilities but remain limited by their reliance on static training data. Retrieval-Augmented Generation (RAG) addresses this constraint by retrieving external knowledge during inference, though it still depends heavily on...
Updating Parametric Knowledge with Context Distillation Retains Post-Training Capabilities
arXiv:2602.16093v1 Announce Type: new Abstract: Post-training endows pretrained LLMs with a variety of desirable skills, including instruction-following, reasoning, and others. However, these post-trained LLMs only encode knowledge up to a cut-off date, necessitating continual adaptation. Unfortunately, existing solutions cannot simultaneously...
Beyond Learning: A Training-Free Alternative to Model Adaptation
arXiv:2602.16189v1 Announce Type: new Abstract: Despite the continuous research and evolution of language models, they sometimes underperform previous versions. Existing approaches to overcome these challenges are resource-intensive, highlighting the need for alternatives that enable immediate action. We assume that each...