Auto Researching, not hyperparameter tuning: Convergence Analysis of 10,000 Experiments
arXiv:2603.15916v1 Announce Type: new Abstract: When LLM agents autonomously design ML experiments, do they perform genuine architecture search -- or do they default to hyperparameter tuning within a narrow region of the design space? We answer this question by analyzing...
Evaluating Causal Discovery Algorithms for Path-Specific Fairness and Utility in Healthcare
arXiv:2603.15926v1 Announce Type: new Abstract: Causal discovery in health data faces evaluation challenges when ground truth is unknown. We address this by collaborating with experts to construct proxy ground-truth graphs, establishing benchmarks for synthetic Alzheimer's disease and heart failure clinical...
Discovery of interaction and diffusion kernels in particle-to-mean-field multi-agent systems
arXiv:2603.15927v1 Announce Type: new Abstract: We propose a data-driven framework to learn interaction kernels in stochastic multi-agent systems. Our approach aims at identifying the functional form of nonlocal interaction and diffusion terms directly from trajectory data, without any a priori...
Data-Local Autonomous LLM-Guided Neural Architecture Search for Multiclass Multimodal Time-Series Classification
arXiv:2603.15939v1 Announce Type: new Abstract: Applying machine learning to sensitive time-series data is often bottlenecked by the iteration loop: Performance depends strongly on preprocessing and architecture, yet training often has to run on-premise under strict data-local constraints. This is a...
Trump's plan to shut down weather and climate center triggers lawsuit
Suit: The National Center for Atmospheric Research is to be terminated for no rational reason.
Formulating Public Pharma
In 2022, prices for both brand-name and generic drugs in the United States were nearly three times as high as prices in comparably industrialized nations, with the cost of insulin products in particular being nearly ten times as high. As...
Optimizing LLM Annotation of Classroom Discourse through Multi-Agent Orchestration
arXiv:2603.13353v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly positioned as scalable tools for annotating educational data, including classroom discourse, interaction logs, and qualitative learning artifacts. Their ability to rapidly summarize instructional interactions and assign rubric-aligned labels has...
Prompt Complexity Dilutes Structured Reasoning: A Follow-Up Study on the Car Wash Problem
arXiv:2603.13351v1 Announce Type: new Abstract: In a previous study [Jo, 2026], STAR reasoning (Situation, Task, Action, Result) raised car wash problem accuracy from 0% to 85% on Claude Sonnet 4.5, and to 100% with additional prompt layers. This follow-up asks:...
Training-Free Agentic AI: Probabilistic Control and Coordination in Multi-Agent LLM Systems
arXiv:2603.13256v1 Announce Type: new Abstract: Multi-agent large language model (LLM) systems enable complex, long-horizon reasoning by composing specialized agents, but practical deployment remains hindered by inefficient routing, noisy feedback, and high interaction cost. We introduce REDEREF, a lightweight and training-free...
Early Rug Pull Warning for BSC Meme Tokens via Multi-Granularity Wash-Trading Pattern Profiling
arXiv:2603.13830v1 Announce Type: new Abstract: The high-frequency issuance and short-cycle speculation of meme tokens in decentralized finance (DeFi) have significantly amplified rug-pull risk. Existing approaches still struggle to provide stable early warning under scarce anomalies, incomplete labels, and limited interpretability....
How Transformers Reject Wrong Answers: Rotational Dynamics of Factual Constraint Processing
arXiv:2603.13259v1 Announce Type: new Abstract: When a language model is fed a wrong answer, what happens inside the network? Current understanding treats truthfulness as a static property of individual-layer representations-a direction to be probed, a feature to be extracted. Less...
Multi-Axis Trust Modeling for Interpretable Account Hijacking Detection
arXiv:2603.13246v1 Announce Type: new Abstract: This paper proposes a Hadith-inspired multi-axis trust modeling framework, motivated by a structurally analogous problem in classical Hadith scholarship: assessing the trustworthiness of information sources using interpretable, multidimensional criteria rather than a single anomaly score....
Multimodal Emotion Regression with Multi-Objective Optimization and VAD-Aware Audio Modeling for the 10th ABAW EMI Track
arXiv:2603.13760v1 Announce Type: new Abstract: We participated in the 10th ABAW Challenge, focusing on the Emotional Mimicry Intensity (EMI) Estimation track on the Hume-Vidmimic2 dataset. This task aims to predict six continuous emotion dimensions: Admiration, Amusement, Determination, Empathic Pain, Excitement,...
Multi-hop Reasoning and Retrieval in Embedding Space: Leveraging Large Language Models with Knowledge
arXiv:2603.13266v1 Announce Type: new Abstract: As large language models (LLMs) continue to grow in size, their abilities to tackle complex tasks have significantly improved. However, issues such as hallucination and the lack of up-to-date knowledge largely remain unresolved. Knowledge graphs...
Slang Context-based Inference Enhancement via Greedy Search-Guided Chain-of-Thought Prompting
arXiv:2603.13230v1 Announce Type: new Abstract: Slang interpretation has been a challenging downstream task for Large Language Models (LLMs) as the expressions are inherently embedded in contextual, cultural, and linguistic frameworks. In the absence of domain-specific training data, it is difficult...
EviAgent: Evidence-Driven Agent for Radiology Report Generation
arXiv:2603.13956v1 Announce Type: new Abstract: Automated radiology report generation holds immense potential to alleviate the heavy workload of radiologists. Despite the formidable vision-language capabilities of recent Multimodal Large Language Models (MLLMs), their clinical deployment is severely constrained by inherent limitations:...
QuarkMedBench: A Real-World Scenario Driven Benchmark for Evaluating Large Language Models
arXiv:2603.13691v1 Announce Type: new Abstract: While Large Language Models (LLMs) excel on standardized medical exams, high scores often fail to translate to high-quality responses for real-world medical queries. Current evaluations rely heavily on multiple-choice questions, failing to capture the unstructured,...
Human Attribution of Causality to AI Across Agency, Misuse, and Misalignment
arXiv:2603.13236v1 Announce Type: new Abstract: AI-related incidents are becoming increasingly frequent and severe, ranging from safety failures to misuse by malicious actors. In such complex situations, identifying which elements caused an adverse outcome, the problem of cause selection, is a...
Intelligent Materials Modelling: Large Language Models Versus Partial Least Squares Regression for Predicting Polysulfone Membrane Mechanical Performance
arXiv:2603.13834v1 Announce Type: new Abstract: Predicting the mechanical properties of polysulfone (PSF) membranes from structural descriptors remains challenging due to extreme data scarcity typical of experimental studies. To investigate this issue, this study benchmarked knowledge-driven inference using four large language...
Deep Convolutional Architectures for EEG Classification: A Comparative Study with Temporal Augmentation and Confidence-Based Voting
arXiv:2603.13261v1 Announce Type: new Abstract: Electroencephalography (EEG) classification plays a key role in brain-computer interface (BCI) systems, yet it remains challenging due to the low signal-to-noise ratio, temporal variability of neural responses, and limited data availability. In this paper, we...
Projection-Free Evolution Strategies for Continuous Prompt Search
arXiv:2603.13786v1 Announce Type: new Abstract: Continuous prompt search offers a computationally efficient alternative to conventional parameter tuning in natural language processing tasks. Nevertheless, its practical effectiveness can be significantly hindered by the black-box nature and the inherent high-dimensionality of the...
FLUX: Data Worth Training On
arXiv:2603.13972v1 Announce Type: new Abstract: Modern large language model training is no longer limited by data availability, but by the inability of existing preprocessing pipelines to simultaneously achieve massive scale and high data quality. Current approaches are forced to sacrifice...
CMHL: Contrastive Multi-Head Learning for Emotionally Consistent Text Classification
arXiv:2603.14078v1 Announce Type: new Abstract: Textual Emotion Classification (TEC) is one of the most difficult NLP tasks. State of the art approaches rely on Large language models (LLMs) and multi-model ensembles. In this study, we challenge the assumption that larger...
Selective Fine-Tuning of GPT Architectures for Parameter-Efficient Clinical Text Classification
arXiv:2603.14183v1 Announce Type: new Abstract: The rapid expansion of electronic health record (EHR) systems has generated large volumes of unstructured clinical narratives that contain valuable information for disease identification, patient cohort discovery, and clinical decision support. Extracting structured knowledge from...
Rethinking Evaluation in Retrieval-Augmented Personalized Dialogue: A Cognitive and Linguistic Perspective
arXiv:2603.14217v1 Announce Type: new Abstract: In cognitive science and linguistic theory, dialogue is not seen as a chain of independent utterances but rather as a joint activity sustained by coherence, consistency, and shared understanding. However, many systems for open-domain and...
Automatic Inter-document Multi-hop Scientific QA Generation
arXiv:2603.14257v1 Announce Type: new Abstract: Existing automatic scientific question generation studies mainly focus on single-document factoid QA, overlooking the inter-document reasoning crucial for scientific understanding. We present AIM-SciQA, an automated framework for generating multi-document, multi-hop scientific QA datasets. AIM-SciQA extracts...
MedPriv-Bench: Benchmarking the Privacy-Utility Trade-off of Large Language Models in Medical Open-End Question Answering
arXiv:2603.14265v1 Announce Type: new Abstract: Recent advances in Retrieval-Augmented Generation (RAG) have enabled large language models (LLMs) to ground outputs in clinical evidence. However, connecting LLMs with external databases introduces the risk of contextual leakage: a subtle privacy threat where...
Motivation in Large Language Models
arXiv:2603.14347v1 Announce Type: new Abstract: Motivation is a central driver of human behavior, shaping decisions, goals, and task performance. As large language models (LLMs) become increasingly aligned with human preferences, we ask whether they exhibit something akin to motivation. We...
PARSA-Bench: A Comprehensive Persian Audio-Language Model Benchmark
arXiv:2603.14456v1 Announce Type: new Abstract: Persian poses unique audio understanding challenges through its classical poetry, traditional music, and pervasive code-switching - none captured by existing benchmarks. We introduce PARSA-Bench (Persian Audio Reasoning and Speech Assessment Benchmark), the first benchmark for...
Knowledge, Rules and Their Embeddings: Two Paths towards Neuro-Symbolic JEPA
arXiv:2603.13265v1 Announce Type: new Abstract: Modern self-supervised predictive architectures excel at capturing complex statistical correlations from high-dimensional data but lack mechanisms to internalize verifiable human logic, leaving them susceptible to spurious correlations and shortcut learning. Conversely, traditional rule-based inference systems...