L-PRISMA: An Extension of PRISMA in the Era of Generative Artificial Intelligence (GenAI)
arXiv:2603.19236v1 Announce Type: cross Abstract: The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) framework provides a rigorous foundation for evidence synthesis, yet the manual processes of data extraction and literature screening remain time-consuming and restrictive. Recent advances in...
Framing Effects in Independent-Agent Large Language Models: A Cross-Family Behavioral Analysis
arXiv:2603.19282v1 Announce Type: cross Abstract: In many real-world applications, large language models (LLMs) operate as independent agents without interaction, thereby limiting coordination. In this setting, we examine how prompt framing influences decisions in a threshold voting task involving individual-group interest...
A Visualization for Comparative Analysis of Regression Models
arXiv:2603.19291v1 Announce Type: cross Abstract: As regression is a widely studied problem, many methods have been proposed to solve it, each of them often requiring setting different hyper-parameters. Therefore, selecting the proper method for a given application may be very...
Spelling Correction in Healthcare Query-Answer Systems: Methods, Retrieval Impact, and Empirical Evaluation
arXiv:2603.19249v1 Announce Type: new Abstract: Healthcare question-answering (QA) systems face a persistent challenge: users submit queries with spelling errors at rates substantially higher than those found in the professional documents they search. This paper presents the first controlled study of...
From Comprehension to Reasoning: A Hierarchical Benchmark for Automated Financial Research Reporting
arXiv:2603.19254v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used to generate financial research reports, shifting from auxiliary analytic tools to primary content producers. Yet recent real-world deployments reveal persistent failures--factual errors, numerical inconsistencies, fabricated references, and shallow...
From Tokens To Agents: A Researcher's Guide To Understanding Large Language Models
arXiv:2603.19269v1 Announce Type: new Abstract: Researchers face a critical choice: how to use -- or not use -- large language models in their work. Using them well requires understanding the mechanisms that shape what LLMs can and cannot do. This...
Automated Motif Indexing on the Arabian Nights
arXiv:2603.19283v1 Announce Type: new Abstract: Motifs are non-commonplace, recurring narrative elements, often found originally in folk stories. In addition to being of interest to folklorists, motifs appear as metaphoric devices in modern news, literature, propaganda, and other cultural texts. Finding...
Memory-Driven Role-Playing: Evaluation and Enhancement of Persona Knowledge Utilization in LLMs
arXiv:2603.19313v1 Announce Type: new Abstract: A core challenge for faithful LLM role-playing is sustaining consistent characterization throughout long, open-ended dialogues, as models frequently fail to recall and accurately apply their designated persona knowledge without explicit cues. To tackle this, we...
Prompt-tuning with Attribute Guidance for Low-resource Entity Matching
arXiv:2603.19321v1 Announce Type: new Abstract: Entity Matching (EM) is an important task that determines the logical relationship between two entities, such as Same, Different, or Undecidable. Traditional EM approaches rely heavily on supervised learning, which requires large amounts of high-quality...
Is Evaluation Awareness Just Format Sensitivity? Limitations of Probe-Based Evidence under Controlled Prompt Structure
arXiv:2603.19426v1 Announce Type: new Abstract: Prior work uses linear probes on benchmark prompts as evidence of evaluation awareness in large language models. Because evaluation context is typically entangled with benchmark format and genre, it is unclear whether probe-based signals reflect...
EvidenceRL: Reinforcing Evidence Consistency for Trustworthy Language Models
arXiv:2603.19532v1 Announce Type: new Abstract: Large Language Models (LLMs) are fluent but prone to hallucinations, producing answers that appear plausible yet are unsupported by available evidence. This failure is especially problematic in high-stakes domains where decisions must be justified by...
BEAVER: A Training-Free Hierarchical Prompt Compression Method via Structure-Aware Page Selection
arXiv:2603.19635v1 Announce Type: new Abstract: The exponential expansion of context windows in LLMs has unlocked capabilities for long-document understanding but introduced severe bottlenecks in inference latency and information utilization. Existing compression methods often suffer from high training costs or semantic...
Prune-then-Quantize or Quantize-then-Prune? Understanding the Impact of Compression Order in Joint Model Compression
arXiv:2603.18426v1 Announce Type: new Abstract: What happens when multiple compression methods are combined-does the order in which they are applied matter? Joint model compression has emerged as a powerful strategy to achieve higher efficiency by combining multiple methods such as...
TherapyGym: Evaluating and Aligning Clinical Fidelity and Safety in Therapy Chatbots
arXiv:2603.18008v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used for mental-health support; yet prevailing evaluation methods--fluency metrics, preference tests, and generic dialogue benchmarks--fail to capture the clinically critical dimensions of psychotherapy. We introduce THERAPYGYM, a framework that...
Large-Scale Analysis of Political Propaganda on Moltbook
arXiv:2603.18349v1 Announce Type: new Abstract: We present an NLP-based study of political propaganda on Moltbook, a Reddit-style platform for AI agents. To enable large-scale analysis, we develop LLM-based classifiers to detect political propaganda, validated against expert annotation (Cohen's $\kappa$= 0.64-0.74)....
How Confident Is the First Token? An Uncertainty-Calibrated Prompt Optimization Framework for Large Language Model Classification and Understanding
arXiv:2603.18009v1 Announce Type: new Abstract: With the widespread adoption of large language models (LLMs) in natural language processing, prompt engineering and retrieval-augmented generation (RAG) have become mainstream to enhance LLMs' performance on complex tasks. However, LLMs generate outputs autoregressively, leading...
Multi-Trait Subspace Steering to Reveal the Dark Side of Human-AI Interaction
arXiv:2603.18085v1 Announce Type: new Abstract: Recent incidents have highlighted alarming cases where human-AI interactions led to negative psychological outcomes, including mental health crises and even user harm. As LLMs serve as sources of guidance, emotional support, and even informal therapy,...
BenchBrowser -- Collecting Evidence for Evaluating Benchmark Validity
arXiv:2603.18019v1 Announce Type: new Abstract: Do language model benchmarks actually measure what practitioners intend them to ? High-level metadata is too coarse to convey the granular reality of benchmarks: a "poetry" benchmark may never test for haikus, while "instruction-following" benchmarks...
MedForge: Interpretable Medical Deepfake Detection via Forgery-aware Reasoning
arXiv:2603.18577v1 Announce Type: new Abstract: Text-guided image editors can now manipulate authentic medical scans with high fidelity, enabling lesion implantation/removal that threatens clinical trust and safety. Existing defenses are inadequate for healthcare. Medical detectors are largely black-box, while MLLM-based explainers...
Controllable Evidence Selection in Retrieval-Augmented Question Answering via Deterministic Utility Gating
arXiv:2603.18011v1 Announce Type: new Abstract: Many modern AI question-answering systems convert text into vectors and retrieve the closest matches to a user question. While effective for topical similarity, similarity scores alone do not explain why some retrieved text can serve...
D-Mem: A Dual-Process Memory System for LLM Agents
arXiv:2603.18631v1 Announce Type: new Abstract: Driven by the development of persistent, self-adapting autonomous agents, equipping these systems with high-fidelity memory access for long-horizon reasoning has emerged as a critical requirement. However, prevalent retrieval-based memory frameworks often follow an incremental processing...
From Topic to Transition Structure: Unsupervised Concept Discovery at Corpus Scale via Predictive Associative Memory
arXiv:2603.18420v1 Announce Type: new Abstract: Embedding models group text by semantic content, what text is about. We show that temporal co-occurrence within texts discovers a different kind of structure: recurrent transition-structure concepts or what text does. We train a 29.4M-parameter...
Learned but Not Expressed: Capability-Expression Dissociation in Large Language Models
arXiv:2603.18013v1 Announce Type: new Abstract: Large language models (LLMs) demonstrate the capacity to reconstruct and trace learned content from their training data under specific elicitation conditions, yet this capability does not manifest in standard generation contexts. This empirical observational study...
UT-ACA: Uncertainty-Triggered Adaptive Context Allocation for Long-Context Inference
arXiv:2603.18446v1 Announce Type: new Abstract: Long-context inference remains challenging for large language models due to attention dilution and out-of-distribution degradation. Context selection mitigates this limitation by attending to a subset of key-value cache entries, yet most methods allocate a fixed...
DiscoPhon: Benchmarking the Unsupervised Discovery of Phoneme Inventories With Discrete Speech Units
arXiv:2603.18612v1 Announce Type: new Abstract: We introduce DiscoPhon, a multilingual benchmark for evaluating unsupervised phoneme discovery from discrete speech units. DiscoPhon covers 6 dev and 6 test languages, chosen to span a wide range of phonemic contrasts. Given only 10...
Evaluating LLM-Generated Lessons from the Language Learning Students' Perspective: A Short Case Study on Duolingo
arXiv:2603.18873v1 Announce Type: new Abstract: Popular language learning applications such as Duolingo use large language models (LLMs) to generate lessons for its users. Most lessons focus on general real-world scenarios such as greetings, ordering food, or asking directions, with limited...
Hypothesis-Conditioned Query Rewriting for Decision-Useful Retrieval
arXiv:2603.19008v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) improves Large Language Models (LLMs) by grounding generation in external, non-parametric knowledge. However, when a task requires choosing among competing options, simply grounding generation in broadly relevant context is often insufficient to...
Towards Differentiating Between Failures and Domain Shifts in Industrial Data Streams
arXiv:2603.18032v1 Announce Type: new Abstract: Anomaly and failure detection methods are crucial in identifying deviations from normal system operational conditions, which allows for actions to be taken in advance, usually preventing more serious damages. Long-lasting deviations indicate failures, while sudden,...
Quotient Geometry and Persistence-Stable Metrics for Swarm Configurations
arXiv:2603.18041v1 Announce Type: new Abstract: Swarm and constellation reconfiguration can be viewed as motion of an unordered point configuration in an ambient space. Here, we provide persistence-stable, symmetry-invariant geometric representations for comparing and monitoring multi-agent configuration data. We introduce a...
AGRI-Fidelity: Evaluating the Reliability of Listenable Explanations for Poultry Disease Detection
arXiv:2603.18247v1 Announce Type: new Abstract: Existing XAI metrics measure faithfulness for a single model, ignoring model multiplicity where near-optimal classifiers rely on different or spurious acoustic cues. In noisy farm environments, stationary artifacts such as ventilation noise can produce explanations...