Decomposing Retrieval Failures in RAG for Long-Document Financial Question Answering
arXiv:2602.17981v1 Announce Type: new Abstract: Retrieval-augmented generation is increasingly used for financial question answering over long regulatory filings, yet reliability depends on retrieving the exact context needed to justify answers in high stakes settings. We study a frequent failure mode...
Towards More Standardized AI Evaluation: From Models to Agents
arXiv:2602.18029v1 Announce Type: new Abstract: Evaluation is no longer a final checkpoint in the machine learning lifecycle. As AI systems evolve from static models to compound, tool-using agents, evaluation becomes a core control function. The question is no longer "How...
Perceived Political Bias in LLMs Reduces Persuasive Abilities
arXiv:2602.18092v1 Announce Type: new Abstract: Conversational AI has been proposed as a scalable way to correct public misconceptions and spread misinformation. Yet its effectiveness may depend on perceptions of its political neutrality. As LLMs enter partisan conflict, elites increasingly portray...
Click it or Leave it: Detecting and Spoiling Clickbait with Informativeness Measures and Large Language Models
arXiv:2602.18171v1 Announce Type: new Abstract: Clickbait headlines degrade the quality of online information and undermine user trust. We present a hybrid approach to clickbait detection that combines transformer-based text embeddings with linguistically motivated informativeness features. Using natural language processing techniques,...
Improving Sampling for Masked Diffusion Models via Information Gain
arXiv:2602.18176v1 Announce Type: new Abstract: Masked Diffusion Models (MDMs) offer greater flexibility in decoding order than autoregressive models but require careful planning to achieve high-quality generation. Existing samplers typically adopt greedy heuristics, prioritizing positions with the highest local certainty to...
Information-Theoretic Storage Cost in Sentence Comprehension
arXiv:2602.18217v1 Announce Type: new Abstract: Real-time sentence comprehension imposes a significant load on working memory, as comprehenders must maintain contextual information to anticipate future input. While measures of such load have played an important role in psycholinguistic theories, they have...
Predicting Contextual Informativeness for Vocabulary Learning using Deep Learning
arXiv:2602.18326v1 Announce Type: new Abstract: We describe a modern deep learning system that automatically identifies informative contextual examples (\qu{contexts}) for first language vocabulary instruction for high school student. Our paper compares three modeling approaches: (i) an unsupervised similarity-based strategy using...
SPQ: An Ensemble Technique for Large Language Model Compression
arXiv:2602.18420v1 Announce Type: new Abstract: This study presents an ensemble technique, SPQ (SVD-Pruning-Quantization), for large language model (LLM) compression that combines variance-retained singular value decomposition (SVD), activation-based pruning, and post-training linear quantization. Each component targets a different source of inefficiency:...
LATMiX: Learnable Affine Transformations for Microscaling Quantization of LLMs
arXiv:2602.17681v1 Announce Type: cross Abstract: Post-training quantization (PTQ) is a widely used approach for reducing the memory and compute costs of large language models (LLMs). Recent studies have shown that applying invertible transformations to activations can significantly improve quantization robustness...
VeriSoftBench: Repository-Scale Formal Verification Benchmarks for Lean
arXiv:2602.18307v1 Announce Type: cross Abstract: Large language models have achieved striking results in interactive theorem proving, particularly in Lean. However, most benchmarks for LLM-based proof automation are drawn from mathematics in the Mathlib ecosystem, whereas proofs in software verification are...
Topic Modeling with Fine-tuning LLMs and Bag of Sentences
arXiv:2408.03099v2 Announce Type: replace Abstract: Large language models (LLMs) are increasingly used for topic modeling, outperforming classical topic models such as LDA. Commonly, pre-trained LLM encoders such as BERT are used out-of-the-box despite the fact that fine-tuning is known to...
Probabilistic NDVI Forecasting from Sparse Satellite Time Series and Weather Covariates
arXiv:2602.17683v1 Announce Type: new Abstract: Accurate short-term forecasting of vegetation dynamics is a key enabler for data-driven decision support in precision agriculture. Normalized Difference Vegetation Index (NDVI) forecasting from satellite observations, however, remains challenging due to sparse and irregular sampling...
Optimal Multi-Debris Mission Planning in LEO: A Deep Reinforcement Learning Approach with Co-Elliptic Transfers and Refueling
arXiv:2602.17685v1 Announce Type: new Abstract: This paper addresses the challenge of multi target active debris removal (ADR) in Low Earth Orbit (LEO) by introducing a unified coelliptic maneuver framework that combines Hohmann transfers, safety ellipse proximity operations, and explicit refueling...
Calibrated Adaptation: Bayesian Stiefel Manifold Priors for Reliable Parameter-Efficient Fine-Tuning
arXiv:2602.17809v1 Announce Type: new Abstract: Parameter-efficient fine-tuning methods such as LoRA enable practical adaptation of large language models but provide no principled uncertainty estimates, leading to poorly calibrated predictions and unreliable behavior under domain shift. We introduce Stiefel-Bayes Adapters (SBA),...
Avoid What You Know: Divergent Trajectory Balance for GFlowNets
arXiv:2602.17827v1 Announce Type: new Abstract: Generative Flow Networks (GFlowNets) are a flexible family of amortized samplers trained to generate discrete and compositional objects with probability proportional to a reward function. However, learning efficiency is constrained by the model's ability to...
Neural Prior Estimation: Learning Class Priors from Latent Representations
arXiv:2602.17853v1 Announce Type: new Abstract: Class imbalance induces systematic bias in deep neural networks by imposing a skewed effective class prior. This work introduces the Neural Prior Estimator (NPE), a framework that learns feature-conditioned log-prior estimates from latent representations. NPE...
Optimizing Graph Causal Classification Models: Estimating Causal Effects and Addressing Confounders
arXiv:2602.17941v1 Announce Type: new Abstract: Graph data is becoming increasingly prevalent due to the growing demand for relational insights in AI across various domains. Organizations regularly use graph data to solve complex problems involving relationships and connections. Causal learning is...
Birthright citizenship: under the flag
Brothers in Law is a recurring series by brothers Akhil and Vikram Amar, with special emphasis on measuring what the Supreme Court says against what the Constitution itself says. For more content from […]The postBirthright citizenship: under the flagappeared first...
A Meta AI security researcher said an OpenClaw agent ran amok on her inbox
The viral X post from an AI security researcher reads like satire. But it's really a word of warning about what can go wrong when handing tasks to an AI agent.
With AI, investor loyalty is (almost) dead: At least a dozen OpenAI VCs now also back Anthropic
While some dual investors are understandable, others were more shocking, and signal the disregard of a longstanding ethical conflict-of-interest rule.
Google’s Cloud AI leads on the three frontiers of model capability
AI models are pushing against three frontiers at once: raw intelligence, response time, and a third quality you might call "extensibility."
Particle’s AI news app listens to podcasts for interesting clips so you you don’t have to
AI news app Particle can now pull in key moments from podcasts, letting readers instantly play short, relevant clips alongside related stories.
World-Model-Augmented Web Agents with Action Correction
arXiv:2602.15384v1 Announce Type: new Abstract: Web agents based on large language models have demonstrated promising capability in automating web tasks. However, current web agents struggle to reason out sensible actions due to the limitations of predicting environment changes, and might...
Common Belief Revisited
arXiv:2602.15403v1 Announce Type: new Abstract: Contrary to common belief, common belief is not KD4. If individual belief is KD45, common belief does indeed lose the 5 property and keep the D and 4 properties -- and it has none of...
Quantifying construct validity in large language model evaluations
arXiv:2602.15532v1 Announce Type: new Abstract: The LLM community often reports benchmark results as if they are synonymous with general model capabilities. However, benchmarks can have problems that distort performance, like test set contamination and annotator error. How can we know...
Recursive Concept Evolution for Compositional Reasoning in Large Language Models
arXiv:2602.15725v1 Announce Type: new Abstract: Large language models achieve strong performance on many complex reasoning tasks, yet their accuracy degrades sharply on benchmarks that require compositional reasoning, including ARC-AGI-2, GPQA, MATH, BBH, and HLE. Existing methods improve reasoning by expanding...
This human study did not involve human subjects: Validating LLM simulations as behavioral evidence
arXiv:2602.15785v1 Announce Type: new Abstract: A growing literature uses large language models (LLMs) as synthetic participants to generate cost-effective and nearly instantaneous responses in social science experiments. However, there is limited guidance on when such simulations support valid inference about...
Developing AI Agents with Simulated Data: Why, what, and how?
arXiv:2602.15816v1 Announce Type: new Abstract: As insufficient data volume and quality remain the key impediments to the adoption of modern subsymbolic AI, techniques of synthetic data generation are in high demand. Simulation offers an apt, systematic approach to generating diverse...
EduResearchBench: A Hierarchical Atomic Task Decomposition Benchmark for Full-Lifecycle Educational Research
arXiv:2602.15034v1 Announce Type: cross Abstract: While Large Language Models (LLMs) are reshaping the paradigm of AI for Social Science (AI4SS), rigorously evaluating their capabilities in scholarly writing remains a major challenge. Existing benchmarks largely emphasize single-shot, monolithic generation and thus...
Indic-TunedLens: Interpreting Multilingual Models in Indian Languages
arXiv:2602.15038v1 Announce Type: cross Abstract: Multilingual large language models (LLMs) are increasingly deployed in linguistically diverse regions like India, yet most interpretability tools remain tailored to English. Prior work reveals that LLMs often operate in English centric representation spaces, making...