Skip to main content
Academic

SourceBench: Can AI Answers Reference Quality Web Sources?

arXiv:2602.16942v1 Announce Type: new Abstract: Large language models (LLMs) increasingly answer queries by citing web sources, but existing evaluations emphasize answer correctness rather than evidence quality. We introduce SourceBench, a benchmark for measuring the quality of cited web sources across 100 real-world queries spanning informational, factual, argumentative, social, and shopping intents. SourceBench uses an eight-metric framework covering content quality (content relevance, factual accuracy, objectivity) and page-level signals (e.g., freshness, authority/accountability, clarity), and includes a human-labeled dataset with a calibrated LLM-based evaluator that matches expert judgments closely. We evaluate eight LLMs, Google Search, and three AI search tools over 3996 cited sources using SourceBench and conduct further experiments to understand the evaluation results. Overall, our work reveals four key new insights that can guide future research in the direction of GenAI an

H
Hexi Jin, Stephen Liu, Yuheng Li, Simran Malik, Yiying Zhang
· · 1 min read · 3 views

arXiv:2602.16942v1 Announce Type: new Abstract: Large language models (LLMs) increasingly answer queries by citing web sources, but existing evaluations emphasize answer correctness rather than evidence quality. We introduce SourceBench, a benchmark for measuring the quality of cited web sources across 100 real-world queries spanning informational, factual, argumentative, social, and shopping intents. SourceBench uses an eight-metric framework covering content quality (content relevance, factual accuracy, objectivity) and page-level signals (e.g., freshness, authority/accountability, clarity), and includes a human-labeled dataset with a calibrated LLM-based evaluator that matches expert judgments closely. We evaluate eight LLMs, Google Search, and three AI search tools over 3996 cited sources using SourceBench and conduct further experiments to understand the evaluation results. Overall, our work reveals four key new insights that can guide future research in the direction of GenAI and web search.

Executive Summary

The article introduces SourceBench, a benchmark for evaluating the quality of cited web sources by large language models (LLMs) across various intents. The authors use an eight-metric framework to assess content quality and page-level signals, and evaluate eight LLMs and other search tools. The study reveals four key insights guiding future research in Generative AI and web search. While the study provides a comprehensive evaluation framework, it is limited by its reliance on human-labeled datasets and the potential for bias in labeling. The findings have significant implications for the development of more accurate and trustworthy AI-powered search tools.

Key Points

  • SourceBench introduces an eight-metric framework for evaluating the quality of cited web sources
  • The study evaluates eight LLMs and other search tools using SourceBench
  • The findings provide four key insights guiding future research in Generative AI and web search

Merits

Strength in Evaluation Framework

The eight-metric framework provides a comprehensive and nuanced evaluation of the quality of cited web sources

Real-world Relevance

The study evaluates LLMs and search tools across real-world queries and intents

Methodological Transparency

The authors provide detailed descriptions of their methodology and evaluation framework

Demerits

Limitation of Human-labeled Datasets

The study relies on human-labeled datasets, which may be subject to bias and variability

Scalability and Generalizability

The study is limited to a relatively small number of queries and LLMs, which may not be representative of broader trends

Lack of Contextual Understanding

The evaluation framework may not fully capture the nuances of contextual understanding and common sense in AI-powered search tools

Expert Commentary

The introduction of SourceBench provides a significant contribution to the field of AI-powered search tools. The eight-metric framework and comprehensive evaluation methodology offer a robust approach to assessing the quality of cited web sources. However, the study's reliance on human-labeled datasets and potential bias in labeling are notable limitations. Furthermore, the study's focus on real-world queries and intents provides valuable insights into the practical applications of AI-powered search tools. To further build on this research, it would be beneficial to explore the development of more advanced evaluation frameworks and to investigate the potential for bias in human-labeled datasets. Additionally, the study's findings have significant implications for the development of regulations and guidelines for AI-powered search tools, highlighting the need for greater transparency and accountability in the evaluation and deployment of these tools.

Recommendations

  • Develop more advanced evaluation frameworks that can capture the nuances of contextual understanding and common sense in AI-powered search tools
  • Investigate the potential for bias in human-labeled datasets and explore alternative evaluation methods

Sources