Academic

LIT-RAGBench: Benchmarking Generator Capabilities of Large Language Models in Retrieval-Augmented Generation

arXiv:2603.06198v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) is a framework in which a Generator, such as a Large Language Model (LLM), produces answers by retrieving documents from an external collection using a Retriever. In practice, Generators must integrate evidence from long contexts, perform multi-step reasoning, interpret tables, and abstain when evidence is missing. However, existing benchmarks for Generators provide limited coverage, with none enabling simultaneous evaluation of multiple capabilities under unified conditions. To bridge the gap between existing evaluations and practical use, we introduce LIT-RAGBench (the Logic, Integration, Table, Reasoning, and Abstention RAG Generator Benchmark), which defines five categories: Integration, Reasoning, Logic, Table, and Abstention, each further divided into practical evaluation aspects. LIT-RAGBench systematically covers patterns combining multiple aspects across categories. By using fictional entitie

arXiv:2603.06198v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) is a framework in which a Generator, such as a Large Language Model (LLM), produces answers by retrieving documents from an external collection using a Retriever. In practice, Generators must integrate evidence from long contexts, perform multi-step reasoning, interpret tables, and abstain when evidence is missing. However, existing benchmarks for Generators provide limited coverage, with none enabling simultaneous evaluation of multiple capabilities under unified conditions. To bridge the gap between existing evaluations and practical use, we introduce LIT-RAGBench (the Logic, Integration, Table, Reasoning, and Abstention RAG Generator Benchmark), which defines five categories: Integration, Reasoning, Logic, Table, and Abstention, each further divided into practical evaluation aspects. LIT-RAGBench systematically covers patterns combining multiple aspects across categories. By using fictional entities and scenarios, LIT-RAGBench evaluates answers grounded in the provided external documents. The dataset consists of 114 human-constructed Japanese questions and an English version generated by machine translation with human curation. We use LLM-as-a-Judge for scoring and report category-wise and overall accuracy. Across API-based and open-weight models, no model exceeds 90% overall accuracy. By making strengths and weaknesses measurable within each category, LIT-RAGBench serves as a valuable metric for model selection in practical RAG deployments and for building RAG-specialized models. We release LIT-RAGBench, including the dataset and evaluation code, at https://github.com/Koki-Itai/LIT-RAGBench.

Executive Summary

This article introduces LIT-RAGBench, a novel benchmarking framework designed to evaluate the capabilities of Large Language Models (LLMs) in Retrieval-Augmented Generation (RAG). The framework addresses the gaps in existing benchmarks by incorporating multiple evaluation aspects, including integration, reasoning, logic, table interpretation, and abstention. The LIT-RAGBench dataset comprises 114 human-constructed questions in Japanese and an English version, along with evaluation code. This framework enables the assessment of LLMs' strengths and weaknesses across various categories, facilitating model selection and development of RAG-specialized models. The results indicate that no model exceeds 90% overall accuracy, highlighting the complexity of the tasks involved. LIT-RAGBench is a valuable resource for the development of more accurate and efficient RAG systems.

Key Points

  • LIT-RAGBench is a comprehensive benchmarking framework for evaluating LLMs in RAG
  • The framework addresses the limitations of existing benchmarks by incorporating multiple evaluation aspects
  • The LIT-RAGBench dataset comprises 114 human-constructed questions in Japanese and an English version

Merits

Comprehensive evaluation aspects

LIT-RAGBench covers multiple evaluation aspects, including integration, reasoning, logic, table interpretation, and abstention, providing a more comprehensive understanding of LLMs' capabilities

Valuable resource for model selection

LIT-RAGBench enables the assessment of LLMs' strengths and weaknesses, facilitating model selection and development of RAG-specialized models

Open-source and publicly available

The LIT-RAGBench dataset and evaluation code are released on GitHub, making it accessible to the research community

Demerits

Limited dataset size

The LIT-RAGBench dataset comprises only 114 human-constructed questions, which may not be sufficient to evaluate the robustness of LLMs

Dependence on human curation

The English version of the dataset is generated by machine translation with human curation, which may introduce bias and affect the accuracy of the results

Limited representation of real-world scenarios

The dataset primarily consists of fictional entities and scenarios, which may not accurately represent real-world applications of RAG

Expert Commentary

The introduction of LIT-RAGBench marks a significant step forward in the evaluation of LLMs in RAG. By addressing the limitations of existing benchmarks, LIT-RAGBench provides a more comprehensive understanding of LLMs' capabilities and facilitates model selection and development of RAG-specialized models. However, the limited dataset size and dependence on human curation are notable limitations that need to be addressed in future work. Furthermore, the development of LIT-RAGBench highlights the need for more robust evaluation frameworks for LLMs, which can inform policy decisions on the adoption and regulation of AI-powered systems.

Recommendations

  • Future research should focus on expanding the dataset size and diversity to better represent real-world scenarios
  • The development of more robust evaluation frameworks for LLMs is essential to inform policy decisions on the adoption and regulation of AI-powered systems

Sources