Academic

Evaluating Prompt Engineering Techniques for RAG in Small Language Models: A Multi-Hop QA Approach

arXiv:2602.13890v1 Announce Type: new Abstract: Retrieval Augmented Generation (RAG) is a powerful approach for enhancing the factual grounding of language models by integrating external knowledge. While widely studied for large language models, the optimization of RAG for Small Language Models (SLMs) remains a critical research gap, particularly in complex, multi-hop question-answering tasks that require sophisticated reasoning. In these systems, prompt template design is a crucial yet under-explored factor influencing performance. This paper presents a large-scale empirical study to investigate this factor, evaluating 24 different prompt templates on the HotpotQA dataset. The set includes a standard RAG prompt, nine well-formed techniques from the literature, and 14 novel hybrid variants, all tested on two prominent SLMs: Qwen2.5-3B Instruct and Gemma3-4B-It. Our findings, based on a test set of 18720 instances, reveal significant performance gains of up to 83% on Qwen2.5 and 84.5%

arXiv:2602.13890v1 Announce Type: new Abstract: Retrieval Augmented Generation (RAG) is a powerful approach for enhancing the factual grounding of language models by integrating external knowledge. While widely studied for large language models, the optimization of RAG for Small Language Models (SLMs) remains a critical research gap, particularly in complex, multi-hop question-answering tasks that require sophisticated reasoning. In these systems, prompt template design is a crucial yet under-explored factor influencing performance. This paper presents a large-scale empirical study to investigate this factor, evaluating 24 different prompt templates on the HotpotQA dataset. The set includes a standard RAG prompt, nine well-formed techniques from the literature, and 14 novel hybrid variants, all tested on two prominent SLMs: Qwen2.5-3B Instruct and Gemma3-4B-It. Our findings, based on a test set of 18720 instances, reveal significant performance gains of up to 83% on Qwen2.5 and 84.5% on Gemma3-4B-It, yielding an improvement of up to 6% for both models compared to the Standard RAG prompt. This research also offers concrete analysis and actionable recommendations for designing effective and efficient prompts for SLM-based RAG systems, practically for deployment in resource-constrained environments.

Executive Summary

The article 'Evaluating Prompt Engineering Techniques for RAG in Small Language Models: A Multi-Hop QA Approach' investigates the optimization of Retrieval Augmented Generation (RAG) for Small Language Models (SLMs) in complex, multi-hop question-answering tasks. The study evaluates 24 different prompt templates, including standard RAG prompts, well-formed techniques from the literature, and novel hybrid variants, on two SLMs: Qwen2.5-3B Instruct and Gemma3-4B-It. The findings reveal significant performance gains of up to 83% on Qwen2.5 and 84.5% on Gemma3-4B-It, with improvements of up to 6% compared to the Standard RAG prompt. The research provides actionable recommendations for designing effective prompts for SLM-based RAG systems, particularly in resource-constrained environments.

Key Points

  • The study focuses on optimizing RAG for Small Language Models (SLMs) in multi-hop question-answering tasks.
  • 24 different prompt templates were evaluated, including standard RAG prompts, literature-based techniques, and novel hybrid variants.
  • Significant performance gains were observed, with improvements of up to 6% compared to the Standard RAG prompt.
  • The research offers concrete analysis and actionable recommendations for designing effective prompts for SLM-based RAG systems.

Merits

Comprehensive Evaluation

The study conducts a large-scale empirical evaluation of 24 different prompt templates, providing a thorough analysis of their effectiveness in enhancing the performance of SLMs in multi-hop question-answering tasks.

Actionable Recommendations

The research offers practical recommendations for designing effective prompts, which can be particularly valuable for deployment in resource-constrained environments.

Significant Performance Gains

The study demonstrates significant performance improvements of up to 83% on Qwen2.5 and 84.5% on Gemma3-4B-It, highlighting the importance of prompt template design in RAG systems.

Demerits

Limited Scope of Models

The study focuses on only two SLMs, Qwen2.5-3B Instruct and Gemma3-4B-It, which may limit the generalizability of the findings to other SLMs.

Dataset Specificity

The evaluation is conducted on the HotpotQA dataset, which may not fully represent the diversity of multi-hop question-answering tasks in real-world applications.

Potential Bias in Prompt Selection

The selection of prompt templates, including standard RAG prompts, literature-based techniques, and novel hybrid variants, may introduce bias in the evaluation results.

Expert Commentary

The article presents a rigorous and well-reasoned investigation into the optimization of RAG for Small Language Models in multi-hop question-answering tasks. The study's comprehensive evaluation of 24 different prompt templates, including standard RAG prompts, literature-based techniques, and novel hybrid variants, provides valuable insights into the importance of prompt template design in enhancing the performance of SLMs. The significant performance gains observed, up to 83% on Qwen2.5 and 84.5% on Gemma3-4B-It, underscore the practical relevance of the research. However, the study's focus on only two SLMs and the HotpotQA dataset may limit the generalizability of the findings. Despite this limitation, the research offers actionable recommendations that can guide practitioners in designing effective prompts for SLM-based RAG systems, particularly in resource-constrained environments. The study's findings also contribute to broader discussions on the optimization of language models and the efficiency of AI systems in real-world applications.

Recommendations

  • Future research should expand the evaluation to include a broader range of SLMs to enhance the generalizability of the findings.
  • The study should be replicated using diverse datasets to ensure the robustness of the results across different multi-hop question-answering tasks.

Sources