Academic

A Systematic Investigation of Document Chunking Strategies and Embedding Sensitivity

arXiv:2603.06976v1 Announce Type: new Abstract: We present the first large-scale, cross-domain evaluation of document chunking strategies for dense retrieval, addressing a critical but underexplored aspect of retrieval-augmented systems. In our study, 36 segmentation methods spanning fixed-size, semantic, structure-aware, hierarchical, adaptive, and LLM-assisted approaches are benchmarked across six diverse knowledge domains using five different embedding models. Retrieval performance is assessed using graded relevance scores from a state-of-the-art LLM evaluator, with Normalised DCG@5 as the primary metric (complemented by Hit@5 and MRR). Our experiments show that content-aware chunking significantly improves retrieval effectiveness over naive fixed-length splitting. The top-performing strategy, Paragraph Group Chunking, achieved the highest overall accuracy (mean nDCG@5~0.459) and substantially better top-rank hit rates (Precision@1~24%, Hit@5~59%). In contrast, simple fixed-size ch

M
Muhammad Arslan Shaukat, Muntasir Adnan, Carlos C. N. Kuhn
· · 1 min read · 28 views

arXiv:2603.06976v1 Announce Type: new Abstract: We present the first large-scale, cross-domain evaluation of document chunking strategies for dense retrieval, addressing a critical but underexplored aspect of retrieval-augmented systems. In our study, 36 segmentation methods spanning fixed-size, semantic, structure-aware, hierarchical, adaptive, and LLM-assisted approaches are benchmarked across six diverse knowledge domains using five different embedding models. Retrieval performance is assessed using graded relevance scores from a state-of-the-art LLM evaluator, with Normalised DCG@5 as the primary metric (complemented by Hit@5 and MRR). Our experiments show that content-aware chunking significantly improves retrieval effectiveness over naive fixed-length splitting. The top-performing strategy, Paragraph Group Chunking, achieved the highest overall accuracy (mean nDCG@5~0.459) and substantially better top-rank hit rates (Precision@1~24%, Hit@5~59%). In contrast, simple fixed-size character chunking as baselines performed poorly (nDCG@5 < 0.244, Precision@1~2-3%). We observe pronounced domain-specific differences: dynamic token sizing is strongest in biology, physics and health, while paragraph grouping is strongest in legal and maths. Larger embedding models yield higher absolute scores but remain sensitive to suboptimal segmentation, indicating that better chunking and large embeddings provide complementary benefits. In addition to accuracy gains, we quantify the efficiency trade-offs of advanced chunking. Producing more, smaller chunks can increase index size and latency. Consequently, we identify methods (like dynamic chunking) that approach an optimal balance of effectiveness and efficiency. These findings establish chunking as a vital lever for improving retrieval performance and reliability.

Executive Summary

This article presents a comprehensive evaluation of document chunking strategies for dense retrieval, assessing 36 methods across six domains and five embedding models. The results show that content-aware chunking significantly improves retrieval effectiveness, with Paragraph Group Chunking achieving the highest accuracy. Domain-specific differences and the impact of embedding model size on performance are also examined, highlighting the importance of chunking in retrieval-augmented systems.

Key Points

  • Content-aware chunking outperforms naive fixed-length splitting
  • Paragraph Group Chunking achieves the highest overall accuracy
  • Domain-specific differences in chunking strategy effectiveness exist

Merits

Comprehensive Evaluation

The study provides a thorough assessment of various chunking strategies, offering valuable insights for retrieval-augmented systems

Domain-Specific Analysis

The examination of domain-specific differences in chunking strategy effectiveness allows for more tailored approaches

Demerits

Complexity of Chunking Methods

The numerous chunking methods evaluated may lead to complexity in selecting the most suitable approach for a given application

Efficiency Trade-Offs

The increase in index size and latency associated with advanced chunking methods may be a concern for large-scale implementations

Expert Commentary

The article provides a significant contribution to the understanding of document chunking strategies and their impact on retrieval performance. The comprehensive evaluation and domain-specific analysis offer valuable insights for the development of more effective retrieval-augmented systems. However, the complexity of chunking methods and efficiency trade-offs must be carefully considered in practical implementations. Further research is needed to explore the applications of these findings in various domains and to develop more efficient and effective chunking methods.

Recommendations

  • Further research on the development of more efficient and effective chunking methods
  • Consideration of domain-specific differences in chunking strategy effectiveness when designing retrieval-augmented systems

Sources