Beyond Facts: Benchmarking Distributional Reading Comprehension in Large Language Models
arXiv:2604.06201v1 Announce Type: new Abstract: While most reading comprehension benchmarks for LLMs focus on factual information that can be answered by localizing specific textual evidence, many real-world tasks require understanding distributional information, such as population-level trends and preferences expressed across collections of text. We introduce Text2DistBench, a reading comprehension benchmark for evaluating LLMs' ability to infer distributional knowledge from natural language. Built from real-world YouTube comments about movie and music entities, the benchmark provides models with entity metadata and associated comments, and requires them to answer distributional questions, such as estimating the proportions of positive and negative comments, or identifying the most and second most frequent topics discussed among viewers. To support reliable and long-term evaluation, the construction pipeline of Text2DistBench is fully automated and continuously updated to incorporate
arXiv:2604.06201v1 Announce Type: new Abstract: While most reading comprehension benchmarks for LLMs focus on factual information that can be answered by localizing specific textual evidence, many real-world tasks require understanding distributional information, such as population-level trends and preferences expressed across collections of text. We introduce Text2DistBench, a reading comprehension benchmark for evaluating LLMs' ability to infer distributional knowledge from natural language. Built from real-world YouTube comments about movie and music entities, the benchmark provides models with entity metadata and associated comments, and requires them to answer distributional questions, such as estimating the proportions of positive and negative comments, or identifying the most and second most frequent topics discussed among viewers. To support reliable and long-term evaluation, the construction pipeline of Text2DistBench is fully automated and continuously updated to incorporate newly emerging entities over time. Experiments across multiple LLMs show that while models substantially outperform random baselines, performance varies widely across different distribution types and characteristics. These findings highlight both the capabilities and limitations of current LLMs in distributional reading comprehension and demonstrate the value of Text2DistBench as a practical and scalable testbed for future research.
Executive Summary
This article introduces Text2DistBench, a novel reading comprehension benchmark designed to evaluate Large Language Models' (LLMs) capacity for understanding distributional information, a critical but often overlooked aspect of real-world text analysis. Unlike traditional benchmarks focused on factual extraction, Text2DistBench assesses LLMs' ability to infer population-level trends, sentiments, and topic distributions from collections of text, specifically YouTube comments related to movies and music. The benchmark's fully automated and continuously updated construction pipeline ensures its scalability and long-term relevance. Initial experiments reveal varying performance across LLMs and distribution types, underscoring both their emerging capabilities and significant limitations in this complex domain, thereby establishing Text2DistBench as a valuable tool for advancing research in distributional reading comprehension.
Key Points
- ▸ Traditional LLM benchmarks primarily focus on factual information retrieval, neglecting distributional understanding.
- ▸ Text2DistBench evaluates LLMs' ability to infer population-level trends, sentiments, and topic frequencies from collections of text.
- ▸ The benchmark is constructed from real-world YouTube comments about movie and music entities, providing authentic data.
- ▸ Its construction pipeline is fully automated and continuously updated, ensuring long-term reliability and scalability.
- ▸ Experimental results indicate that LLMs outperform random baselines but exhibit considerable variance in performance across different distribution types and characteristics.
Merits
Novelty of Focus
Addresses a critical gap in LLM evaluation by focusing on distributional reading comprehension, moving beyond mere factual extraction to more complex inferential tasks relevant to real-world applications.
Real-World Data Source
Utilizes authentic, messy, and diverse YouTube comments, enhancing the ecological validity and practical relevance of the benchmark compared to synthetic or curated datasets.
Automated and Scalable Construction
The fully automated and continuously updating pipeline ensures the benchmark's longevity, scalability, and ability to incorporate new entities, a significant advantage over static, manually curated datasets.
Clear Problem Definition
Articulates a distinct and challenging problem space for LLMs, encouraging the development of models with deeper analytical capabilities beyond surface-level text processing.
Demerits
Domain Specificity
Reliance on YouTube comments for movie and music entities, while authentic, may limit the generalizability of findings to other domains with different linguistic styles, topic complexities, or user demographics.
Subjectivity of 'Distributional Knowledge'
Defining and validating 'ground truth' for subjective distributional questions (e.g., 'most frequent topics') from user-generated content can be inherently challenging and potentially introduce human annotation biases, even with aggregation.
Lack of Error Analysis Depth
While acknowledging performance variance, the abstract does not detail specific types of errors or failure modes, which would be crucial for understanding the 'why' behind LLM limitations.
Potential for Prompt Engineering Exploitation
The 'distributional questions' might be susceptible to prompt engineering strategies that allow LLMs to 'guess' or approximate answers without true distributional understanding, especially if question types are predictable.
Expert Commentary
This paper's introduction of Text2DistBench marks a significant and timely evolution in LLM evaluation. The shift from atomistic factual retrieval to complex distributional inference mirrors the increasing sophistication demanded of AI in real-world analytical tasks. The choice of YouTube comments, while rich, also highlights the inherent challenges of 'ground truth' in subjective, crowd-sourced data; defining 'most frequent topics' or 'proportions of sentiment' often involves a degree of human interpretation, which LLMs must now emulate or surpass. The automated pipeline is commendable for scalability, yet the abstract leaves open questions regarding the robustness of this automation in handling evolving slang, nuanced sarcasm, or domain-specific jargon that could skew distributional understanding. Future work must rigorously probe the types of errors LLMs make in these inferential tasks – mistaking correlation for causation, misinterpreting implicit sentiment, or failing to aggregate disparate but related concepts – to truly advance the field beyond mere performance metrics. This benchmark is not just about LLM capabilities, but also about refining our understanding of what 'understanding' truly entails in complex, aggregate data.
Recommendations
- ✓ Expand Text2DistBench to include diverse domains beyond entertainment (e.g., legal texts, scientific literature, policy documents) to assess generalizability and domain-specific challenges in distributional understanding.
- ✓ Implement a detailed error taxonomy and analysis framework within the benchmark to categorize and quantify specific failure modes of LLMs in distributional reasoning, guiding targeted model improvements.
- ✓ Investigate the impact of different prompting strategies and few-shot learning on LLM performance in distributional tasks, exploring whether models are truly inferring or merely pattern-matching based on prompt structure.
- ✓ Explore methods for incorporating explainability into LLM responses for distributional questions, allowing human users to understand *why* a model reached a particular conclusion about a distribution.
- ✓ Conduct a thorough analysis of potential biases (e.g., demographic, cultural, linguistic) embedded in the YouTube comment data and assess how these biases might influence LLM-derived distributional insights.
Sources
Original: arXiv - cs.CL