Lost in Stories: Consistency Bugs in Long Story Generation by LLMs
arXiv:2603.05890v1 Announce Type: new Abstract: What happens when a storyteller forgets its own story? Large Language Models (LLMs) can now generate narratives spanning tens of thousands of words, but they often fail to maintain consistency throughout. When generating long-form narratives, these models can contradict their own established facts, character traits, and world rules. Existing story generation benchmarks focus mainly on plot quality and fluency, leaving consistency errors largely unexplored. To address this gap, we present ConStory-Bench, a benchmark designed to evaluate narrative consistency in long-form story generation. It contains 2,000 prompts across four task scenarios and defines a taxonomy of five error categories with 19 fine-grained subtypes. We also develop ConStory-Checker, an automated pipeline that detects contradictions and grounds each judgment in explicit textual evidence. Evaluating a range of LLMs through five research questions, we find that consistency
arXiv:2603.05890v1 Announce Type: new Abstract: What happens when a storyteller forgets its own story? Large Language Models (LLMs) can now generate narratives spanning tens of thousands of words, but they often fail to maintain consistency throughout. When generating long-form narratives, these models can contradict their own established facts, character traits, and world rules. Existing story generation benchmarks focus mainly on plot quality and fluency, leaving consistency errors largely unexplored. To address this gap, we present ConStory-Bench, a benchmark designed to evaluate narrative consistency in long-form story generation. It contains 2,000 prompts across four task scenarios and defines a taxonomy of five error categories with 19 fine-grained subtypes. We also develop ConStory-Checker, an automated pipeline that detects contradictions and grounds each judgment in explicit textual evidence. Evaluating a range of LLMs through five research questions, we find that consistency errors show clear tendencies: they are most common in factual and temporal dimensions, tend to appear around the middle of narratives, occur in text segments with higher token-level entropy, and certain error types tend to co-occur. These findings can inform future efforts to improve consistency in long-form narrative generation. Our project page is available at https://picrew.github.io/constory-bench.github.io/.
Executive Summary
This article presents ConStory-Bench, a novel benchmark for evaluating narrative consistency in long-form story generation by Large Language Models (LLMs). The study identifies common consistency errors and their tendencies, and proposes the ConStory-Checker, an automated pipeline for detecting contradictions. The findings can inform future efforts to improve consistency in long-form narrative generation. While the work sheds light on the limitations of LLMs, it also raises questions about the impact of consistency errors on the credibility and reliability of generated narratives.
Key Points
- ▸ ConStory-Bench is a benchmark designed to evaluate narrative consistency in long-form story generation.
- ▸ The study identifies five error categories with 19 fine-grained subtypes related to consistency bugs in LLMs.
- ▸ ConStory-Checker is an automated pipeline for detecting contradictions in generated narratives.
Merits
Innovative Benchmark
ConStory-Bench provides a comprehensive and structured approach to evaluating narrative consistency in LLMs, which is a crucial aspect of long-form story generation.
Insightful Analysis
The study offers a detailed examination of consistency errors and their tendencies, providing valuable insights for future research and improvement.
Automated Pipeline
ConStory-Checker is a useful tool for detecting contradictions in generated narratives, making it an essential component of the ConStory-Bench framework.
Demerits
Limited Scope
The study focuses primarily on LLMs and the ConStory-Bench framework, leaving the broader implications and applications of the findings to be explored.
Lack of Real-World Context
The study does not consider the real-world implications of consistency errors in generated narratives, which may be crucial for applications such as content generation and creative writing.
Expert Commentary
The article presents a significant contribution to the field of natural language processing, shedding light on the limitations of LLMs in maintaining narrative consistency. The ConStory-Bench framework and the ConStory-Checker pipeline are valuable tools for evaluating and improving the quality of AI-generated content. However, the study's focus on LLMs and the lack of real-world context may limit its broader implications. Nevertheless, the findings can inform future research and development in the field, and the study's methodology can be applied to other areas of NLP. As the use of AI-generated content becomes more widespread, it is essential to address the issues of consistency and credibility, which are critical for maintaining the trust and reliability of AI-generated narratives.
Recommendations
- ✓ Future research should explore the broader implications of the study's findings, including the impact of consistency errors on the credibility and reliability of AI-generated content.
- ✓ The development of more advanced benchmarks and evaluation metrics for narrative consistency is essential for improving the quality of AI-generated content.