Academic

Lost in Stories: Consistency Bugs in Long Story Generation by LLMs

arXiv:2603.05890v1 Announce Type: new Abstract: What happens when a storyteller forgets its own story? Large Language Models (LLMs) can now generate narratives spanning tens of thousands of words, but they often fail to maintain consistency throughout. When generating long-form narratives, these models can contradict their own established facts, character traits, and world rules. Existing story generation benchmarks focus mainly on plot quality and fluency, leaving consistency errors largely unexplored. To address this gap, we present ConStory-Bench, a benchmark designed to evaluate narrative consistency in long-form story generation. It contains 2,000 prompts across four task scenarios and defines a taxonomy of five error categories with 19 fine-grained subtypes. We also develop ConStory-Checker, an automated pipeline that detects contradictions and grounds each judgment in explicit textual evidence. Evaluating a range of LLMs through five research questions, we find that consistency

Junjie Li, Xinrui Guo, Yuhao Wu, Roy Ka-Wei Lee, Hongzhi Li, Yutao Xie · March 9, 2026 · 1 min read · 9 views

#cs.CL #cs.AI

Executive Summary

This article presents ConStory-Bench, a novel benchmark for evaluating narrative consistency in long-form story generation by Large Language Models (LLMs). The study identifies common consistency errors and their tendencies, and proposes the ConStory-Checker, an automated pipeline for detecting contradictions. The findings can inform future efforts to improve consistency in long-form narrative generation. While the work sheds light on the limitations of LLMs, it also raises questions about the impact of consistency errors on the credibility and reliability of generated narratives.

Key Points

▸ ConStory-Bench is a benchmark designed to evaluate narrative consistency in long-form story generation.
▸ The study identifies five error categories with 19 fine-grained subtypes related to consistency bugs in LLMs.
▸ ConStory-Checker is an automated pipeline for detecting contradictions in generated narratives.

Merits

Innovative Benchmark

ConStory-Bench provides a comprehensive and structured approach to evaluating narrative consistency in LLMs, which is a crucial aspect of long-form story generation.

Insightful Analysis

The study offers a detailed examination of consistency errors and their tendencies, providing valuable insights for future research and improvement.

Automated Pipeline

ConStory-Checker is a useful tool for detecting contradictions in generated narratives, making it an essential component of the ConStory-Bench framework.

Demerits

Limited Scope

The study focuses primarily on LLMs and the ConStory-Bench framework, leaving the broader implications and applications of the findings to be explored.

Lack of Real-World Context

The study does not consider the real-world implications of consistency errors in generated narratives, which may be crucial for applications such as content generation and creative writing.

Expert Commentary

The article presents a significant contribution to the field of natural language processing, shedding light on the limitations of LLMs in maintaining narrative consistency. The ConStory-Bench framework and the ConStory-Checker pipeline are valuable tools for evaluating and improving the quality of AI-generated content. However, the study's focus on LLMs and the lack of real-world context may limit its broader implications. Nevertheless, the findings can inform future research and development in the field, and the study's methodology can be applied to other areas of NLP. As the use of AI-generated content becomes more widespread, it is essential to address the issues of consistency and credibility, which are critical for maintaining the trust and reliability of AI-generated narratives.

Recommendations

✓ Future research should explore the broader implications of the study's findings, including the impact of consistency errors on the credibility and reliability of AI-generated content.
✓ The development of more advanced benchmarks and evaluation metrics for narrative consistency is essential for improving the quality of AI-generated content.

Sources

arXiv - cs.CL

Lost in Stories: Consistency Bugs in Long Story Generation by LLMs

AI Commentary

Executive Summary

Key Points

Merits

Innovative Benchmark

Insightful Analysis

Automated Pipeline

Demerits

Limited Scope

Lack of Real-World Context

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs