Skip to main content
Academic

Semantic Novelty at Scale: Narrative Shape Taxonomy and Readership Prediction in 28,606 Books

arXiv:2602.20647v1 Announce Type: new Abstract: I introduce semantic novelty--cosine distance between each paragraph's sentence embedding and the running centroid of all preceding paragraphs--as an information-theoretic measure of narrative structure at corpus scale. Applying it to 28,606 books in PG19 (pre-1920 English literature), I compute paragraph-level novelty curves using 768-dimensional SBERT embeddings, then reduce each to a 16-segment Piecewise Aggregate Approximation (PAA). Ward-linkage clustering on PAA vectors reveals eight canonical narrative shape archetypes, from Steep Descent (rapid convergence) to Steep Ascent (escalating unpredictability). Volume--variance of the novelty trajectory--is the strongest length-independent predictor of readership (partial rho = 0.32), followed by speed (rho = 0.19) and Terminal/Initial ratio (rho = 0.19). Circuitousness shows strong raw correlation (rho = 0.41) but is 93 percent correlated with length; after control, partial rho drops to

W
W. Frederick Zimmerman
· · 1 min read · 9 views

arXiv:2602.20647v1 Announce Type: new Abstract: I introduce semantic novelty--cosine distance between each paragraph's sentence embedding and the running centroid of all preceding paragraphs--as an information-theoretic measure of narrative structure at corpus scale. Applying it to 28,606 books in PG19 (pre-1920 English literature), I compute paragraph-level novelty curves using 768-dimensional SBERT embeddings, then reduce each to a 16-segment Piecewise Aggregate Approximation (PAA). Ward-linkage clustering on PAA vectors reveals eight canonical narrative shape archetypes, from Steep Descent (rapid convergence) to Steep Ascent (escalating unpredictability). Volume--variance of the novelty trajectory--is the strongest length-independent predictor of readership (partial rho = 0.32), followed by speed (rho = 0.19) and Terminal/Initial ratio (rho = 0.19). Circuitousness shows strong raw correlation (rho = 0.41) but is 93 percent correlated with length; after control, partial rho drops to 0.11--demonstrating that naive correlations in corpus studies can be dominated by length confounds. Genre strongly constrains narrative shape (chi squared = 2121.6, p < 10 to the power negative 242), with fiction maintaining plateau profiles while nonfiction front-loads information. Historical analysis shows books became progressively more predictable between 1840 and 1910 (T/I ratio trend r = negative 0.74, p = 0.037). SAX analysis reveals 85 percent signature uniqueness, suggesting each book traces a nearly unique path through semantic space. These findings demonstrate that information-density dynamics, distinct from sentiment or topic, constitute a fundamental dimension of narrative structure with measurable consequences for reader engagement. Dataset: https://huggingface.co/datasets/wfzimmerman/pg19-semantic-novelty

Executive Summary

The article introduces the concept of semantic novelty as a measure of narrative structure, applying it to a corpus of 28,606 pre-1920 English literature books. Using sentence embeddings and clustering techniques, the study identifies eight canonical narrative shape archetypes and explores their relationship with readership. The findings highlight the importance of information-density dynamics in narrative structure and their impact on reader engagement, while also addressing the potential confounds of length in corpus studies.

Key Points

  • Introduction of semantic novelty as a measure of narrative structure.
  • Identification of eight canonical narrative shape archetypes.
  • Volume of novelty trajectory is the strongest predictor of readership.
  • Genre strongly constrains narrative shape, with fiction and nonfiction showing distinct patterns.
  • Historical analysis reveals a trend towards increased predictability in books from 1840 to 1910.

Merits

Innovative Methodology

The use of semantic novelty and advanced clustering techniques provides a novel approach to analyzing narrative structure at scale.

Comprehensive Dataset

The study leverages a large corpus of 28,606 books, offering robust and generalizable findings.

Practical Implications

The findings have direct implications for understanding reader engagement and could inform content creation strategies.

Demerits

Potential Bias in Corpus

The focus on pre-1920 English literature may limit the generalizability of the findings to contemporary or non-English works.

Length Confounds

The study acknowledges that some correlations may be confounded by the length of the books, which could affect the validity of certain findings.

Complexity of Analysis

The advanced statistical and computational methods used may be challenging for some readers to fully grasp and replicate.

Expert Commentary

The article presents a rigorous and innovative approach to analyzing narrative structure through the lens of semantic novelty. The identification of eight canonical narrative shape archetypes and their correlation with readership metrics offers valuable insights into the dynamics of reader engagement. The study's methodology is commendable, leveraging advanced computational techniques to handle a large corpus of texts. However, the focus on pre-1920 English literature may limit the generalizability of the findings. The acknowledgment of length confounds is a strength, as it highlights the importance of controlling for such variables in corpus studies. The historical analysis revealing a trend towards increased predictability in books from 1840 to 1910 is particularly intriguing and warrants further exploration. Overall, the study contributes significantly to the fields of digital humanities and narrative theory, providing a robust framework for future research.

Recommendations

  • Future studies should expand the corpus to include contemporary and non-English literature to enhance the generalizability of the findings.
  • Researchers should explore the application of semantic novelty to other forms of narrative media, such as films and television series, to broaden the scope of the analysis.

Sources