Academic

Scaling In, Not Up? Testing Thick Citation Context Analysis with GPT-5 and Fragile Prompts

Arno Simons · March 1, 2026 · 1 min read · 3 views

#cs.CL #cs.AI

arXiv:2602.22359v1 Announce Type: new Abstract: This paper tests whether large language models (LLMs) can support interpretative citation context analysis (CCA) by scaling in thick, text-grounded readings of a single hard case rather than scaling up typological labels. It foregrounds prompt-sensitivity analysis as a methodological issue by varying prompt scaffolding and framing in a balanced 2x3 design. Using footnote 6 in Chubin and Moitra (1975) and Gilbert's (1977) reconstruction as a probe, I implement a two-stage GPT-5 pipeline: a citation-text-only surface classification and expectation pass, followed by cross-document interpretative reconstruction using the citing and cited full texts. Across 90 reconstructions, the model produces 450 distinct hypotheses. Close reading and inductive coding identify 21 recurring interpretative moves, and linear probability models estimate how prompt choices shift their frequencies and lexical repertoire. GPT-5's surface pass is highly stable, consistently classifying the citation as "supplementary". In reconstruction, the model generates a structured space of plausible alternatives, but scaffolding and examples redistribute attention and vocabulary, sometimes toward strained readings. Relative to Gilbert, GPT-5 detects the same textual hinges yet more often resolves them as lineage and positioning than as admonishment. The study outlines opportunities and risks of using LLMs as guided co-analysts for inspectable, contestable interpretative CCA, and it shows that prompt scaffolding and framing systematically tilt which plausible readings and vocabularies the model foregrounds.

Executive Summary

This study explores the potential of large language models (LLMs) to support interpretative citation context analysis (CCA) by scaling in thick, text-grounded readings of a single hard case. The authors test the sensitivity of GPT-5 to prompt scaffolding and framing, using a two-stage pipeline that includes surface classification and cross-document interpretative reconstruction. The results show that GPT-5 generates a structured space of plausible alternatives, but prompt choices significantly impact the frequency and lexical repertoire of recurring interpretative moves. The study highlights the opportunities and risks of using LLMs as guided co-analysts for CCA and emphasizes the importance of carefully designing prompts to obtain reliable and relevant results.

Key Points

▸ GPT-5 can support interpretative CCA by scaling in thick, text-grounded readings of a single hard case.
▸ Prompt scaffolding and framing significantly impact the frequency and lexical repertoire of recurring interpretative moves.
▸ LLMs can be used as guided co-analysts for CCA, but careful prompt design is crucial for reliable and relevant results.

Merits

Strength in Methodology

The study employs a rigorous methodology, including a balanced 2x3 design and two-stage pipeline, to test the sensitivity of GPT-5 to prompt scaffolding and framing.

Insight into LLM Limitations

The study highlights the limitations of LLMs in generating strained readings and the importance of carefully designing prompts to avoid such outcomes.

Promising Applications

The study suggests that LLMs can be used as guided co-analysts for CCA, potentially facilitating more efficient and accurate interpretative analysis.

Demerits

Limited Generalizability

The study focuses on a single hard case and a specific LLM model, limiting the generalizability of the findings to other contexts and models.

Dependence on Prompt Design

The study emphasizes the importance of carefully designing prompts, but the potential consequences of poorly designed prompts are not fully explored.

Expert Commentary

The study provides valuable insights into the potential and limitations of LLMs in supporting interpretative CCA. The emphasis on prompt sensitivity analysis and the importance of careful prompt design are crucial considerations for researchers and practitioners alike. While the study's methodology is rigorous, the limited generalizability of the findings and the dependence on prompt design are notable limitations. Nevertheless, the study's contributions to the ongoing discussion on LLM interpretability and prompt engineering are significant and warrant further exploration.

Recommendations

✓ Researchers should prioritize the development of rigorous methodologies for testing the sensitivity of LLMs to prompt scaffolding and framing.
✓ Developers of LLMs should design more robust and transparent prompt engineering tools to support careful prompt design and minimize the risk of strained readings.

Sources

arXiv - cs.CL

Something extraordinary is coming.

Scaling In, Not Up? Testing Thick Citation Context Analysis with GPT-5 and Fragile Prompts

AI Commentary

Executive Summary

Key Points

Merits

Strength in Methodology

Insight into LLM Limitations

Promising Applications

Demerits

Limited Generalizability

Dependence on Prompt Design

Expert Commentary

Recommendations

Sources

Related Articles

Uncovering Context Reliance in Unstructured Knowledge Editing

Using AI in Dance Notation and Copyright Infringement Prevention: Enhancing …

Multilevel Determinants of Overweight and Obesity Among U.S. Children Aged …

An artificial intelligence framework for end-to-end rare disease phenotyping from …

JCG, PC

HSOLLC Co., Ltd.