Scaling In, Not Up? Testing Thick Citation Context Analysis with GPT-5 and Fragile Prompts
arXiv:2602.22359v1 Announce Type: new Abstract: This paper tests whether large language models (LLMs) can support interpretative citation context analysis (CCA) by scaling in thick, text-grounded readings of a single hard case rather than scaling up typological labels. It foregrounds prompt-sensitivity analysis as a methodological issue by varying prompt scaffolding and framing in a balanced 2x3 design. Using footnote 6 in Chubin and Moitra (1975) and Gilbert's (1977) reconstruction as a probe, I implement a two-stage GPT-5 pipeline: a citation-text-only surface classification and expectation pass, followed by cross-document interpretative reconstruction using the citing and cited full texts. Across 90 reconstructions, the model produces 450 distinct hypotheses. Close reading and inductive coding identify 21 recurring interpretative moves, and linear probability models estimate how prompt choices shift their frequencies and lexical repertoire. GPT-5's surface pass is highly stable, co
arXiv:2602.22359v1 Announce Type: new Abstract: This paper tests whether large language models (LLMs) can support interpretative citation context analysis (CCA) by scaling in thick, text-grounded readings of a single hard case rather than scaling up typological labels. It foregrounds prompt-sensitivity analysis as a methodological issue by varying prompt scaffolding and framing in a balanced 2x3 design. Using footnote 6 in Chubin and Moitra (1975) and Gilbert's (1977) reconstruction as a probe, I implement a two-stage GPT-5 pipeline: a citation-text-only surface classification and expectation pass, followed by cross-document interpretative reconstruction using the citing and cited full texts. Across 90 reconstructions, the model produces 450 distinct hypotheses. Close reading and inductive coding identify 21 recurring interpretative moves, and linear probability models estimate how prompt choices shift their frequencies and lexical repertoire. GPT-5's surface pass is highly stable, consistently classifying the citation as "supplementary". In reconstruction, the model generates a structured space of plausible alternatives, but scaffolding and examples redistribute attention and vocabulary, sometimes toward strained readings. Relative to Gilbert, GPT-5 detects the same textual hinges yet more often resolves them as lineage and positioning than as admonishment. The study outlines opportunities and risks of using LLMs as guided co-analysts for inspectable, contestable interpretative CCA, and it shows that prompt scaffolding and framing systematically tilt which plausible readings and vocabularies the model foregrounds.
Executive Summary
This study explores the potential of large language models (LLMs) to support interpretative citation context analysis (CCA) by scaling in thick, text-grounded readings of a single hard case. The authors test the sensitivity of GPT-5 to prompt scaffolding and framing, using a two-stage pipeline that includes surface classification and cross-document interpretative reconstruction. The results show that GPT-5 generates a structured space of plausible alternatives, but prompt choices significantly impact the frequency and lexical repertoire of recurring interpretative moves. The study highlights the opportunities and risks of using LLMs as guided co-analysts for CCA and emphasizes the importance of carefully designing prompts to obtain reliable and relevant results.
Key Points
- ▸ GPT-5 can support interpretative CCA by scaling in thick, text-grounded readings of a single hard case.
- ▸ Prompt scaffolding and framing significantly impact the frequency and lexical repertoire of recurring interpretative moves.
- ▸ LLMs can be used as guided co-analysts for CCA, but careful prompt design is crucial for reliable and relevant results.
Merits
Strength in Methodology
The study employs a rigorous methodology, including a balanced 2x3 design and two-stage pipeline, to test the sensitivity of GPT-5 to prompt scaffolding and framing.
Insight into LLM Limitations
The study highlights the limitations of LLMs in generating strained readings and the importance of carefully designing prompts to avoid such outcomes.
Promising Applications
The study suggests that LLMs can be used as guided co-analysts for CCA, potentially facilitating more efficient and accurate interpretative analysis.
Demerits
Limited Generalizability
The study focuses on a single hard case and a specific LLM model, limiting the generalizability of the findings to other contexts and models.
Dependence on Prompt Design
The study emphasizes the importance of carefully designing prompts, but the potential consequences of poorly designed prompts are not fully explored.
Expert Commentary
The study provides valuable insights into the potential and limitations of LLMs in supporting interpretative CCA. The emphasis on prompt sensitivity analysis and the importance of careful prompt design are crucial considerations for researchers and practitioners alike. While the study's methodology is rigorous, the limited generalizability of the findings and the dependence on prompt design are notable limitations. Nevertheless, the study's contributions to the ongoing discussion on LLM interpretability and prompt engineering are significant and warrant further exploration.
Recommendations
- ✓ Researchers should prioritize the development of rigorous methodologies for testing the sensitivity of LLMs to prompt scaffolding and framing.
- ✓ Developers of LLMs should design more robust and transparent prompt engineering tools to support careful prompt design and minimize the risk of strained readings.