Skip to main content
Academic

CCiV: A Benchmark for Structure, Rhythm and Quality in LLM-Generated Chinese \textit{Ci} Poetry

arXiv:2602.14081v1 Announce Type: new Abstract: The generation of classical Chinese \textit{Ci} poetry, a form demanding a sophisticated blend of structural rigidity, rhythmic harmony, and artistic quality, poses a significant challenge for large language models (LLMs). To systematically evaluate and advance this capability, we introduce \textbf{C}hinese \textbf{Ci}pai \textbf{V}ariants (\textbf{CCiV}), a benchmark designed to assess LLM-generated \textit{Ci} poetry across these three dimensions: structure, rhythm, and quality. Our evaluation of 17 LLMs on 30 \textit{Cipai} reveals two critical phenomena: models frequently generate valid but unexpected historical variants of a poetic form, and adherence to tonal patterns is substantially harder than structural rules. We further show that form-aware prompting can improve structural and tonal control for stronger models, while potentially degrading weaker ones. Finally, we observe weak and inconsistent alignment between formal correctne

S
Shangqing Zhao, Yupei Ren, Yuhao Zhou, Xiaopeng Bai, Man Lan
· · 1 min read · 9 views

arXiv:2602.14081v1 Announce Type: new Abstract: The generation of classical Chinese \textit{Ci} poetry, a form demanding a sophisticated blend of structural rigidity, rhythmic harmony, and artistic quality, poses a significant challenge for large language models (LLMs). To systematically evaluate and advance this capability, we introduce \textbf{C}hinese \textbf{Ci}pai \textbf{V}ariants (\textbf{CCiV}), a benchmark designed to assess LLM-generated \textit{Ci} poetry across these three dimensions: structure, rhythm, and quality. Our evaluation of 17 LLMs on 30 \textit{Cipai} reveals two critical phenomena: models frequently generate valid but unexpected historical variants of a poetic form, and adherence to tonal patterns is substantially harder than structural rules. We further show that form-aware prompting can improve structural and tonal control for stronger models, while potentially degrading weaker ones. Finally, we observe weak and inconsistent alignment between formal correctness and literary quality in our sample. CCiV highlights the need for variant-aware evaluation and more holistic constrained creative generation methods.

Executive Summary

The article introduces CCiV, a benchmark for evaluating large language models (LLMs) in generating classical Chinese Ci poetry, which requires a delicate balance of structural, rhythmic, and artistic elements. The study evaluates 17 LLMs across 30 Ci forms, revealing that models often produce valid but unexpected historical variants and struggle more with tonal patterns than structural rules. Form-aware prompting improves performance in stronger models but may degrade weaker ones. The research highlights a weak alignment between formal correctness and literary quality, emphasizing the need for variant-aware evaluation and more comprehensive methods for constrained creative generation.

Key Points

  • Introduction of CCiV benchmark for evaluating LLM-generated Ci poetry.
  • Evaluation of 17 LLMs reveals challenges in structural, rhythmic, and artistic generation.
  • Form-aware prompting improves stronger models but may degrade weaker ones.
  • Weak alignment between formal correctness and literary quality observed.
  • Need for variant-aware evaluation and holistic constrained creative generation methods.

Merits

Comprehensive Benchmark

The CCiV benchmark provides a systematic and multi-dimensional approach to evaluating LLM-generated Ci poetry, addressing structure, rhythm, and quality.

Insightful Findings

The study reveals critical phenomena such as the generation of unexpected historical variants and the difficulty in adhering to tonal patterns, offering valuable insights into LLM capabilities.

Practical Implications

The research provides practical implications for improving LLM performance in constrained creative tasks, particularly through form-aware prompting.

Demerits

Limited Scope

The study focuses solely on Ci poetry, which may limit the generalizability of findings to other forms of classical Chinese poetry or creative tasks.

Sample Size

Evaluating only 17 LLMs and 30 Ci forms may not capture the full spectrum of LLM capabilities and challenges in generating classical poetry.

Subjective Quality Assessment

The assessment of literary quality is subjective and may vary among evaluators, potentially affecting the consistency and reliability of the findings.

Expert Commentary

The article presents a rigorous and well-structured approach to evaluating LLM-generated Ci poetry, addressing critical dimensions such as structure, rhythm, and quality. The findings are insightful, particularly the revelation that models often produce unexpected historical variants and struggle with tonal patterns. The study's emphasis on form-aware prompting and the need for variant-aware evaluation is a significant contribution to the field. However, the limited scope and sample size may restrict the generalizability of the findings. The subjective nature of literary quality assessment also poses a challenge. Overall, the research highlights the complexities of evaluating LLMs in constrained creative tasks and underscores the need for more holistic and culturally sensitive approaches.

Recommendations

  • Future research should expand the scope of evaluation to include a broader range of classical Chinese poetry forms and a larger sample of LLMs.
  • Developers should explore advanced prompting techniques and evaluation metrics to improve the performance and cultural relevance of LLM-generated content.
  • Policymakers and industry stakeholders should collaborate to establish guidelines and standards for evaluating and regulating AI-generated creative content, ensuring cultural sensitivity and historical accuracy.

Sources