POEMetric: The Last Stanza of Humanity
arXiv:2604.03695v1 Announce Type: new Abstract: Large Language Models (LLMs) can compose poetry, but how far are they from human poets? In this paper, we introduce POEMetric, the first comprehensive framework for poetry evaluation, examining 1) basic instruction-following abilities in generating poems according to a certain form and theme, 2) advanced abilities of showing creativity, lexical diversity, and idiosyncrasy, evoking emotional resonance, and using imagery and literary devices, and 3) general appraisal of the overall poem quality and estimation of authorship. We curated a human poem dataset - 203 English poems of 7 fixed forms annotated with meter, rhyme patterns and themes - and experimented with 30 LLMs for poetry generation based on the same forms and themes of the human data, totaling 6,090 LLM poems. Based on POEMetric, we assessed the performance of both human poets and LLMs through rule-based evaluation and LLM-as-a-judge, whose results were validated by human experts
arXiv:2604.03695v1 Announce Type: new Abstract: Large Language Models (LLMs) can compose poetry, but how far are they from human poets? In this paper, we introduce POEMetric, the first comprehensive framework for poetry evaluation, examining 1) basic instruction-following abilities in generating poems according to a certain form and theme, 2) advanced abilities of showing creativity, lexical diversity, and idiosyncrasy, evoking emotional resonance, and using imagery and literary devices, and 3) general appraisal of the overall poem quality and estimation of authorship. We curated a human poem dataset - 203 English poems of 7 fixed forms annotated with meter, rhyme patterns and themes - and experimented with 30 LLMs for poetry generation based on the same forms and themes of the human data, totaling 6,090 LLM poems. Based on POEMetric, we assessed the performance of both human poets and LLMs through rule-based evaluation and LLM-as-a-judge, whose results were validated by human experts. Results show that, though the top model achieved high form accuracy (4.26 out of 5.00, with Gemini-2.5-Pro as a judge; same below) and theme alignment (4.99), all models failed to reach the same level of advanced abilities as human poets, who achieved unparalleled creativity (4.02), idiosyncrasy (3.95), emotional resonance (4.06), and skillful use of imagery (4.49) and literary devices (4.67). Humans also defeated the best-performing LLM in overall poem quality (4.22 vs. 3.20). As such, poetry generation remains a formidable challenge for LLMs. Data and codes are released at https://github.com/Bingru-Li/POEMetric.
Executive Summary
This study introduces POEMetric, a comprehensive framework for evaluating poetry composition by Large Language Models (LLMs). The researchers assessed the performance of 30 LLMs and human poets through a rule-based evaluation and a novel 'LLM-as-a-judge' approach. While LLMs demonstrated high form accuracy and theme alignment, they struggled to match human poets' advanced abilities, such as creativity, idiosyncrasy, and emotional resonance. The study highlights the significant gap between human and machine-generated poetry, underscoring the challenges of replicating human artistic expression. The findings have implications for both art and technology, emphasizing the need for more nuanced and context-specific language models.
Key Points
- ▸ Introduction of POEMetric, a framework for evaluating poetry composition by LLMs
- ▸ Comparison of LLM-generated and human-generated poetry through POEMetric
- ▸ Identification of significant gaps in LLM performance, particularly in advanced abilities
Merits
Methodological Innovation
The study introduces a novel 'LLM-as-a-judge' approach, which adds a new dimension to poetry evaluation and provides a more comprehensive assessment of LLM performance.
Large-Scale Dataset
The researchers curated a large dataset of 203 human poems, which serves as a benchmark for evaluating LLM performance and enables the development of more accurate and context-specific language models.
Demerits
Limited Generalizability
The study focuses on a specific genre of poetry (English) and a limited set of forms and themes, which may not be representative of the broader poetry landscape and may limit the generalizability of the findings.
Overreliance on Rule-Based Evaluation
The study relies heavily on rule-based evaluation, which may not capture the full range of human creativity and artistic expression, and may lead to an overemphasis on formal metrics at the expense of more nuanced and subjective evaluations.
Expert Commentary
The study provides a nuanced and multifaceted evaluation of LLM performance in poetry composition, highlighting the significant gaps between human and machine-generated poetry. While the findings are not surprising, they underscore the need for more advanced and context-specific language models that can capture the full range of human creativity and artistic expression. The introduction of POEMetric and the 'LLM-as-a-judge' approach is a significant methodological innovation, and the large-scale dataset provides a valuable benchmark for evaluating LLM performance. However, the study's limitations, including its focus on a specific genre and limited set of forms and themes, must be acknowledged and addressed in future research.
Recommendations
- ✓ Developers and researchers should prioritize the development of more advanced and context-specific language models that can capture the full range of human creativity and artistic expression.
- ✓ Policymakers and regulatory bodies should consider the implications of AI-generated content for human creators and consumers, including issues related to authorship, ownership, and intellectual property.
Sources
Original: arXiv - cs.CL