Academic

CreativeBench: Benchmarking and Enhancing Machine Creativity via Self-Evolving Challenges

Zi-Han Wang, Lam Nguyen, Zhengyang Zhao, Mengyue Yang, Chengwei Qin, Yujiu Yang, Linyi Yang · March 13, 2026 · 1 min read · 17 views

#cs.AI

arXiv:2603.11863v1 Announce Type: new Abstract: The saturation of high-quality pre-training data has shifted research focus toward evolutionary systems capable of continuously generating novel artifacts, leading to the success of AlphaEvolve. However, the progress of such systems is hindered by the lack of rigorous, quantitative evaluation. To tackle this challenge, we introduce CreativeBench, a benchmark for evaluating machine creativity in code generation, grounded in a classical cognitive framework. Comprising two subsets -- CreativeBench-Combo and CreativeBench-Explore -- the benchmark targets combinatorial and exploratory creativity through an automated pipeline utilizing reverse engineering and self-play. By leveraging executable code, CreativeBench objectively distinguishes creativity from hallucination via a unified metric defined as the product of quality and novelty. Our analysis of state-of-the-art models reveals distinct behaviors: (1) scaling significantly improves combinatorial creativity but yields diminishing returns for exploration; (2) larger models exhibit ``convergence-by-scaling,'' becoming more correct but less divergent; and (3) reasoning capabilities primarily benefit constrained exploration rather than combination. Finally, we propose EvoRePE, a plug-and-play inference-time steering strategy that internalizes evolutionary search patterns to consistently enhance machine creativity.

Executive Summary

This article introduces CreativeBench, a benchmark for machine creativity, to address the lack of rigorous evaluation in evolutionary systems. By leveraging a classical cognitive framework and using executable code, CreativeBench objectively measures creativity through a unified metric. The analysis reveals scaling effects on combinatorial creativity, convergence-by-scaling in larger models, and the benefits of reasoning capabilities in constrained exploration. A plug-and-play inference-time steering strategy, EvoRePE, is proposed to enhance machine creativity. This work contributes to the development of more creative AI systems, with implications for various applications, including code generation, artistic creation, and decision-making. By addressing the evaluation challenge, CreativeBench paves the way for more effective and efficient machine creativity research.

Key Points

▸ CreativeBench is introduced as a benchmark for evaluating machine creativity
▸ The benchmark targets combinatorial and exploratory creativity through an automated pipeline
▸ EvoRePE is proposed as a plug-and-play inference-time steering strategy to enhance machine creativity

Merits

Strength in theoretical foundation

CreativeBench is grounded in a classical cognitive framework, providing a rigorous theoretical foundation for evaluating machine creativity.

Objective evaluation metric

The unified metric defined as the product of quality and novelty offers an objective evaluation of creativity, distinguishing it from hallucination.

Practical application potential

The proposed EvoRePE strategy has the potential to be applied in various domains, including code generation, artistic creation, and decision-making.

Demerits

Limited scope

The current implementation of CreativeBench focuses on code generation and may need to be extended to other domains to fully demonstrate its effectiveness.

Scalability concerns

The proposed EvoRePE strategy may not be scalable to very large models, potentially limiting its applicability.

Expert Commentary

The introduction of CreativeBench and EvoRePE represents a significant step forward in the evaluation and enhancement of machine creativity. While the current implementation has its limitations, the proposed framework has the potential to be extended and adapted to various domains. The work's implications for AI research and development are substantial, particularly in relation to the development of more creative and effective AI systems. As the field continues to evolve, it is essential to address the evaluation challenge and provide more comprehensive understanding of AI capabilities and limitations.

Recommendations

✓ Future research should focus on extending CreativeBench to other domains and exploring its applicability in various creative tasks.
✓ The proposed EvoRePE strategy should be further developed and tested to ensure its scalability and effectiveness in large-scale AI systems.

Sources

arXiv - cs.AI

CreativeBench: Benchmarking and Enhancing Machine Creativity via Self-Evolving Challenges

AI Commentary

Executive Summary

Key Points

Merits

Strength in theoretical foundation

Objective evaluation metric

Practical application potential

Demerits

Limited scope

Scalability concerns

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs