CreativeBench: Benchmarking and Enhancing Machine Creativity via Self-Evolving Challenges
arXiv:2603.11863v1 Announce Type: new Abstract: The saturation of high-quality pre-training data has shifted research focus toward evolutionary systems capable of continuously generating novel artifacts, leading to the success of AlphaEvolve. However, the progress of such systems is hindered by the lack of rigorous, quantitative evaluation. To tackle this challenge, we introduce CreativeBench, a benchmark for evaluating machine creativity in code generation, grounded in a classical cognitive framework. Comprising two subsets -- CreativeBench-Combo and CreativeBench-Explore -- the benchmark targets combinatorial and exploratory creativity through an automated pipeline utilizing reverse engineering and self-play. By leveraging executable code, CreativeBench objectively distinguishes creativity from hallucination via a unified metric defined as the product of quality and novelty. Our analysis of state-of-the-art models reveals distinct behaviors: (1) scaling significantly improves combin
arXiv:2603.11863v1 Announce Type: new Abstract: The saturation of high-quality pre-training data has shifted research focus toward evolutionary systems capable of continuously generating novel artifacts, leading to the success of AlphaEvolve. However, the progress of such systems is hindered by the lack of rigorous, quantitative evaluation. To tackle this challenge, we introduce CreativeBench, a benchmark for evaluating machine creativity in code generation, grounded in a classical cognitive framework. Comprising two subsets -- CreativeBench-Combo and CreativeBench-Explore -- the benchmark targets combinatorial and exploratory creativity through an automated pipeline utilizing reverse engineering and self-play. By leveraging executable code, CreativeBench objectively distinguishes creativity from hallucination via a unified metric defined as the product of quality and novelty. Our analysis of state-of-the-art models reveals distinct behaviors: (1) scaling significantly improves combinatorial creativity but yields diminishing returns for exploration; (2) larger models exhibit ``convergence-by-scaling,'' becoming more correct but less divergent; and (3) reasoning capabilities primarily benefit constrained exploration rather than combination. Finally, we propose EvoRePE, a plug-and-play inference-time steering strategy that internalizes evolutionary search patterns to consistently enhance machine creativity.
Executive Summary
This article introduces CreativeBench, a benchmark for machine creativity, to address the lack of rigorous evaluation in evolutionary systems. By leveraging a classical cognitive framework and using executable code, CreativeBench objectively measures creativity through a unified metric. The analysis reveals scaling effects on combinatorial creativity, convergence-by-scaling in larger models, and the benefits of reasoning capabilities in constrained exploration. A plug-and-play inference-time steering strategy, EvoRePE, is proposed to enhance machine creativity. This work contributes to the development of more creative AI systems, with implications for various applications, including code generation, artistic creation, and decision-making. By addressing the evaluation challenge, CreativeBench paves the way for more effective and efficient machine creativity research.
Key Points
- ▸ CreativeBench is introduced as a benchmark for evaluating machine creativity
- ▸ The benchmark targets combinatorial and exploratory creativity through an automated pipeline
- ▸ EvoRePE is proposed as a plug-and-play inference-time steering strategy to enhance machine creativity
Merits
Strength in theoretical foundation
CreativeBench is grounded in a classical cognitive framework, providing a rigorous theoretical foundation for evaluating machine creativity.
Objective evaluation metric
The unified metric defined as the product of quality and novelty offers an objective evaluation of creativity, distinguishing it from hallucination.
Practical application potential
The proposed EvoRePE strategy has the potential to be applied in various domains, including code generation, artistic creation, and decision-making.
Demerits
Limited scope
The current implementation of CreativeBench focuses on code generation and may need to be extended to other domains to fully demonstrate its effectiveness.
Scalability concerns
The proposed EvoRePE strategy may not be scalable to very large models, potentially limiting its applicability.
Expert Commentary
The introduction of CreativeBench and EvoRePE represents a significant step forward in the evaluation and enhancement of machine creativity. While the current implementation has its limitations, the proposed framework has the potential to be extended and adapted to various domains. The work's implications for AI research and development are substantial, particularly in relation to the development of more creative and effective AI systems. As the field continues to evolve, it is essential to address the evaluation challenge and provide more comprehensive understanding of AI capabilities and limitations.
Recommendations
- ✓ Future research should focus on extending CreativeBench to other domains and exploring its applicability in various creative tasks.
- ✓ The proposed EvoRePE strategy should be further developed and tested to ensure its scalability and effectiveness in large-scale AI systems.