Academic

CresOWLve: Benchmarking Creative Problem-Solving Over Real-World Knowledge

arXiv:2604.03374v1 Announce Type: new Abstract: Creative problem-solving requires combining multiple cognitive abilities, including logical reasoning, lateral thinking, analogy-making, and commonsense knowledge, to discover insights that connect seemingly unrelated pieces of information. However, most existing benchmarks for large language models (LLMs) evaluate only specific components of this process. Moreover, many creativity-oriented benchmarks rely on artificially constructed brainteasers or contrived scenarios that do not reflect how creative problem-solving occurs in real-world settings. To address this gap, we introduce CresOWLve, a benchmark for evaluating creative problem-solving using puzzles grounded in real-world knowledge. Problems in CresOWLve require employing multiple creative thinking strategies, retrieving facts from diverse domains, and creatively combining them to arrive at a solution. Evaluating several frontier non-thinking and thinking LLMs, we show that CresOW

arXiv:2604.03374v1 Announce Type: new Abstract: Creative problem-solving requires combining multiple cognitive abilities, including logical reasoning, lateral thinking, analogy-making, and commonsense knowledge, to discover insights that connect seemingly unrelated pieces of information. However, most existing benchmarks for large language models (LLMs) evaluate only specific components of this process. Moreover, many creativity-oriented benchmarks rely on artificially constructed brainteasers or contrived scenarios that do not reflect how creative problem-solving occurs in real-world settings. To address this gap, we introduce CresOWLve, a benchmark for evaluating creative problem-solving using puzzles grounded in real-world knowledge. Problems in CresOWLve require employing multiple creative thinking strategies, retrieving facts from diverse domains, and creatively combining them to arrive at a solution. Evaluating several frontier non-thinking and thinking LLMs, we show that CresOWLve remains highly challenging. Our analysis reveals a consistent performance gap: models perform substantially better on factual questions than on creative ones (up to a -17% drop). While models can often retrieve the relevant knowledge, they struggle to form the non-obvious creative connections required to integrate this information and arrive at the correct answer.

Executive Summary

The article introduces CresOWLve, a novel benchmark designed to evaluate the creative problem-solving capabilities of large language models (LLMs) using real-world knowledge. Unlike existing benchmarks that focus on isolated cognitive skills or contrived puzzles, CresOWLve emphasizes the integration of logical reasoning, lateral thinking, analogy-making, and commonsense knowledge to solve problems that reflect real-world complexity. The authors demonstrate that frontier LLMs, despite excelling in factual retrieval, face significant challenges in forming non-obvious connections necessary for creative problem-solving, with performance drops of up to 17% compared to factual questions. This underscores a critical limitation in current LLM architectures and highlights the need for further advancements in AI systems to bridge the gap between knowledge retrieval and creative synthesis.

Key Points

  • CresOWLve is the first benchmark to evaluate creative problem-solving in LLMs using real-world knowledge, addressing the gap in existing benchmarks that focus on isolated cognitive skills or contrived scenarios.
  • The benchmark reveals a substantial performance gap between factual question answering and creative problem-solving in frontier LLMs, with drops of up to 17% in creative tasks.
  • Models demonstrate proficiency in retrieving relevant knowledge but struggle to form non-obvious connections required to integrate this information creatively, indicating a fundamental limitation in current AI architectures.

Merits

Innovative Benchmark Design

CresOWLve uniquely combines real-world knowledge with creative problem-solving tasks, offering a more holistic evaluation of LLMs compared to artificial or isolated benchmarks.

Real-World Relevance

By grounding puzzles in real-world scenarios, the benchmark provides a more accurate reflection of how creative problem-solving occurs in practice, enhancing its ecological validity.

Comprehensive Evaluation Framework

The benchmark evaluates multiple cognitive abilities simultaneously (e.g., logical reasoning, lateral thinking, analogy-making), providing a nuanced assessment of LLM capabilities.

Demerits

Limited Scope of Evaluation

The benchmark may not fully capture the breadth of creative problem-solving scenarios encountered in diverse professional or academic domains, potentially limiting its generalizability.

Dependence on Real-World Knowledge

The reliance on real-world knowledge may introduce biases or inconsistencies if the underlying data is incomplete, outdated, or culturally specific, affecting the reliability of the benchmark.

Performance Metrics Challenges

The article does not detail how creative problem-solving performance is quantified or standardized, which may complicate cross-model comparisons or longitudinal assessments.

Expert Commentary

The introduction of CresOWLve represents a significant step forward in the evaluation of LLMs, addressing a longstanding gap in benchmarking their creative problem-solving abilities. The authors’ findings underscore a critical limitation in current models: while they excel at retrieving facts, they falter when required to synthesize disparate pieces of knowledge into novel solutions. This aligns with broader observations in AI research, where systems often struggle with tasks demanding abductive reasoning or lateral thinking. The benchmark’s focus on real-world scenarios is particularly commendable, as it moves beyond the artificial constraints of contrived puzzles to reflect the messy, interconnected nature of human creativity. However, the implications of this work extend beyond technical evaluations. The performance gap identified by CresOWLve raises important questions about the nature of creativity itself and whether it can be fully replicated by statistical models. For practitioners, this benchmark serves as a cautionary tale about overestimating the capabilities of current LLMs in domains requiring genuine innovation. Future research should explore hybrid architectures that combine the strengths of neural networks with symbolic reasoning systems to bridge this gap, while also addressing the ethical and societal implications of deploying AI in creative processes.

Recommendations

  • Develop hybrid AI models that integrate neural networks with symbolic reasoning systems to enhance creative problem-solving capabilities, drawing inspiration from cognitive architectures like ACT-R or SOAR.
  • Expand CresOWLve to include a broader range of domains and cultural contexts to ensure its generalizability and reduce biases inherent in real-world knowledge datasets.
  • Collaborate with cognitive scientists and psychologists to refine the benchmark’s design, incorporating insights from human creative processes to improve the ecological validity of the evaluation.

Sources

Original: arXiv - cs.CL