Academic

CUDABench: Benchmarking LLMs for Text-to-CUDA Generation

arXiv:2603.02236v1 Announce Type: new Abstract: Recent studies have demonstrated the potential of Large Language Models (LLMs) in generating GPU Kernels. Current benchmarks focus on the translation of high-level languages into CUDA, overlooking the more general and challenging task of text-to-CUDA generation. Furthermore, given the hardware-specific and performance-critical features of GPU programming, accurately assessing the performance of LLM-generated GPU programs is nontrivial. In this work, we introduce CUDABench, a comprehensive benchmark designed to evaluate the text-to-CUDA capabilities of LLMs. First, we construct CUDABench-Set, which covers Breadth-Depth-Difficulty evaluation space in diverse application domains, including artificial intelligence, scientific computing, and data analytics, etc. Furthermore, we propose CUDABench-Score and Generative Verification Pipeline that assess (1) compilation correctness, (2) functional consistency through execution-based verification,

arXiv:2603.02236v1 Announce Type: new Abstract: Recent studies have demonstrated the potential of Large Language Models (LLMs) in generating GPU Kernels. Current benchmarks focus on the translation of high-level languages into CUDA, overlooking the more general and challenging task of text-to-CUDA generation. Furthermore, given the hardware-specific and performance-critical features of GPU programming, accurately assessing the performance of LLM-generated GPU programs is nontrivial. In this work, we introduce CUDABench, a comprehensive benchmark designed to evaluate the text-to-CUDA capabilities of LLMs. First, we construct CUDABench-Set, which covers Breadth-Depth-Difficulty evaluation space in diverse application domains, including artificial intelligence, scientific computing, and data analytics, etc. Furthermore, we propose CUDABench-Score and Generative Verification Pipeline that assess (1) compilation correctness, (2) functional consistency through execution-based verification, and (3) a novel roofline-based metric, Performance-Score. Benchmarking state-of-the-art LLMs reveals insightful findings and challenges of text-to-CUDA, such as a notable mismatch between high compilation success rates and low functional correctness, a lack of domain-specific algorithmic knowledge, and suboptimal utilization of GPU hardware resources. Our benchmark is available at https://github.com/CUDA-Bench/CUDABench.

Executive Summary

The authors introduce CUDABench, a comprehensive benchmark designed to evaluate the text-to-CUDA capabilities of Large Language Models (LLMs). CUDABench assesses compilation correctness, functional consistency through execution-based verification, and a novel roofline-based metric, Performance-Score. The benchmark reveals insightful findings and challenges of text-to-CUDA generation, including a notable mismatch between high compilation success rates and low functional correctness, a lack of domain-specific algorithmic knowledge, and suboptimal utilization of GPU hardware resources. The authors' work provides a significant contribution to the field of LLMs, highlighting the need for more robust and accurate benchmarks.

Key Points

  • CUDABench is designed to evaluate the text-to-CUDA capabilities of LLMs
  • The benchmark assesses compilation correctness, functional consistency, and Performance-Score
  • The results reveal a mismatch between compilation success rates and functional correctness

Merits

Comprehensive evaluation framework

CUDABench provides a rigorous evaluation framework for LLMs, covering multiple aspects of text-to-CUDA generation.

Novel roofline-based metric

The authors propose a novel roofline-based metric, Performance-Score, which provides a more accurate assessment of LLM performance.

Demerits

Limited generalizability

The results may not be generalizable to other LLMs or application domains, given the specific focus on text-to-CUDA generation.

Need for more robust evaluation

The authors highlight the need for more robust evaluation methods to assess LLM performance in diverse application domains.

Expert Commentary

The authors' introduction of CUDABench provides a significant contribution to the field of LLMs, highlighting the need for more robust and accurate benchmarks. The results of CUDABench reveal a mismatch between high compilation success rates and low functional correctness, which underscores the complexity of text-to-CUDA generation. The authors' focus on Performance-Score and Generative Verification Pipeline provides a more comprehensive evaluation framework, which can inform the development of more accurate and robust LLMs. However, the limited generalizability of the results and the need for more robust evaluation methods are notable limitations of the work.

Recommendations

  • Future research should focus on developing more comprehensive evaluation frameworks for LLMs, incorporating multiple aspects of text-to-CUDA generation and diverse application domains.
  • The authors' work highlights the need for more robust evaluation methods to assess LLM performance, which can inform policy decisions related to the development and deployment of AI systems.

Sources