Academic

ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation

arXiv:2604.03922v1 Announce Type: new Abstract: Selecting LLM-generated code candidates using LLM-generated tests is challenging because the tests themselves may be incorrect. Existing methods either treat all tests equally or rely on ad-hoc heuristics to filter unreliable tests. Yet determining test correctness requires knowing which codes are correct, creating a \emph{circular dependency}. Our key insight is that we need not determine test correctness at all: \emph{test votes should rank, not merely count}. What matters is not how many codes pass a test, but whether the test can \emph{distinguish} correct from incorrect code. We break the circular dependency via leave-one-out evaluation: hold out one test, rank codes by their aggregate scores on all remaining tests, and measure whether the held-out test's pass/fail pattern agrees with this ranking. We formalize this agreement as the leave-one-out AUC~(LOO-AUC) and prove that the expected LOO-AUC is proportional to each test's abilit

arXiv:2604.03922v1 Announce Type: new Abstract: Selecting LLM-generated code candidates using LLM-generated tests is challenging because the tests themselves may be incorrect. Existing methods either treat all tests equally or rely on ad-hoc heuristics to filter unreliable tests. Yet determining test correctness requires knowing which codes are correct, creating a \emph{circular dependency}. Our key insight is that we need not determine test correctness at all: \emph{test votes should rank, not merely count}. What matters is not how many codes pass a test, but whether the test can \emph{distinguish} correct from incorrect code. We break the circular dependency via leave-one-out evaluation: hold out one test, rank codes by their aggregate scores on all remaining tests, and measure whether the held-out test's pass/fail pattern agrees with this ranking. We formalize this agreement as the leave-one-out AUC~(LOO-AUC) and prove that the expected LOO-AUC is proportional to each test's ability to separate correct code from incorrect code. Building on this, we propose \textbf{ACES}~(\textbf{A}UC \textbf{C}onsist\textbf{E}ncy \textbf{S}coring) with two complementary variants: ACES-C provides closed-form weights that provably approximate the oracle in expectation under a mild assumption on average test quality; ACES-O drops this assumption and iteratively optimizes a differentiable LOO-AUC objective. Both operate solely on the binary pass matrix with negligible overhead, and achieve state-of-the-art Pass@$k$ on multiple code generation benchmarks.

Executive Summary

This article proposes a novel approach to evaluating the correctness of Large Language Model (LLM)-generated code tests, known as ACES (AUC Consistency Scoring). By utilizing leave-one-out evaluation, ACES breaks the circular dependency of determining test correctness and instead focuses on a test's ability to distinguish correct from incorrect code. The authors formalize this notion as the leave-one-out AUC (LOO-AUC) and demonstrate that it is proportional to each test's ability to separate correct code from incorrect code. ACES is implemented in two variants: ACES-C, which provides closed-form weights, and ACES-O, which iteratively optimizes a differentiable LOO-AUC objective. Both variants achieve state-of-the-art performance on multiple code generation benchmarks.

Key Points

  • ACES breaks the circular dependency of determining test correctness by focusing on a test's ability to distinguish correct from incorrect code.
  • The leave-one-out AUC (LOO-AUC) is a novel metric that measures a test's ability to separate correct code from incorrect code.
  • ACES is implemented in two variants: ACES-C, which provides closed-form weights, and ACES-O, which iteratively optimizes a differentiable LOO-AUC objective.

Merits

Strength in Novelty

The article proposes a novel approach to evaluating the correctness of LLM-generated code tests, which breaks the circular dependency of determining test correctness.

Strength in Performance

Both variants of ACES achieve state-of-the-art performance on multiple code generation benchmarks.

Strength in Formalization

The authors formalize the leave-one-out AUC (LOO-AUC) as a novel metric that measures a test's ability to separate correct code from incorrect code.

Demerits

Limitation in Assumption

The authors assume that the average test quality is mild, which may not always be the case in real-world scenarios.

Limitation in Scalability

The leave-one-out evaluation approach may not be scalable to large datasets, which could limit its practical applications.

Limitation in Interpretability

The ACES-O variant may be less interpretable than ACES-C due to its iterative optimization process.

Expert Commentary

The article proposes a novel approach to evaluating the correctness of LLM-generated code tests, which breaks the circular dependency of determining test correctness. The leave-one-out AUC (LOO-AUC) is a novel metric that measures a test's ability to separate correct code from incorrect code. The authors demonstrate the effectiveness of ACES on multiple code generation benchmarks, which highlights the importance of evaluating the correctness of AI-generated code. However, the article assumes that the average test quality is mild, which may not always be the case in real-world scenarios. Additionally, the leave-one-out evaluation approach may not be scalable to large datasets, which could limit its practical applications. Overall, the article proposes a novel and effective approach to evaluating the correctness of LLM-generated code tests, which could improve the reliability and trustworthiness of AI-generated code.

Recommendations

  • Further research is needed to investigate the scalability of the leave-one-out evaluation approach to large datasets.
  • The authors should explore alternative approaches to evaluating the correctness of AI-generated code that do not rely on the leave-one-out evaluation method.

Sources

Original: arXiv - cs.LG