ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation
arXiv:2604.03922v1 Announce Type: new Abstract: Selecting LLM-generated code candidates using LLM-generated tests is challenging because the tests themselves may be incorrect. Existing methods either treat all tests equally or rely on ad-hoc heuristics to filter unreliable tests. Yet determining test correctness requires knowing which codes are correct, creating a \emph{circular dependency}. Our key insight is that we need not determine test correctness at all: \emph{test votes should rank, not merely count}. What matters is not how many codes pass a test, but whether the test can \emph{distinguish} correct from incorrect code. We break the circular dependency via leave-one-out evaluation: hold out one test, rank codes by their aggregate scores on all remaining tests, and measure whether the held-out test's pass/fail pattern agrees with this ranking. We formalize this agreement as the leave-one-out AUC~(LOO-AUC) and prove that the expected LOO-AUC is proportional to each test's abilit
arXiv:2604.03922v1 Announce Type: new Abstract: Selecting LLM-generated code candidates using LLM-generated tests is challenging because the tests themselves may be incorrect. Existing methods either treat all tests equally or rely on ad-hoc heuristics to filter unreliable tests. Yet determining test correctness requires knowing which codes are correct, creating a \emph{circular dependency}. Our key insight is that we need not determine test correctness at all: \emph{test votes should rank, not merely count}. What matters is not how many codes pass a test, but whether the test can \emph{distinguish} correct from incorrect code. We break the circular dependency via leave-one-out evaluation: hold out one test, rank codes by their aggregate scores on all remaining tests, and measure whether the held-out test's pass/fail pattern agrees with this ranking. We formalize this agreement as the leave-one-out AUC~(LOO-AUC) and prove that the expected LOO-AUC is proportional to each test's ability to separate correct code from incorrect code. Building on this, we propose \textbf{ACES}~(\textbf{A}UC \textbf{C}onsist\textbf{E}ncy \textbf{S}coring) with two complementary variants: ACES-C provides closed-form weights that provably approximate the oracle in expectation under a mild assumption on average test quality; ACES-O drops this assumption and iteratively optimizes a differentiable LOO-AUC objective. Both operate solely on the binary pass matrix with negligible overhead, and achieve state-of-the-art Pass@$k$ on multiple code generation benchmarks.
Executive Summary
This article proposes a novel approach to evaluating the correctness of Large Language Model (LLM)-generated code tests, known as ACES (AUC Consistency Scoring). By utilizing leave-one-out evaluation, ACES breaks the circular dependency of determining test correctness and instead focuses on a test's ability to distinguish correct from incorrect code. The authors formalize this notion as the leave-one-out AUC (LOO-AUC) and demonstrate that it is proportional to each test's ability to separate correct code from incorrect code. ACES is implemented in two variants: ACES-C, which provides closed-form weights, and ACES-O, which iteratively optimizes a differentiable LOO-AUC objective. Both variants achieve state-of-the-art performance on multiple code generation benchmarks.
Key Points
- ▸ ACES breaks the circular dependency of determining test correctness by focusing on a test's ability to distinguish correct from incorrect code.
- ▸ The leave-one-out AUC (LOO-AUC) is a novel metric that measures a test's ability to separate correct code from incorrect code.
- ▸ ACES is implemented in two variants: ACES-C, which provides closed-form weights, and ACES-O, which iteratively optimizes a differentiable LOO-AUC objective.
Merits
Strength in Novelty
The article proposes a novel approach to evaluating the correctness of LLM-generated code tests, which breaks the circular dependency of determining test correctness.
Strength in Performance
Both variants of ACES achieve state-of-the-art performance on multiple code generation benchmarks.
Strength in Formalization
The authors formalize the leave-one-out AUC (LOO-AUC) as a novel metric that measures a test's ability to separate correct code from incorrect code.
Demerits
Limitation in Assumption
The authors assume that the average test quality is mild, which may not always be the case in real-world scenarios.
Limitation in Scalability
The leave-one-out evaluation approach may not be scalable to large datasets, which could limit its practical applications.
Limitation in Interpretability
The ACES-O variant may be less interpretable than ACES-C due to its iterative optimization process.
Expert Commentary
The article proposes a novel approach to evaluating the correctness of LLM-generated code tests, which breaks the circular dependency of determining test correctness. The leave-one-out AUC (LOO-AUC) is a novel metric that measures a test's ability to separate correct code from incorrect code. The authors demonstrate the effectiveness of ACES on multiple code generation benchmarks, which highlights the importance of evaluating the correctness of AI-generated code. However, the article assumes that the average test quality is mild, which may not always be the case in real-world scenarios. Additionally, the leave-one-out evaluation approach may not be scalable to large datasets, which could limit its practical applications. Overall, the article proposes a novel and effective approach to evaluating the correctness of LLM-generated code tests, which could improve the reliability and trustworthiness of AI-generated code.
Recommendations
- ✓ Further research is needed to investigate the scalability of the leave-one-out evaluation approach to large datasets.
- ✓ The authors should explore alternative approaches to evaluating the correctness of AI-generated code that do not rely on the leave-one-out evaluation method.
Sources
Original: arXiv - cs.LG