Academic

Inference-Time Code Selection via Symbolic Equivalence Partitioning

arXiv:2604.06485v1 Announce Type: new Abstract: "Best-of-N" selection is a popular inference-time scaling method for code generation using Large Language Models (LLMs). However, to reliably identify correct solutions, existing methods often depend on expensive or stochastic external verifiers. In this paper, we propose Symbolic Equivalence Partitioning, a selection framework that uses symbolic execution to group candidate programs by semantic behavior and select a representative from the dominant functional partition. To improve grouping and selection, we encode domain-specific constraints as Satisfiability Modulo Theories (SMT) assumptions during symbolic execution to reduce path explosion and prevent invalid input searches outside the problem domain. At N=10, our method improves average accuracy over Pass@1 from 0.728 to 0.803 on HumanEval+ and from 0.516 to 0.604 on LiveCodeBench, without requiring any additional LLM inference beyond the initial N candidate generations.

D
David Cho, Yifan Wang, Fanping Sui, Ananth Grama
· · 1 min read · 15 views

arXiv:2604.06485v1 Announce Type: new Abstract: "Best-of-N" selection is a popular inference-time scaling method for code generation using Large Language Models (LLMs). However, to reliably identify correct solutions, existing methods often depend on expensive or stochastic external verifiers. In this paper, we propose Symbolic Equivalence Partitioning, a selection framework that uses symbolic execution to group candidate programs by semantic behavior and select a representative from the dominant functional partition. To improve grouping and selection, we encode domain-specific constraints as Satisfiability Modulo Theories (SMT) assumptions during symbolic execution to reduce path explosion and prevent invalid input searches outside the problem domain. At N=10, our method improves average accuracy over Pass@1 from 0.728 to 0.803 on HumanEval+ and from 0.516 to 0.604 on LiveCodeBench, without requiring any additional LLM inference beyond the initial N candidate generations.

Executive Summary

The paper introduces "Symbolic Equivalence Partitioning" (SEP), a novel inference-time code selection framework for Large Language Models (LLMs) that aims to improve the accuracy of "Best-of-N" code generation. Unlike conventional methods relying on external verifiers, SEP leverages symbolic execution to semantically group candidate programs. It identifies the dominant functional partition and selects a representative, enhancing reliability without additional LLM inference. A key innovation is the integration of Satisfiability Modulo Theories (SMT) constraints during symbolic execution to mitigate path explosion and focus on valid problem domains. Empirical results demonstrate significant accuracy improvements on HumanEval+ and LiveCodeBench datasets, suggesting a more robust and efficient approach to code selection.

Key Points

  • SEP uses symbolic execution for semantic grouping of candidate programs generated by LLMs.
  • It selects a representative from the 'dominant functional partition' to enhance code selection reliability.
  • SMT constraints are incorporated into symbolic execution to reduce path explosion and ensure domain-specific validity.
  • The method improves average accuracy over Pass@1 significantly (e.g., 0.728 to 0.803 on HumanEval+) at N=10.
  • SEP achieves these gains without requiring additional LLM inference beyond initial candidate generation.

Merits

Semantic Robustness

Utilizing symbolic execution for semantic grouping offers a more profound understanding of program behavior than syntactic or test-case-based verification, leading to more reliable selections.

Efficiency

The method avoids expensive external verifiers and additional LLM inference, making it computationally more efficient than many existing Best-of-N approaches.

Novel Application of SMT

Encoding domain-specific constraints as SMT assumptions is an elegant solution to manage complexity in symbolic execution, directing analysis towards relevant problem spaces and preventing invalid input exploration.

Tangible Performance Gains

The reported accuracy improvements on standard benchmarks like HumanEval+ and LiveCodeBench are substantial and demonstrate the practical efficacy of SEP.

Demerits

Scalability of Symbolic Execution

While SMT helps, symbolic execution remains notoriously resource-intensive for complex or large programs, potentially limiting its practical applicability to longer generated code snippets.

Completeness of Domain Constraints

The effectiveness of SMT relies heavily on the completeness and correctness of the encoded domain-specific constraints, which may be challenging to define perfectly for all problem types.

Definition of 'Dominant Functional Partition'

The paper implies a clear method for identifying the dominant partition, but the robustness of this identification, especially in cases of ambiguous or nearly equally sized partitions, warrants further scrutiny.

Generalizability Across Languages/Paradigms

The approach's performance for languages beyond those typically used in HumanEval+ (e.g., Python) or for different programming paradigms (e.g., functional) is not explicitly detailed.

Expert Commentary

This paper presents a compelling advancement in the critical area of LLM-driven code generation, moving beyond heuristic or probabilistic selection methods towards a more principled, semantic approach. The integration of symbolic execution with SMT solvers is particularly astute, addressing one of symbolic execution's primary practical impediments—path explosion—by grounding the analysis in problem-specific constraints. While the scalability of symbolic execution for arbitrary code complexity remains a perennial challenge, the targeted application here, combined with domain-specific knowledge, significantly enhances its utility. The reported accuracy gains are impressive and suggest that investing in more sophisticated semantic analysis at inference time can yield substantial dividends. Future work should delve deeper into the robustness of 'dominant partition' identification and explore the framework's adaptability to different programming paradigms and problem domains, particularly those less amenable to straightforward SMT encoding. This work paves the way for more trustworthy and efficient AI-assisted software engineering.

Recommendations

  • Conduct a thorough empirical study on the computational overhead of symbolic execution and SMT solving for varying code complexities and problem types to better understand practical scalability limits.
  • Investigate alternative or hybrid methods for defining and deriving domain-specific constraints, perhaps using LLMs themselves to suggest or refine SMT assumptions from problem descriptions.
  • Explore the framework's applicability to code generation for embedded systems or safety-critical domains where formal correctness is paramount, potentially integrating with existing verification toolchains.
  • Provide a detailed methodological exposition on how the 'dominant functional partition' is robustly identified, especially in scenarios with multiple semantically similar but distinct partitions.

Sources

Original: arXiv - cs.LG