Academic

AST-PAC: AST-guided Membership Inference for Code

arXiv:2602.13240v1 Announce Type: new Abstract: Code Large Language Models are frequently trained on massive datasets containing restrictively licensed source code. This creates urgent data governance and copyright challenges. Membership Inference Attacks (MIAs) can serve as an auditing mechanism to detect unauthorized data usage in models. While attacks like the Loss Attack provide a baseline, more involved methods like Polarized Augment Calibration (PAC) remain underexplored in the code domain. This paper presents an exploratory study evaluating these methods on 3B--7B parameter code models. We find that while PAC generally outperforms the Loss baseline, its effectiveness relies on augmentation strategies that disregard the rigid syntax of code, leading to performance degradation on larger, complex files. To address this, we introduce AST-PAC, a domain-specific adaptation that utilizes Abstract Syntax Tree (AST) based perturbations to generate syntactically valid calibration samples

Roham Koohestani, Ali Al-Kaswan, Jonathan Katzy, Maliheh Izadi · March 7, 2026 · 1 min read · 73 views

#cs.AI #cs.SE

Executive Summary

The article 'AST-PAC: AST-guided Membership Inference for Code' explores the critical issue of data governance and copyright challenges in code Large Language Models (LLMs) trained on extensive datasets. It evaluates Membership Inference Attacks (MIAs) as a means to audit unauthorized data usage, comparing baseline methods like the Loss Attack with more sophisticated approaches such as Polarized Augment Calibration (PAC). The study introduces AST-PAC, a domain-specific adaptation of PAC that leverages Abstract Syntax Tree (AST) based perturbations to generate syntactically valid calibration samples. Preliminary results show that AST-PAC improves performance on larger, complex files where PAC degrades, but it under-mutates small files and underperforms on alphanumeric-rich code. The findings highlight the need for syntax-aware and size-adaptive calibration methods for reliable provenance auditing of code LLMs.

Key Points

▸ Membership Inference Attacks (MIAs) are essential for auditing unauthorized data usage in code LLMs.
▸ Polarized Augment Calibration (PAC) generally outperforms the Loss Attack baseline but faces challenges with code syntax.
▸ AST-PAC, an AST-based adaptation of PAC, improves performance on larger, complex files but under-mutates small files and underperforms on alphanumeric-rich code.
▸ The study underscores the need for syntax-aware and size-adaptive calibration methods for reliable provenance auditing.

Merits

Innovative Approach

The introduction of AST-PAC represents a novel and innovative approach to addressing the limitations of PAC in the context of code syntax.

Practical Relevance

The study's focus on real-world challenges in data governance and copyright issues makes it highly relevant to both academic and industry stakeholders.

Comprehensive Evaluation

The article provides a thorough evaluation of different MIA methods, offering a balanced view of their strengths and weaknesses.

Demerits

Limited Scope

The study is preliminary and limited in scope, focusing primarily on 3B--7B parameter code models, which may not be representative of all code LLMs.

Performance Limitations

AST-PAC's under-mutation of small files and underperformance on alphanumeric-rich code highlight significant limitations that need to be addressed.

Lack of Extensive Testing

The preliminary nature of the results suggests that more extensive testing and validation are required to confirm the robustness of AST-PAC.

Expert Commentary

The article 'AST-PAC: AST-guided Membership Inference for Code' makes a significant contribution to the field of AI ethics and data governance by addressing the critical issue of unauthorized data usage in code Large Language Models. The introduction of AST-PAC as a domain-specific adaptation of PAC is a noteworthy advancement, demonstrating the potential for syntax-aware calibration methods to enhance the reliability of provenance auditing. However, the study's preliminary nature and the identified limitations of AST-PAC underscore the need for further research and development. The findings have important implications for both practical applications and policy-making, highlighting the necessity for robust data governance policies and regulatory frameworks. The article's balanced evaluation of different MIA methods provides valuable insights for academics and industry professionals alike, making it a compelling read for those interested in the intersection of AI, copyright, and data privacy.

Recommendations

✓ Future research should focus on expanding the scope of the study to include a wider range of code LLMs and more diverse datasets to validate the robustness of AST-PAC.
✓ Developers of code LLMs should explore syntax-aware and size-adaptive calibration methods to improve the reliability of provenance auditing and ensure compliance with data governance standards.

Sources

arXiv - cs.AI

AST-PAC: AST-guided Membership Inference for Code

AI Commentary

Executive Summary

Key Points

Merits

Innovative Approach

Practical Relevance

Comprehensive Evaluation

Demerits

Limited Scope

Performance Limitations

Lack of Extensive Testing

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs