AST-PAC: AST-guided Membership Inference for Code
arXiv:2602.13240v1 Announce Type: new Abstract: Code Large Language Models are frequently trained on massive datasets containing restrictively licensed source code. This creates urgent data governance and copyright challenges. Membership Inference Attacks (MIAs) can serve as an auditing mechanism to detect unauthorized data usage in models. While attacks like the Loss Attack provide a baseline, more involved methods like Polarized Augment Calibration (PAC) remain underexplored in the code domain. This paper presents an exploratory study evaluating these methods on 3B--7B parameter code models. We find that while PAC generally outperforms the Loss baseline, its effectiveness relies on augmentation strategies that disregard the rigid syntax of code, leading to performance degradation on larger, complex files. To address this, we introduce AST-PAC, a domain-specific adaptation that utilizes Abstract Syntax Tree (AST) based perturbations to generate syntactically valid calibration samples
arXiv:2602.13240v1 Announce Type: new Abstract: Code Large Language Models are frequently trained on massive datasets containing restrictively licensed source code. This creates urgent data governance and copyright challenges. Membership Inference Attacks (MIAs) can serve as an auditing mechanism to detect unauthorized data usage in models. While attacks like the Loss Attack provide a baseline, more involved methods like Polarized Augment Calibration (PAC) remain underexplored in the code domain. This paper presents an exploratory study evaluating these methods on 3B--7B parameter code models. We find that while PAC generally outperforms the Loss baseline, its effectiveness relies on augmentation strategies that disregard the rigid syntax of code, leading to performance degradation on larger, complex files. To address this, we introduce AST-PAC, a domain-specific adaptation that utilizes Abstract Syntax Tree (AST) based perturbations to generate syntactically valid calibration samples. Preliminary results indicate that AST-PAC improves as syntactic size grows, where PAC degrades, but under-mutates small files and underperforms on alphanumeric-rich code. Overall, the findings motivate future work on syntax-aware and size-adaptive calibration as a prerequisite for reliable provenance auditing of code language models.
Executive Summary
The article 'AST-PAC: AST-guided Membership Inference for Code' explores the critical issue of data governance and copyright challenges in code Large Language Models (LLMs) trained on extensive datasets. It evaluates Membership Inference Attacks (MIAs) as a means to audit unauthorized data usage, comparing baseline methods like the Loss Attack with more sophisticated approaches such as Polarized Augment Calibration (PAC). The study introduces AST-PAC, a domain-specific adaptation of PAC that leverages Abstract Syntax Tree (AST) based perturbations to generate syntactically valid calibration samples. Preliminary results show that AST-PAC improves performance on larger, complex files where PAC degrades, but it under-mutates small files and underperforms on alphanumeric-rich code. The findings highlight the need for syntax-aware and size-adaptive calibration methods for reliable provenance auditing of code LLMs.
Key Points
- ▸ Membership Inference Attacks (MIAs) are essential for auditing unauthorized data usage in code LLMs.
- ▸ Polarized Augment Calibration (PAC) generally outperforms the Loss Attack baseline but faces challenges with code syntax.
- ▸ AST-PAC, an AST-based adaptation of PAC, improves performance on larger, complex files but under-mutates small files and underperforms on alphanumeric-rich code.
- ▸ The study underscores the need for syntax-aware and size-adaptive calibration methods for reliable provenance auditing.
Merits
Innovative Approach
The introduction of AST-PAC represents a novel and innovative approach to addressing the limitations of PAC in the context of code syntax.
Practical Relevance
The study's focus on real-world challenges in data governance and copyright issues makes it highly relevant to both academic and industry stakeholders.
Comprehensive Evaluation
The article provides a thorough evaluation of different MIA methods, offering a balanced view of their strengths and weaknesses.
Demerits
Limited Scope
The study is preliminary and limited in scope, focusing primarily on 3B--7B parameter code models, which may not be representative of all code LLMs.
Performance Limitations
AST-PAC's under-mutation of small files and underperformance on alphanumeric-rich code highlight significant limitations that need to be addressed.
Lack of Extensive Testing
The preliminary nature of the results suggests that more extensive testing and validation are required to confirm the robustness of AST-PAC.
Expert Commentary
The article 'AST-PAC: AST-guided Membership Inference for Code' makes a significant contribution to the field of AI ethics and data governance by addressing the critical issue of unauthorized data usage in code Large Language Models. The introduction of AST-PAC as a domain-specific adaptation of PAC is a noteworthy advancement, demonstrating the potential for syntax-aware calibration methods to enhance the reliability of provenance auditing. However, the study's preliminary nature and the identified limitations of AST-PAC underscore the need for further research and development. The findings have important implications for both practical applications and policy-making, highlighting the necessity for robust data governance policies and regulatory frameworks. The article's balanced evaluation of different MIA methods provides valuable insights for academics and industry professionals alike, making it a compelling read for those interested in the intersection of AI, copyright, and data privacy.
Recommendations
- ✓ Future research should focus on expanding the scope of the study to include a wider range of code LLMs and more diverse datasets to validate the robustness of AST-PAC.
- ✓ Developers of code LLMs should explore syntax-aware and size-adaptive calibration methods to improve the reliability of provenance auditing and ensure compliance with data governance standards.