Academic

Same Graph, Different Likelihoods: Calibration of Autoregressive Graph Generators via Permutation-Equivalent Encodings

arXiv:2604.05613v1 Announce Type: new Abstract: Autoregressive graph generators define likelihoods via a sequential construction process, but these likelihoods are only meaningful if they are consistent across all linearizations of the same graph. Segmented Eulerian Neighborhood Trails (SENT), a recent linearization method, converts graphs into sequences that can be perfectly decoded and efficiently processed by language models, but admit multiple equivalent linearizations of the same graph. We quantify violations in assigned negative log-likelihood (NLL) using the coefficient of variation across equivalent linearizations, which we call Linearization Uncertainty (LU). Training transformers under four linearization strategies on two datasets, we show that biased orderings achieve lower NLL on their native order but exhibit expected calibration error (ECE) two orders of magnitude higher under random permutation, indicating that these models have learned their training linearization rath

arXiv:2604.05613v1 Announce Type: new Abstract: Autoregressive graph generators define likelihoods via a sequential construction process, but these likelihoods are only meaningful if they are consistent across all linearizations of the same graph. Segmented Eulerian Neighborhood Trails (SENT), a recent linearization method, converts graphs into sequences that can be perfectly decoded and efficiently processed by language models, but admit multiple equivalent linearizations of the same graph. We quantify violations in assigned negative log-likelihood (NLL) using the coefficient of variation across equivalent linearizations, which we call Linearization Uncertainty (LU). Training transformers under four linearization strategies on two datasets, we show that biased orderings achieve lower NLL on their native order but exhibit expected calibration error (ECE) two orders of magnitude higher under random permutation, indicating that these models have learned their training linearization rather than the underlying graph. On the molecular graph benchmark QM9, NLL for generated graphs is negatively correlated with molecular stability (AUC $=0.43$), while LU achieves AUC $=0.85$, suggesting that permutation-based evaluation provides a more reliable quality check for generated molecules. Code is available at https://github.com/lauritsf/linearization-uncertainty

Executive Summary

The article presents a critical examination of autoregressive graph generators, which define likelihoods via sequential graph construction but face challenges in consistency across equivalent linearizations of the same graph. The authors introduce Linearization Uncertainty (LU), a metric quantifying violations in negative log-likelihood (NLL) due to permutation-equivalent encodings, using the coefficient of variation across linearizations. Through experiments on transformer models trained under four linearization strategies and evaluated on two datasets, the study demonstrates that biased orderings yield lower NLL but suffer from severe Expected Calibration Error (ECE) under random permutation, indicating overfitting to training linearizations rather than learning the underlying graph distribution. Notably, on the molecular graph benchmark QM9, LU correlates strongly with molecular stability (AUC = 0.85), outperforming NLL (AUC = 0.43), suggesting that permutation-based evaluation is a more reliable metric for assessing generated graph quality.

Key Points

  • Autoregressive graph generators assign likelihoods to graphs via sequential construction, but these likelihoods may vary across equivalent linearizations of the same graph.
  • Linearization Uncertainty (LU) quantifies inconsistency in negative log-likelihood (NLL) due to permutation-equivalent encodings, measured as the coefficient of variation across linearizations.
  • Biased training linearizations result in models that achieve low NLL on their native order but exhibit high Expected Calibration Error (ECE) under random permutation, indicating overfitting to the training linearization rather than the graph distribution.
  • On the QM9 molecular benchmark, LU (AUC = 0.85) is a more reliable indicator of molecular stability than NLL (AUC = 0.43), highlighting the importance of permutation-invariant evaluation.

Merits

Novelty of Linearization Uncertainty (LU)

The introduction of LU as a metric to quantify inconsistencies in likelihood assignments across permutation-equivalent graph linearizations is a significant methodological contribution. It addresses a critical gap in the evaluation of autoregressive graph generators, where traditional NLL metrics may be misleading due to their dependency on linearization order.

Empirical Rigor and Benchmarking

The study conducts extensive experiments across four linearization strategies and two datasets, providing robust empirical evidence for the limitations of biased linearizations and the superiority of LU as an evaluation metric. The inclusion of molecular stability as an external validation metric further strengthens the analysis.

Theoretical Insight into Model Overfitting

The article offers a nuanced understanding of how autoregressive models can overfit to specific linearization biases rather than learning the underlying graph distribution. This insight has broader implications for machine learning models trained on structured data with inherent symmetries.

Demerits

Limited Generalizability of Findings

While the study presents compelling evidence on molecular graphs (QM9) and another unspecified dataset, the generalizability of these findings to other types of graphs (e.g., social networks, citation graphs) remains untested. The structural properties of molecular graphs may not fully capture the challenges posed by other graph domains.

Computational Overhead of LU Calculation

The computation of LU involves evaluating likelihoods across multiple linearizations of the same graph, which may introduce significant computational overhead, particularly for large graphs. This could limit the practical applicability of LU in high-throughput scenarios.

Dependence on Language Model Architecture

The study focuses on transformer models trained with language model architectures. The findings may not fully extend to other autoregressive graph generation architectures (e.g., RNNs, GNNs) or non-autoregressive approaches, which could exhibit different behaviors under permutation-equivalent encodings.

Expert Commentary

This article makes a compelling case for reassessing how we evaluate autoregressive graph generators. The introduction of Linearization Uncertainty (LU) is particularly noteworthy, as it directly addresses a fundamental flaw in current practice: the assumption that negative log-likelihood (NLL) is a reliable indicator of model performance when likelihoods are assigned in a sequence-dependent manner. The empirical evidence is robust, demonstrating that models can achieve artificially low NLL by overfitting to their training linearizations, a phenomenon that only becomes apparent under permutation-based evaluation. The study’s focus on molecular graphs is timely, given the growing interest in AI-driven molecular design, where the stability and validity of generated molecules are paramount. However, the broader implications extend to any domain where autoregressive models are applied to structured data with symmetries, such as protein folding or social network analysis. The article also implicitly critiques the over-reliance on NLL as a universal metric in generative modeling, a critique that resonates with recent discussions in the language modeling community. Moving forward, researchers and practitioners should prioritize permutation-invariant evaluation frameworks to ensure that models are genuinely learning the underlying data distribution rather than exploiting artifacts of their training setup.

Recommendations

  • Adopt LU as a standard evaluation metric for autoregressive graph generators, alongside or in place of NLL, to ensure robustness against permutation biases.
  • Develop permutation-equivariant or permutation-invariant architectures for graph generation, such as Graph Transformers with explicit symmetry handling, to reduce reliance on biased linearization strategies.
  • Conduct additional studies to validate LU across diverse graph domains (e.g., social networks, citation graphs) to assess its generalizability and identify domain-specific nuances.
  • Incorporate permutation-based validation into the training pipeline for autoregressive graph generators, using techniques such as data augmentation with random linearizations or curriculum learning to expose models to diverse orderings.

Sources

Original: arXiv - cs.LG