Academic

Bivariate Causal Discovery Using Rate-Distortion MDL: An Information Dimension Approach

Tiago Brogueira, M\'ario A. T. Figueiredo · April 8, 2026 · 1 min read · 37 views

#cs.LG #stat.ML

arXiv:2604.05829v1 Announce Type: new Abstract: Approaches to bivariate causal discovery based on the minimum description length (MDL) principle approximate the (uncomputable) Kolmogorov complexity of the models in each causal direction, selecting the one with the lower total complexity. The premise is that nature's mechanisms are simpler in their true causal order. Inherently, the description length (complexity) in each direction includes the description of the cause variable and that of the causal mechanism. In this work, we argue that current state-of-the-art MDL-based methods do not correctly address the problem of estimating the description length of the cause variable, effectively leaving the decision to the description length of the causal mechanism. Based on rate-distortion theory, we propose a new way to measure the description length of the cause, corresponding to the minimum rate required to achieve a distortion level representative of the underlying distribution. This distortion level is deduced using rules from histogram-based density estimation, while the rate is computed using the related concept of information dimension, based on an asymptotic approximation. Combining it with a traditional approach for the causal mechanism, we introduce a new bivariate causal discovery method, termed rate-distortion MDL (RDMDL). We show experimentally that RDMDL achieves competitive performance on the T\"ubingen dataset. All the code and experiments are publicly available at github.com/tiagobrogueira/Causal-Discovery-In-Exchangeable-Data.

Executive Summary

The article introduces a novel bivariate causal discovery method, Rate-Distortion MDL (RDMDL), which addresses limitations in existing Minimum Description Length (MDL)-based approaches to causal inference. The authors argue that prior MDL methods inadequately estimate the description length of the cause variable, skewing decisions toward the complexity of the causal mechanism. RDMDL leverages rate-distortion theory to more accurately measure the description length of the cause variable, using information dimension and histogram-based density estimation. The proposed method is validated on the Tübingen dataset, demonstrating competitive performance. The work bridges information theory, statistical learning, and causal inference, offering a theoretically grounded improvement over existing techniques.

Key Points

▸ Current MDL-based causal discovery methods fail to properly account for the description length of the cause variable, focusing instead on the causal mechanism.
▸ RDMDL introduces a rate-distortion framework to estimate the description length of the cause variable using information dimension and distortion levels derived from histogram-based density estimation.
▸ The method achieves competitive results on the Tübingen dataset, validating its practical efficacy.
▸ The theoretical foundation combines rate-distortion theory with asymptotic approximations of information dimension.
▸ The approach is publicly available, including code and experiments, facilitating reproducibility.

Merits

Theoretical Rigor

The article presents a sophisticated theoretical framework combining rate-distortion theory, information dimension, and MDL to address a longstanding problem in causal discovery.

Empirical Validation

The method is empirically validated on a well-established dataset (Tübingen), demonstrating competitive performance and practical applicability.

Reproducibility

The authors provide open-source code and experiments, ensuring transparency and enabling further research and validation.

Demerits

Assumptions and Approximations

The method relies on asymptotic approximations of information dimension and histogram-based distortion levels, which may introduce limitations in finite-sample scenarios or high-dimensional settings.

Bivariate Focus

The analysis is restricted to bivariate causal discovery, leaving open questions about scalability and applicability to multivariate or high-dimensional causal systems.

Dependence on Dataset Quality

Performance is validated on a specific dataset (Tübingen), which may not fully capture the diversity of real-world causal relationships.

Expert Commentary

The article presents a compelling and theoretically sound approach to addressing a critical limitation in MDL-based causal discovery. By focusing on the accurate estimation of the description length of the cause variable, the authors have made a significant contribution to the field. The integration of rate-distortion theory with information dimension is particularly innovative, offering a fresh perspective on measuring complexity in causal models. The empirical validation on the Tübingen dataset is robust and demonstrates the method's practical utility. However, the reliance on asymptotic approximations and histogram-based density estimation may pose challenges in finite-sample or high-dimensional settings. Future work could explore extensions to multivariate causal systems and further refine the method's robustness in real-world applications. The open-source release of the code and experiments is commendable and sets a high standard for transparency in causal discovery research. Overall, this work represents a valuable advancement in the intersection of information theory, statistical learning, and causal inference.

Recommendations

✓ Future research should explore the scalability of RDMDL to multivariate causal discovery, extending its applicability beyond bivariate settings.
✓ The method's performance should be further validated on additional datasets, particularly those with high-dimensional or noisy data, to assess its robustness in diverse real-world scenarios.
✓ Researchers should investigate the theoretical properties of the information dimension approximation in finite-sample settings to better understand its limitations and potential refinements.
✓ Collaborations between theorists and practitioners in applied domains (e.g., economics, biology) could help refine the method and demonstrate its utility in solving real-world causal inference problems.

Sources

Original: arXiv - cs.LG

arXiv - cs.LG

Bivariate Causal Discovery Using Rate-Distortion MDL: An Information Dimension Approach

AI Commentary

Executive Summary

Key Points

Merits

Theoretical Rigor

Empirical Validation

Reproducibility

Demerits

Assumptions and Approximations

Bivariate Focus

Dependence on Dataset Quality

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs