A Permutation-Based Kernel Conditional Independence Test ‡ Gary Doran ‡ Case Western Reserve University Step I: Learning a Permutation Conditional independence tests are required for algorithms (e.g., the PC algorithm) that attempt to infer Bayesian network structure or causal DAGs from data. We develop a test that works with continuous, real-valued data, only makes weak assumptions about the joint distribution over variables, works for high-dimensional datasets, and is wellcalibrated (has appropriate Type I error rates when H0 holds). Weak High Assumptions Dimensionality WellCalibrated CHSIC I I Experiments and Results In order to preserve Py|z , we must find a permutation P that minimally distorts Z , quantified by pairwise metric Dij = d(zi, zj ). Relaxing P to be a doubly stochastic matrix (whose rows and columns each sum to 1), we can use an LP: Baselines I I min Tr(PD). CHSIC (Fukumizu et al., 2008): Requires many permutations to estimate the null hypothesis; problematic with high dimensionality. KCIT (Zhang et al., 2011): Implicitly assumes that X and Y can be modeled as smooth functions of Z . Tr(P)=0 I For high-dimensional data, assuming that Y is a smooth function of Z plus noise, Py|z = N (f (z), Σ), we can find a good permutation using the distortion measure: Dij = kf (zi) − f (zj )k2. Metrics I I Area Under Power Curve: When Ha holds, the AUC of the fraction of the time across 300 trials that the test rejects the H0 as α varies from 0 to 1. Calibratedness: When H0 holds, a measure using the Kolmogorov test of how closely the test rejects H0 at a rate of α as α varies from 0 to 1. Datasets Step II: Kernel Two-Sample Test P Overview In our approach, we reduce the conditional independence testing problem to one of two-sample testing using a permutation: X { µ(P) b(X) µ φ(xi) xi x 0 x 0 k(x, x ) = hφ(x), φ(x )i φ:X →H µ b(X) = Test Statistic: kb µ [(x1, y1, z1)] − 1 |X| P x∈X φ(x) 2 ? µ b [(x2, Py2, z2)]kH ≈ 0 Using a characteristic kernel (e.g., a Gaussian kernel) and a kernelbased distortion measure Dij = kφ(zi) − φ(zj )kH leads to a consistent conditional independence test. Step III: Splitting the Sample and Bootstrapping To ensure independence of the samples given to the two-sample test, the original sample must be split. Splitting results in a loss of power, which can be recovered via bootstrapping. 1.0 0.8 0.6 0.4 0.2 KCIPT 0.0 1.00 0.95 0.90 0.85 0.80 0.75 0 10 100 10−20 10−40 10−60 10−80 10−100 0 10 AUPC Our Approach (KCIPT) I Post-Nonlinear Noise: X and Y are smooth, but noisy and nonlinear functions of Z1. The additional dimensions of Z are noise. Chaotic Timeseries: Values in the timeseries are chaotically related to values from the previous timestep, and γ controls the amount of dependence for Ha. 1 KCIT 2 3 CHSIC 4 5 Dimensionality of Z 101 101 Dimensionality of Z Figure 1: Results for the post-nonlinear noise dataset. 1.0 100 Calibratedness The kernel-based two-sample test (Gretton et al., 2012) uses the kernel mean map to embed samples from distributions into an RKHS. KCIT I Area Under Power Curve Continuous Data ∗ Bernhard Schölkopf Area Under Power Curve I ‡ Kun Zhang Max Planck Institute for Intelligent Systems Background & Motivation I Krikamol Muandet ‡ Calibratedness ∗‡ 0.8 0.6 0.4 0.2 0.0 0.0 0.1 0.2 0.3 γ 0.4 0.5 10−3 10−6 10−9 10−12 0.0 0.1 0.2 0.3 0.4 0.5 γ Figure 2: Results for the chaotic timeseries dataset. References I Fukumizu, K.; Gretton, A.; Sun, X.; and Schölkopf, B. 2008. Kernel measures of conditional dependence. In Advances in Neural Information Processing Systems, 489–496. Gretton, A.; Borgwardt, K. M.; Rasch, M. J.; Schölkopf, B.; and Smola, A. 2012. A kernel two-sample test. The Journal of Machine Learning Research 13:723–773. Zhang, K.; Peters, J.; Janzing, D.; and Schölkopf, B. 2011. Kernel-based conditional independence test and application in causal discovery. In Proc. of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, 804–813. I We observe that the CHSIC performance degrades quickly with the dimensionality of the problem. With high dimensionality or nonsmooth relationships between variables, the KCIT has superior power, but KCIPT remains wellcalibrated. Code at http://engr.case.edu/doran_gary/code.html.
© Copyright 2025 ExpyDoc