Poster - Case Western Reserve University

A Permutation-Based Kernel Conditional Independence Test
‡
Gary Doran
‡
Case Western Reserve University
Step I: Learning a Permutation
Conditional independence tests are required for algorithms (e.g.,
the PC algorithm) that attempt to infer Bayesian network structure
or causal DAGs from data.
We develop a test that works with continuous, real-valued data,
only makes weak assumptions about the joint distribution over
variables, works for high-dimensional datasets, and is wellcalibrated (has appropriate Type I error rates when H0 holds).
Weak
High
Assumptions Dimensionality
WellCalibrated
CHSIC
I
I
Experiments and Results
In order to preserve Py|z , we must find a permutation P that minimally distorts Z , quantified by pairwise metric Dij = d(zi, zj ).
Relaxing P to be a doubly stochastic matrix (whose rows and
columns each sum to 1), we can use an LP:
Baselines
I
I
min Tr(PD).
CHSIC (Fukumizu et al., 2008): Requires many permutations to estimate the
null hypothesis; problematic with high dimensionality.
KCIT (Zhang et al., 2011): Implicitly assumes that X and Y can be modeled
as smooth functions of Z .
Tr(P)=0
I
For high-dimensional data, assuming that Y is a smooth function
of Z plus noise, Py|z = N (f (z), Σ), we can find a good permutation using the distortion measure: Dij = kf (zi) − f (zj )k2.
Metrics
I
I
Area Under Power Curve: When Ha holds, the AUC of the fraction of the time
across 300 trials that the test rejects the H0 as α varies from 0 to 1.
Calibratedness: When H0 holds, a measure using the Kolmogorov test of
how closely the test rejects H0 at a rate of α as α varies from 0 to 1.
Datasets
Step II: Kernel Two-Sample Test
P
Overview
In our approach, we reduce the conditional independence testing
problem to one of two-sample testing using a permutation:
X
{
µ(P)
b(X)
µ
φ(xi)
xi
x
0
x
0
k(x, x ) = hφ(x), φ(x )i
φ:X →H µ
b(X) =
Test Statistic: kb
µ [(x1, y1, z1)] −
1
|X|
P
x∈X
φ(x)
2 ?
µ
b [(x2, Py2, z2)]kH ≈
0
Using a characteristic kernel (e.g., a Gaussian kernel) and a kernelbased distortion measure Dij = kφ(zi) − φ(zj )kH leads to a consistent conditional independence test.
Step III: Splitting the Sample and Bootstrapping
To ensure independence of the samples given to the two-sample test,
the original sample must be split. Splitting results in a loss of power,
which can be recovered via bootstrapping.
1.0
0.8
0.6
0.4
0.2
KCIPT
0.0
1.00
0.95
0.90
0.85
0.80
0.75 0
10
100
10−20
10−40
10−60
10−80
10−100 0
10
AUPC
Our Approach
(KCIPT)
I
Post-Nonlinear Noise: X and Y are smooth, but noisy and nonlinear functions of Z1. The additional dimensions of Z are noise.
Chaotic Timeseries: Values in the timeseries are chaotically related to values
from the previous timestep, and γ controls the amount of dependence for Ha.
1
KCIT
2
3
CHSIC
4
5
Dimensionality of Z
101
101
Dimensionality of Z
Figure 1: Results for the post-nonlinear noise dataset.
1.0
100
Calibratedness
The kernel-based two-sample test (Gretton et al., 2012) uses the kernel mean map to embed samples from distributions into an RKHS.
KCIT
I
Area Under Power Curve
Continuous
Data
∗
Bernhard Schölkopf
Area Under Power Curve
I
‡
Kun Zhang
Max Planck Institute for Intelligent Systems
Background & Motivation
I
Krikamol Muandet
‡
Calibratedness
∗‡
0.8
0.6
0.4
0.2
0.0
0.0
0.1
0.2
0.3
γ
0.4
0.5
10−3
10−6
10−9
10−12
0.0
0.1
0.2
0.3
0.4
0.5
γ
Figure 2: Results for the chaotic timeseries dataset.
References
I
Fukumizu, K.; Gretton, A.; Sun, X.; and Schölkopf, B. 2008. Kernel measures of conditional dependence. In Advances in Neural Information Processing Systems, 489–496.
Gretton, A.; Borgwardt, K. M.; Rasch, M. J.; Schölkopf, B.; and Smola, A. 2012. A kernel two-sample
test. The Journal of Machine Learning Research 13:723–773.
Zhang, K.; Peters, J.; Janzing, D.; and Schölkopf, B. 2011. Kernel-based conditional independence
test and application in causal discovery. In Proc. of the Twenty-Seventh Conference on Uncertainty
in Artificial Intelligence, 804–813.
I
We observe that the CHSIC performance degrades quickly with
the dimensionality of the problem.
With high dimensionality or nonsmooth relationships between
variables, the KCIT has superior power, but KCIPT remains wellcalibrated.
Code at http://engr.case.edu/doran_gary/code.html.