latent confounders.

Estimation of Causal Direction
in the Presence of Latent Confounders
Using a Bayesian LiNGAM Mixture Model
Naoki Tanaka,
Shohei Shimizu, Takashi Washio
The Institute of Scientific and Industrial Research,
Osaka University
Outline
1.
2.
3.
4.
5.
Motivation
Background
Our Approach
Our Model: Bayesian LiNGAM Mixture
Simulation Experiments
2
Motivation
• Recently, estimation of causal structure attracts
much attention in machine learning.
– Epidemiology
– Genetics
Cause
Sleep
problems
• The estimation results can be biased
if there are latent confounders.
Depression
mood
Latent confounder
𝑓
→ Unobserved variables that have
more than one observed child variables.
𝑥1
𝑥2
Observed variables
• We propose a new estimation approach
that can solve the problem.
3
Outline
1.
2.
3.
4.
5.
Motivation
Background
Our Approach
Our Model: Bayesian LiNGAM Mixture
Simulation Experiments
4
LiNGAM(Linear Non-Gaussian Acyclic Model)
[Shimizu et al., 2006]
• The relations between variables are linear.
• Observed variables are generated from a DAG
(Directed Acyclic Graphs).
𝑒3
x1  1.4 x3  e1
1.4
x2  0.8 x1  0.5 x3  e2
x3  e3
𝑒1
𝑥1
𝑥3
0.5
-0.8
𝑥2
• External influences 𝑒𝑖 are non-Gaussian.
• No latent confounders.
→𝑒𝑖 are mutually independent.
• LiNGAM is an identifiable causal model.
𝑒2
5
A Problem of LiNGAM
• Latent confounders make 𝑒𝑖 dependent.
→The estimation results can be biased.
𝑥3 = 𝑓
x1  1.4 x3  e1
x2  0.8 x1  0.5 x3  e2
Patients’ condition
mild
Medicine
A
Survival
rate
x1  e1 '
dependent
x2  0.8 x1  e2 '
Patients’ condition
serious
Medicine
A
Survival
rate
6
LiNGAM with Latent Confounders
[Hoyer et al., 2008]
• LvLiNGAM (Latent variable LiNGAM)
𝑥𝑖 =
𝑏𝑖𝑗 𝑥𝑗 +
𝑘 𝑗 <𝑘(𝑖)
λ𝑖𝑑 𝑓𝑑 + 𝑒𝑖
𝑑
𝑓𝑑 :Latent variables
・Independent
・Non-Gaussian
λ𝑖𝑑 :Represent effects of 𝑓𝑑 on 𝑥𝑖
7
A Problem in Estimation of LiNGAM
with Latent Confounders
• Existing methods:
• An estimation method using overcomplete ICA.
[Hoyer et al., 2008]
→Suffers from local optima and requires large sample sizes.
• Estimates unconfounded causal relations.
[Entner and Hoyer, 2011; Tashiro et al., 2012]
→Cannot estimate a causal direction of two observed variables
that are affected by latent confounders.
• We propose an alternative.
– Computationally simpler.
– Capable of finding a causal direction in the presence of
latent confounders.
8
Outline
1.
2.
3.
4.
5.
Motivation
Background
Our Approach
Our Model: Bayesian LiNGAM Mixture
Simulation Experiments
9
Basic Idea of Our Approach
• Assumption
– Continuous latent confounders can be
approximated by discrete variables.
→LiNGAM with latent confounders reduces to
LiNGAM mixture model. [Shimizu et al., 2008]
• Estimation
– Estimation of LiNGAM mixture. [Mollah et al., 2006]
• Also suffers from local optima.
– Propose to use Bayesian approach.
• Bayesian approach for basic LiNGAM. [Hoyer et al., 2009]
10
LiNGAM Mixture Model [Shimizu et al.,2008]
• A data generating model of observed variable 𝑥𝑖
within class 𝑐 is
𝑥𝑖 =
𝑏𝑖𝑗
𝑐
𝑥𝑗 − 𝜇𝑗 (𝑐) + 𝑒𝑖 (𝑐) + 𝜇𝑖 (𝑐)
𝑘 𝑗 <𝑘(𝑖)
Matrix form
𝐱 = 𝐁(𝑐) 𝐱 + 𝐈 − 𝐁
mean
Class 1
0
Class 2
7
𝑥1
𝑥1
𝑐
0.8
𝑥2
0.8
𝑥2
𝛍
𝑐
mean
0
+ 𝐞(𝑐)
+
++
+
++
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
++
+
+
+
+
+
+
+
+
++
+
+
+
+
+
+
++
++
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
++
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+++++
+
+
+
+
++++
+
+
6
• Existing estimation methods of LiNGAM mixture model
also suffer from local optima.[Mollah et al., 2006]
11
Relation of Latent Variable LiNGAM
and LiNGAM Mixture (1)
• We assume that continuous latent confounders
can be approximated by discrete variables
having several values with good precision.
– The combination of the discrete values determine
which “class” an observation belongs to.
→𝑒𝑖 within the same class are mutually independent.
→It is simpler than incorporating latent confounders
in LiNGAM directly.
independent
x1  1.4 f 3  e1
x2  0.8 x1  0.5 f 3  e2
𝑓3 → constant
x1  μ1( c )  e1
x2  0.8 x1 μ(2c )  e2
12
Relation of Latent Variable LiNGAM
and LiNGAM Mixture (2)
• A simple example
– If latent confounders 𝑓3 and 𝑓4 can be approximated by 0 and 1 …
Latent Variable LiNGAM
𝑥𝑖 =
𝑏𝑖𝑗
𝑐
λ𝑖𝑑 𝑓𝑑 + 𝑒𝑖 (𝑐)
𝑥𝑗 +
𝑘 𝑗 <𝑘(𝑖)
LiNGAM Mixture
𝑥𝑖 =
𝑏𝑖𝑗
𝑐
𝑘 𝑗 <𝑘(𝑖)
𝑑
reduces
𝑓3
0.7 0.9
0.3
𝜇1
0
𝑥1
𝑥𝑗 − 𝜇𝑗 (𝑐) + 𝑒𝑖 (𝑐) + 𝜇𝑖 (𝑐)
𝑓4
0
1
0.6
𝑥2
Class 2
4
1
3
𝜇2
0
𝜇1 (1) = 0
𝜇1 (2) = 0.9
(2)
(4)
(1)
(3)
𝜇1 (3) =𝜇0.3
1
𝟎.9
1.2
0.3
0
𝟎.3
𝟏.2
𝜇1 (4) =0.9
1.2
0.7 0.9
0.3
𝑥1
1
0
𝜇2 (1) = 0
0.6𝜇
2
(2)
= 0.6
(3) = 0.7
(4)
(1)
(3)
𝜇𝜇22(2)
𝑥2 𝟎.7
𝟎.6
𝟏.3
0.7
𝜇02 (4) = 1.3
𝑥
13
Outline
1.
2.
3.
4.
5.
Motivation
Background
Our Approach
Our Model: Bayesian LiNGAM Mixture
Simulation Experiments
14
Bayesian LiNGAM Mixture Model (1)
• The data within class 𝑐 are assumed to be generated
by the LiNGAM model.
→ 𝑏𝑖𝑗 and 𝑝𝑖 , the densities of 𝑒𝑖 , have no relation to latent confounders 𝑓𝑑 ,
so they are not different between classes. Although 𝑓
3
x1  1.4 f 3  e1
𝑏21 does
not change
• 𝑏𝑖𝑗
𝑐
changes …
x2  0.8 x1  0.5 f 3  e2
Density 𝑝𝑖
do not
change
and 𝑝𝑖 (𝑐) are the same between classes, so
we replace 𝑏𝑖𝑗
𝑐
and 𝑒𝑖 (𝑐) of the LiNGAM mixture model by 𝑏𝑖𝑗 and 𝑒𝑖 :
𝑏𝑖𝑗 𝑥𝑗 − 𝜇𝑗 (𝑐) + 𝑒𝑖 + 𝜇𝑖 (𝑐)
𝑥𝑖 =
𝑘 𝑗 <𝑘(𝑖)
• Then their probability density is
𝑝 𝑥𝑖 𝛉(𝑐) = 𝑝𝑖 (𝑥𝑖 − 𝜇𝑖 (𝑐) −
𝑏𝑖𝑗 (𝑥𝑗 − 𝜇𝑗 (𝑐) ))
𝑘 𝑗 <𝑘 𝑖
15
Bayesian LiNGAM Mixture Model (2)
• The probability density of the data 𝐱
within each class is mixed according to some weights.
𝑙
𝑝 𝐱 𝛉(𝑐) 𝑝(𝑐)
𝑝 𝐱|𝛉 =
𝑐=1
(𝑙: The number of classes)
• 𝑝(𝑐) : multinomial distribution.
• The parameters of the multinomial distribution:
Dirichlet distribution
– A typical prior for the parameters of the multinomial
distribution.
– Conjugate prior for multinomial distribution.
16
Compare Three LiNGAM Mixture Models
• Select the model
with the largest log-marginal likelihood.
• There are only three (𝐺1 , 𝐺2 and 𝐺3 ) models
between two observed variables because of
the assumption of acyclicity.
𝐺1
𝐺2
𝐺3
class
class
class
𝑥1
𝑥2
𝑥1
𝑥2
𝑥1
𝑥2
17
Log-marginal Likelihood of Our Model
• Bayes’ theorem
• 𝐃 = 𝐱1 , … , 𝐱 𝑁
P 𝐺𝑚 𝐃 =
𝑝(𝐃|𝐺𝑚 )𝑃 𝐺𝑚
𝑝 𝐃
(𝑁: sample size)
• Log-marginal likelihood is calculated as follows:
log𝑝 𝐃 𝐺𝑚 = log
𝑝 𝐃 𝛉, 𝐺𝑚 𝑝 𝛉 𝐺𝑚 𝑑𝛉
LiNGAM-mixture
Prior distribution
• We use Monte Carlo integration to compute the integral.
• The assumption of i.i.d. data,
𝑝 𝐃 𝛉, 𝐺𝑚
𝑙
𝑛
=
𝑝𝑖 𝑥𝑖 − 𝜇𝑖
𝑐=1
𝑖=1
𝑐
−
𝑏𝑖𝑗 𝑥𝑗 − 𝜇𝑗
𝑐
𝑝(𝑐)
𝑘 𝑗 <𝑘 𝑖
18
Distribution of 𝒆𝒊
• 𝑒𝑖 follows a generalized Gaussian distribution
with zero means.
→Includes Gaussian, Laplace, continuous uniform
and many non-Gaussian distributions.
– 𝑝𝑖 𝑒𝑖 =
𝛽𝑖
exp
1
2𝛼𝑖 Γ( 𝛽 )
𝑖
– 𝑉𝑎𝑟(𝑒𝑖 ) =
|𝑒𝑖 | 𝛽
−( ) 𝑖
𝛼𝑖
𝛼𝑖2 Γ(3 𝛽 )
𝑖
Γ(1 𝛽 )
𝑖
– Γ( ) is the Gamma function.
𝑉𝑎𝑟(𝑒𝑖 ) = 1
𝛽𝑖 = 1
𝛽𝑖 = 2
𝛽𝑖 = 10
19
Prior Distributions and the Number of Classes
• Prior distribution
– 𝑏𝑖𝑗 and 𝜇𝑖 (𝑐) ~𝑵(𝟎, 𝝈𝟐𝒊 )
– 𝑉𝑎𝑟(𝑒𝑖 ), 𝛽𝑖 and 𝜎𝑖2 ~𝐈𝐧𝐯 − 𝐆𝐚𝐦𝐦𝐚(𝟑, 𝟑)
– 𝛼𝑖 can be calculated by using the equation
of 𝑉𝑎𝑟(𝑒𝑖 ).
• How to select the number of classes.
Inv-Gamma(3,3)
– Note that ‘true 𝑙’ does not exist.
② Selects the best number of classes.
(painted in orange)
1
…
𝑙
…
2log𝑁
𝐺1
0.6
𝐺2
0.3
0.8
0.5
𝐺3
0.1
0.1
0.2
0.1
In a Dirichlet process
mixture model,
𝑁 → ∞ ⇒ 𝑙 → log𝑁
[Antoniak, 1974]
0.3
① Selects the best
model. (letter in red)
20
Outline
1.
2.
3.
4.
5.
Motivation
Background
Our Approach
Our Model: Bayesian LiNGAM Mixture
Simulation Experiments
21
Simulation Settings(1)
• Generated data using a LiNGAM with latent confounders.
[Hoyer et al., 2008]
• 100 trials.
𝑓3
0.7
𝑓4
0.9
-1
0.6
0.8
0.3
(This graph is 𝐺2 .)
𝑒1
𝑓5
𝑥1
0.8
𝑥2
𝑒2
• The distributions of latent variables (𝑒1 ,𝑒2 ,𝑓3 ,𝑓4 and 𝑓5 )
are randomly selected from the following three
non-Gaussian distributions:
Laplace distribution
Mixture of two
Gaussian distribution
(symmetric)
Mixture of two
Gaussian distribution
(asymmetric)
22
Simulation Settings(2)
• Two methods for comparison:
– Pairwise likelihood ratios for estimation of
non-Gaussian SEMs [Hyvärinen et al., 2013]
→Assumes no latent confounders.
– PairwiseLvLiNGAM [Entner et al., 2011]
→Finds variable pairs that are not affected by
latent confounders and then estimate
a causal ordering of one to the other.
23
Simulation Results
𝐺1 (𝑥1
𝑥2 )
100
The number of
correct answers
The number of
correct answers
100
𝐺2 (𝑥1 → 𝑥2 )
80
60
40
20
0
50 100 200
Sample size
𝐺3 (𝑥1 ← 𝑥2 )
100
The number of
correct answers
True:
80
60
40
20
0
50 100 200
Sample size
100Our method
80
60
40
20
0
Pairwise
50measure
PairwiseLv
LiNGAM
0(Number of
50
outputs)
50 100 200
Sample size
• “(Number of outputs)” is the number of estimation by PairwiseLvLiNGAM.
– For the details,
Correct answers / Number of outputs
𝐺1
𝐺2
𝐺3
50
64/64
6/12
6/16
100
52/52
7/20
5/24
200
42/42
0/14
2/14
• Our method is most robust against existing latent confounders. 24
Conclusions and Future Work
• A challenging problem: Estimation of causal direction
in the presence of latent confounders.
– Latent confounders violate the assumption of LiNGAM
and can bias the estimation results.
• Proposed a Bayesian LiNGAM mixture approach.
– Capable of finding causal direction in the presence of
latent confounders.
– Computationally simpler: no iterative estimation
in the parameter space.
• In this simulation, our method was better
than two existing methods.
• Future work
– Test our method on a wide variety of real datasets.
25
26
Histograms of 𝒍
20
20
20
10
10
10
0
0
0
1 2 3 4 5 6 7
1 2 3 4 5 6 7
1 2 3 4 5 6 7
G1, sample size:50
G2, sample size:50
G3, sample size:50
20
20
20
10
10
10
0
0
0
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
G1, sample size:100
G2, sample size:100
G3, sample size:100
20
20
20
10
10
10
0
0
0
1 2 3 4 5 6 7 8 9 10
G1, sample size:200
1 2 3 4 5 6 7 8 9 10
G2, sample size:200
1 2 3 4 5 6 7 8 9 10
G3, sample size:200
27
Density of a Transformation
[Hyvärinen et al., 2001]
• e.g.)𝒙 = (𝑥1 , … , 𝑥𝑛 )𝑇 , 𝒆 = (𝑒1 , … , 𝑒𝑛 )𝑇
• 𝑝𝒙 is the density of 𝒙 and 𝑝𝒆 is the density of 𝒆.
– 𝑥𝑖 is i.i.d data, so 𝑝𝒙 =
𝒊 𝑝𝑥𝑖 .
Similarly, 𝑝𝒆 =
𝒊 𝑝𝑒𝑖
• We can rewrite LiNGAM in a matrix form.
𝒙 = 𝑩𝒙 + 𝒆 ⇔ 𝒆 = (𝑰 − 𝑩)−1 𝒙
• 𝑝𝒙 𝒙 =
1
det 𝑰−𝑩 −1
𝑝𝒆 𝐞 =
1
det 𝑰−𝑩 −1
𝑝𝒆 (𝑰 − 𝑩)−1 𝒙
• 𝑩 could be permuted by simultaneous equal row and column
permutations to be strictly lower triangular due to the acyclicity
assumption. [Bollen, 1989]
→ (𝑰 − 𝑩)−1 is lower triangular whose diagonal elements are all 1.
• A determinant of lower triangular equals the product of
its diagonal elements.
→ |(𝑰 − 𝑩)−1 | = 1
28
Gaussian vs. Non-Gaussian
𝑥2
Gaussian
𝑥2
Non-Gaussian (uniform)
(𝑥1 → 𝑥2 )
𝑥2
𝑥1
𝑥2
𝑥1
(𝑥1 ← 𝑥2 )
𝑥1
𝑥1
29