Ψ最尤法とその適応 - ホーム | 統計数理研究所

Ψ最尤法とその適応
ー PCA, ICA, Gaussian Mixture ー
発表者：江口真透
共同研究者：紙屋英彦
狩野裕
南美穂子
藤澤洋徳
1
あらすじ
１．Ψ最尤法の導出
Ψダイバージェンス
２．基本的性質
不偏性、一致性、効率、影響関数
３．正規モデル
２変量正規分布でのシムレイション
４．PCA
射影ユークリッド距離
５．ICA
独立性とβ尤度
６．Gaussian mixture
β尤度の有界性
2
Fisherの最尤法
モデル
データ
M  { f ( x, ) :    }
Let ( x1 ,, xn ) be from f ( x, )
対数密度
( x, )  log f ( x, )
対数尤度
1
L(θ ) 
n
n
 ( x , )
i 1
i
3
最尤推定のΨバージョン
Ψ尤度（一般形）
1
L ( ) 
n
n
{ (( x , ))  b

i
( )}
i 1
 () は単調増加な関数．
βべき尤度
ηシグモイド尤度
n
1
L ( ) 
n

1
L ( ) 
n
n
i 1
f ( xi , )   1

 log{ f ( x , )  }
i 1
i
4
Wedderburnの擬似尤度
• 分散関数
• 擬似スコアー
•
•
•
•
Normal
Poisson
Gamma
Binomial
V  V ( )
z
L


V ( )
V ( )  1
 
 2
  (   )
5
ポアソン尤度から KLダイバージェンス
V ( )  
Poisson分散関数
擬似尤度
L
( po )
KLダイバージェンス
関係
( z,  ) 
z

zs
ds  z (log z  log  )  z  
s
DKL ( g, f ) 
DKL ( g, f ) 


g (logg  log f )d
L( po ) ( g ( y), f ( y)) d ( y)
6
擬似尤度からΨダイバージェンス
L( z ,  ) 
擬似尤度関数
Ψダイバージェンス
( z )  
exp z
ds
, * ( z )  
V ( s)
D ( g , f ) 

zs
ds  0
V ( s)
z

D ( g , f ) 

exp z
 *


'
(
z
)

exp(
z
)

'
(
z
)




s
ds
V ( s)
L( g , f )d  0
g{ (log g )   (log f )}d
  { * (log g )   * (log f )}d
7
ΨダイバージェンスからΨ尤度
• Ψダイバージェンス
D ( g, f )    g{(log f )   * (log f )d }
• 関係
D ( g, f ( , ))    {(( y, )  b ( )}dG( y )
1 n
  {(( xi , )  b ( )}   L ( )
n i 1
8
Ψバイアスポテンシャル
一般形
1
L ( ) 
n
バイアスポテンシャル
βべき尤度
ηシグモイド尤度
L ( ) 
n
{(( x , ))  b

i
i 1
( )}
b ( )   * (( y, ))d ( y)
n
1
n

f
(
x
,

)
 i 
1
 1
i 1
1
L ( ) 
n

f ( y, )  1 dy
n
 log{ f ( x , )  }
i 1
i
  log{ f ( y, )  }dy
9
M推定としてのΨ最尤推定
• HuberのM推定
• location推定
• 関係
1 n
min   ( xi , )
  n
i 1
if | y   |  k
 y 
 ( y , )  
 k sgn( y   ) otherwise
 ( y, )   { (( y, ))  b ( )}
10
不偏性
Ψ尤度方程式
1 n
b ( )
g ( )  [ ( xi , ) S ( xi , ) 
]
n i 1

ここで
不偏性


 ( x, ) 
(( x, )), S ( x, ) 
( x, )


E {g ( )}  0 ( )



b ( )  E { ( X , )S ( X , )}

 

11
重み関数
  0.0 1 5
  0.0 1
0.1 5
0.4
0.8
3.5
2.5
2.5
0.0 5
0.0 7 5
0.1
3
0.2 5
2
1.5
1
0.8
1
0.6
0.5
0.4
0.2
-10
-8
-6
-4
-2
 ()  exp( )
2
-10
-8
-6
-4
-2
2
exp()
 ( ) 
  exp()
12
一致性
( x1 ,xn ) be from f ( x) ならば
  arg max E {L ( *)}
*
 E [ L ( )  L ( *)]  D ( f , f * )  0
これより
a.s.
ˆ 

a.s.
 L ( ) 
E {L ( )}
Cf. Wald (1949)
13
影響関数
• 統計汎関数
T (G)  arg max  {(( y, ))  b ( )}dG( y)
 
• 影響関数


IF(T , x)   T (G )
G  (1   )F    x
 
  0
IF(T , x )  J 
1

( ){ ( x, ) S ( x, ) 
b ( )}

GES(T )  sup || IF(T , x) ||
x
14
効率
漸近分散
n (ˆ   )  N (0, J  ( ) 1 H  ( ) J  ( ) 1 )
D
ここで
J  ( )  E ( ( X , ) S ( X , ) S ( X , )T ),
H  ( ) Var( ( X , ) S ( X , ))
情報不等式
I ( )1  J  ( )1 H ( ) J  ( )1
(等号は  ()   の時に限る）
15
対数尤度からΨ尤度（まとめ１）
KLダイバージェンス
Ψダイバージェンス
（対数尤度
Ψ尤度）
Ψ最尤推定の基本性質
一致性、漸近正規性、影響関数、効率、GES
（Ψ最尤信頼領域、 Ψ最尤検定）
Ψ最尤推定の注意点
データ変換に対する共変性
Ψ尤度の分解性
16
正規分布の平均の推定
影響関数
η=
=
=
=
=
=
β= 0
= 0.015
0
0.01
= 0.15
6
0.05
= 0.4
0.075 4
= 0.8
0.1
2
= 2.5
0.125
-6
-4
-2
2
4
6
-6
-4
0
0.01
0
0.05
0.015
0.15
0.075
0.4
0.10.8
2.5
0.125
6
4
2
-2
2
-2
-2
-4
-4
-6
β-power estimates
4
6
-6
η-sigmoid estimates
17
Gross Error Sensitivity
β-power estimates
η-sigmoid estimates
β
0
0.01
効率
1
0.97
GES
∞
6.16
η
0
0.015
効率
1
0.972
GES
∞
1.9
0.05
0.075
0.1
0.861
0.799
0.742
2.92
2.47
2.21
0.15
0.4
0.8
0.873
0.802
0.753
1.04
0.678
0.455
0.125
0.689
2.04
2.5
0.694
0.197
18
多変量正規モデル
正規密度関数
( y   )T 1( y   ) 

f ( y,  , )  ((2 ) det ) exp

2


尤度方程式
p






1
2
1
 j ( )( x j   )  0

n
1
T

(

)
{
(
x


)(
x


)
 }  c ( )

j
j
j
n
  (  , )
19
アルゴリズム
k   k , k   k 1   k 1, k 1 
繰り返し重み付け平均と分散
 k 1
 ( ) x


 ( )
j
k
j
j
k 1 
k
T

(

)
(
x


)
(
x


)
 j k j k j k
 ( )  c det  
j
k
,
k
ある条件の下で
L (k 1 )  L (k ) ( k  1,...)
20
シムレイション（１）
ε混入モデル
0
Gε(1)  (1-ε ) N   
0
(2)
ε
G
最尤推定量の
KL error DKL (θˆ,θ0 )
0
 (1-ε ) N    ,
0
 5  4 0
 1 0.5  
    N    , 
 
, 
 0.5 2  
  -5   0 1  
0
1 0

    N   
0 1
0
9 9
 
, 
9 9
ε0

ε  0.05
3.03

39.24
under G (1)

2.70

16.65
under G(2)

21
β-power estimates v.s. η-sigmoid estimates
β
(1)
G
( 2)
G
KL error
η
KL error
0
39.24
0
39.24
0.01
35.30
0.0001
23.04
0.05
20.93
0.00025
6.70
0.10
8.91
0.0005
4.46
0.20
12.40
0.00075
4.64
0.30
31.64
0.001
6.04
0
16.5
0
16.5
0.01
13.91
0.0005
3.36
0.05
6.65
0.00075
3.19
0.10
5.14
0.001
3.9
0.20
12.20
0.002
3.04
0.30
29.68
0.003
3.07
22
分散のΨ推定値のプロット（外れ値なし）
100 replications with
100 size of sample
under Normal dis.
 11
 
 12
 22
 12
 11
 12 
 22 
η- MLE (η=0.0025)
β- MLE (β=0.1)
MLE
true (1,0,1)
23
分散のΨ推定値のプロット（外れ値あり）
100 replications with
100 size of sample
under G(2)
1.5
1
0.5
0
2
η- MLE (η=0.0025)
β- MLE (β=0.1)
MLE
true (1,0,1)
1.5
1
0.5
1
1.5
2
2.5
24
チューニングパラメータβの選び方
Squared loss function
1
2
ˆ
ˆ
Loss( )  2  { f ( y, )  g ( y)} dy
1 n
1
( i )
ˆ
CV(  )    f ( xi ,  ) 
n i 1
2

f ( y,ˆ) 2 dy
ˆ  arg minCV(  )

近似
ˆ( i )  ˆ 
1
IF ( xi ,ˆ )
n 1
25
チューニングパラメータβの選び方２
1 n
1
ˆ
CV (  )    f ( xi , )   f ( y,ˆ) 2 dy
n i 1
2
ˆ )

f
(
x
,

1
i


IF ( xi ,ˆ )T
n 1

外れ値が少ないときは第3項が効いて βの値が0に
外れ値が多いときは第１,2項が効いて βの値が１に
引っ張られる
Cf. Konishi & Kitagawa （1996）
26
 0
 .26 - .1
Normalwith mean   and variance 

 0
 - .1 .26
CV (  )
MLE  0.054 


signal  - 0.081
2.94
2.92
2.9
2.84
 0.204 


Cont ami  - 0.184
MLE
2.88
2.86
 .228 - .126


 - .126 .261 
ˆ 0.07
1.059 - .263


 - .263 .383 
2.82
0.0250.050.0750.10.1250.150.175
β
 0.086 

 - 0.132
 - MLE 
 .293 - .134


 - .134 .286 
27
Ψ-PCA
 TS
min  r ( xi  x ,  )  tr(S )  max T


 
Classical PCA
y
r
T
2
(

y
)
r( y,  )  || y ||2 
||  ||2
γ
Ψ擬似尤度関数
n
L ( )   ( r ( xi   ,  ))
i 1
η-sigmoid
ˆ  arg min {min L ( ,  )}


( z)  log(  exp(z)), * ( z)  exp( z)  ( z)
28
Ψアルゴリズム 2
Update ( ,  ) into ( * ,  * )
ここで
w( x ,  ,  ) 
 S (, )  *  * *

n
 *
    w( xi , ,  ) xi
i 1

 ( r ( x    ))
 ( r ( xi   ))
S(  ,  )   w( xi   ,  )( xi   )( xi   )T
29
古典的PCAのノンロバストネス
3
2
1
-4
-2
2
4
6
-1
-2
-3
10
5
0
Pc vector = (.55, .82, .01, .07, .01, .032, .10)
-5
-10
10
5
Pc vector = (.00, .01, .05, .04, .02, .99, .00)
0
-5
-10
-10
-5
0
30
5
10
Ψ-PCAの重み
1
0.8
Ψ
重
み
関
数
0.6
0.4
0.2
20
40
60
80
100
120
140
-0.2
Pc vector = (.55, .82, .01, .07, .01, .032, .10)
ΨPc vector = (. 64, .75, .01, .09, .01, .03, .09)
31
ΨPCAのチューブ近傍
γˆ
xi
 z( xi  μˆ  , γˆ 
1
1
0.8
ψ(r)
0.6
0.4
0.2
2.5
5
7.5
10
12.5
15
半径 r
32
ICAとは？
仮定１（信号の独立性） s  (s1 , ..., sm ) ～ p(s)  p1 (s1 ) pm (sm )
E (S1 )  0,, E (Sm )  0
仮定２（アフィン変換） W R mm ,  R m s.t. x W 1s
f ( x,W , p)  | det(W ) | p1 (w1 x) pm (wm x)
問題 ( x1 , , xn ) から W を推定する
セミパラメトリックモデル
パラメトリック成分
W
ノンパラメトリック成分 p(s)
33
最尤法によるICA
尤度関数
1
(W , p) 
n
( x,W , p) 
n
 ( x ,W , p)
m
i
i 1

i 1
log pi (Wx )  log | det(W ) |
 ( x,W , p)
 ( I m  h(Wx ) (Wx )T ) W T
W
 log pm ( sm )
 log p1 ( s1 )
h( s )  (
, ,
)
 s1
 sm
尤度方程式 F ( x ,W , p) 
セミパラメトリック一致性
E{ Fij ( x ,W , p) |W , p*}  ci E (S j | p j )  0 (i  j )
*
34
Ψ-ICA
βべき尤度方程式
分解性：

q  s ,q  t
1 n

f
(
x
,
W
,

)
F ( xi ,W ,  )  B (W ,  )

i
n i 1
方程式の (s ,t ) 成分は、 s  t のとき
E[{ p( wq X   q )} ]
 E[{ ps ( ws X   s )} hs ( ws X   s )]
 E[{ pt ( wt X  t ))} wt X ]  0
35
最尤法によるICA
１５０の一様乱数 U(0,1)× U(0,1)
線形混合 W
1
1 2 


1
0
.
5


1.5
1
0.5
-1.5
-1
-0.5
0.5
1
1.5
-0.5
-1
-1.5
36
最尤法の失敗
2
５０の正規ノイズ N(0, 1)× N(0, 1)
の追加
1
-2
-1
1
2
3
-1
-2
2
1
-2
-1
1
-1
-2
37
β-ICA (β=0.2)
2
4
1
2
-2
-1
1
2
3
-10
-1
-7.5
-5
-2.5
2.5
5
7.5
-2
-4
-2
β最尤法
-6
38
正規混合モデル
混合モデル
h( x, 1 )
1
1
f ( x,  ) 
h( x,  R )
非正則性
R
R
R

r 1
r
h ( x,  r )
 { ( r ,r ) : r  1,, R}
 r 0  r は無駄なパラメータ
r s   r , s のどちらかは無駄なパラメータ
39
非有界な尤度
• ２成分正規混合モデル
0.6
0.5
0.4
0.3
0.2
0.1
-4
-2
2
4
0.3N (0,1)  0.7 N (3, 0.5)
0.32 N (0.11, 1.61)  0.68 N (3.04, 0.42)
0.67 N (0.34, 0.02)  0.93 N (2.23, 2.64)
EMアルゴリズムの
２つの収束先
40
β尤度の有界性
正規混合モデル f ( x, ) 
R
2


(
x


,

 r
r
r ) において
r 1
(1   )
min  r 
1 r  R
n
3
2
(n :samplesize)
ならば、β尤度関数は有界である
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0.2
0.4
0.6
0.8
1
41
βの選択
Cramer-von Mises ダイバージェンス
D( g , f ) 

{G( x)  F ( x, )}2 g ( x)dx
1 n i  0.5
 {
 F ( xi , )}2
n i 1
n
交互検証法
n
i  0.5
ˆ  arg min {
 F ( xi ,ˆ( i ) )}2
 0
n
i 1
近似
ˆ( i )  ˆ 
1
IF ( xi ,ˆ )
n 1
42
Ψ尤度（まとめ２）
バイアス-分散のバランスは η-最尤法が良さそう
PCAはユークリッド射影距離で行った．
経験的にはη-最尤法が良さそう
ICAにおいては推定方程式の分解性から
β-最尤法が計算しやすいことが分かった
Gaussian Mixtureにおいては尤度の有界性から
β-最尤法がよい
43
Ψ最尤法の課題
回帰分析
一般化線形モデルのもとで？
判別分析
ブースティングとの関連？
離散分布モデル
分割表のロバストネス？
グラフィカルモデル
解釈の安定性と計算量？
44
参考文献
Higuchi, I. and Eguchi, S. (1998). The influence function of principal
component analysis by self-organizing rule. Neural Computation, 10,
1435-1444.
Kamiya, H. and Eguchi, S. (2001). A class of robust principal component
vectors. J. Mult. Anal., 77, 239-269.
Eguchi, S. and Y. Kano. (2001). Robustifing maximum likelihood estimation
by psi-divergence. ISM Research Memo 802.
Minami, M. and S. Eguchi. (2001). Robust blind source separation by
beta-Divergence. ISM Research Memo. 799.
Fujisawa, H. and S. Eguchi. (2001). Robust estimation in normal mixture model.
(to be submitted).
Eguchi, S.and J. Copas. (2002). A class of logistic type discriminations. In press
Biometrika 89.
45
参考文献 (一般モデル)
Basu, A, , Harris, I. R , Hjort, N. L. and Jones, M. C. (1998). Robust and efficient
estimation by minimizing a density power divergence. Biometrika 85, 549-559.
Fisher, R. A. (1922). On the mathematical foundations of the theoretical statistics.
Phil. Trans. R. Soc. 222 309-368.
Hampel, F. R. (1974). The influence curve and its role in robust estimation.
J. Amer. Statist. Ass., 69, 383-393.
Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J. and Stahel, W. A. (1986).
Robust Statistics: The Approach Based on Influence Functions. Wiley & Sons.
Huber, P. J. (1964). Robust estimation of a location parameter
Annals of Math. Statist. 35, 73-101.
Huber, P. J. (1981). Robust Statistics. New York: Wiley & Sons.
Jones, M. C., Hjort, N. L. , Harris, I. R. and Basu, A. (2001). A comparison of
related density-based minimum divergence estimators. Biometrika, 88, 865-873.
Konishi, S., and Kitagawa, G. (1996), Generalised information criteria in model
selection. Biometrika, 83, 875--890.
Wald, A., (1949). Note on the consistency of the maximum likelihood estimate.
Ann. Math. Statist. 20, 595-601.
46
参考文献 PCA
Amari, S.-I. (1977). Neural theory of association and concept formation.
Biol. Cybernetics, 26, 175--85.
Critchley, F. (1985). Influence in Principal components analysis.
Biometrika, 72, 627--36.
Haykin, S. (1999). Neural Networks. Toronto: Prentice Hall.
Maronna, R. A. (1976). Robust M-estimators of multivariate location and scatter.
Ann. Statist., 4, 51--67.
Xu, L. and Yuille, A. L. (1995). Robust principal component analysis by
self-organising rules based on statistical physics approach. IEEE Trans. on
Neural Networks, 6, 131--43.
47
参考文献 ICA
Amari, S. (1998). Natural gradient works efficiently in learning.
Neural Computation, 10, 254-276.
Amari, S. & Cardoso, J. F. (1997). Blind source separation --- Semiparametric statistical approach. IEEE Trans. on Signal Processing, 45,
2692-2700.
Amari, S., Chen, T. \& Cichocki, A. (1997). Stability analysis of learning
algorithm for blind source separation. Neural Networks, 10(8), 1345-1351.
Bell, A.J. & Sejnowski, T.J. (1995). An information-maximization approach to
blind separation and blind deconvolution. Neural Computation 7, 1129-1159.
Cardoso, J.F. \& Souloumiac, A. (1993). Blind beamforming for non-Gaussian
signals. Proc. IEEE on SP, 140, 362-370.
Hyvarinen, A. (1999). Fast and robust fixed-point algorithms for independent
component analysis. IEEE Trans. on Neural Network, 10(3), 626-634.
Jutten, C. & Herault, J. (1991). Blind separation of sources, Part I: An adaptive
algorithm based on neuromimetic architecture. Signal Processing, 24, 1-20.
48
参考文献 Gaussian Mixture
Chen, H., Chen, J., and Kalbfleisch, J. D. (2001), A modified likelihood ratio
test for homogeneity in finite mixture models,'' J. Royal Statist.Soc. B, 63, 19--29.
Cutler, A., and Cordero-Bra\~na, O. I. (1996). Minimum Hellinger distance estimation
for finite mixture models, J. Amer. Statist. Assoc., 91, 1716--1723.
Hathaway, R. J. (1985). A constrained formulation of maximum-likelihood estimation
for normal mixture distributions. Ann. Statist., 13, 795--800.
McLachlan, G., and Peel, D. (2000), Finite Mixture Models. New York: Wiley.
Titterington, D. M., Smith, A. F. M., and Makov, U. E. (1985). Statistical analysis of
finite mixture distributions. Chichester: Wiley.
Woodward, W. A., Parr, W. C., Schucany, W. R., and Lindsey, H. (1984). A comparison
of minimum distance and maximum likelihood estimation of a mixture proportion.
J. Amer. Statist. Assoc. 79, 590--598.
49

Download Report