PowerPoint プレゼンテーション

2004 Open Lecture at ISM
Recent topics in machine learning:
Boosting
2004年11月24日(水)~ 26日(金)
公開講座 統計数理要論「機械学習の最近の話題」
ブースト学習
江口 真透
(統計数理研究所, 総合研究大学院統計科学)
1
講座内容
ブースト学習:
統計的パタン認識の手法であるアダブーストを
概説し、その長所と欠点について考察します.
遺伝子発現、リモートセンシング・データ
などの適用例の紹介をします。
2
Boost Leaning (I)
10:00-12:30 November 25 Thu
Boost learning algorithm
AdaBoost
AsymAdaBoost
AsymLearning
EtaBoost
Robust learning
GroupBoost
Group learning
3
Boost Leaning (II)
13:30-16:00 November 26 Fri
BridgeBoost
Meta learning
LocalBoost
Local learning
Statistical discussion
Probablistic framework
Bayes Rule, Fisher’s LDF, Logistic regression
Optimal classifier by AdaBoost
4
謝辞
この講座で紹介された内容の多くは,以下の
共同研究者の方々との成果を含む.ここに感謝する.
村田 昇氏(早稲田大学理工学)
西井龍映氏(九州大学数理学)
金森敬文氏(東京工大,情報数理)
竹之内高志氏(統計数理研究所)
川喜田正則君(総研大,統計科学)
John B. Copas (Dept Stats, Univ of Warwick)
5
The strength of the weak learnability.
Schapire, R. (1990)
Strong learnability
If, given access to a source of examples of the unknown
concept, the learner with high probability is able to output
an hypothesis that is correct on all but an arbitrarily small
fraction of the instances.
Weak learnability
The concept class is weakly learnable if the learner can
produce an hypothesis that performs only slightly better
than random guessing.
6
Web-page on Boost
Boosting Research Site: http: //www.boosting.org/
Robert Schapire’s home page:
http://www.cs.princeton.edu/~schapire/
Yoav Freund's home page :
http://www1.cs.columbia.edu/~freund/
John Lafferty
http://www-2.cs.cmu.edu/~lafferty/
7
Statistical pattern recognition
Recognition for
Character, Image, Speaker, Signal, Face, Language,…
Prediction for
Weather, earthquake, disaster, finance, interest rates,
company bankruptcy, credit, default, infection, disease, adverse
effect
Classification for
Species, parentage, genomic type, gene expression,
protein expression, system failure, machine trouble
8
Multi-class classification
Feature vector
Class label
Discriminant function
Classification rule
9
Binary classification
label
0-normalization
Classification rule
Learn a training dataset
Make a classification
10
Statistical learning theory
Boost learning
Boost by filter (Schapire, 1990)
Bagging, Arching (bootstrap)
(Breiman, Friedman, Hasite)
AdaBoost (Schapire, Freund, Batrlett, Lee)
Support vector
Maximize margin
Kernel space
(Vapnik, Sholkopf)
11
Class of weak machines
Stamp class
Linear class
ANN class
SVM class
kNN class
Point: colorful character rather than universal character
12
AdaBoost
1
1. Initial settings : w1 (i )  (i  1 n ), F0 ( x )  0
n
2. For t  1,, T
(a )
(b)
(c)
wt (i )
 t ( f )   I( yi  f ( xi ))
,
 wt (i ' )
 t ( f (t ) )  min  t ( f )
fF
t 
1   t ( f( t ) )
1
2 log
 t ( f( t ) )
wt 1(i )  wt (i ) exp( t f( t ) ( xi ) yi )
T
3. sign( FT ( x )), where FT ( x )  t f ( t ) ( x ) t 1
13
Learning algorithm
w1 (1), , w1 (n)
f (1) ( x )
1
w2 (1),, w2 (n)
f( 2) ( x)
2
Data
1
2
T

t 1
t
f(t ) ( x)
T
wT (1),, wT (n)
f (T ) ( x )
T
Final machine sign( FT ( x ) ), where FT ( x )  t f ( t ) ( x ) t 1
14
Simulation (complete separation)
1
Feature space
[-1,1]×[-1,1]
0.5
Decision boundary
-1
-0.5
0.5
1
-0.5
-1
15
Set of weak machines
Linear classification machines
1
Random generation
0.5
-1
-0.5
0.5
1
-0.5
-1
16
Learning process (I)
1
1
1
0.5
0.5
0.5
0
0
0
-0.5
-0.5
-1
-0.5
-1
-1
-0.5
0
0.5
1
-1
Iter = 1, train err = 0.21
-0.5
0
0.5
1
-1
-1
-0.5
Iter = 13, train err = 0.18
1
0.5
0.5
0.5
0
0
0
-0.5
-0.5
-1
-0.5
0
0.5
Iter = 23, train err = 0.10
1
1
1
-1
-1
-1
0.5
Iter = 17, train err = 0.10
1
-0.5
0
-1
-0.5
0
0.5
Iter = 31, train err = 0.095
1
-1
-0.5
0
0.5
Iter = 47, train err = 0.08
1
17
Learning process (II)
1
1
0.5
0.5
0
0
0
-0.5
-0.5
-0.5
-1
-1
-0.5
0
0.5
Iter = 55, train err = 0.061
1
1
0.5
-1
-1
-1
-0.5
0
0.5
Iter = 99, train err = 0.032
1
-1
-0.5
0
0.5
1
Iter = 155, train err = 0.016
18
Final stage
1
1
0.5
0.5
0
0
-0.5
-0.5
-1
-1
-1
-0.5
0
0.5
Contour of F(x)
1
-1
-0.5
0
Sign(F(x))
0.5
1
19
Learning curve
0.2
Training error
0.15
0.1
0.05
50
100
150
200
250
Iter = 1,…..,277
20
Characteristics
Update
wt (i )  wt 1 (i )
f ( t ) ( xi )  yi  multiplifact or et
f ( t ) ( xi )  yi  multiplifact or e t Weighted error rates  t ( f( t ) )   t 1 ( f( t ) )   t 1 ( f( t 1) )
1
 t 1 ( f ( t ) ) 
2
( least favorable )
21
 t 1 ( f ( t ) ) 
n
 t 1 ( f t )   I ( f ( t ) ( xi )  yi )
i 1
n

I( f
i 1
(t )
1
2
wt 1 (i )
 wt 1(i' )
( xi )  yi ) exp{ t yi f ( t ) ( xi )}wt (i )
n
 exp{
i 1
t
yi f ( t ) ( xi )}wt (i )
n


exp{ t } I ( f ( t ) ( xi )  yi ) wt (i )
i 1
n
n
i 1
i 1
exp{ t } I ( f ( t ) ( xi )  yi ) wt (i )  exp{ t } I ( f ( t ) ( xi )  yi ) wt (i )
1   t ( f(t ) )
 (f )
 t ( f(t ) ) t (t )
1   t ( f(t ) )
 t ( f(t ) )
 t ( f(t ) ) 
{1   t ( f ( t ) )}
 t ( f(t ) )
1   t ( f(t ) )

1
2
22
Exponential loss
Exponential loss :
1 n
Lexp ( F )   exp{ yi F (xi )}
n i 1
Update by F ( x)  F ( x)   f ( x)
1 n
Lexp ( F   f )   exp{ yi F ( xi )}exp{ yi f ( xi )}
n i 1

1 n
  exp{ yi F ( xi )} e I( f ( xi )  yi )  e I( f ( xi )  yi )
n i 1

 Lexp (F ){ e  ( f )  e (1   ( f ))}
23
Sequential minimization
Lexp (F   f )  Lexp (F ){  ( f ) e  (1  ( f )) e }
n
where  ( f ) 
 I ( f (x )  y ) exp{ y F (x )}
i 1
j
i
i
i
Lexp ( F )
 ( f ) e  (1   ( f )) e
2
 1 ( f )



  ( f )e   2  ( f ){1   ( f )}

e


 2  ( f ){1   ( f )}
Equality holds iff  opt
1
1 ( f )
 log
2
( f )
24
AdaBoost = minimum exp-loss
min Lexp ( Ft 1   f ( t ) )  Lexp ( Ft 1 )  ( f ( t ) ){ 1   ( f ( t ) )}
 R
opt
(a )
(b)
(c)
1   ( f(t ) )
1
 log
2
 ( f(t ) )
f( t )  arg min  t ( f )
f F
1   t ( f(t ) )
1
t  log
2
 t ( f(t ) )
wt 1(i)  wt (i) exp{t yi ft ( xi )}
*
25
Simulation (complete random)
1
0.5
-1
-0.5
0.5
1
-0.5
-1
26
Overlearning of AdaBoost
1
1
1
0.5
0.5
0.5
0
0
0
-0.5
-0.5
-0.5
-1
-1
-0.5
0
0.5
Iter = 51, train err = 0.21
1
-1
-1
-1
-0.5
0
0.5
Iter = 151, train err = 0.06
1
-1
-0.5
0
0.5
1
Iter =301, train err = 0.0
27
Drawbacks of AdaBoost
1. Unbalancing learning
AsymAdaBoost
Balancing the false n/ps’
2. Over-learning even for noisy dataset
EtaBoost
Robustfy mislabelled examples
GroupBoost
Relax the p >> n problems
LocalBoost
Extract spatial information
BridgeBoost
Combine different datasets
28
AsymBoost
The small modification of AdaBoost into
Lasym ( f t  Ft 1 )
2 (b)’   arg min
 R
*
t
*
n
The selection of k
The default choice is
k
 I ( y  1)
i 1
n
i
 I ( y  1)
i 1
i
29
Weighted errors by k
30
31
32
33
Result of AsymBoost
34
Eta-loss function
regularized
35
EtaBoost
1. Initial settings : w1 (i ) 
n
 m ( f )   I( yi  f ( xi )) wm (i ) ,
2. For m  1,, T
(a )
1
(i  1 n ), F0 ( x )  0
n
 m ( f ( m ) )  min  m ( f )
i 1
f
(b)
(c)
wm1(i)  wm (i) exp(* m f( m) ( xi ) yi )
T
3. sign( FT ( x )), where FT ( x )  t f ( t ) ( x ) t 1
36
A toy example
37
AdaBoost vs Eta-Boost
38
Overlearning of AdaBoost
1
Iter = 51, train err = 0.21
1
0.5
0.5
0
-0.5
-1
-0.5
0.5
1
-1
-0.5
-1
-0.5
0
0.5
1
-1
-0.5
0
0.5
1
1
Iter =301, train err = 0.0
0.5
-1
0
Simulation (complete random)
-0.5
-1
39
EtaBoost
1
1
0.5
0.5
0
0
0
-0.5
-0.5
-0.5
1
0.5
-1
-1
-1
-1
-0.5
0
0.5
Iter = 51, train err = 0.25
1
-1
-0.5
0
0.5
Iter = 51, train err = 0.15
1
-1
-0.5
0
0.5
1
Iter =351, train err = 0.18
40
Mis-labeled examples
Mis-labeled
41
Comparison
AdaBoost
EtaBoost
42
GroupBoost
Relax over-learning of AdaBoost by group learning
Idea: In AdaBoost 2 (a)
The best machine is singly selected
Other better machines are cast off.
Is there any wise way of grouping G best macines?
43
Grouping machines
Let f t j ( ; b j )  arg min  t ( f ), where
f Fj
Fj  a sgn(x j  b) : b IR, a{1,1} 
Select a set
f
t (1)
(  ; b(1) ),, ft ( g ) (  ; b( g ) ) 
such that  t ( ft (1) (  ; b(1) ))     t ( ft ( g ) (  ; b( g ) ))
1 G
f t (x)   t ,( g ) f t ( x( g ) ; b( g ) )
G g 1
44
GroupBoost
N
(a )
 t ( f )   I( f ( xi )  yi ) wt (i )
 f (x
t
(b)
t ,(g )
(1)
i 1
; b(1) ),, ft ( x(G ) ; b(G ) ) 
1   t ( f t (  ; b( g ) ))
1
 log
2
 t ( ft ( ; b( g ) ))
( g  1,, G).
1 G
f ( x )  t ,(g ) f t ( x( g ) ; b( g ) ).
G g 1
(c)
wt (i ) exp( f t ( x ) yi )
wt 1 (i ) 
Zt 1
45
Grouping jumps for the next
2 f2
f2
1 f1
f1
f3
F0
3 f3
f4
4 f4
46
Learning archtecture
f (1,1) ( x ),  , f (1,G ) ( x )
w1 (1), , w1 (n)
w2 (1),, w2 (n)
Data
 (1,1) , ,  (1,G )
1
f (1) ( x)
f ( 2,1) ( x ), , f ( 2,G ) ( x )
( 2,1) , , ( 2,G )
2
f( 2) ( x)
T
f
t 1
Grouping G machines
(x )
T
f (T ,1) ( x ), , f (T ,G ) ( x )
wT (1),, wT (n)
(t )
(T ,1) , , (T ,G )
f (T ) ( x )
f( t ) 
1
G
((t ,1) f(t ,1)  (t ,G) f(t ,G) )
47
AdaBoost and GroupBoost
Update the weights
AdaBoost: exp(t )
G
GroupAdaBoost : exp(  t , g )
g 1
・ AdaBoost
Lexp (Ft  t 1 ft 1 )  Lexp (Ft )
・ GroupAdaBoost
Lexp (Ft  t 1 ft 1 )  Lexp (Ft  ft 1 )  Lexp (Ft )
48
From microarray
Contest program from bioinformatics
(BIP2003)
http://contest.genome.ad.jp/
Microarray data
Number of genes
p = 1000~100000
Size of individuals n = 10~100
49
Output
http://genome-www.stanford.edu/cellcycle/
50
Goal
Disease and gene expressions
y is a label for clinical infomation
x = (x 1 , …, x p ) = feature vector
The genen expression for i-th individual is observed
x i = log
(BIP2003) Problem2: n = 27+11 = 38 << p = 7109
51
Microarray data
cDNA microarry
52
Prediction from gene expressions
Feature vector dimension = number of genes p
components = quantities of gene expression
x  ( x1,x p )
Class label
disease, adverse effect
y  {  1,  1}
f : x y
Classification machine
based on training dataset {( xi , yi ) : 1  i  n }
53
Leukemic diseases, Golub et al
http://www.broad.mit.edu/cgi-bin/cancer/publications/
54
AdaBoost
1
1. Initial settings : w1 (i )  (i  1 n ), F0 ( x )  0
n
2. For t  1,, T
n
 t ( f )   I( yi  f ( xi )) wt (i ) ,
i 1
(a )
 t ( f ( t ) )  min  t ( f )
f
1   t ( f( t ) )
1
2 log
 t ( f( t ) )
(b)
t 
(c)
wt 1(i)  wt (i) exp( t f(t ) ( xi ) yi )
T
3. sign( FT ( x )), where FT ( x )  t f ( t ) ( x ) t 1
55
One-gene classifier
 1 if x j  b j
f j (x )  
  1 if x j  b j
one-gene classifier
Let x j1, ... , x j n be expressions of the j-th gene
bj
xj
5
Error number
6
5
6
5
6
5 4 5 65
b j  arg min {  I ( yi   sgn ( x j i  b ) }
b
i
56
The second training
16
 nb. of correctans.

0
.
5
log
 log2

4
 nb. of false ans. 
Update the weight: 1  0.5 log 
Weight up to 2
Weight down to 0.5
bj
bj
xj
Errror number
4.5
4
5.5
6
7.5
7
97
9 8.5
8
b j  arg min {  w(i ) I ( yi   sgn ( x j i  b )}
b
i
57
Web microarray data
p
n
y = +1
y = -1
ALLAML
7129
72
37
35
Colon
2000
62
40
22
Estrogen
7129
49
25
24
http://microarray.princeton.edu/oncology/
http://mgm.duke.edu/genome/dna micro/work/
p >> n
58
10-fold validation
D  ( xi , yi )
N
i 1
Validation
Validation
Validation
1
2
3
Training
Training
4

9
10
 (1)  (2)  (3)

Averaing by
1 10
 (k )

10 k 1
59
ALLAML
Training error
10-fold CV error
60
Colon
Training error
10-fold CV error
61
Estrogen
Training error
10-fold CV error
62
Gene score
T
T
G
t 1
t 1 g 1
・ FT (x)   f (x)   t ,( g ) f t ( x( g ) ; b( g ) )
Confidencefactor for j - th gene
1 T G
S ( j )   I (( g )  j )t ,(g )
T t 1 g 1
S(j) suggests the degree of association with the class-label
63
Genes associated with disease
Compare our result with the set of genes suggested
by Golub et al.,West et al.
AdaBoost can detect only 15 genes in
the case of ALLAML,.
GroupAdaBoostde detect 30 genes compatible with
Their results
64
Boost Leaning (II)
13:30-16:00 November 26 Fri
BridgeBoost
 p >> n problem
LocalBoost
 contextual information
Statistical discussion
Probablistic framework
Bayes Rule, Fisher’s LDF, Logistic regression
Optimal classifier by AdaBoost
65
Problem: p >> n
Fundamental issue on Bioinformatics
p
is the dimension of biomarker
(SNPs, proteome, microarray, …)
n
is the number of individuals
(informed consent, institutional protocol, …
bioethics)
66
An approach by combining
Let B be a biomarker space
Let I 1 , , I K be K experimental facilities
Dk  { ( zi ) : i  1, , nk }
p  n k
(k  1,...,K )
Rapid expansion of genomic data
K  larger

K
n
k 1
k
 p
67
BridgeBoost
CAMDA (Critical Assessment of Microarray Data Analysis )
DDBJ (DNA Data Bank Japan, NIG)
D1
f ( D1 )
f ( D1 | D1 )
D2
f ( D2 )
f ( D2 | D2 )
….
f ( DK )
….
….
DK
result
f ( DK | D K )
68
CAMDA 2003
4 datasets for Lung Cancer
Harvard
PNAS, 2001
Affymetrix
Michigan
Nature Med,
2002
PNAS, 2001
Affymetrix
Cancer Res
2001
cDNA
Stanford
Ontario
cDNA
http://www.camda.duke.edu/camda03/datasets/
69
Some problems
1. Heterogeneity in feature space
cDNA, Affymetrix
2. Heterogeneous class-labeling
Differences in covariates,medical diagnosis
3. Heterogeneous generalization powers
Uncertainty for microarray experiments
4. Publication bias
A vast of unpublished studies
70
Machine learning
Leanability: boosting weak learners?
AdaBoost : Freund & Schapire (1997)
weak classifiers
A strong classifier
{ f1 ( x), .... , f p ( x) }
f ( x)
stagewise
1 f(1) ( x)
1 f(1) ( x)   t f( t ) ( x)
71
Learning algorithm
w1 (1), , w1 (n)
f (1) ( x )
1
w2 (1),, w2 (n)
f( 2) ( x)
2
D
1
2
T

t 1
t
f(t ) ( x)
T
wT (1),, wT (n)
f (T ) ( x )
T
Final machine sign( FT ( x ) ), where FT ( x )  t f ( t ) ( x ) t 1
72
Different datasets
D
K

where Dk  { ( x(ik ) , y(ik ) ) : 1  i  nk }
Dk
k 1
x (ik )  IR p
y (i k )
expression vector of the same genes
label of the same clinical item
Normalization: IR p ∋ x(i k )  x (i k )  [0,1]p
x
(k )
ij

x (i kj)  min x (i kj)
i
max x
i
(k )
i j
 min x
i
(k )
i j
( j  1,..., p)
73
Weighted Errors
The k-th weighted error
nk
t(k ) ( f )  
i 1
wt
nk
(k )
 wt
(i )
(k )
i 1
I( yi
(k )
 f ( xi ))
(k )
(i )
The combined weighted error
nk
K
 t ( f )   ( k )  t ( k ) ( f )
k 1
where  ( k ) 

wt
(k )
i 1
K nh

h 1 i 1
wt
(i )
(h)
(i )
74
BridgeBoost
{wt (i ) :1  i  nk }1  k  K
(k )
(a)
ft
(k )
 arg min  t ( f )
(k )
f
(k )
1   t ( ft )
1
(k )
2 log
 t ( ft )
(b)
t
( c)
1 K (k ) (k )
f t ( x )   t f t ( x )
K k 1
(d )
wt 1 (i )  wt (i ) exp(  t
(k )
(k )

(k )
(k )
(k )
(k )
f t ( xi ) yi )
75
Learning flow
D1
D2
ft
(1)
{wt (i )}
ft
( 2)
{wt (i )}
(1)
( 2)
( x)
 t (1)
( x)
 t ( 2)
(1)
{wt1 (i )}
( 2)
{wt1 (i )}
D
f (t ) ( x)
DK
Stage t :
{ wt
(K)
(i) }
f(t ) 
ft
1
K
( t
(K )
(1)
( x)
ft
(1)
 t(K )
  
(K )
t
(K)
{ wt1 (i)}
ft
(K )
)
Stage t+1 :
76
Mean exponential loss
1
Lk ( F ) 
nk
Exponential loss
Mean exponential loss
nk
(k )
(k )
exp{

y
F
(
x

i
i )}
i 1
K
L ( F )   Lk ( F )
k 1

(k )
t

1
2 log
1   t ( ft
 t ( ft
(k )
(k )
)
)
 t( k )  arg min L ( ft ( k )  Ft 1 )

1 K
L ( Ft 1  f t )   Lk ( Ft 1   (t k ) f t( k ) )  L ( Ft 1 )
K k 1
Note: convexity of Expo-Loss
77
Meta-leaning
h  k  Lh ( ft ( k )  Ft 1 ) is cross- validatory
Lh () is on Dh ; f t( k ) is on Dk
K
(k )
(k )
(k )
L
(

f

F
)

L
(

f

F
)

L
(

f
h t
 h t  Ft 1 )
t 1
k
t
t 1
h 1
h k
Separate learning
Meta-learning
78
Simulation
3 datasets
{ D1 , D 2 , D3 }
p  100, n1  50, n2  50, n3  50
data 1, data2
D1 , D2
Test error 0 (ideal)
data3
D3
Traning error
Test error
Test error 0.5(ideal)
Collapsed dataset
79
Comparison
Training error
Test error
Separate AdaBoost
Training error
BridgeBoost
Test error
80
Test errors
Collapsed AdaBoost
Min =15%
Separate AdaBoost
Min =43%
Min = 3%
Min = 4%
BridgeBoost
Min =4%
81
Conclusion
f ( D1 )
f ( D1 | D1 )
D2
f ( D2 )
f ( D2 | D2 )
….
….
DK
f ( DK )
Separate
Leaning
….
D1
result
f ( DK | D K )
Meta-leaning
82
Unsolved problems in BridgeBoost
1. Which dataset should be joined or deleted
in BridgeBoost ?
2.
Prediction for class-label for a given new x ?
3.
On the information on the unmatched genes
in combining datasets
4.
Heterogeneity is OK, but publication bias?
83
Markov random field
Markov random field (MRF)
ICM algorithm (Besag, 1986)
84
Neighbor sites
85
Space AdaBoost
1. Estimate the posterior p(k | x) using
only non-contextual information.
2. Extract the contextual information
based on the estimated posterior p(k | x)
3. Make a hierarchical set of weak machines, and
start the algorithm from the noncontextual result.
Fr  { pr (k | i ) }
86
Analysis comparison
True Image
r
r
r
True
2 labels
 2
2
2
 0
r
2
1
 0
r
2
 4
r
2
 17
87
LocalBoost
Let S be the sets of possible sites and we obtain
training dataset over S :
Local exponential loss :
1 n
L local( F , h)   Kh ( s, s(i )) exp{ ys ( i ) F ( xs ( i ) )}
n i 1
88
Algorithm
st is unformlysampledfrom{ s(i ) : i  1,...,n}
n
 t ( f , st )   Kh ( s,st )I( yi  f ( xi )) wt (i ) ,
i 1
(a )
( b)
f ( t ) st  arg min  t ( f , st )
f F
t 
1   t ( f( t ) s , st )
1
t
log
2
 t ( f( t ) s , st )
t
(c)
wt 1(i)  wt (i) exp( t f(t ) st ( xi ) yi )
Note: the selected machine f ( t ) st has local information
around st
89
Locally weited combine
Predicion rule h( x, s)  sgn(F ( x, s))
F ( x, s) 
T
 K ( s, s ) f
t 1
h
t
t ( t ) st
( x)
We use the second weighting for strongly combining
only classification machines f ( t ) st with st near s.
90
LocalBoost vs. AdaBoost
91
Statistical discussion
Bayes rule
ROC curve
Neyman-Pearson Lemma
Model-misspecification
92
Error rate
Two types of errors (misclassification probabilities)
True Positive
False Positive
False Negative
True Negative
False Negative
False Positive
u is a cut-off point.
93
Neyman-Pearson Lemma (1)
Null hypothesis
Alternative
Log-likelihood raio
Neyman-Pearson lemma
94
Neyman-Pearson Lemma (2)
Bayes rule
95
Loss functions by NP lemma
where
96
A class of loss functions
Exponential
Log loss
97
Bayes rule equivalence
Theorem 1
Equality holds if and only if
98
Error rate
Error rate
is of the class
where H(s) は Heaviside function,
In general
99
Important examples
Credit scoring
Medical screening
100
Minimum Exponential Loss
Empirical exp loss
Expected exp loss
Theorem.
101
Variational discussion
102
ROC curve
A
Area Under the Curve
Gini index
103
ROC analysis
TP
FP
104
Logistic type discriminant
TFor a given traning dataset
For a given function
the empirical loss is
suggests the decision rule
105
Log loss
Conditionsal expected likelihood
where
106
Estimation equation
expected
empirical
where
logistic
IRLS (Iteratively reweighted least squares)
glm(formula, family = binomial, data, weights =W,・・・ )
107
Fisher consistency
Theorem 3.
If the distribution of (x, y) is
is asymptotically consistent for
108
Asymptotic efficiency
Asymptotic variance
1 1
ˆ
var (β)  J (β)V (β) J 1 (β)
n
A
Cramer-Rao type innequality gives


1
ˆ
var (β) 
E{ p (1  p}f (x) f (x)T }
n
A
1
Equality holds if and only if
Or equivalently the logistic regression.
109
Expected loss under parametrics
Expected D-loss
Risk( βˆ ,U )  I{LU ( βˆ )}
Under the parametric assumption
1
ˆ
ˆ
Risk (βU * , U )  Risk (β log , U )  o( ) ( U , U *)
n
110
Expected loss under misspecified model
Under a near parametric settings
  ( x)  β
T

1
2
f ( x )  O(n )
Then for
Risk( βˆ U * ,U )  D( βU * )  12 tr{varA ( βˆU * ) Hesse(LU )}
1
 o(n )
where
β U *  argmin LU * ( β ' )
β'
111
Sampling scheme
Conditinal samling (Cohort study, prospective study)
Mixture sampling
Separatesampling (case-control study, retrospective study
112
Adams, N.M. nd Hand, D.J. (1999). Comparing classifiers when the
misclassification costs are uncertain. Pattern Recognition 32, 1139-1147.
Adams, N.M. and Hand, D.J. (2000). Improving the practice of classifier
performance assessment. Neural Computation 12, 305-311.
Begg, C. B., Satogopan, J. M. and Berwick, M. (1998). A new strategy for evaluating
the impact of epidemiologic risk factors for cancer with applications to melanoma.
J. Amer. Statist. Assoc. 93, 415-426.
Berwick, M, Begg, C. B., Fine, J. A., Roush, G. C. and Barnhill, R. L. (1996).
Screening for cutaneous melanoma by self skin examination. J. National Cancer
Inst., 88, 17-23.
Bishop, C. (1995),. Neural Networks for Pattern Recognition, Clarendon Press, Oxford.
113
Domingo, C and Watanabe, O. (2000). MadaBoost: A modification of AdaBoost.In
Proc. of the 13th Conference on Computational Learning Theory.
Efron, B. (1975), The efficiency of logistic regression compared to normal
discriminant analysis. J. Amer. Statist. Asoc.70, 892-898.
Friedman, J., Hastie, T. and Tibishirani, R. (2000). Additive logistic regression: A
statitistical view of boosting. Ann. Statist. 28, 337-407.
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems.
Annals of Eugenics, 7, 179-188.
Hand, D. J. and Henley, W. E. (1997). Statistical classification methods in consumer
credit scoring: a review. J. Roy. Statist. Soc., A, 160, 523-541.
Ryuei Nishii and Shinto Eguchi (2004). Supervised Image Classification by Contextual
AdaBoost Based on Posteriors in Neighborhoods To be submitted.
Hastie, T. Tibishirani, R. and Friedman J. (2001). The elements of statistical learning.
Springer, New York.
114
Lebanon, G. and Lafftry, J. (2001). Boosting and maximum likelihood for
exponential models. NIPS, 14, 2001.
McLachlan, G. J. (1992). Discriminant analysis and statistical pattern recognition.
Wiley, New York.
Pepe. M.S. and Thampson, M.L. (2000). Combing diagnostic test results to
increase accuracy. Biostatistics 1, 123-140.
Ratsch, G., Onoda, T. and Muller K.-R. (2001) Soft Margins for AdaBoost. Machine
Learning. 42(3)}, 287-320.
Schapire, R. (1990). The strength of the weak learnability. Machine Learning 5,
197-227.
Schapire, R. Freund, Y, Bartlett, P. and Lee, W. (1998). Boosting the margin: a new
explanation for effectiveness of voting methods. Ann. Statist., 26, 1651-1686.
Vapnik, V. N. (1999). The Nature of Statistical Learning Theory. Springer: New
York.
115
Eguchi, S. and Copas, J. (2001). Recent developments in discriminant
analysis from an information geometric point of view. J. Korean Statist.
Soc. 30, 247-264 (2001). (The special issue of the 30th aniversary of the
Korean Statist. Soc)
Eguchi, S. and J. Copas . A class of logistic-type discriminant functions.
Biometrika 89, 1-22 (2002).
Eguchi, S. (2002) U-boosting Method for Classification and Information
Geometry. Invited talk on International Statistical Workshop, Statistical
Research Center for Complex System, Seoul Natinal University
Kanamori, T. Takenouchi, S. Eguchi and N. Murata (2004). The most robust
loss function for boosting. Lecture Notes in Computer Science 3316, 496501, Springer.
Murata, N., T. Takenouchi, T. Kanamori and S. Eguchi , Information
geometry of U-Boost and Bregman divergence. Neural Computation 16,
1437-1481 (2004).
116
Takenouchi, T. and S. Eguchi. Robustifying AdaBoost by adding the naive error rate.
Neural Computation 16, 767-787 (2004).
N. Murata, T. Takenouchi, T. Kanamori, S. Eguchi. Geometry of U-Boost algorithms.
林原フォーラム「偶然と必然-数理、情報、経済」分科会「情報の物理学」
江口真透 (2004). 情報幾何と統計的パタン認識, 数学論説
江口真透 (2004). 統計的パタン識別の情報幾何 -U ブースト学習アルゴリズム-
数理科学 No. 489, 53-59
江口真透 (2002). 統計的識別の方法について - ロジスティック判別からアダブース
トまで -. 応用統計学会 第24回 シンポジウムプログラム-多変量解析の新展
開- 特別講演
117
Future paradigm
The concept of learning will offer highly productive
ideas for computational algorithms also in future.
Is it an imitation of biological brain?
Meta learning?
Human life?
And Statistics?
And Statistics?
118