PowerPoint プレゼンテーション

AdaBoost
Reference
Yoav Freund and Robert E. Schapire. A decision-theoretic
generalization of on-line learning and an application to
boosting. Journal of Computer and System Sciences,
55(1):119-139, 1997.
“Two heads are better than one.”
三人寄れば文殊の知恵
Boosting is a technique of constructing a strong classifier
by combining a collection of weak classifiers.
Training dataset

x1
T1 T2 T3 T4
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
1
1
0
1
0
1
1
0
1
1
0
0
1
0
1
1
1
1
1
0
0
0
0
1
0
0
0
1
0
1
1
1
1
0
0
1
1
1
1
1
1
1
1
0
0
0
目標属性
(Objective Attribute)
1
1
1
0
0
0
0
0
1
1
1
1
1
0
0
0
y1

( x1 , y1 )


( xN , y N )
Basic Idea of AdaBoost
• Initially assign an equal weight to each record.
• Iterate the following steps:
1. Generate a hypothesis the error ratio of which
is smaller than that of random guessing.
2. Increase weights of mis-predicted records
relatively higher than the others.
We call a classifier “a hypothesis” following the terms in the cited paper.
hypothesis
T1 T2 T3 T4
Ob
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
1
1
0
1
0
1
1
1
1
1
1
0
0
0
1
1
1
0
0
1
1
1
Size of
Weight
if T1=1
then Ob=0
else Ob=1
0
0
0
0
0
0
0
0
represents the degree of the weight.
New
Weight
another
hypothesis
T1 T2 T3 T4
Ob
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
1
1
0
1
0
1
1
1
1
1
1
0
0
0
1
1
1
0
0
1
1
1
Weight
if T3=1
then Ob=1
else Ob=0
1
1
1
1
1
0
0
0
New
Weight
another
hypothesis
T1 T2 T3 T4
Ob
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
1
1
0
1
0
1
1
1
1
1
1
0
0
0
1
1
1
0
0
1
1
1
Weight
if T4=1
then Ob=1
else Ob=0
1
1
1
0
0
1
1
1
New
Weight
Hypotheses
if T1=1
if T3=1
if T4=1
Simple
then Ob=0
then Ob=1
then Ob=1
Majority
T1 T2 T3 T4 Ob
else Ob=1
else Ob=0
else Ob=0
Voting
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
1
1
1
1
1
0
0
0
1
1
1
0
0
1
1
1
1
1
1
0
0
0
0
0
0
0
1
1
0
1
0
1
1
1
1
1
1
0
0
0
1
1
1
0
0
1
1
1
1
1
1
0
0
0
0
0
AdaBoost performs weighted majority voting by hypotheses.
Details of AdaBoost
Input
Training dataset


( x1 , y1 ),...,( xN , yN )
Initial weight
wi  D(i)  1
1
N
for each i =1,2, …, N
Weak learner WeakLearn that always ouputs
a hypothesis whose error ratio is less than ½.
T: number of iterations to generate weak hypothesis
For each t  1, ...,T , iterate:
1: Compute the distribution pit of each record by normalizing weights
pi  wi
t
t

N
i 1
wi
t
2: Call WeakLearn to generate such a weak hypothesis ht that

εt  i 1 pt ht ( xi )  yi  1
N
t
2
0 if ht is correct

ht ( xi )  yi  
1 otherwise
3: Revise weights
βt  εt ( 1  εt )
t 1
i
w

1 ht ( xi )  yi
t
w β
t
i
Revision of weights
t 1
i
w
T 1
i
w

1 ht ( xi )  yi
t
w β
t
i
T

1 ht ( xi )  yi
t
 D(i ) β
i 1

1 ht ( xi )  yi
t
β
 βt

1
βt  1 because
if ht is correct
otherwise
εt  1
wi1  wi2    wiT 1
2
Output: Final hypothesis hf (Weighted Majority Voting among ht)

1
T
T
 
1 if t 1 ( log βt )ht ( x )  t 1 ( log βt )  *
hf ( x)  
2

0 otherwise

If h f ( xi )  yi , β
T
Lemma1

1 ht ( xi )  yi
t
i 1


   βt 
 i 1 
T
1
2

T
If h f ( xi )  1 and yi  0 : Add t 1 (log βt ) to both sides of (*).

1
T
(log
β
)(
1

h
(
x
))

(log
β
)
t 1
t 1
t
t
t
2
T

If h f ( xi )  0 and yi  1 :

1
T
t 1 ( log βt )ht ( x )  t 1 ( log βt ) 2
T

1
T
(log
β
)
h
(
x
)

(log
β
)
t 1
t 1
t
t
t
2
T


Observe 1  ht ( x )  1  ht ( xi )  yi


Observe ht ( x )  1  ht ( xi )  yi
ε
Definition
 D(i)

{i | h f ( xi )  yi }


T 1
βt 
i1 wi  ε 
i 1

N
Lemma2
T

1 ht ( x j )  y j
t
wiT 1  D(i)  β
j 1
T 1
w
i 1 i 
N
The error of hf for the initial distribution.
T
1
2
 T

 D(i)   βt 
 j 1 
T 1
w
 i

{i | h f ( x i )  y i }






D (i )   βt 

 {i | h ( x )  y }   i 1 
 f i i 
T
1
2
1
2
By Lemma1.
ε 
T heorem
T
2
εt (1  εt )
t 1
 T

ε  βt 
 i 1 
1
2


Lemma 2
N
i 1
T 1
i
w

t 1
w
i 1 i
N


N
i 1

1 ht ( xi )  yi
t
wβ
t
i

 i 1 w (1  (1  βt )(1  ht ( xi )  yi ))
N
α γ  1  (1  α)γ
t
i

 i 1 w  (1  βt )i 1 w (1  ht ( xi )  yi )
N
t
i
N
t
i

N
N
1  εt  i 1 pit  i 1 pit ht ( xi )  yi

 i 1 p (1  ht ( xi )  yi )
N
t
i

N
 i 1 w (1  ht ( xi )  yi ) (i 1 wit )
N
 (i 1 wit )(1  (1  βt )(1  εt ))
t
i
N
 (i 1 wit )  2εt
N
εt
1  2εt
1  βt  1 

1  εt 1  εt

t 1
i
N
w
i 1

N
T 1
i
w
i 1
 T

ε  βt 
 i 1 
ε 
T
1
 (i 1 w )  2εt
N

2


N
w

1
i 1
1
i

T
2
ε
t
t 1

 2εt  βt
t 1
t
i
N
i 1
1
2
T 1
i
w

T
By Lemma2.
2
εt (1  εt )
t 1
because βt  εt ( 1  εt ) .
Q.E.D