AdaBoost Reference Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119-139, 1997. “Two heads are better than one.” 三人寄れば文殊の知恵 Boosting is a technique of constructing a strong classifier by combining a collection of weak classifiers. Training dataset x1 T1 T2 T3 T4 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 1 0 1 1 0 0 1 0 1 1 1 1 1 0 0 0 0 1 0 0 0 1 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 0 0 目標属性 (Objective Attribute) 1 1 1 0 0 0 0 0 1 1 1 1 1 0 0 0 y1 ( x1 , y1 ) ( xN , y N ) Basic Idea of AdaBoost • Initially assign an equal weight to each record. • Iterate the following steps: 1. Generate a hypothesis the error ratio of which is smaller than that of random guessing. 2. Increase weights of mis-predicted records relatively higher than the others. We call a classifier “a hypothesis” following the terms in the cited paper. hypothesis T1 T2 T3 T4 Ob 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 0 1 0 1 1 1 1 1 1 0 0 0 1 1 1 0 0 1 1 1 Size of Weight if T1=1 then Ob=0 else Ob=1 0 0 0 0 0 0 0 0 represents the degree of the weight. New Weight another hypothesis T1 T2 T3 T4 Ob 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 0 1 0 1 1 1 1 1 1 0 0 0 1 1 1 0 0 1 1 1 Weight if T3=1 then Ob=1 else Ob=0 1 1 1 1 1 0 0 0 New Weight another hypothesis T1 T2 T3 T4 Ob 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 0 1 0 1 1 1 1 1 1 0 0 0 1 1 1 0 0 1 1 1 Weight if T4=1 then Ob=1 else Ob=0 1 1 1 0 0 1 1 1 New Weight Hypotheses if T1=1 if T3=1 if T4=1 Simple then Ob=0 then Ob=1 then Ob=1 Majority T1 T2 T3 T4 Ob else Ob=1 else Ob=0 else Ob=0 Voting 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 1 1 1 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 0 1 0 1 1 1 1 1 1 0 0 0 1 1 1 0 0 1 1 1 1 1 1 0 0 0 0 0 AdaBoost performs weighted majority voting by hypotheses. Details of AdaBoost Input Training dataset ( x1 , y1 ),...,( xN , yN ) Initial weight wi D(i) 1 1 N for each i =1,2, …, N Weak learner WeakLearn that always ouputs a hypothesis whose error ratio is less than ½. T: number of iterations to generate weak hypothesis For each t 1, ...,T , iterate: 1: Compute the distribution pit of each record by normalizing weights pi wi t t N i 1 wi t 2: Call WeakLearn to generate such a weak hypothesis ht that εt i 1 pt ht ( xi ) yi 1 N t 2 0 if ht is correct ht ( xi ) yi 1 otherwise 3: Revise weights βt εt ( 1 εt ) t 1 i w 1 ht ( xi ) yi t w β t i Revision of weights t 1 i w T 1 i w 1 ht ( xi ) yi t w β t i T 1 ht ( xi ) yi t D(i ) β i 1 1 ht ( xi ) yi t β βt 1 βt 1 because if ht is correct otherwise εt 1 wi1 wi2 wiT 1 2 Output: Final hypothesis hf (Weighted Majority Voting among ht) 1 T T 1 if t 1 ( log βt )ht ( x ) t 1 ( log βt ) * hf ( x) 2 0 otherwise If h f ( xi ) yi , β T Lemma1 1 ht ( xi ) yi t i 1 βt i 1 T 1 2 T If h f ( xi ) 1 and yi 0 : Add t 1 (log βt ) to both sides of (*). 1 T (log β )( 1 h ( x )) (log β ) t 1 t 1 t t t 2 T If h f ( xi ) 0 and yi 1 : 1 T t 1 ( log βt )ht ( x ) t 1 ( log βt ) 2 T 1 T (log β ) h ( x ) (log β ) t 1 t 1 t t t 2 T Observe 1 ht ( x ) 1 ht ( xi ) yi Observe ht ( x ) 1 ht ( xi ) yi ε Definition D(i) {i | h f ( xi ) yi } T 1 βt i1 wi ε i 1 N Lemma2 T 1 ht ( x j ) y j t wiT 1 D(i) β j 1 T 1 w i 1 i N The error of hf for the initial distribution. T 1 2 T D(i) βt j 1 T 1 w i {i | h f ( x i ) y i } D (i ) βt {i | h ( x ) y } i 1 f i i T 1 2 1 2 By Lemma1. ε T heorem T 2 εt (1 εt ) t 1 T ε βt i 1 1 2 Lemma 2 N i 1 T 1 i w t 1 w i 1 i N N i 1 1 ht ( xi ) yi t wβ t i i 1 w (1 (1 βt )(1 ht ( xi ) yi )) N α γ 1 (1 α)γ t i i 1 w (1 βt )i 1 w (1 ht ( xi ) yi ) N t i N t i N N 1 εt i 1 pit i 1 pit ht ( xi ) yi i 1 p (1 ht ( xi ) yi ) N t i N i 1 w (1 ht ( xi ) yi ) (i 1 wit ) N (i 1 wit )(1 (1 βt )(1 εt )) t i N (i 1 wit ) 2εt N εt 1 2εt 1 βt 1 1 εt 1 εt t 1 i N w i 1 N T 1 i w i 1 T ε βt i 1 ε T 1 (i 1 w ) 2εt N 2 N w 1 i 1 1 i T 2 ε t t 1 2εt βt t 1 t i N i 1 1 2 T 1 i w T By Lemma2. 2 εt (1 εt ) t 1 because βt εt ( 1 εt ) . Q.E.D
© Copyright 2024 ExpyDoc