Linear Programming Boosting by Column and Row Generation Kohei Hatano and Eiji Takimoto Kyushu University, Japan DS 2009 1. 2. 3. 4. 5. Introduction Preliminaries Our algorithm Experiment Summary Example Given a web page, predict if the topic is “DS 2009”. hypothesis set = words y DS 2009? +1 n -1 Weighted majority vote 0.5* y +1 DS 2009? n -1 + 0.3* y +1 ALT? n -1 + 0.2* y +1 Porto? n -1 Modern Machine Learning Approach Find a weighted majority of hypotheses (or hyperplane) which enlarges the margin 1-norm soft margin optimization (aka 1 norm SVM) Popular formulation as well as 2 norm soft margin opt. “find a hyperplane which separates positive and negative instances well” margin max . ρ α , ρ ,ξ sub.to. - 1 ν m loss (ν : constant s.t. 1 ν ∑ξi m) i 1 n yi ∑α j hj ( xi ) ≥ ρ - ξi , (i 1,...,m, yi 1,-1) j 1 α n 1 ∑αi 1, αi ≥ 0 normalized with 1 norm i 1 Note ・Linear Program ・good generarization guarantee [Schapire et al. 98] 1-norm soft margin optimization(2) 2 norm soft margin opt. 1 norm soft marign opt. non-sparse hyperplane 0.2 *(DS 2009?) + 0.1 *(ALT?) + 0.1 * (Porto?)* 0.1* (wine?)+0.05*(tasting?) +0.05* (discovery?)+ 0.03*(science?)+0.02*()+… sparse hyperplane 0.5 *(DS 2009?) + 0.3 *(ALT?) + 0.2* (Porto?) Advantage of 1 norm soft margin opt. Solution likely to be sparse ⇒ useful for feature selection Recent Results 2-norm soft margin optimization ・Quadratic Programming ・SMO [Platt, 1999] ・SVMlight [Joachims, 1999] ・SVMPerf [Joachims, 2006] ・Pegasos [Shai-Schwartz et al., 2007] There are state-of-the-art 1-norm soft margin optimization ・ Linear Programming ・ LPBoost [Dem iriz et al, 2003] ・ Entropy Regularized LPBoost [Warmuth et al., 2008] ・ others [Mangasarian 2006][Sra 2006] not efficient enough for large data room for improvements solvers Our result new algorithm for 1 norm soft margin optimization 1. 2. 3. 4. 5. Introduction Preliminaries Our algorithm Experiment Summary Boosting Classification: frog “+1”, others “-1” Hypotheses・・・ color and size size -1 -1 +1 1.d1 : uniform distribution 2.For t=1,…,T (i) Choose hypothesis ht maximizing the edge w.r.t.dt (ii) Update distribution dt to -1 +1 dt+1 color 3. Output weighting of chosen hypotheses Boosting (2) Edge of hypothesis h w.r.t. distribution d m frog h1 size edged (h ) di y i h ( xi ) i 1 -1 -1 -1,or +1 ∈[-1,+1] yih(xi)>0 if correct accuracyof h +1 1.d1 : uniform distribution h1(xi) 2.For t=1,…,T (i) Choose hypothesis ht maximizing the edge w.r.t.dt (ii) Update distribution dt to -1 +1 dt+1 color 3. Output weighting of chosen hypotheses Boosting (2) h1 size -1 -1 +1 1.d1 : uniform distribution h1(xi) 2.For t=1,…,T (i) Choose hypothesis ht maximizing the edge w.r.t.dt (ii) Update distribution dt to -1 +1 dt+1 color 3. Output weighting More weightsof on chosen hypotheses Misclassified instances Boosting (3) Note: more weights on “diffucult” instances size -1 -1 h2 +1 1.d1 : uniform distribution 2.For t=1,…,T (i) Choose hypothesis ht maximizing the edge w.r.t.dt (ii) Update distribution dt to -1 +1 frog dt+1 color 3. Output weighting of chosen hypotheses Boosting (4) 0.4h1 +0.6h2 size -1 -1 +1 1.d1 : uniform distribution 2.For t=1,…,T (i) Choose hypothesis ht maximizing the edge w.r.t.dt (ii) Update distribution dt to -1 +1 dt+1 color 3. Output weighting of chosen hypotheses Boosting & 1-norm soft margin optimizatoin Primal max . ρ α , ρ ,ξ 1 m ξi (ν 1) ν i 1 sub.to. n yi α j hj ( xi ) ρ ξi , (i 1,...,m, yi 1,1) j 1 α “find the large-margin hyperplane which separates pos. and neg. instances as much as possible” n 1 α j 1, α j 0 j 1 equivalent by duality of LP Dual min . γ γ ,d sub.to. edged (hj ) γ , ( j 1,...,n ) 0 di 1 ν , d m 1 di 1 i 1 find the distiribution which minimizes the maximum edge of hypotheses (find the most “difficult” distribution w.r.t. hypotheses) ≈ solvable using Boosting LPBoost [Demiriz et al, 2003] Update: solve the dual problem w.r.t. the hypothesis set {h1,…,ht} (γt 1 , dt 1 ) arg mi n . γ γ ,d sub.to. edged (hj ) γ , ( j 1,...,t ) 0 di 1 ν , d m 1 di 1 i 1 Output: the convex combination f (x ) T αt ht ( x ) t 1 where α is the solution of the primal problem Theorem[Demiriz et al.] Given ε>0, LPBoost outputs ε-approximation of the optimal solution. Properties of the optimal solution (α * , ρ * , ξ * ) : sol. of the pri malproblem (d ,γ ) : sol. of the dual problem * * KKT conditions imply: instancei with margin ρ * di * instance with margin ρ * 0 di * instancei with margin ρ * 1 ν 1 ν di * 0 Note: Optimal solution can be recovered using only instances with positive weigthts Properties of the optimal solution (2) By KKT conditions hypothesis with edge γ * w.r.t. d* α j* 0 hypothesis with edge γ * w.r.t. d* α j* 0 Note: Optimal solution can be recovered using only hypotheses with positive coefficients. Sparseness of the optimal solution 1.Sparseness w.r.t. hypotheses weighting 2.Sparseness w.r.t. instances 1. 2. 3. 4. 5. Introduction Preliminaries Our algorithm Experiment Summary Our idea: Sparse LPBoost Take advantage of the sparseness w.r.t. hypotheses and instances 2.For t=1…. (i)Pick up instances with margin <ρt (iii) solve the dual problem w.r.t. the past chosen instances by Boosting (ρt+1:the solution) 3.Output the solution of the primal problem. Theorem Given ε>0, Sparse LPBoost outputs ε-approximation of the optimal solution. Our idea (matrix form) Inequality Constraints of the dual problem Each row i corresponds to instance i y1h1 ( x 1 ) y h (x ) i 1 i y mh1 ( x m ) ... ... y1hj ( x1 ) ... yi hj ( xi ) ... ... ... ... ... ... y mh1 ( x m ) ... γ yi hn ( xi ) di γ y mh1 ( x m ) d m γ y1hn ( x1 ) d1 Each column j corresponds to hypothesis j LP LPBoost whole matrix columns Sparse LPBoost intersections of columns and rows “effective” constraints for optimal sol. intersections of column and rows How to choose examples (hypotheses)? Assumptions - # of hypotheses is constant ・ time complexity of a LP solver: mk (m: # of instances) 1st attempt: add an instance one by one ν k 1 ν 1 (computation time) t k t k Ωm k 1 k 1 t 1 t 1 ν less efficient than LP solver! Our method: Choose at most 2t instances with margin < ρ (t:# of iterations) If the algorithm terminate after it chooose cm instances (0<c < 1) log( cm ) t k (computation time) ν 2 t (4c 0.2)k m k O m k 1 Note: same argument holds for hypotheses unknown 1. 2. 3. 4. 5. Introduction Preliminaries Our algorithm Experiment Summary Experiments (new experiments not in the proceedings) Data set Reuters-21578 # of examples m # of hypotheses n 10,170 30,389 RCV1 20,242 47,237 news20 19,996 1,335,193 parameters: ν=0.2m, ε=0.001 each algorithm implemented with C++ and LP solver CPLEX 11.0 Experimental Results (sec.) Sparse LPBoost improves the computation time by 3 to 100 times. 1. 2. 3. 4. 5. Introduction Preliminaries Our algorithm Experiment Summary Summary & Open problem Our result • Sparse LPBoost: provable decompotion algorithm which ε-approximates 1-norm soft margin optimization • faster than 3 to 100 times than LP solver or LPBoost. Open problem • Theoretical guarantee on # of iterations • Better method for choosing instances (hypotheses)
© Copyright 2024 ExpyDoc