Linear Programming Boosting by Column and Row

Linear Programming Boosting
by Column and Row Generation
Kohei Hatano and Eiji Takimoto
Kyushu University, Japan
DS 2009
1.
2.
3.
4.
5.
Introduction
Preliminaries
Our algorithm
Experiment
Summary
Example
Given a web page, predict if the topic is
“DS 2009”.
hypothesis set
= words
y
DS 2009?
+1
n
-1
Weighted majority vote
0.5*
y
+1
DS 2009?
n
-1
+
0.3*
y
+1
ALT?
n
-1
+ 0.2*
y
+1
Porto?
n
-1
Modern Machine Learning Approach
Find a weighted majority of hypotheses (or hyperplane) which enlarges
the margin
1-norm soft margin optimization
(aka 1 norm SVM)
Popular formulation as well as 2 norm soft margin opt.
“find a hyperplane which separates positive and negative instances well”
margin
max . ρ
α , ρ ,ξ
sub.to.
-
1
ν
m
loss
(ν : constant s.t. 1  ν
∑ξi  m)
i 1
 n


yi ∑α j hj ( xi )  ≥ ρ - ξi , (i  1,...,m, yi  1,-1)
 j 1



α
n
1
 ∑αi  1, αi ≥ 0
normalized with 1 norm
i 1
Note
・Linear Program
・good generarization guarantee
[Schapire et al. 98]
1-norm soft margin
optimization(2)
2 norm soft margin opt.
1 norm soft marign opt.
non-sparse hyperplane
0.2 *(DS 2009?) + 0.1 *(ALT?) + 0.1 * (Porto?)* 0.1* (wine?)+0.05*(tasting?)
+0.05* (discovery?)+ 0.03*(science?)+0.02*()+…
sparse hyperplane
0.5 *(DS 2009?) + 0.3 *(ALT?) + 0.2* (Porto?)
Advantage of 1 norm soft margin opt.
Solution likely to be sparse ⇒ useful for feature selection
Recent Results
2-norm soft margin
optimization
・Quadratic Programming
・SMO [Platt, 1999]
・SVMlight [Joachims, 1999]
・SVMPerf [Joachims, 2006]
・Pegasos
[Shai-Schwartz et al., 2007]
There are state-of-the-art
1-norm soft margin optimization
・ Linear Programming
・ LPBoost [Dem iriz et al, 2003]
・ Entropy Regularized LPBoost
[Warmuth et al., 2008]
・ others
[Mangasarian 2006][Sra 2006]
not efficient enough for large data
room for improvements
solvers
Our result
new algorithm for 1 norm soft margin optimization
1.
2.
3.
4.
5.
Introduction
Preliminaries
Our algorithm
Experiment
Summary
Boosting
Classification: frog “+1”, others “-1”
Hypotheses・・・ color and size
size
-1
-1
+1
1.d1 : uniform distribution
2.For t=1,…,T
(i) Choose hypothesis ht
maximizing the edge
w.r.t.dt
(ii) Update distribution dt to
-1
+1
dt+1
color
3. Output weighting of
chosen hypotheses
Boosting (2)
Edge of hypothesis h w.r.t. distribution d
m
frog
h1
size
edged (h )  di y i h ( xi )
i 1
-1
-1
-1,or +1
∈[-1,+1]
yih(xi)>0 if correct
 accuracyof h
+1
1.d1 : uniform distribution
h1(xi)
2.For t=1,…,T
(i) Choose hypothesis ht
maximizing the edge
w.r.t.dt
(ii) Update distribution dt to
-1
+1
dt+1
color
3. Output weighting of
chosen hypotheses
Boosting (2)
h1
size
-1
-1
+1
1.d1 : uniform distribution
h1(xi)
2.For t=1,…,T
(i) Choose hypothesis ht
maximizing the edge
w.r.t.dt
(ii) Update distribution dt to
-1
+1
dt+1
color
3. Output
weighting
More
weightsof
on
chosen
hypotheses
Misclassified
instances
Boosting (3)
Note: more weights on “diffucult” instances
size
-1
-1
h2
+1
1.d1 : uniform distribution
2.For t=1,…,T
(i) Choose hypothesis ht
maximizing the edge
w.r.t.dt
(ii) Update distribution dt to
-1
+1
frog
dt+1
color
3. Output weighting of
chosen hypotheses
Boosting (4)
0.4h1 +0.6h2
size
-1
-1
+1
1.d1 : uniform distribution
2.For t=1,…,T
(i) Choose hypothesis ht
maximizing the edge
w.r.t.dt
(ii) Update distribution dt to
-1
+1
dt+1
color
3. Output weighting of
chosen hypotheses
Boosting & 1-norm soft margin
optimizatoin
Primal
max . ρ 
α , ρ ,ξ
1
m
ξi (ν  1)

ν i
1
sub.to.
 n

yi   α j hj ( xi )   ρ  ξi , (i  1,...,m, yi  1,1)
 j 1

α
“find the large-margin hyperplane
which separates pos. and neg.
instances as much as possible”
n
1
  α j  1, α j  0
j 1
equivalent by duality of LP
Dual
min . γ γ ,d
sub.to.
edged (hj )  γ , ( j  1,...,n )
0  di 
1
ν
, d
m
1
  di  1
i 1
find the distiribution which minimizes
the maximum edge of hypotheses
(find the most “difficult” distribution
w.r.t. hypotheses)
≈ solvable using Boosting
LPBoost [Demiriz et al, 2003]
Update:
solve the dual problem w.r.t. the hypothesis set {h1,…,ht}
(γt 1 , dt 1 )  arg mi n . γ γ ,d
sub.to.
edged (hj )  γ , ( j  1,...,t )
0  di 
1
ν
, d
m
1
  di  1
i 1
Output: the convex combination
f (x ) 
T
αt ht ( x )

t
1
where α is the solution of the primal problem
Theorem[Demiriz et al.]
Given ε>0, LPBoost outputs ε-approximation of the optimal solution.
Properties of the optimal solution
(α * , ρ * , ξ * ) : sol. of the pri malproblem
(d ,γ ) : sol. of the dual problem
*
*
KKT conditions imply:
instancei with margin  ρ * di * 
instance with margin  ρ *
0  di * 
instancei with margin  ρ * 1
ν
1
ν
di *  0
Note:
Optimal solution can be recovered
using only instances with positive weigthts
Properties of the optimal solution
(2)
By KKT conditions
hypothesis with edge  γ * w.r.t. d*
α j*  0
hypothesis with edge  γ * w.r.t. d*
α j*  0
Note:
Optimal solution can be recovered
using only hypotheses with positive coefficients.
Sparseness of the optimal solution
1.Sparseness w.r.t. hypotheses weighting
2.Sparseness w.r.t. instances
1.
2.
3.
4.
5.
Introduction
Preliminaries
Our algorithm
Experiment
Summary
Our idea: Sparse LPBoost
Take advantage of the sparseness w.r.t. hypotheses and instances
2.For t=1….
(i)Pick up instances with margin <ρt
(iii) solve the dual problem w.r.t. the past
chosen instances by Boosting
(ρt+1:the solution)
3.Output the solution of the primal problem.
Theorem
Given ε>0, Sparse LPBoost outputs ε-approximation of the optimal
solution.
Our idea (matrix form)
Inequality Constraints of the dual problem
Each row i corresponds to
instance i
 y1h1 ( x 1 )



 y h (x )
 i 1 i



 y mh1 ( x m )
...
...
y1hj ( x1 )

... yi hj ( xi )
...

...
...
...
...
... y mh1 ( x m ) ...
γ 
    
    
yi hn ( xi )    di   γ 
    
   
  
y mh1 ( x m )  d m  γ 
y1hn ( x1 )   d1 
Each column j corresponds to hypothesis j
LP





LPBoost





whole matrix





columns





Sparse LPBoost










intersections of
columns and rows
“effective” constraints
for optimal sol.










intersections of
column and rows
How to choose examples (hypotheses)?
Assumptions
- # of hypotheses is constant
・ time complexity of a LP solver: mk (m: # of instances)
1st attempt: add an instance one by one
ν
k 1
ν
1
(computation time)  t k  t k 
 Ωm k 1 
k 1
t 1
t 1
ν
less efficient than LP solver!
Our method:
Choose at most 2t instances with margin < ρ (t:# of iterations)
If the algorithm terminate after it chooose cm instances (0<c < 1)
log( cm )
t k
(computation time)
ν  2 

t
 
 (4c  0.2)k m k  O m k
1
Note: same argument holds for hypotheses
unknown
1.
2.
3.
4.
5.
Introduction
Preliminaries
Our algorithm
Experiment
Summary
Experiments
(new experiments not in the proceedings)
Data set
Reuters-21578
# of
examples
m
# of
hypotheses n
10,170
30,389
RCV1
20,242
47,237
news20
19,996
1,335,193
parameters:
ν=0.2m, ε=0.001
each algorithm implemented with C++ and LP solver CPLEX 11.0
Experimental Results (sec.)
Sparse LPBoost improves the computation time by 3 to 100 times.
1.
2.
3.
4.
5.
Introduction
Preliminaries
Our algorithm
Experiment
Summary
Summary & Open problem
Our result
• Sparse LPBoost: provable decompotion algorithm which
ε-approximates 1-norm soft margin optimization
• faster than 3 to 100 times than LP solver or LPBoost.
Open problem
• Theoretical guarantee on # of iterations
• Better method for choosing instances (hypotheses)