CPTS 570 Machine Learning, Fall 2014
Homework #1
Due Date: Sep 18
1. (5 points) Answer the following questions with a yes or no along with proper justification.
a. Is the decision boundary of voted perceptron linear?
b. Is the decision boundary of averaged perceptron linear?
2. (10 points) In the class, we saw the Passive-Aggressive (PA) update that tries to achieve a
margin equal to one after each update. Derive the PA weight update for achieving margin M .
3. (15 points) The standard perceptron algorithm we discussed in the class performs weightupdate whenever there is a prediction error (ˆ
yt 6= yt ) as follows:
wt+1 = wt + yt · xt
(1)
Suppose we know beforehand that the training data is separable with margin value γ > 0.
So we decided to incorporate this knowledge into the perceptron algorithm by changing the
weights whenever there is a margin error:
wt
0 < yt ·
· xt ≤ γ
(2)
kwt k
Let B(xt , γ) be a ball of radius γ centered at xt . Whenever there is a margin error, we want
to update the weights such that all the points in the ball B(xt , γ) have the same label. Derive
the weight update to achieve this criteria.
4. (70 points) Programming and empirical analysis question.
Implement a multi-class online learning algorithm with both perceptron and passive-aggressive
(PA) weight update as shown below. Employ the single weight vector represenation (representationII as discussed in the class). This representation is defined as follows. Each training example
is of the form (xt , yt ), where xt ∈ <d is the input and yt ∈ {1, 2, · · · , k} is the class (output)
label. In this representation, you will have a single weight-vector w ∈ <k·d and the augmented
feature function F (xt , y) ∈ <k·d will have k blocks of size d and it will have zeroes everywhere
except for the y th block, which will have xt in it.
Algorithm 1 Online Multi-Class Learning Algorithm
Input: D = Training examples, k = number of classes, T = maximum number of training iterations
Output: w, the final weight vector
1: Initialize the weights w = 0
2: for each training iteration itr ∈ {1, 2, · · · , T } do
3:
for each training example (xt , yt ) ∈ D do
4:
yˆt = arg maxy∈{1,2,···,k} w · F (xt , y) // predict using the current weights
5:
if mistake then
6:
w = w + τ · (F (xt , yt ) − F (xt , yˆt )) // update the weights
7:
end if
8:
end for
9: end for
10: return final weight vector w
For standard perceptron, you will use τ = 1, and for Passive-Aggressive (PA) algorithm, you
will compute the learning rate τ as follows.
τ=
1 − (w · F (xt , yt ) − w · F (xt , yˆt ))
kF (xt , yt ) − F (xt , yˆt )k2
(3)
You are provided with a training set, development set, and testing set of classification examples. Run the perceptron and passive-aggressive algorithm on the training data to compute
the following.
a. Compute the online learning curve for both Perceptron and PA algorithm by plotting the
number of training iterations (1 to 100) on the x-axis and the number of mistakes on the
y-axis. Compare the two curves and list your observations.
b. Compute the accuracy of both Perceptron and PA algorithm on the training data and validation data as a function of the training iterations (1 to 100). So you will have two accuracy
curves for Perceptron and another two accuracy curves for PA algorithm. Compare the four
curves and list your observations.
c. Tune the hyper-parameter (number of training iterations) based on the validation data,
and compute the test accuracy of the perceptron and PA algorithm with the corresponding
best weights. The test accuracy of which algorithm is better?
d. Repeat the Perceptron experiments (a, b, and c) by shuffling the training data during each
iteration. Now compare the experiments with and without shuffling. What did you observe?
1
e. Repeat the Perceptron experiments (a, b, and c) with variable learning rate ( iteration
)
during each iteration. Now compare the experiments with fixed learning rate and variable
learning rate. What did you observe?
f. Repeat experiment (c) with averaged perceptron (remember the efficient implementation
trick I mentioned in the class). Compare the test accuracies of plain perceptron and averaged
perceptron. What did you observe?
g. Compute the general learning curve (vary the number of training examples starting from
1000 in the increments of 1000) for a fixed number of training iterations (100). Plot the
number of training examples on x-axis and the testing accuracy on the y-axis. List your
observations from this curve.