Introduction to Machine Learning (89511) Exercise 5

Introduction to Machine Learning (89511)
Exercise 5
submission: December 31, 2014
1
Submission instructions:
1. Theoretical Part
(a) If needed, make sure to explain or present an example to any given answer to assure
full understanding of your idea.
(b) Write your full name and ID at the top of your solution.
(c) The following submission methods are allowed:
i. At the end of the Tirugl lesson
ii. PDF version via email to:
danny dot karmon at biu dot ac dot il
The title should be: Exercise 4: Theoretical part - [Full Name] - [ID]
iii. To Danny Karmon’s mailbox (10)
2. Practical Part
(a) You are not allowed to use any machine learning packages or tools (e.g. scikit-learn,
PyBrain, PyML, etc.).
(b) You are allowed to use numpy package
(c) Use Python 2.7
(d) In order to submit your solution - please send an email to:
danny dot karmon at biu dot ac dot il
i.
ii.
iii.
iv.
v.
The title should be: Exercise 4: Practical part - [Full Name] - [ID]
Your email should consist:
Your Full Name and ID
knn.py - Your code file.
knn_report.txt - Full performance overview of the different kNN models.
Good Luck!
1
2
Questions:
1. Theoretical Part
Let X be the instance domain, and assume the label set Y is {0, 1}. The loss is the
zero-one loss (i.e 1[ˆy6=y] ) Given any probability distribution D over X × {0, 1}, the Bayes
classifier fD is defined by:
1
if P[y = 1|x] ≥ 1/2
fD (x) =
0
otherwise
Show that for every probability distribution D, the Bayes optimal predictor fD is optimal,
in the sense that for every classifier (function) g from X to {0, 1}, LD (fD ) ≤ LD (g).
2. Practical Part
In this part you will implement the kNN algorithm with different setups (i.e # of neighbors
and distance measurement methods), and evaluate their performances.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
Your code file should be called knn.py
Download the W ine Data Set.
Read the Data Set Description.
Your code should receive input two parameters: [k - # of neighbors] [distance measurement method code] (e.g knn.py 7 2):
Measurement method codes:
• 1 - T axicab geometry
• 2 - Euclidean distance
• 3 - L − inf inity distance
Shuffle the data instances to avoid overfitting.
Use 10-cross validation to evaluate the performance of your model (partition the data
into 10 sets. on each iteration use a different set for validation and the rest as training
data).
Implement the kNN algorithm, as presented at the tirgul.
Run the algorithm according to the input parameters over the data while evaluating
the performance of the model.
Pick 3 values for the k (# of neighbors) input parameter.
You should run your code on all 9 possible combination of # of neighbors and distance
measurement methods values.
The results should be written in a report file named knn_report.txt, according
to the following format (for each model combination):
k:XX; distance measurement method code:X; loss:XXX
..
.
k:XX; distance measurement method code:X; loss:XXX
2