Introduction to Machine Learning (89511) Exercise 5 submission: December 31, 2014 1 Submission instructions: 1. Theoretical Part (a) If needed, make sure to explain or present an example to any given answer to assure full understanding of your idea. (b) Write your full name and ID at the top of your solution. (c) The following submission methods are allowed: i. At the end of the Tirugl lesson ii. PDF version via email to: danny dot karmon at biu dot ac dot il The title should be: Exercise 4: Theoretical part - [Full Name] - [ID] iii. To Danny Karmon’s mailbox (10) 2. Practical Part (a) You are not allowed to use any machine learning packages or tools (e.g. scikit-learn, PyBrain, PyML, etc.). (b) You are allowed to use numpy package (c) Use Python 2.7 (d) In order to submit your solution - please send an email to: danny dot karmon at biu dot ac dot il i. ii. iii. iv. v. The title should be: Exercise 4: Practical part - [Full Name] - [ID] Your email should consist: Your Full Name and ID knn.py - Your code file. knn_report.txt - Full performance overview of the different kNN models. Good Luck! 1 2 Questions: 1. Theoretical Part Let X be the instance domain, and assume the label set Y is {0, 1}. The loss is the zero-one loss (i.e 1[ˆy6=y] ) Given any probability distribution D over X × {0, 1}, the Bayes classifier fD is defined by: 1 if P[y = 1|x] ≥ 1/2 fD (x) = 0 otherwise Show that for every probability distribution D, the Bayes optimal predictor fD is optimal, in the sense that for every classifier (function) g from X to {0, 1}, LD (fD ) ≤ LD (g). 2. Practical Part In this part you will implement the kNN algorithm with different setups (i.e # of neighbors and distance measurement methods), and evaluate their performances. (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) Your code file should be called knn.py Download the W ine Data Set. Read the Data Set Description. Your code should receive input two parameters: [k - # of neighbors] [distance measurement method code] (e.g knn.py 7 2): Measurement method codes: • 1 - T axicab geometry • 2 - Euclidean distance • 3 - L − inf inity distance Shuffle the data instances to avoid overfitting. Use 10-cross validation to evaluate the performance of your model (partition the data into 10 sets. on each iteration use a different set for validation and the rest as training data). Implement the kNN algorithm, as presented at the tirgul. Run the algorithm according to the input parameters over the data while evaluating the performance of the model. Pick 3 values for the k (# of neighbors) input parameter. You should run your code on all 9 possible combination of # of neighbors and distance measurement methods values. The results should be written in a report file named knn_report.txt, according to the following format (for each model combination): k:XX; distance measurement method code:X; loss:XXX .. . k:XX; distance measurement method code:X; loss:XXX 2
© Copyright 2024 ExpyDoc