CS 491/591 – Machine Learning – Spring 2014 Review Subjects for Exam #1 Be sure to review the following: • LFD Lectures: 1–7, 9 (don’t worry about VC details) Note that lecture videos + slides are available on LFD website • LFD Text: chapters 1–3 (don’t worry about: VC details, bias-variance details) • Carver Slides: ML Intro, Perceptrons, Prob & Stats, Concept Learning, Linear Models • Homework #1 (PLA and Delta Rule) Topics to be covered on Exam #1: 1. Introduction to machine learning • What is machine learning? • LFD: (1) a pattern exists, (2) cannot pin down mathematically, (3) we have data on it. • Key issue: function, etc. to be learned is unknown and unknowable, so how is it possible to judge that one has learned well? • Five characteristics used to “categorize” ML techniques: e.g., target format, information/feedback, etc. • Inductive learning: what it is (i.e., inducing function from examples). • Supervised learning: labeled training data, need for teacher/expert. • Supervised vs. unsupervised learning vs. reinforcement learning. • Offline vs. online learning. • Other “types” of learning: decision tree learning, neural network learning, etc. 2. Supervised/Inductive Learning Basics • Inductive learning terminology: input set/space, training data, examples, attributes/features, target function, hypothesis set/space, hypotheses, target function, classification, regression, etc. • Learning as search through hypothesis space. • Training vs. testing, learning curves. • Learning functions (y = f (x)) vs. distributions (p(y|x)) (noisy/stochastic data). 3. Perceptron Learning • • • • • What is the hypothesis space for perceptrons? (Hyperplanes) Notion of linearly separable data/problems. Perceptron Learning Algorithm: basics, issues. Delta Rule: differences with PLA, advantages/disadvantages. Gradient descent: basic approach, differentiability, termination. 4. Linear Learning Models • • • • • Classification vs. regression. PLA and Pocket algorithm. Linear regression: closed form solution. Logistic regression (classification): sigmoid, MLE, gradient descent, probabilities. Error measures: classification/binary error, squared error, cross-entropy error (MLE) • Maximum likelihood estimation: p(θ|D) = p(D|θ). • Gradient descent: convex error functions. • Using linear models for nonlinear functions: nonlinear transforms. 5. Computational Learning Theory • • • • • • • • Is learning feasible? What is the problem? Why an issue? Ein vs. Eout . Two questions of learning: (1) Eout (g) ≈ Ein (g) and Ein (g) ≈ 0. PAC learning: probably approximately correct. Hoeffding’s Inequality: what it is, how applies to learning, importance. Terminology: dichotomies, growth function, break point. Break point leads to polynomial growth function (mH (N)). VC dimension: what is, vs. break point. • Vapnik-Chervonenkis (VC) Inequality: P [|Ein (g)−Eout (g)| > ǫ] ≤ 4mH (2N)e− what it tells us about learning feasibility, training set size, etc. • VC dimensions vs. degrees of freedom, VC dim. for perceptron, etc. • Generalization bound (VC inequality in terms of ǫ and δ): let δ be max q prob. of bad hyp, i.e., P [|Ein (g) − Eout (g)| > ǫ] ≤ δ then ǫ = N8 ln 4mHδ(2N ) • Trade-offs: Ein vs. Eout , model complexity vs. ability to generalization. ǫ2 N 8 6. Concept Learning and Version Space Search • What is concept learning and relation of CL to classification (same thing). • Concept representations: constraint vectors vs. logic expressions, ? for don’t care (all OK). • Learning as search through hypothesis space. • Hypothesis space structured using general-to-specific ordering, so efficient search. • Version space approach to represent subsets of hypothesis space: S and G. • Inductive bias: limited representational power of hypotheses, trade-offs. • Problems with CL algorithms: require consistency with data so noise/errors are a problem. ,