Case Study 4: Collaborative Filtering Review: Cold Start Problem Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 18th, 2014 ©Emily Fox 2014 1 Cold-Start Problem n n Challenge: Cold-start problem (new movie or user) Methods: use features of movie/user IN THEATERS ©Emily Fox 2014 2 1 Cold-Start Problem More Formally n Consider a new user u’ and predicting that user’s ratings ¨ No previous observations ¨ Objective considered so far: min L,R 1X (Lu · Rv 2r ruv )2 + uv ¨ Optimal user factor: ¨ Predicted user ratings: u 2 ||L||2F + v 2 ||R||2F ©Emily Fox 2014 3 An Alternative Formulation n A simpler model for collaborative filtering ¨ We would not have this issue if we assumed all users were identical ¨ What about for new movies? What if we had side information? ¨ ¨ What dimension should w be? Fit linear model: ¨ Minimize: ©Emily Fox 2014 4 2 Personalization n If we don’t have any observations about a user, use wisdom of the crowd ¨ Address cold-start problem n Clearly, not all users are the same Just as in personalized click prediction, consider model with global and user-specific parameters n As we gain more information about the user, forget the crowd n ©Emily Fox 2014 5 User Features… n In addition to movie features, may have information about the user: n Combine with features of movie: n Unified linear model: ©Emily Fox 2014 6 3 Feature-based Approach versus Matrix Factorization n Feature-based approach: ¨ ¨ n Matrix factorization approach: ¨ ¨ n Feature representation of user and movies fixed Can address cold-start problem Suffers from cold-start problem User & movie features are learned from data A unified model: 7 ©Emily Fox 2014 Unified Collaborative Filtering via SGD X 1 min L,R,w,{wu }u 2 ruv + n (Lu · Rv + (w + wu ) · (u, v) u 2 ||L||2F + Gradient step observing ruv " ¨ For L,R ¨ For w and wu: (t+1) Lu (t+1) Rv # " v 2 ||R||2F + (1 (1 ⌘t ⌘t ©Emily Fox 2014 w 2 ruv )2 ||w||22 + (t) u )Lu (t) v )Rv wu 2 X u (t) ⌘ t ✏ t Rv (t) ⌘t ✏ t Lu ||wu ||22 # 8 4 What you need to know… n Cold-start problem n Feature-based methods for collaborative filtering ¨ Help n address cold-start problem Unified approach ©Emily Fox 2014 9 Case Study 4: Collaborative Filtering Connections with Probabilistic Matrix Factorization Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 18th, 2014 ©Emily Fox 2014 10 5 Matrix Completion Problem Xij known for black cells Xij unknown for white cells Rows movies Rowsindex index users Columns index users Columns index movies X= n Filling missing data? 11 ©Emily Fox 2014 Coordinate Descent for Matrix Factorization: Alternating Least-Squares min L,R n (Lu · Rv ruv )2 (u,v):ruv 6=? Fix movie factors, optimize for user factors ¨ n X Independent least-squares over users min Lu Fix user factors, optimize for movie factors ¨ Independent least-squares over movies n System may be underdetermined: n Converges to ©Emily Fox 2014 min Rv X (Lu · Rv ruv )2 (Lu · Rv ruv )2 v2Vu X u2Uv 12 6 Probabilistic Matrix Factorization (PMF) n A generative process: ¨ Pick user factors ¨ Pick movie factors ¨ For each (user,movie) pair observed: n n Pick rating as Lu Rv + noise Joint probability: ©Emily Fox 2014 13 PMF Graphical Model P (L, R | X) / P (L)P (R)P (X | L, R) n Graphically: ©Emily Fox 2014 14 7 Maximum A Posteriori for Matrix Completion P (L, R|X) / P (L, R, X) = p(L)p(R)p(X | L, R) / e2 1 2 u Pn u=1 Pk i=1 L2ui e2 1 2 v Pm v=1 Pk i=1 2 Rvi e2 1 2 r P ruv (Lu ·Rv ruv )2 15 ©Emily Fox 2014 MAP versus Regularized Least-Squares for Matrix Completion n MAP under Gaussian Model: max log P (L, R | X) = L,R 2 n 1 XX 2 u u i L2ui 2 1 XX 2 v v Rv2i i 2 1 X 2 r ruv (Lu · Rv ruv )2 + const Least-squares matrix completion with L2 regularization: 1X u v (Lu · Rv ruv )2 + min ||L||2F + ||R||2F L,R 2 2 2 r uv n Understanding as a probabilistic model is very useful! E.g., ¨ Change priors ¨ Incorporate other sources of information or dependencies ©Emily Fox 2014 16 8 What you need to know… n Probabilistic model for collaborative filtering ¨ Models, choice of priors ¨ MAP equivalent to optimization for matrix completion ©Emily Fox 2014 17 Case Study 4: Collaborative Filtering Gibbs Sampling for Bayesian Inference Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 18th, 2014 ©Emily Fox 2014 18 9 Posterior Computations n MAP estimation focuses on point estimation: ✓ˆM AP = arg max p(✓ | x) ✓ n What if we want a full characterization of the posterior? ¨ ¨ ¨ n Maintain a measure of uncertainty Estimators other than posterior mode (different loss functions) Predictive distributions for future observations Often no closed-form characterization (e.g., mixture models, PMF, etc.) 19 ©Emily Fox 2014 Bayesian PMF Example n Latent user and movie factors: Lu n Observations Hyperparameters: n Want to predict new movie rating: n u = 1, . . . , n ©Emily Fox 2014 Rv ruv v = 1, . . . , m 20 10 Bayesian PMF Example ⇤ p(ruv n | X, ) = Z ⇤ p(ruv | Lu , Rv )p(L, R | X, )dLdR Monte Carlo methods: u v Lu Rv u = 1, . . . , n n ruv Ideally: v = 1, . . . , m r ©Emily Fox 2014 21 Bayesian PMF Example n n Want posterior samples (L(k) , R(k) ) ⇠ p(L, R | X, ) What can we sample from? ¨ Hint: Same reasoning as behind ALS, but sampling rather than maximization ©Emily Fox 2014 22 11 Bayesian PMF Example n For user u: p(Lu | X, R, n n u) / p(Lu | u) Y v2Vu p(ruv | Lu , Rv , r) Symmetrically for Rv conditioned on L (breaks down over movies) Luckily, we can use this to get our desired posterior samples ©Emily Fox 2014 23 Gibb Sampling n Want draws: n Construct Markov chain whose steady state distribution is Then, asymptotically correct Simplest case: n n ©Emily Fox 2014 24 12 Bayesian PMF Gibbs Sampler n Outline of Bayesian PMF sampler 25 ©Emily Fox 2014 Bayesian PMF Results Netflix data with: ¨ ¨ ¨ Training set = 100,480,507 ratings from 480,189 users on 17,770 movie titles Validation set = 1,408,395 ratings. Test set = 2,817,131 user/movie pairs with the ratings withheld. Bayesian Probabilistic Matrix Factorization using MCMC 0.97 Bayesian PMF 0.92 Netflix Baseline Score 0.96 0.915 0.95 0.91 0.94 SVD RMSE RMSE n From Salakhutdinov and Mnih, ICML 2008 0.93 0.92 0.91 0.9 30−D 0.905 5.7 hrs. PMF 60−D 0.9 47 hrs. Bayesian PMF 0 10 20 30 Epochs 40 50 0.89 60 90 hrs. 11.7 hrs. 0.895 Logistic PMF 23 hrs. 4 8 16 32 64 128 Number of Samples 188 hrs. 256 512 Figure 2. Left panel: Performance of SVD, PMF, logistic PMF, and Bayesian PMF using 30D feature vectors, on the Netflix validation data. The y-axis displays RMSE (root mean squared error), and the x-axis shows the number of epochs, or passes, through the entire training set. Right panel: RMSE for the Bayesian PMF models on the validation set as a function of the number of samples generated. The two curves are for the models with 30D and 60D feature vectors. ©Emily Fox 2014 4.2. Training PMF models For comparison, we have trained a variety of linear PMF models using MAP, choosing their regularization parameters using the validation set. In addition to linear PMF models, we also trained logistic PMF models, in which we pass the dot product between userand movie-specific feature vectors through the logistic function σ(x) = 1/(1 + exp(−x)) to bound the range of predictions: p(R|U, V, α) = N ! M " ! i=1 j=1 #Iij N (Rij |σ(UiT Vj ), α−1 ) . (15) 26 precision α was fixed at 2. The predictive distribution was computed using Eq. 10 by running the Gibbs (k) (k) sampler with samples {Ui , Vj } collected after each full Gibbs step. 4.4. Results In our first experiment, we compared a Bayesian PMF model to an SVD model, a linear PMF model, and a logistic PMF model, all using 30D feature vectors. The SVD model was trained to minimize the sum-squared distance to the observed entries of the target matrix, with no regularization applied to the feature vectors. 13 −20 5 −20 −10 0 10 Dimension1 0 20 −10 0 Dimension1 10 5 −20 20 −10 0 10 Dimension5 0 20 Movie Y (142 ratings) 80 60 0.4 40 0 Dimension2 0.5 1 100 80 Frequency Count Bayesian PMF Results −0.2 −0.4 0 n 40 30 40 60 150 300 20 −0.1 0 Dimension2 0.1 −0.1 0 Dimension1 0.1 40 30 20 10 −0.5 0 Dimension1 0.5 Valid. RMSE PMF BPMF 0.9154 0.8994 0.9135 0.8968 0.9150 0.8954 0.9178 0.8931 0.9231 0.8920 % Inc. 1.74 1.83 2.14 2.69 3.37 Test RMSE PMF BPMF 0.9188 0.9029 0.9170 0.9002 0.9185 0.8989 0.9211 0.8965 0.9265 0.8954 % Inc. 1.73 1.83 2.13 2.67 3.36 We than trained larger PMF models with D = 40 and D = 60. Capacity control for such models becomes a rather challenging task. For example, a PMF model ©Emily Fox 2014 with D = 60 has approximately 30 million parameters. Searching for appropriate values of the regularization coefficients becomes a very computationally expensive task. Table 1 further shows that for the 60-dimensional feature vectors, Bayesian PMF outperforms its MAP counterpart by over 2%. We should also point out that even the simplest possible Bayesian extension of the PMF model, where Gamma priors are placed over the precision hyperparameters αU and αV (see Fig. 1, left panel), significantly outperforms the MAP-trained PMF models, even though it does not perform as well What you need to know… n 30 50 −0.4 Table 1. Performance of Bayesian PMF (BPMF) and linear PMF on Netflix validation and test sets. n 40 1 −0.4 −0.2 0 0.2 Dimension1 0.4 0 −0.2 Bayesian model better controls for overfitting by Figureover 3. Samples from the posterior over(instead the user and averaging possible parameters ofmovie feature vectors generated by each step of the G sampler. The two dimensions with the highest variance are shown for two users and two movies. The first 800 sam committing to one) were discarded as “burn-in”. D n 20 60 −0.2 From Salakhutdinov and Mnih, ICML 2008 60 0 −1 0.4 10 50 0 −0.2 20 −0.4 −0.2 0 0.2 Dimension1 0 Dimension5 10 0.2 Frequency Count 0 −0.5 Dimension2 Dimension2 0 −1 −10 60 20 0.2 −20 70 100 Frequency Count 0.4 −20 120 Movie X (5 ratings) Fr 10 Frequency Count Fr −20 as the Bayesian PMF models. It is interesting to observe that as the feature mensionality grows, the performance accuracy for MAP-trained PMF models does not improve, and trolling overfitting becomes a critical issue. The dictive accuracy of the Bayesian PMF models, h ever, steadily improves as the model complexity gr Inspired by this result, we experimented with Baye PMF models with D = 150 and D = 300 fea vectors. Note that these models have about 75 150 million parameters, and running the Gibbs s pler becomes computationally much more expen Nonetheless, the validation set RMSEs for the 27 models were 0.8931 and 0.8920. Table 1 shows these models not only significantly outperform t MAP counterparts but also outperform Bayesian P models that have fewer parameters. These res clearly show that the Bayesian approach does no quire limiting the complexity of the model based on number of the training samples. In practice, howe we will be limited by the available computer resour For completeness, we also report the performance sults on the Netflix test set. These numbers were Idea of full posterior inference vs. MAP estimation Gibbs sampling as an MCMC approach Example of inference in Bayesian probabilistic matrix factorization model ©Emily Fox 2014 28 14
© Copyright 2024 ExpyDoc