PMF and Gibbs annotated slides

Case Study 4: Collaborative Filtering
Review:
Cold Start Problem
Machine Learning for Big Data
CSE547/STAT548, University of Washington
Emily Fox
February 18th, 2014
©Emily Fox 2014
1
Cold-Start Problem
n 
n 
Challenge: Cold-start problem (new movie or user)
Methods: use features of movie/user
IN THEATERS
©Emily Fox 2014
2
1
Cold-Start Problem More Formally
n 
Consider a new user u’ and predicting that user’s ratings
¨ 
No previous observations
¨ 
Objective considered so far:
min
L,R
1X
(Lu · Rv
2r
ruv )2 +
uv
¨ 
Optimal user factor:
¨ 
Predicted user ratings:
u
2
||L||2F +
v
2
||R||2F
©Emily Fox 2014
3
An Alternative Formulation
n 
A simpler model for collaborative filtering
¨ 
We would not have this issue if we assumed all users were identical
¨ 
What about for new movies? What if we had side information?
¨ 
¨ 
What dimension should w be?
Fit linear model:
¨ 
Minimize:
©Emily Fox 2014
4
2
Personalization
n 
If we don’t have any observations about a user, use wisdom of the crowd
¨ 
Address cold-start problem
n 
Clearly, not all users are the same
Just as in personalized click prediction, consider model with global and
user-specific parameters
n 
As we gain more information about the user, forget the crowd
n 
©Emily Fox 2014
5
User Features…
n 
In addition to movie features, may have information about the user:
n 
Combine with features of movie:
n 
Unified linear model:
©Emily Fox 2014
6
3
Feature-based Approach versus Matrix
Factorization
n 
Feature-based approach:
¨ 
¨ 
n 
Matrix factorization approach:
¨ 
¨ 
n 
Feature representation of user and movies fixed
Can address cold-start problem
Suffers from cold-start problem
User & movie features are learned from data
A unified model:
7
©Emily Fox 2014
Unified Collaborative
Filtering via SGD
X
1
min
L,R,w,{wu }u 2
ruv
+
n 
(Lu · Rv + (w + wu ) · (u, v)
u
2
||L||2F +
Gradient step observing ruv
"
¨ 
For L,R
¨ 
For w and wu:
(t+1)
Lu
(t+1)
Rv
#
"
v
2
||R||2F +
(1
(1
⌘t
⌘t
©Emily Fox 2014
w
2
ruv )2
||w||22 +
(t)
u )Lu
(t)
v )Rv
wu
2
X
u
(t)
⌘ t ✏ t Rv
(t)
⌘t ✏ t Lu
||wu ||22
#
8
4
What you need to know…
n 
Cold-start problem
n 
Feature-based methods for collaborative filtering
¨  Help
n 
address cold-start problem
Unified approach
©Emily Fox 2014
9
Case Study 4: Collaborative Filtering
Connections with
Probabilistic Matrix
Factorization
Machine Learning for Big Data
CSE547/STAT548, University of Washington
Emily Fox
February 18th, 2014
©Emily Fox 2014
10
5
Matrix Completion Problem
Xij known for black cells
Xij unknown for white cells
Rows
movies
Rowsindex
index users
Columns
index
users
Columns index movies
X=
n 
Filling missing data?
11
©Emily Fox 2014
Coordinate Descent for Matrix
Factorization: Alternating Least-Squares
min
L,R
n 
(Lu · Rv
ruv )2
(u,v):ruv 6=?
Fix movie factors, optimize for user factors
¨ 
n 
X
Independent least-squares over users
min
Lu
Fix user factors, optimize for movie factors
¨ 
Independent least-squares over movies
n 
System may be underdetermined:
n 
Converges to
©Emily Fox 2014
min
Rv
X
(Lu · Rv
ruv )2
(Lu · Rv
ruv )2
v2Vu
X
u2Uv
12
6
Probabilistic Matrix Factorization (PMF)
n 
A generative process:
¨ 
Pick user factors
¨ 
Pick movie factors
¨ 
For each (user,movie) pair observed:
n 
n 
Pick rating as Lu Rv + noise
Joint probability:
©Emily Fox 2014
13
PMF Graphical Model
P (L, R | X) / P (L)P (R)P (X | L, R)
n 
Graphically:
©Emily Fox 2014
14
7
Maximum A Posteriori for Matrix Completion
P (L, R|X) / P (L, R, X) = p(L)p(R)p(X | L, R)
/ e2
1
2
u
Pn
u=1
Pk
i=1
L2ui
e2
1
2
v
Pm
v=1
Pk
i=1
2
Rvi
e2
1
2
r
P
ruv (Lu ·Rv
ruv )2
15
©Emily Fox 2014
MAP versus Regularized Least-Squares
for Matrix Completion
n 
MAP under Gaussian Model:
max log P (L, R | X) =
L,R
2
n 
1 XX
2
u
u
i
L2ui
2
1 XX
2
v
v
Rv2i
i
2
1 X
2
r ruv
(Lu · Rv
ruv )2 + const
Least-squares matrix completion with L2 regularization:
1X
u
v
(Lu · Rv ruv )2 +
min
||L||2F +
||R||2F
L,R 2
2
2
r
uv
n 
Understanding as a probabilistic model is very useful! E.g.,
¨ 
Change priors
¨ 
Incorporate other sources of information or dependencies
©Emily Fox 2014
16
8
What you need to know…
n 
Probabilistic model for collaborative filtering
¨  Models,
choice of priors
¨  MAP equivalent to optimization for matrix completion
©Emily Fox 2014
17
Case Study 4: Collaborative Filtering
Gibbs Sampling for
Bayesian Inference
Machine Learning for Big Data
CSE547/STAT548, University of Washington
Emily Fox
February 18th, 2014
©Emily Fox 2014
18
9
Posterior Computations
n 
MAP estimation focuses on point estimation:
✓ˆM AP = arg max p(✓ | x)
✓
n 
What if we want a full characterization of the posterior?
¨ 
¨ 
¨ 
n 
Maintain a measure of uncertainty
Estimators other than posterior mode (different loss functions)
Predictive distributions for future observations
Often no closed-form characterization (e.g., mixture models,
PMF, etc.)
19
©Emily Fox 2014
Bayesian PMF Example
n 
Latent user and movie factors:
Lu
n 
Observations
Hyperparameters:
n 
Want to predict new movie rating:
n 
u = 1, . . . , n
©Emily Fox 2014
Rv
ruv
v = 1, . . . , m
20
10
Bayesian PMF Example
⇤
p(ruv
n 
| X, ) =
Z
⇤
p(ruv
| Lu , Rv )p(L, R | X, )dLdR
Monte Carlo methods:
u
v
Lu
Rv
u = 1, . . . , n
n 
ruv
Ideally:
v = 1, . . . , m
r
©Emily Fox 2014
21
Bayesian PMF Example
n 
n 
Want posterior samples (L(k) , R(k) ) ⇠ p(L, R | X, )
What can we sample from?
¨ 
Hint: Same reasoning as behind ALS, but sampling rather than maximization
©Emily Fox 2014
22
11
Bayesian PMF Example
n 
For user u:
p(Lu | X, R,
n 
n 
u)
/ p(Lu |
u)
Y
v2Vu
p(ruv | Lu , Rv ,
r)
Symmetrically for Rv conditioned on L (breaks down over movies)
Luckily, we can use this to get our desired posterior samples
©Emily Fox 2014
23
Gibb Sampling
n 
Want draws:
n 
Construct Markov chain whose steady state distribution is
Then, asymptotically correct
Simplest case:
n 
n 
©Emily Fox 2014
24
12
Bayesian PMF Gibbs Sampler
n 
Outline of Bayesian PMF sampler
25
©Emily Fox 2014
Bayesian PMF Results
Netflix data with:
¨ 
¨ 
¨ 
Training set = 100,480,507 ratings from 480,189 users on 17,770 movie titles
Validation set = 1,408,395 ratings.
Test set = 2,817,131 user/movie pairs with the ratings withheld.
Bayesian Probabilistic Matrix Factorization using MCMC
0.97
Bayesian PMF
0.92
Netflix
Baseline Score
0.96
0.915
0.95
0.91
0.94
SVD
RMSE
RMSE
n 
From Salakhutdinov
and Mnih, ICML 2008
0.93
0.92
0.91
0.9
30−D
0.905
5.7 hrs.
PMF
60−D
0.9
47 hrs.
Bayesian PMF
0
10
20
30
Epochs
40
50
0.89
60
90 hrs.
11.7 hrs.
0.895
Logistic PMF
23 hrs.
4
8
16
32
64
128
Number of Samples
188 hrs.
256
512
Figure 2. Left panel: Performance of SVD, PMF, logistic PMF, and Bayesian PMF using 30D feature vectors, on the
Netflix validation data. The y-axis displays RMSE (root mean squared error), and the x-axis shows the number of epochs,
or passes, through the entire training set. Right panel: RMSE for the Bayesian PMF models on the validation set as a
function of the number of samples generated. The two curves are for the models with 30D and 60D feature vectors.
©Emily Fox 2014
4.2. Training PMF models
For comparison, we have trained a variety of linear
PMF models using MAP, choosing their regularization
parameters using the validation set. In addition to linear PMF models, we also trained logistic PMF models, in which we pass the dot product between userand movie-specific feature vectors through the logistic
function σ(x) = 1/(1 + exp(−x)) to bound the range
of predictions:
p(R|U, V, α) =
N !
M "
!
i=1 j=1
#Iij
N (Rij |σ(UiT Vj ), α−1 )
. (15)
26
precision α was fixed at 2. The predictive distribution
was computed using Eq. 10 by running the Gibbs
(k)
(k)
sampler with samples {Ui , Vj } collected after each
full Gibbs step.
4.4. Results
In our first experiment, we compared a Bayesian PMF
model to an SVD model, a linear PMF model, and a
logistic PMF model, all using 30D feature vectors. The
SVD model was trained to minimize the sum-squared
distance to the observed entries of the target matrix,
with no regularization applied to the feature vectors.
13
−20
5
−20
−10
0
10
Dimension1
0
20
−10
0
Dimension1
10
5
−20
20
−10
0
10
Dimension5
0
20
Movie Y (142 ratings)
80
60
0.4
40
0
Dimension2
0.5
1
100
80
Frequency Count
Bayesian PMF Results
−0.2
−0.4
0
n 
40
30
40
60
150
300
20
−0.1
0
Dimension2
0.1
−0.1
0
Dimension1
0.1
40
30
20
10
−0.5
0
Dimension1
0.5
Valid. RMSE
PMF
BPMF
0.9154 0.8994
0.9135 0.8968
0.9150 0.8954
0.9178 0.8931
0.9231 0.8920
%
Inc.
1.74
1.83
2.14
2.69
3.37
Test RMSE
PMF
BPMF
0.9188 0.9029
0.9170 0.9002
0.9185 0.8989
0.9211 0.8965
0.9265 0.8954
%
Inc.
1.73
1.83
2.13
2.67
3.36
We than trained larger PMF models with D = 40 and
D = 60. Capacity control for such models becomes a
rather challenging task. For example, a PMF model
©Emily Fox 2014
with D = 60 has approximately 30 million parameters.
Searching for appropriate values of the regularization
coefficients becomes a very computationally expensive
task. Table 1 further shows that for the 60-dimensional
feature vectors, Bayesian PMF outperforms its MAP
counterpart by over 2%. We should also point out
that even the simplest possible Bayesian extension of
the PMF model, where Gamma priors are placed over
the precision hyperparameters αU and αV (see Fig. 1,
left panel), significantly outperforms the MAP-trained
PMF models, even though it does not perform as well
What you need to know…
n 
30
50
−0.4
Table 1. Performance of Bayesian PMF (BPMF) and linear PMF on Netflix validation and test sets.
n 
40
1
−0.4 −0.2
0
0.2
Dimension1
0.4
0
−0.2
Bayesian model better controls for overfitting by
Figureover
3. Samples
from the
posterior over(instead
the user and
averaging
possible
parameters
ofmovie feature vectors generated by each step of the G
sampler. The two dimensions with the highest variance are shown for two users and two movies. The first 800 sam
committing
to
one)
were discarded as “burn-in”.
D
n 
20
60
−0.2
From Salakhutdinov
and Mnih, ICML 2008
60
0
−1
0.4
10
50
0
−0.2
20
−0.4 −0.2
0
0.2
Dimension1
0
Dimension5
10
0.2
Frequency Count
0
−0.5
Dimension2
Dimension2
0
−1
−10
60
20
0.2
−20
70
100
Frequency Count
0.4
−20
120
Movie X (5 ratings)
Fr
10
Frequency Count
Fr
−20
as the Bayesian PMF models.
It is interesting to observe that as the feature
mensionality grows, the performance accuracy for
MAP-trained PMF models does not improve, and
trolling overfitting becomes a critical issue. The
dictive accuracy of the Bayesian PMF models, h
ever, steadily improves as the model complexity gr
Inspired by this result, we experimented with Baye
PMF models with D = 150 and D = 300 fea
vectors. Note that these models have about 75
150 million parameters, and running the Gibbs s
pler becomes computationally much more expen
Nonetheless,
the validation set RMSEs for the
27
models were 0.8931 and 0.8920. Table 1 shows
these models not only significantly outperform t
MAP counterparts but also outperform Bayesian P
models that have fewer parameters. These res
clearly show that the Bayesian approach does no
quire limiting the complexity of the model based on
number of the training samples. In practice, howe
we will be limited by the available computer resour
For completeness, we also report the performance
sults on the Netflix test set. These numbers were
Idea of full posterior inference vs. MAP
estimation
Gibbs sampling as an MCMC approach
Example of inference in Bayesian probabilistic
matrix factorization model
©Emily Fox 2014
28
14