Prediction - Department of Statistics

Author(s): Kerby Shedden, Ph.D., 2010
License: Unless otherwise noted, this material is made available under the
terms of the Creative Commons Attribution Share Alike 3.0 License:
http://creativecommons.org/licenses/by-sa/3.0/
We have reviewed this material in accordance with U.S. Copyright Law and have tried to maximize your
ability to use, share, and adapt it. The citation key on the following slide provides information about how you
may share and adapt this material.
Copyright holders of content included in this material should contact [email protected] with any
questions, corrections, or clarification regarding the use of content.
For more information about how to cite these materials visit http://open.umich.edu/privacy-and-terms-use.
Any medical information in this material is intended to inform and educate and is not a tool for self-diagnosis
or a replacement for medical evaluation, advice, diagnosis or treatment by a healthcare professional. Please
speak to your physician if you have questions about your medical condition.
Viewer discretion is advised: Some medical content is graphic and may not be suitable for all viewers.
1 / 22
Prediction
Kerby Shedden
Department of Statistics, University of Michigan
November 3, 2014
2 / 22
Prediction analysis
In a prediction analysis, we are interested in fitting a model fθˆ to
capture the mean relationship between indpendent variables X and
a dependent variable Y , and then using fθˆ to make predictions on
an independent data set.
It is helpful to think in terms of training data (Y , X ) that are used
ˆ , X ), and testing data (Y ∗ , X ∗ ) on
to fit the model, so θˆ = θ(Y
which predictions are made, or on which the model is evaluated.
3 / 22
Quantifying prediction error
Prediction analysis focuses on prediction errors, for example
through the mean squared prediction error (MSPE):
E kY ∗ − fθˆ(X ∗ )k2 /n∗ ,
where n∗ is the size of the testing set.
Prediction analysis does not usually focus on properties of the
ˆ − θ, or parameter
parameter estimates themselves, e.g. bias E [θ]
2
ˆ
MSE E [(θ − θ) ].
4 / 22
MSPE for OLS analysis
The mean squared prediction error for OLS regression is easy to
derive. The testing data follows Y ∗ = X ∗ β + ∗ . Let Yˆ ∗ = X ∗ βˆ
denote the predicted values in the test set. Then
ˆ 2
E kY ∗ − Yˆ ∗ k2 = E kX ∗ β + ∗ − X ∗ βk
ˆ 2 + E k∗ k2
= E kX ∗ (β − β)k
ˆ 0 (X ∗0 X ∗ )(β − β)
ˆ + n∗ σ 2
= (β − β)
ˆ
ˆ 0 + n∗ σ 2
= tr X ∗0 X ∗ · E (β − β)(β
− β)
= tr X ∗0 X ∗ · Σβˆ + n∗ σ 2 ,
ˆ Note the requirement for
where Σβˆ is the covariance matrix of β.
Yˆ ∗ and Y ∗ to be independent (given X and X ∗ ).
5 / 22
MSPE for OLS analysis
The MSPE for OLS is
tr (X ∗0 X ∗ /n∗ ) · Σβˆ + σ 2 .
If X is the training set design matrix, then Σβˆ = σ 2 (X 0 X )−1 , so if
X = X ∗ , then
E kY ∗ − Yˆ k2 = σ 2 (p + 1 + n∗ ),
and the MSPE in this case is
σ 2 (p + 1)/n∗ + σ 2 = σ 2 (p + 1)/n + σ 2 .
6 / 22
MSPE for OLS analysis
More generally, suppose X 0 X /n = X ∗0 X ∗ /n∗ . Then
Σβˆ = σ 2 (X 0 X )−1 = σ 2 n∗ (X ∗0 X ∗ )−1 /n.
Thus the MSPE is
tr (X ∗0 X ∗ /n∗ ) · Σβˆ + σ 2 = σ 2 (p + 1)/n + σ 2 .
7 / 22
Ridge regression
Ridge regression uses the minimizer of a penalized squared error
loss function to estimate the regression coefficients:
βˆ ≡ argminβ kY − X βk2 + λβ 0 Dβ.
Typically D is a diagonal matrix with 0 in the 1,1 position and
ones on the rest of the diagonal. In this case,
β 0 Dβ =
X
βj2 .
j≥1
This makes most sense when the covariates have been
standardized, so it is reasonable to penalize the βj equally.
8 / 22
Ridge regression
Ridge regression is a compromise between fitting the data as well
as possible (by making kY − X βk2 small), while not allowing any
one fitted coefficient to get very large (which causes β 0 Dβ to get
large).
9 / 22
Ridge regression and colinearity
Suppose X1 , X2 ∈ Rn are standardized and strongly positively
collinear, and their population slopes are β1 and β2 respectively.
Fits of the form
(β1 + γ)X1 + (β2 − γ)X2 = EY + γ(X1 − X2 )
have similar MSE values as γ varies, since X1 − X2 is small when
X1 and X2 are strongly positively associated.
In other words, OLS can’t easily distinguish among these fits.
For example, if X1 ≈ X2 , then 3X1 + 3X2 , 4X1 + 2X2 , 5X1 + X2 ,
etc. all have very similar MSE values.
10 / 22
Ridge regression and colinearity
For large λ, ridge regression favors the fits that minimize
(β1 + γ)2 + (β2 − γ)2 .
This expression is minimized at γ = (β2 − β1 )/2, giving the fit
(β1 + β2 )X1 /2 + (β1 + β2 )X2 /2.
⇒ Ridge regression favors coefficient estimates for which strongly
positively correlated covariates have similar estimated effects.
11 / 22
Calculation of ridge regression estimates
For a given value λ > 0, ridge regression is no more difficult
computationally than ordinary least squares, since
∂
kY − X βk2 + λβ 0 Dβ = −2X 0 Y + 2X 0 X β + 2λDβ,
∂β
so the ridge estimate βˆ solves the system of linear equations
(X 0 X + λD)β = X 0 Y .
This equation can have a unique solution even when X 0 X is
singular. Thus one application of ridging is to produce regression
estimates for singular design matrices.
12 / 22
Ridge regression bias and variance
Ridge regression estimates are biased, but may be less variable
than OLS estimates. If X 0 X is non-singular, the ridge estimator
can be written
βˆλ = (X 0 X + λD)−1 X 0 Y
= (I + λ(X 0 X )−1 D)−1 (X 0 X )−1 X 0 Y
= (I + λ(X 0 X )−1 D)−1 (X 0 X )−1 X 0 (X β + )
= (I + λ(X 0 X )−1 D)−1 β + (I + λ(X 0 X )−1 D)−1 (X 0 X )−1 X 0 .
Thus the bias is
E βˆλ − β = ((I + λ(X 0 X )−1 D)−1 − I )β
13 / 22
Ridge regression bias and variance
The variance of the ridge regression estimates is
varβˆλ = σ 2 (I + λ(X 0 X )−1 D)−1 (X 0 X )−1 (I + λ(X 0 X )−1 D)−T .
14 / 22
Ridge regression bias and variance
Next we will show that varβˆ ≥ varβˆλ , in the sense that
varβˆ − varβˆλ is a non-negative definite matrix.
First let M = λ(X 0 X )−1 D, and note that
v 0 (varβˆ − varβˆλ )v
∝
=
=
=
“
”
v 0 (X 0 X )−1 − (I + M)−1 (X 0 X )−1 (I + M)−T v
“
”
u 0 (I + M)(X 0 X )−1 (I + M)0 − (X 0 X )−1 u
“
”
u 0 M(X 0 X )−1 + (X 0 X )−1 M 0 + M(X 0 X )−1 M 0 u
“
u 0 2λ(X 0 X )−1 D(X 0 X )−1 +
”
λ2 (X 0 X )−1 D(X 0 X )−1 D(X 0 X )−1 u
where u = (I + M)−T v .
We can conclude that for any fixed vector θ,
ˆ
var(θ0 βˆλ ) ≤ var(θ0 β).
15 / 22
Ridge regression effective degrees of freedom
Like OLS, the fitted values under ridge regression are linear
functions of the observed values
Yˆλ = X (X 0 X + λD)−1 X 0 Y
In OLS regression, the degrees of freedom is the number of free
parameters in the model, which is equal to the trace of the
projection matrix P that satisfies Yˆ = PY .
Fitted values in ridge regression are not a projection of Y , but the
matrix
X (X 0 X + λD)−1 X 0 .
plays an analogous role to P.
16 / 22
Ridge regression effective degrees of freedom
The effective degrees of freedom for ridge regression is defined as
EDFλ = tr X (X 0 X + λD)−1 X 0 .
The trace can be easily computed using the identity
trace X (X 0 X + λD)−1 X 0 = trace (X 0 X + λD)−1 X 0 X .
17 / 22
Ridge regression effective degrees of freedom
EDFλ is monotonically decreasing in λ. To see this we will use the
following fact about matrix derivatives
∂tr(A−1 B)/∂A = −A−T B 0 A−T .
By the chain rule, letting A = X 0 X + λD, we have
−1
∂tr A
X ∂tr A−1 X 0 X
∂Aij
·
X X /∂λ =
∂Aij
∂λ
ij
X
= −
[A−T (X 0 X )A−T ]ij · Dij
0
ij
X
= −
[A−T (X 0 X )A−T ]ii · Dii
i
≤ 0.
18 / 22
Ridge regression effective degrees of freedom
EDFλ equals rank(X ) when λ = 0. To see what happens as
λ → ∞, we can apply the Sherman-Morrison-Woodbury identity
(A + UCV )−1 = A−1 − A−1 U C −1 + VA−1 U
−1
VA−1 .
Let G = X 0 X , and write D = FF 0 , where F has independent
columns (usually F will be p + 1 × p as we do not penalize the
intercept).
19 / 22
Ridge regression effective degrees of freedom
Applying the SMW identity and letting λ → ∞ we get
tr (G + λD)−1 G
G −1 − G −1 F (I /λ + F 0 G −1 F )−1 F 0 G −1 G
= tr Ip+1 − G −1 F (I /λ + F 0 G −1 F )−1 F 0
→ trIp+1 − tr G −1 F (F 0 G −1 F )−1 F 0
→ trIp+1 − tr (F 0 G −1 F )−1 F 0 G −1 F
=
tr
=
p + 1 − rank(F ).
Therefore in the usual case where F has rank p, EDFλ converges
to 1 as λ grows large, reflecting the fact that all coefficients other
than the intercept are forced to zero.
20 / 22
Ridge regression and the SVD
Suppose we are fitting a ridge regression with D = I , and we factor
X = USV 0 using the singular value decomposition (SVD), so that
U and V are orthogonal matrices, and S is a diagonal matrix with
non-negative diagonal elements.
The fitted coefficients are
βˆλ = (X 0 X + λI )−1 X 0 Y
= (VS 2 V 0 + λVV 0 )−1 VSU 0 Y
= V (S 2 + λI )−1 SU 0 Y
Note that for OLS (λ = 0), we get βˆ = VS −1 U 0 Y . The effect of
ridging is to replace S −1 in this expression with (S 2 + λI )−1 S,
which are uniformly smaller values when λ > 0.
21 / 22
Ridge regression tuning parameter
There are various ways to set the ridge parameter λ.
Cross-validation can be used to estimate the MSPE for any
particular value of λ. Then this estimated MSPE could be
minimized by checking its value at a finite set of λ values.
Generalized cross validation, which minimizes the following over λ,
is a simpler, and more commonly-used approach.
GCV(λ) =
kY − Yˆλ k2
.
(n − EDFλ )2
22 / 22