14: Generalized Least Squares ECON 837

14: Generalized Least Squares
ECON 837
Prof. Simon Woodcock, Spring 2014
We now return to our linear regression model, but depart from the assumption of spherical
errors. We call this the generalized regression model. We retain the linearity assumption,
E [yjX] = X , but we now assume V ar [yjX] = V ar ["jX] = V; an n
(symmetric) matrix. Sometimes we will let V =
2
n positive de…nite
with tr ( ) = n (a normalization).
This is appropriate when we know the pattern of the covariance matrix but not the scale.
Clearly, spherical errors V ar ["jX] =
2
In is a special case. Heteroskedastic and/or serially
correlated errors are also special cases.
We know that the least squares estimator is ^ = (X0 X) 1 X0 y: In the generalized
h
i
regression model, what is E ^ jX ? Is ^ BLUE? Does ^ minimize e0 e? Is it
consistent? Note that the variance of the least squares estimator applied to the generalized
regression model is:
h
i
^
V ar jX = E
^
h
= E (X0 X)
= (X0 X)
6=
unless V =
2
^
2
1
1
0
X0 ""0 X (X0 X)
X0 VX (X0 X)
(X0 X)
jX
1
1
i
jX
1
In :
We’ll focus on a generalization of the least squares estimator called, appropriately enough,
generalized least squares (GLS). The basic idea behind GLS is to transform the data
h
i
matrix y X so that the variance of the transformed model is In (or 2 In ).
Since V is positive de…nite, V
gular n
n matrix P such that V
1
is positive de…nite also. Therefore there exists a nonsin-
1
= P0 P: Transforming the regression model y = X + "
by P yields
Py = PX + P":
1
(1)
Note that E [P"jX] = PE ["jX] = 0 and
V ar [P"jX] = E [P""0 PjX] = PE [""0 jX] P0 = PVP0 = P (P0 P)
1
1
P = PP
(P0 )
1
P0 = In :
That is, the transformed model has spherical errors and hence satis…es the conditions under
which we derived the least squares estimator. Therefore, the least squares estimator applied
to the transformed model is BLUE. This is called the GLS estimator of
is ^ G = (X0 V 1 X)
Proposition 1 The GLS estimator of
1
:
X0 V 1 y:
Proof. Easy. Just apply the least squares formula to the transformed model:
1
^ G = (X0 P0 PX)
X0 P0 Py = X0 V 1 X
h
i
Proposition 2 V ar ^ G jX = (X0 V 1 X)
1
1
X0 V 1 y:
:
Proof. Easy. Just apply the same method of proof as we did for ordinary least squares.
That is, note that
^G =
=
X0 V 1 X
1
X0 V 1 y = X0 V 1 X
+ X0 V 1 X
1
1
X0 V
1
(X + ")
X0 V 1 "
so that
h
i
V ar ^ G jX = E
=
^G
X0 V 1 X
^G
1
0
jX = E
h
X0 V 1 X
X0 V 1 VV 1 X X0 V 1 X
1
1
X0 V 1 ""0 V 1 X X0 V 1 X
= X0 V 1 X
1
:
Theorem 3 (Aitken) The GLS estimator is BLUE. (This just follows from the GMT, but
we’ll give a direct proof anyway).
2
1
i
jX
Proof. Let b be an alternative linear unbiased estimator such that
b=
h
1
X0 V 1 X
X0 V
1
i
+ A y:
(what’s the dimension of A?) Unbiasedness implies AX = 0: Then we have
V ar [bjX] =
=
h
=
1
X0 V 1 X
X0 V
1
i h
+ A V X0 V 1 X
X0 V 1 X
1
+ AVA0 + X0 V 1 X
X0 V 1 X
1
+ AVA0
X0 V 1 X
h i
= V ar ^ G :
1
1
X0 V
1
i0
+A
X0 A0 + AX X0 V 1 X
1
(why?)
1
What Does the GLS Estimator Minimize?
Recall that the least squares estimator ^ minimizes e0 e = y
it minimizes the length of y
X^
0
y
X ^ : That is,
X ^ : We know the GLS estimator is just least squares applied
to Py = PX + P": Therefore, ^ G minimizes the length of P y
X ^ G : That is, it
minimizes
Py
PX ^ G
0
PX ^ G = y
Py
X^G
0
X^G = y
P0 P y
X^G
0
V
1
y
X^G
which is just a weighted sum of squared residuals. After weighting, each observation contributes about the same amount of information to the estimation of ^ G :
Estimating
2
Let V ar [yjX] = V ar ["jX] =
model (1) is therefore
2
2
; where tr ( ) = n: The error variance in the transformed
In : De…ne eG = Py
PX ^ G (note that X0 V 1 eG = 0; but X0 eG 6= 0
in general). We know that an unbiased estimator of
3
2
in the transformed model is just
e0G eG = (n
k) :
Maximum Likelihood Estimation Under Normality
We’ve seen that the GLS estimator is BLUE. Under normality, it is also the MLE (and hence
consistent, asymptotically normal, invariant, and asymptotically e¢ cient). The normality
assumption is:
yjX
N X ;
2
:
The log likelihood is
l
;
2
=
n
ln
2
n
ln 2
2
2
1
ln j j
2
1
2
2
X )0
(y
Proposition 4 The GLS estimator ^ G is also the MLE of
Proposition 5 The maximum likelihood estimator of
2
is
1
(y
X ):
: (why?)
2
ML
= e0G eG =n:
Finite sample inference under normality is the same as in the ordinary linear regression
model. It is based on
^ G jX
N
;
2
X0
1
X
1
:
Asymptotic Properties of the GLS Estimator
Since GLS is just least squares on the transformed model, the GLS estimator inherits the
asymptotic properties of the least squares estimator, suitably modi…ed. If we assume that
plim
X0
plim
1
X
n
X0
1
n
"
= Q (positive de…nite), and
(2)
= 0
(3)
4
then it is straightforward to show that (you should try this!)
plim ^ G =
p
d
n ^G
!N
0;
2
X0
1
1
X
n
!
(4)
(5)
:
That is, the GLS estimator is consistent and asymptotically normal.
When is OLS equivalent to GLS?
That is, when does ^ = ^ G ?
2
1. They are equivalent in the trivial case where V =
2. In general, since ^ = (X0 X)
^
=
1
In :
X0 y and ^ G = (X0 V 1 X)
1
, (X0 X)
1
X0 = X0 V 1 X
, (X0 X)
1
X0 V = X0 V 1 X
, VX = X X0 V 1 X
1
X0 V 1 y; we have:
^G
, X0 V = (X0 X) X0 V 1 X
where R = (X0 V 1 X)
1
(X0 X) is a k
1
1
X0 V
1
1
X0
X0
(X0 X) = XR
k matrix. How do we interpret this result?
Suppose that k = 1 so that the data x is an n
1 vector and r is a scalar. We have
^ = ^ , Vx = rx:
G
Does this look familiar? It is the characteristic equation for V: It implies that r is an
eigenvalue of V; and x the corresponding eigenvector. This is unlikely to be the case
in general. When k > 1; if the columns of X are each linear combinations of the same
k eigenvectors of V, then ^ = ^ G : This is di¢ cult to verify and would typically be a
bad assumption.
5
The Case of Unknown
In the above, we acted as though V (or
) were known. Of course, this is not the situation
we usually face in practice. How do we proceed when the error covariance is unknown?
First o¤, notice that there is no hope of estimating a completely unstructured
; since it
has n (n + 1) =2 > n elements, but there are only n observations from which to estimate it.
Thus we usually make some parametric restriction
=
( ) with
dimensional parameter vector. Then we can hope to estimate
being a relatively low-
consistently using squares
and cross-products of least squares residuals (or by maximum likelihood, which usually
amounts to the same thing). It doesn’t make sense to try and estimate
consistently since
it grows with the sample size. Thus “consistency”refers to the estimate of :
^ is a consistent estimator of
De…nition 6 We say that ^ =
When
p
i¤ ^ ! :
is unknown, we use an estimation method known as feasible GLS (FGLS). It
is the same as GLS, except we use an estimate ^ in place of
Proposition 7 ^ F G = X0 ^
1
1
X
X0 ^
1
y=
+ X0 ^
:
1
1
X
X0 ^
1
":
Proof. This is obvious.
Proposition 8 Su¢ cient conditions for ^ F G to be consistent are
X0 ^ 1 X
= Q where Q is positive de…nite and …nite; and
n
X0 ^ 1 "
plim
= 0:
n
plim
(6)
(7)
Proof. This is obvious.
Alternately, assume conditions (2) and (3) are satis…ed, so that the GLS estimator is
consistent. Provided ^ is a consistent estimator of
; the Slutsky Theorem tells us that (6)
and (7) are satis…ed.
It takes a little more work to derive an asymptotic distribution for the FGLS estimator. We saw the asymptotic distribution of ^ G (
unknown?
6
known) already. How about when
is
Proposition 9 Su¢ cient conditions for ^ F G and ^ G to have the same asymptotic distribution are that
X0 ^
1
plim
X
= 0
n
X0 ^
plim
1
1
p
1
and
(8)
"
(9)
= 0:
n
Proof. Notice that
p
p
n ^G
n ^F G
=
=
X0
1
1
X
n
X0 ^ 1 X
n
!
1
X0
p
1
"
n
X0 ^
p
1
n
"
!
:
Therefore, if
X0 ^ 1 X
X0 1 X
plim
= plim
and
n
n
X0 ^ 1 "
X0 1 "
plim p
= plim p
n
n
p
then plim n ^ G
^ F G = 0; and we are done. (Recall that if plim(x
y) = 0 then x and
y have the same asymptotic distribution).
In summary, if (6) and (7) are satis…ed, the FGLS estimator is consistent. A su¢ cient
condition for this is that (2) and (3) are satis…ed (so that GLS is consistent) and plim ^ =
In this case, (8) and (9) will also be satis…ed (because plim ^ =
:
) and the FGLS estimator
will have the same asymptotic distribution as the GLS estimator. When this is true, the
FGLS estimator is asymptotically e¢ cient, in the sense that it has the same asymptotic
distribution as the GLS estimator –which we know is BLUE. Note that asymptotic e¢ ciency
of the FGLS estimator does not require an e¢ cient estimate of
; only a consistent one.
Note also that this is a di¤erent notion of asymptotic e¢ ciency than we were concerned with
in discussing the Cramer-Rao lower bound (although they coincide under normality).
7
Finite Sample Properties of FGLS Estimators
In general, we do not know the small sample properties of the FGLS estimator. We cannot
rely on the GMT, since ^ F G is not a linear function of y! That is, ^ is a function of y:
The following is an easily obtained, but not always useful result.
Proposition 10 Suppose that ^ is an even function of " (i.e., ^ (") = ^ ( ") : Note this
is true if ^ is a function of squares and cross products of residuals). Suppose further that "
h
i
has a symmetric distribution around zero. Then E ^ F G jX = if the expectation exists.
Proof. The sampling error ^ F G
tion around zero since " and
= X0 ^
1
1
X
X0 ^
1
" has a symmetric distribu-
" yield the same value of ^ : Therefore if the mean exists, it
is zero.
8