Handout - University of Warwick

Probability and Statistics
EC961
University of Warwick
September 2014
This version: September 24, 2014.
1 / 73
Introduction
Reading:
I Main reference:
I
I
Introduction to statistics and econometrics, by Amemiya,
1994.
Suitable textbooks:
I
I
I
I
Statistical inference, by Casella and Berger second edition.
Probability and statistics, by DeGroot and Schervish,
fourth edition.
Introduction to mathematical statistics, by Hogg, McKean,
and Craig, seventh edition.
Introduction to probability and mathematical statistics, by
Bain and Engelhardt, second edition.
2 / 73
Introduction
Outline:
I
Probability (Ch. 2);
I
Random variables and probability distributions (Ch. 3);
I
Moments (Ch. 4);
I
Point estimation (Ch. 7 and 10);
I
Large sample theory (Ch. 6);
I
Tests of hypotheses (Ch. 9).
3 / 73
Section 1
Probability
4 / 73
Set theory
Definition
The set S of all possible outcomes of a particular experiment is
called the sample space for the experiment.
Definition
An even is any collection of possible outcomes of an experiment,
that is, any subset of S (including S itself).
Definition (Axioms of probability, not fully rigorous)
Given a sample space S, a probability function is a function that
satisfies
1. Pr (A) ≥ 0 for any event A.
2. Pr (S) = 1.
3. If {Ai }, i = 1, · · · , are mutually exclusive (that is,
Ai ∩ Aj = ∅ for all i 6= j), then
Pr (A1 ∪ A2 ∪ · · · ) = Pr (A1 ) + Pr (A2 ) + · · · .
5 / 73
Conditional probability
Definition
If A and B are events in S, and Pr (B) > 0, then the conditional
probability of A given B, written Pr (A | B), is defined as
Pr (A | B) =
Pr (A ∩ B)
.
Pr (B)
Theorem (2.4.2 Bayes, skip)
Let events A1 , A2 , · · · , An be mutually exclusive such that
Pr (A1 ∪ A2 ∪ · · · ∪ An ) = 1 and Pr (Ai ) > 0 for each i. Let E
be an arbitrary event such that Pr (E) > 0. Then
Pr (E|Ai ) Pr (Ai )
Pr (Ai |E) = Pn
,
j=1 Pr (E|Aj ) Pr (Aj )
i = 1, · · · , n.
6 / 73
Independence
Definition
Two events, A and B, are statistically independent if
Pr (A ∩ B) = Pr (A) Pr (B) .
Note that independence could have been equivalently defined by
either Pr (A) = Pr (A|B) or Pr (B) = Pr (B|A) (as long as
either Pr (A) > 0 or Pr (B) > 0).
7 / 73
Section 2
Random variables and probability
distributions
8 / 73
Definition (3.1.1)
A random variable is a variable that takes values according to a
certain probability distribution.
Definition (3.1.2)
A random variable is a function from a sample space S into the
real numbers.
Definition (3.2.1)
A discrete random variable is a variable that takes a countable
number of real numbers with certain probabilities.
The probability distribution of a discrete random variable is
completely characterized by the equation
Pr (X = xi ) = pi
i = 1, · · · , n,
where xi ∈ X = {x1 , x2 , . . . }.
9 / 73
A word on notation
Suppose we have a sample space
S = {s1 , . . . , sn }
with probability function Pr(·) and we define a random variable
X with range X = {x1 , . . . , xm }. We can define the probability
function of X as
Pr X (X = xi ) = Pr ({sj ∈ S : X(sj ) = xi }) ,
that is, we will observe X = xi if and only if the outcome of the
random experiment is an sj ∈ S such that X(sj ) = xi .
In order to avoid too heavy notation, we will write Pr(X = xi )
instead of PrX (X = xi ).
10 / 73
Another word on notation
In the statistic literature, random variables are usually denoted
with uppercase letters and the realized values of a variable are
usually denoted by the corresponding lowercase later. For
instance, the random variable X can take value xi .
This is not the case in the econometric literature, where
uppercase letters (in boldface) are reserved for matrices. Recall
the well known formula of the OLS estimator
βˆ = (X 0 X)−1 X 0 y.
Unless stated otherwise, in these notes we will follow the
notation used in statistics (X random variable, x its realization;
X random vector, x its realization).
11 / 73
Definition (3.2.2)
A bivariate discrete random variable is a variable that takes a
countable number of points on the plane with certain
probabilities.
The probability distribution of a bivariate discrete random
variable is determinded by the equations
Pr (X = xi , Y = yj ) = pij
i = 1, . . . , n;
j = 1, . . . , m.
The marginal probability is X is defined as
Pr (X = xi ) =
m
X
Pr (X = xi , Y = yj )
i = 1, . . . , n.
j=1
and the conditional probability of X = xi given Y = yj is
(assume Pr (Y = yj ) > 0)
Pr (X = xi |Y = yj ) =
Pr (X = xi , Y = yj )
.
Pr (y = yj )
12 / 73
Definition (3.2.3)
Two discrete random variables X and Y are said to be
independent if the event (X = xi ) and the event (Y = yj ) are
independent for all i, j. That is to say,
Pr (X = xi , Y = yj ) = Pr (X = xi ) Pr (Y = yj ) for all i, j, and
we write X ⊥
⊥Y.
Example (3.2.3)
Let the joint probability distribution of X and Y be given by
X
Y
1
0
Pr (X = xi )
1
2/8 1/8
0
2/8 3/8
Pr (Y = yj ) 4/8 4/8
Are X and Y independent?
3/8
5/8
1
13 / 73
Definition (3.3.1)
If there is a nonnegative function fX (x) defined over R such
that
Z x2
Pr (x1 ≤ X ≤ x2 ) =
fX (x) dx
x1
for any x1 and x2 satisfying x1 ≤ x2 , then X is a continuous
random variable and fX (x) is called density function.
R∞
By Axiom (2) of probability, −∞ fX (x) dx = 1.
Note: when X is continuous, the following expressions are
equivalent Pr (x1 ≤ X ≤ x2 ), Pr (x1 < X ≤ x2 ),
Pr (x1 ≤ X < x2 ), or Pr (x1 < X < x2 ).
14 / 73
Definition (3.4.1)
If there is a nonnegative function fX,Y (x, y) defined over R2
such that
Z y2 Z x2
Pr (x1 ≤ X ≤ x2 , y1 ≤ Y ≤ y2 ) =
fX,Y (x, y) dx dy
y1
x1
for any x1 , x2 , y1 , y2 satisfying x1 ≤ x2 , y1 ≤ y2 , then (X, Y )0 is
a bivariate continuous random variable and fX,Y (x, y) is called
density function.
In order for fX,Y (x, y) to be a joint density function, it must be
nonnegative and
Z ∞Z ∞
fX,Y (x, y) dx dy = 1.
−∞
∞
Example (3.4.1)
If fX,Y (x, y) = xye−(x+y) , x > 0, y > 0, and 0 otherwise, that is
Pr (X > 1, Y < 1)?
15 / 73
Marginal densities
Definition
Let (X, Y )0 be a continuous bivariate random vector with joint
probability density function fX,Y (x, y). The marginal pdfs
fX (x) and fY (y) are defined as
Z ∞
fX (x) =
fX,Y (x, y) dy
−∞ <x < ∞
Z−∞
∞
fY (y) =
fX,Y (x, y) dx
−∞ <y < ∞.
−∞
16 / 73
Conditional densities
Definition
Let (X, Y )0 be a continuous bivariate random vector with joint
probability density function fX,Y (x, y) and marginal pdfs fX (x)
and fY (y). For any x such that fX (x) > 0, the conditional pdf
of Y given X = x is the function of y denoted by fY |X (y|x) and
defined by
fY,X (y, x)
.
fY |X (y|x) =
fX (x)
17 / 73
Definition (3.4.6)
Continuous random variables X and Y are said to be
independent if, for all x and y,
fX,Y (x, y) = fX (x)fY (y),
and we write X ⊥
⊥Y.
18 / 73
Example (3.4.5)
Suppose
fX,Y (x, y) =
(
(3/2)(x2 + y 2 ) for 0 < x < 1 and 0 < y < 1
0
otherwise.
Calculate
1
1
Pr 0 < X < 0 < Y <
2
2
and determine if X and Y are independent.
19 / 73
Distribution function
Definition
The cumulative distribution function of a random variable X,
denoted by FX (·), is defined by
FX (x) = Pr (X ≤ x) ,
x ∈ R.
Example (3.5.1)
Suppose
fX (x) =
(
1 −x/2
2e
0
if x > 0,
otherwise.
Find FX (x).
20 / 73
Example (3.5.2)
Suppose
fX (x) =
(
2(1 − x) if 0 < x < 1,
0
otherwise.
Find FX (x).
Example (3.5.3 Mixture random variable)
Suppose
Draw FX (x).


0



0.5
FX (x) =

x



1
if
if
if
if
x ≤ 0,
0 < x ≤ 0.5,
0.5 < x ≤ 1,
x > 1.
21 / 73
Change of variables
Theorem (3.6.1)
Let fX (x) be the density of X and let Y = g(X), where g(·) is a
monotonic and differentiable function. Then, the density fY (y)
of Y is given by
−1 dg (y) ,
fY (y) = fX [g −1 (y)] · dy where g −1 (·) denotes the inverse function of g(·) (do not
mistake it for 1 over g).
Example
Let X be a continuous
random variable with pdf
√
2
fX (x) = (1/ 2π)e−x /2 , x ∈ R, and let Y = µ + σX, where µ
and σ are two constants, with σ > 0. Determine fY (y).
22 / 73
Section 3
Moments
23 / 73
Expected value
Definition (4.1.1)
Let X be a discrete random variable taking value xi ∈ X with
probability Pr (X = xi ), i = 1, . . . . The expected value of X,
denoted by E (X), is defined by
X
E (X) =
xi Pr (X = xi ) .
xi ∈X
Definition (4.1.2)
Let X be a continuous random variable with density fX (x).
The expected value of X, denoted by E (X), is defined by
Z ∞
E (X) =
xfX (x) dx.
−∞
24 / 73
Definition
A median of a distribution is a value m such that
Pr(x ≤ m) ≥ 1/2 and Pr(X ≥ m) ≥ 1/2. If X is continuous, m
satisfies
Z ∞
Z m
1
fX (x) dx =
fX (x) dx = .
2
−∞
m
Example (4.1.1)
Let fX (x) = 2x−3 . Find the expected value and the median of
X.
Example (4.1.2)
Let fX (x) = x−2 . Find the expected value and the median of X.
25 / 73
Theorem (4.1.1)
Let X be a discrete random variable taking value xi ∈ X with
probability Pr (X = xi ), i = 1, · · · , and let φ(·) be an arbitrary
function. Then
X
E[φ(X)] =
φ(xi ) Pr (X = xi ) .
xi ∈X
Theorem (4.1.2)
Let X be a continuous random variable with density fX (x) and
let φ(·) be a function for which the integral below is defined.
Then
Z ∞
E[φ(X)] =
φ(x)fX (x) dx.
−∞
26 / 73
Higher moments
Definition
I
For each integer n, the n-th moment of X is defined as
E(X n ).
I
The n-th central moment of X is defined as
E{[X − E(X)]n }.
I
The variance of a random variable X, written var(X), is its
second central moment, var(X) = E{[X − E(X)]2 }.
2 .
Sometime it is denoted by σX
I
The standard deviation of X is the positive square root of
var(X). Sometime it is denoted by σX .
27 / 73
Theorem (4.1.3)
Let (X, Y )0 be a bivariate discrete random variable taking value
(xi , yj ), xi ∈ X and yj ∈ Y, with probability
Pr (X = xi , Y = yj ), i, j = 1, · · · , and let φ(·, ·) be an arbitrary
function. Then
X X
φ(xi , yj ) Pr (X = xi , Y = yj ) .
E[φ(X, Y )] =
xi ∈X yj ∈Y
Theorem (4.1.4)
Let (X, Y )0 be a bivariate continuous random variable with
joint density fX,Y (x, y) and let φ(·, ·) be an arbitrary function
for which the integral below is defined. Then
Z ∞Z ∞
E[φ(X, Y )] =
φ(x, y)fX,Y (x, y) dx dy.
−∞
−∞
28 / 73
Expected value
Some useful properties
Theorem (4.1.5)
If α is a constant, E(α) = α.
Theorem (4.1.6)
If X and Y are random variables and α and β are constants,
E(αX + βY ) = α E(X) + β E(Y ).
Theorem
If X is a random variable with finite variance, then
1. var(X) = E X 2 − [E (X)]2 .
2. var(αX + β) = α2 var(X) for any constant α and β.
Theorem (4.1.7)
If X and Y are independent random variables,
E(XY ) = E(X) E(Y ).
29 / 73
Mixture random variable
Theorem (4.1.8)
If X be a mixture random variable taking discrete value xi ∈ X ,
i = 1, · · · with probability Pr (X = xi ) = pi and a continuum of
values in interval (a, b) according to density fX (x). Then
X
xi p i +
pi +
Z
E(X) =
Z
b
xfX (x).
a
xi ∈X
Note that we must have
X
b
fX (x) = 1.
a
xi ∈X
30 / 73
Covariance
Definition (4.3.1)
A measure of the relationship between two random variables X
and Y is the covariance which is defined as
cov(X, Y ) = E{[X − E(X)][Y − E(Y )]}
= E(XY ) − [E(X) E(Y )].
Example (4.3.1)
X
Y
1
-1
1
α/2
-1
(1 − α)/2
Pr (Y = yj )
Compute cov(X, Y ).
Pr (X = xi )
(1 − α)/2
α/2
1
31 / 73
Theorem (4.3.1)
If X and Y are independent, then cov(X, Y ) = 0.
Example (4.3.2)
Let the joint probability distribution of (X, Y ) be given by
X
Y
-1
0
1
Pr (X = xi )
1
1/6 1/12 1/6
0
1/12
0
1/12
-1
1/6 1/12 1/6
Pr (Y = yj )
1
Compute cov(X, Y ). Are X and Y independent?
32 / 73
Example (4.3.4)
Let the joint density be
(
x+y
fX,Y (x, y) =
0
if 0 < x < 1 and 0 < y < 1
.
otherwise
Calculate cov(X, Y ).
33 / 73
Conditional mean and variance
Definition
Let (X, Y )0 be a bivariate discrete random variable taking
values (xi , yj )0 , i, j = 1, · · · . Let Pr (Y = yj |X = x) be the
probability of Y = yj given X = x. The conditional expectation
of Y given X = x, denoted by E(Y |X = x), is defined by
X
E(Y |X = x) =
yj Pr (Y = yj |X = x) .
yj ∈Y
The conditional variance of Y given X = x, denoted by
var(Y |X = x), is defined by
X
var(Y |X = x) =
[yj − E(Y |X = x)]2 Pr (Y = yj |X = x) .
yj ∈Y
34 / 73
Conditional mean and variance
Definition
Let (X, Y )0 be a bivariate continuous random variable and let
fY |X (y|x) be the density function of Y given X = x. Then the
conditional expectation of Y given X = x is defined by
Z ∞
E(Y |X = x) =
yfY |X (y|x) dy.
−∞
The conditional variance of Y given X = x, denoted by
var(Y |X = x), is defined by
Z ∞
var(Y |X = x) =
[y − E(Y |X = x)]2 fY |X (y|x) dy.
−∞
35 / 73
Theorem (4.4.1 Law of iterated expectation — LIE)
If X and Y are any random variables, then
E(Y ) = E[E(Y |X)],
(1)
provided that the expectations exist.
Remark
Equation (1) contains an abuse of notation, since the “E’s”
used stand for different expectations in the same equation. The
“E” in the left-hand side of (1) is expectation wrt the marginal
distribution of Y . The first “E” in the right-hand side of (1) is
the marginal distribution wrt marginal distribution of X, while
the second one stands for the expectation to the conditional
distribution of Y |X = x. Some people write
E(Y ) = EX [EY |X (Y |X)].
36 / 73
Proof of Th. 4.4.1.
Proof for the continuous case only (discrete case is similar).
Z ∞
Z ∞ Z ∞
E(Y ) =
yfY (y) dy =
y
fY,X (y, x) dx dy
−∞
−∞
Z−∞
∞ Z ∞
=
yfY,X (y, x) dx dy
−∞
Z−∞
∞ Z ∞
=
yfY |X (y|x)fX (x) dx dy
−∞ −∞
Z ∞ Z ∞
=
yfY |X (y|x) dy fX (x) dx
−∞
Z−∞
Z ∞
∞
=
E(Y |X = x)fX (x) dx =
g(x)fX (x) dx = E[g(X)]
−∞
−∞
= E[E(Y |X)],
where g(x) = E(Y |X = x).
37 / 73
Theorem (4.4.2)
var(Y ) = E[var(Y |X)] + var[E(Y |X)].
Proof.
varY (Y ) = EY (Y 2 ) − [EY (Y )]2 = EX [EY |X (Y 2 |X)] − [EY (Y )]2
= EX {varY |X (Y |X) + [EY |X (Y |X)]2 } − [EY (Y )]2
= EX [varY |X (Y |X)] + EX {[EY |X (Y |X)]2 } − [EY (Y )]2
= EX [varY |X (Y |X)] + EX {[EY |X (Y |X)]2 }
− {EX [EY |X (Y |X)]}2
= EX [varY |X (Y |X)] + EX {[g(X)]2 } − {EX [g(X)]}2
= EX [varY |X (Y |X)] + varX [g(X)]
= EX [varY |X (Y |X)] + varX [EY |X (Y |X)].
38 / 73
Example (4.4.1)
Suppose fX (x) = 1 for 0 < x < 1 and equal to zero otherwise,
and fY |X (y|x) = 1/x for 0 < y < x and equal to zero otherwise.
Calculate E(Y ).
Example (4.4.2)
The marginal density of X is given by fX (x) = 1, 0 < x < 1.
The conditional probability of Y given X = x is given by
Pr (Y = 1|X = x) = x
Pr (Y = 0|X = x) = 1 − x.
Find E(Y ) and var(Y ).
39 / 73
Section 4
Some special distributions
40 / 73
Bernoulli
Definition
A Bernoulli experiment is a random experiment, the outcome of
which can be classified in but one of two mutually exclusive and
exhaustive ways (e.g., success or failure, female or male, life or
death).
Definition
Let X be a random variable associated with a Bernoulli trial
defined as follows
X(success) = 1
X(failure) = 0.
Let p any real number between 0 and 1, 0 ≤ p ≤ 1. The
probability mass function of X can be written as
Pr (X = x) = px (1 − p)1−x
x = 0, 1,
and we say that X has a Bernoulli distribution and we write
X ∼ Bernoulli (p).
41 / 73
Exercise
Suppose X ∼ Bernoulli (p), where 0 ≤ p ≤ 1. Compute E (X)
and var(X).
Exercise
Suppose Y |X = x ∼ Bernoulli (F (α + βx)), where F (·) is a
function which maps R → [0, 1]. Compute E (Y |X = x) and
var(Y |X = x).
42 / 73
Binomial distribution
Definition (5.1.1)
Let X1 , · · · , Xn be n random variables, independently
and
P
identically distributed as Bernoulli (p). Let Y = ni=1 Xi . The
distribution of Y is called the binomial distribution, and we
write Y ∼ Bin(n, p).
Theorem (5.1.1)
Let Y ∼ Bin(n, p). Then,
n k
Pr(Y = k) =
p (1 − p)n−k ,
k
E(Y ) = np,
n!
n
=
,
k
k!(n − k)!
var(Y ) = np(1 − p),
where 0 ≤ k ≤ n, k! = 1 × · · · × k, and 0! = 1.
43 / 73
Normal distribution
Definition (5.2.1)
Let µ and σ be any two real numbers, with σ > 0. We say a
random variable X has a normal distribution if its pdf is
"
#
1
1 x−µ 2
fX (x|µ, σ) = √ exp −
,
x∈R
2
σ
σ 2π
and we write X ∼ N(µ, σ 2 ).
44 / 73
Uniform distribution
Definition
Let a and b be any two real numbers. We say a random variable
X has a uniform distribution if its pdf is
(
1
if x ∈ [a, b]
fX (x|a, b) = b−a
0
otherwise
and we write X ∼ U(a, b).
45 / 73
Chi-square distribution
Definition (1 in Appendix)
Let X1 , · · · , Xn be n random variables, independently
and
P
identically distributed as N(0, 1). Let Y = ni=1 Xi2 . The
distribution of Y is called the chi-square distribution with n
degrees of freedom, and we write Y ∼ χ2n .
Exercise
Let Y ∼ χ2n . Show that E(Y ) = n and var(Y ) = 2n.
Hint: if X ∼ N(0, 1), then E(X 4 ) = 3.
Exercise
Let X ∼ χ2n and Y ∼ χ2m . Suppose X and Y are independent.
Show that X + Y ∼ χ2n+m .
46 / 73
Student’s t distribution
Definition (2 in Appendix)
Let Y and X be two independent random variables√with √
Y ∼ N(0, 1) and X ∼ χ2n . Let Z be defined as Z = 2Y / X.
The distribution of Z is called the Student’s t distribution with
n degrees of freedom, and we write Z ∼ tn .
47 / 73
Multivariate normal distribution
Definition
Let µ be a vector of constants and let Σ be a positive definite
matrix. We say that a random vector X has a multivariate
normal distribution if, for x ∈ Rn , its pdf is
1
1
0 −1
fX (x) =
exp
−
(x
−
µ)
Σ
(x
−
µ)
,
2
(2π)n/2 |Σ|1/2
where |Σ|1/2 is the determinant of the matrix defined as follows
Σ 1/2 = S 0 Λ1/2 S,
Σ = S 0 ΛS = S 0 Λ1/2 SS 0 Λ1/2 S = Σ 1/2 Σ 1/2 ,
√
√
and Λ1/2 = diag( λ1 , . . . , λn ), λi eigenvalues of Σ, S is the
orthogonal matrix having as columns the eigenvectors of Σ.
48 / 73
Section 5
Statistical inference
49 / 73
Statistical inference
In a simple econometric problem, we are dealing with a random
variable X of interest, but its probability density function
fX (x) is unknown. Our ignorance about fX (x) can be roughly
classified in two ways:
(a) fX (x) is not known but we make some assumptions about
some characteristics of it (for instance, we specify the first
and second moment).
(b) the form of fX (x) is known down to a parameter θ (or a
vector of parameters, in which case we write θ). Example:
X ∼ N(µ, σ 2 ), where θ = (µ, σ)0 is unknown.
For simplicity, let us start from (b): we (assume we) know
fX (x) but we don’t know θ, so we want to estimate it.
50 / 73
Sampling
Our information about the unknown parameter θ comes from a
sample of X. The sample observations are assumed to have the
same distribution of X, and we denoted them as random
variables X1 , X2 , · · · , Xn , where n is an integer which denotes
the sample size. When the sample is actually drawn, we use
lower case letters x1 , x2 , · · · , xn as the values or realizations of
the sample.
Definition
If the random variables X1 , · · · , Xn are independent and
identically distributed (i.i.d.), then these random variables
constitute a random sample of size n from the common
distribution.
51 / 73
Maximum likelihood estimator
Definition (7.3.2, random sample, basic case)
Let X1 , · · · , Xn be n independent and identically distributed
random variables (random sample), with probability density
function fX (x|θ), where θ is an unknown parameter, and let xi
denote the observed value of Xi . Then we call
L(θ|x1 , · · · , xn ) =
n
Y
i=1
fX (xi |θ),
the likelihood function of θ given x1 , · · · , xn . The value that
maximizes L(·|x1 , · · · , xn ) is called maximum likelihood
estimator of θ, and it is written as θˆ or θˆML .
Example (7.7.3, Normal distribution)
Let X1 , · · · , Xn be a random sample on N(µ, σ 2 ) and let
x1 , · · · , xn be their observed values. Compute the maximum
likelihood estimator of µ and σ.
52 / 73
Example (7.7.1, Binomial distribution)
Let X1 , · · · , Xn be a random sample on Bin(n, p) and let
x1 , · · · , xn be their observed values. Compute the maximum
likelihood estimator of p.
Example (Uniform distribution)
Let X1 , · · · , Xn be i.i.d with the uniform (0, θ) density, that is,
(
1/θ if 0 < x < θ
fX (x|θ) =
0
otherwise.
Find the maximum likelihood estimator of θ.
53 / 73
ML estimator: from statistics to econometrics
Let us consider n continuous random vector Z1 , · · · , Zn , where
Zi = (Yi , Xi )0 , i = 1, · · · , n. Suppose the following assumptions
hold:
A1 (independence) for all i 6= j, Zi is independent of Zj
Zi ⊥
⊥ Zj , i 6= j ⇒ fZi ,Zj (·, ·) = fZi (·)fZj (·).
A2 (identically distributed)
fZ1 (·|θ) = fZ2 (·|θ) = · · · = fZn (·|θ),
so the subscript i can be dropped and we can refer generally
to fZ (·|θ).
A3 (distributional assumptions)
A3.a fZ (z|θ) = fY,X (y, x|θ) = fY |X (y|x; θ)fX (x), with
fX (x) > 0, x ∈ R.
A3.b Y |X = x ∼ N(α + βx, σ 2 ), that is
"
2 #
1 y − α − βx
1
fY |X (y|x; θ) = √ exp −
2
σ
σ 2π
where θ = (α, β, σ)0 , with σ > 0.
54 / 73
ML estimator
L(θ; z1 , . . . , zn ) = fZ1 ,...,Zn (z1 , . . . , zn |θ)
n
Y
=
fZi (zi |θ)
=
=
i=1
n
Y
i=1
n
Y
i=1
(by A1)
fZ (zi |θ)
(by A2)
fY |X (yi |xi ; θ)fX (xi )
(by A3.a)
Note: As fX (·) > 0 and it does not depend on θ, then
θˆn(ML) = arg max L(θ; z1 , . . . , zn ) = arg max
θ∈Θ
θ∈Θ
n
Y
i=1
fY |X (yi |xi ; θ)
55 / 73
ML estimator
We can further simplify the problem taking logs:
( n
)
Y
θˆ(ML) = arg max
fY |X (yi |xi ; θ)
n
θ
(
i=1
= arg max log
θ
= arg max
θ
( n
X
i=1
( n
X
"
n
Y
i=1
fY |X (yi |xi ; θ)
#)
)
log fY |X (yi |xi ; θ)
)
1
1 (yi − α − βxi )2
√ exp −
α,β,σ
2
σ2
σ 2π
i=1
!
n
1 X
= arg max −n log σ − 2
(yi − α − βxi )2
α,β,σ
2σ
= arg max
log
i=1
56 / 73
ML estimator
FOC with respect to α, β, and σ
−
−
n
1 X
ˆ i )(−1) = 0 ⇔
2(yi − α
ˆ − βx
2
2ˆ
σ
1
2ˆ
σ2
i=1
n
X
i=1
ˆ i )(−xi ) = 0 ⇔
2(yi − α
ˆ − βx
α
ˆ = y¯ − βˆx
¯
βˆ =
Pn
(x − x
¯)(yi − y¯)
i=1
Pn i
¯)2
i=1 (xi − x
n
n
n
1 X
1X
ˆ i )2 = 0 ⇔ σ
ˆ i )2 ,
− + 3
(yi − α
ˆ − βx
(yi − α
ˆ − βx
ˆ2 =
σ σ
n
i=1
i=1
Pn
ˆ
where
¯)¯
x = 0 and
Pn the results for β comes from i=1 (xi − x
¯)¯
x = 0.
i=1 (yi − y
57 / 73
Method of moments (MM ) estimator
Let us (again) consider n continuous random vector Z1 , · · · , Zn ,
where Zi = (Yi , Xi )0 , i = 1, · · · , n. Suppose the following
assumptions hold:
B1 (independence) The same as Assumption A1
B2 (identically distributed) The same as Assumption A2
B3 (distributional assumptions)
B3.a E(Y |X = x) = α + βx.
B3.b var(Y |X = x) = E{[Y − E(Y |X = x)]2 } = σ 2 .
Note:
I
B3 is a much weaker assumption than A3. In A3, we
assume we know exactly the entire shape of the distribution
up to some unknown parameter we want to estimate. In B3
we suppose we know only mean and variance (up to some
unknown parameter we want to estimate).
I
The name comes from the fact that we use (the first and
second) moments.
58 / 73
Note:
If you have never heard of the method of moments estimator,
just think of it as the ordinary least square method, and
substitute MM with LS in these slides.
59 / 73
MM estimator
Define
U = Y − E(Y |X) = Y − α − βX
and note that
E(U |X) = E[Y − E(Y |X)|X] = E(Y |X) − E(Y |X) = 0 which
can be use to prove the following two moment conditions:
E(U ) = E[E(U |X)] = 0
E(XU ) = E[E(XU |X)] = E[X E(U |X)] = 0
var(U ) = E[var(U |X)] + var[E(U |X)] = E(σ 2 ) + 0 = σ 2 ,
that is
E(Y − α − βX) = 0
E[X(Y − α − βX)] = 0
E[(Y − α − βX)2 ] − σ 2 = 0.
60 / 73
MM estimator
To estimate α, β and σ, replace (i) expectations with sample
averages, and (ii) random variables with their realized values1
n
1X
ˆ i) = 0 ⇔
(yi − α
ˆ − βx
n
i=1
α
ˆ = y¯ − βˆx
¯
n
1X
ˆ i )] = 0 ⇔
[xi (yi − α
ˆ − βx
n
βˆ =
i=1
n
Pn
(x − x
¯)(yi − y¯)
i=1
Pn i
¯)2
i=1 (xi − x
n
1X
1X
ˆ i )2 ] − σ
ˆ i )2 .
[(yi − α
ˆ − βx
ˆ2 = 0 ⇔ σ
ˆ2 =
(yi − α
ˆ − βx
n
n
i=1
i=1
1
This estimation method — doing in the sample what you would have
done in the population — is called Analog estimation method. It is good for
you to know, but you don’t need to remember this.
61 / 73
Remember
I
An estimator is a random variable.
I
Being a function of the data, the estimator has a
probability distribution, a mean, a variance, and so on.
I
A particular realization of this random variable is called
the estimate.
Definition (Included for completeness, but you can skip it)
Let X1 , . . . , Xn be a random sample of size n from a population
and let T (x1 , . . . , xn ) be a real-valued or vector-valued function
whose domain includes the sample space of (X1 , . . . , Xn ). Then
the random variable or random vector Y = T (X1 , . . . , Xn ) is
called statistic. The probability of a statistic Y is called the
sampling distribution of Y .
62 / 73
Section 6
Large sample theory
63 / 73
Convergence concepts
Definition (6.1.2)
A sequence of random variables, X1 , X2 , · · · , converges in
probability to a random variable X if, for every > 0,
lim Pr (|Xn − X| ≥ ) = 0
n→∞
or, equivalently,
lim Pr (|Xn − X| < ) = 1.
n→∞
p
We write Xn −→ X or plim Xn = X. The last equality reads
“the probability limit of Xn is X”.
Theorem
Suppose that X1 , X2 , · · · , converges in probability to a variable
X and that h is a continuous function. Then, h(X1 ), h(X2 ), · · · ,
converges in probability to h(X).
64 / 73
Definition (6.1.4)
A sequence of random variables, X1 , X2 , . . ., converges in
distribution to a random variable X if
lim FXn (x) = FX (x)
n→∞
d
at all points x where FX (x) is continuous. We write Xn −→ X,
and we call FX (·) the limit distribution of the sequence
X1 , X2 , . . ..
Theorem (6.1.2)
p
d
If Xn −→ X, then Xn −→ X.
Theorem (6.1.4, Slutsky’s theorem)
p
d
If Xn −→ X and Yn −→ a, a constant, then
d
(a) Xn + Yn −→ X + a;
d
(b) Xn Yn −→ aX.
65 / 73
Theorem (Weak law of large numbers)
Let X1 , X2 , · · · be a sequence of independent and identically
distributed (i.i.d.) random variables with
E(Xi ) = µ and
¯ n := (1/n) Pn Xi . Then, for
var(Xi ) = σ 2 < ∞. Define X
i=1
every > 0,
¯ n − µ| < = 1,
lim Pr |X
n→∞
p
¯ n −→
that is, X
µ.
66 / 73
Theorem (Central limit theorem)
Let X1 , X2 , · · · be a sequence of independent and identically
distributed (i.i.d.) random variables with E(Xi ) = µ and
var(Xi ) = σ 2P
, both µ and σ 2 finite. Define
¯ n := (1/n) n Xi . Let Gn (x) be the cumulative density
X
√ i=1
¯ n − µ)/σ. Then, for any x, −∞ < x < ∞,
function of n(X
Z x
1
2
√ e−y /2 dy,
lim Gn (x) =
n→∞
2π
−∞
√ ¯
that is, n(X
n − µ)/σ has a limiting standard normal
distribution, and we write (equivalently):
√ ¯
√
n(Xn − µ) d
d
¯ n − µ) −→
−→ N(0, 1),
or n(X
N(0, σ 2 ),
σ
n
√
1 X
d
d
¯ n −→
or nX
N(µ, σ 2 ),
or √
Xi −→ N(µ, σ 2 ).
n
i=1
67 / 73
Two last tools from asymptotic theory
This slide can be skipped
Theorem (continuous mapping theorem)
Let X1 , X2 , · · · be a sequence of independent and identically
distributed (i.i.d.) and let h(·) be a continuous function.
1. If X1 , X2 , · · · , converges in probability to a variable X,
then, h(X1 ), h(X2 ), · · · , converges in probability to h(X).
2. If X1 , X2 , · · · , converges in distribution to a variable X,
then, h(X1 ), h(X2 ), · · · , converges in distribution to h(X).
Theorem (Delta method)
Let X1 , . . . , Xn be a sequence of random variables that satisfies
√
d
n(Xn − θ) −→ N(0, σ 2 ). For a given function g(·) and a
specific value θ, suppose that g 0 (θ) exists and is not 0. Then
2 √
d
n [g(Xn ) − g(θ)] −→ N 0, σ 2 g 0 (θ)
.
68 / 73
Asymptotic normality of OLS
Simple regression model
Let us focus on the MM estimator only and suppose
Assumptions B1–B3 hold. It is possible to prove (see
Wooldridge, 2013, Chapter 5 Appendix, p.684)
√
σ2
d
.
n(βˆn − β) −→ N 0,
var(X)
where σ 2 = var(Y |X) = var(U ) and
n
1X
p
(xi − x
¯)2 −→ E{[X − E(X)]2 } = var(X).
n
i=1
2 as shorthand notation for var(X). It
For simplicity, let use σX
follows from the results above that
√ ˆ
n(βn − β) d
−→ N (0, 1) .
(σ/σX )
69 / 73
Section 7
Hypothesis testing
70 / 73
Hypothesis testing
Definition
A hypothesis is a statement about a population parameter.
Definition
The two complementary hypotheses in a hypothesis testing
problem are called the null hypothesis and the alternative
hypothesis. They are denoted by H0 and H1 (or HA ),
respectively. A hypothesis is a statement about a population
parameter.
Example
Let θ denote the expected change in a patient’s blood pressure
after taking a drug. You might be interested in testing
H0 : θ = 0 versus H1 : θ 6= 0.
71 / 73
Hypothesis testing
Definition
A hypothesis testing procedure or hypothesis test is a rule that
specifies:
(i) For which sample values the decision is made to accept H0
as true.
(ii) For which sample values H0 is rejected and H1 is accepted
as true.
The subset of the sample space for which H0 will be rejected is
called rejection region. The complement of the rejection region
is called acceptance region.
72 / 73
Hypothesis testing
Table : Two types of errors in hypothesis testing
Decision
Truth
Accept H0
Reject H0
H0
H1
Correct decision
Type II error
Type I error
Correct decision
73 / 73