Probability and Statistics EC961 University of Warwick September 2014 This version: September 24, 2014. 1 / 73 Introduction Reading: I Main reference: I I Introduction to statistics and econometrics, by Amemiya, 1994. Suitable textbooks: I I I I Statistical inference, by Casella and Berger second edition. Probability and statistics, by DeGroot and Schervish, fourth edition. Introduction to mathematical statistics, by Hogg, McKean, and Craig, seventh edition. Introduction to probability and mathematical statistics, by Bain and Engelhardt, second edition. 2 / 73 Introduction Outline: I Probability (Ch. 2); I Random variables and probability distributions (Ch. 3); I Moments (Ch. 4); I Point estimation (Ch. 7 and 10); I Large sample theory (Ch. 6); I Tests of hypotheses (Ch. 9). 3 / 73 Section 1 Probability 4 / 73 Set theory Definition The set S of all possible outcomes of a particular experiment is called the sample space for the experiment. Definition An even is any collection of possible outcomes of an experiment, that is, any subset of S (including S itself). Definition (Axioms of probability, not fully rigorous) Given a sample space S, a probability function is a function that satisfies 1. Pr (A) ≥ 0 for any event A. 2. Pr (S) = 1. 3. If {Ai }, i = 1, · · · , are mutually exclusive (that is, Ai ∩ Aj = ∅ for all i 6= j), then Pr (A1 ∪ A2 ∪ · · · ) = Pr (A1 ) + Pr (A2 ) + · · · . 5 / 73 Conditional probability Definition If A and B are events in S, and Pr (B) > 0, then the conditional probability of A given B, written Pr (A | B), is defined as Pr (A | B) = Pr (A ∩ B) . Pr (B) Theorem (2.4.2 Bayes, skip) Let events A1 , A2 , · · · , An be mutually exclusive such that Pr (A1 ∪ A2 ∪ · · · ∪ An ) = 1 and Pr (Ai ) > 0 for each i. Let E be an arbitrary event such that Pr (E) > 0. Then Pr (E|Ai ) Pr (Ai ) Pr (Ai |E) = Pn , j=1 Pr (E|Aj ) Pr (Aj ) i = 1, · · · , n. 6 / 73 Independence Definition Two events, A and B, are statistically independent if Pr (A ∩ B) = Pr (A) Pr (B) . Note that independence could have been equivalently defined by either Pr (A) = Pr (A|B) or Pr (B) = Pr (B|A) (as long as either Pr (A) > 0 or Pr (B) > 0). 7 / 73 Section 2 Random variables and probability distributions 8 / 73 Definition (3.1.1) A random variable is a variable that takes values according to a certain probability distribution. Definition (3.1.2) A random variable is a function from a sample space S into the real numbers. Definition (3.2.1) A discrete random variable is a variable that takes a countable number of real numbers with certain probabilities. The probability distribution of a discrete random variable is completely characterized by the equation Pr (X = xi ) = pi i = 1, · · · , n, where xi ∈ X = {x1 , x2 , . . . }. 9 / 73 A word on notation Suppose we have a sample space S = {s1 , . . . , sn } with probability function Pr(·) and we define a random variable X with range X = {x1 , . . . , xm }. We can define the probability function of X as Pr X (X = xi ) = Pr ({sj ∈ S : X(sj ) = xi }) , that is, we will observe X = xi if and only if the outcome of the random experiment is an sj ∈ S such that X(sj ) = xi . In order to avoid too heavy notation, we will write Pr(X = xi ) instead of PrX (X = xi ). 10 / 73 Another word on notation In the statistic literature, random variables are usually denoted with uppercase letters and the realized values of a variable are usually denoted by the corresponding lowercase later. For instance, the random variable X can take value xi . This is not the case in the econometric literature, where uppercase letters (in boldface) are reserved for matrices. Recall the well known formula of the OLS estimator βˆ = (X 0 X)−1 X 0 y. Unless stated otherwise, in these notes we will follow the notation used in statistics (X random variable, x its realization; X random vector, x its realization). 11 / 73 Definition (3.2.2) A bivariate discrete random variable is a variable that takes a countable number of points on the plane with certain probabilities. The probability distribution of a bivariate discrete random variable is determinded by the equations Pr (X = xi , Y = yj ) = pij i = 1, . . . , n; j = 1, . . . , m. The marginal probability is X is defined as Pr (X = xi ) = m X Pr (X = xi , Y = yj ) i = 1, . . . , n. j=1 and the conditional probability of X = xi given Y = yj is (assume Pr (Y = yj ) > 0) Pr (X = xi |Y = yj ) = Pr (X = xi , Y = yj ) . Pr (y = yj ) 12 / 73 Definition (3.2.3) Two discrete random variables X and Y are said to be independent if the event (X = xi ) and the event (Y = yj ) are independent for all i, j. That is to say, Pr (X = xi , Y = yj ) = Pr (X = xi ) Pr (Y = yj ) for all i, j, and we write X ⊥ ⊥Y. Example (3.2.3) Let the joint probability distribution of X and Y be given by X Y 1 0 Pr (X = xi ) 1 2/8 1/8 0 2/8 3/8 Pr (Y = yj ) 4/8 4/8 Are X and Y independent? 3/8 5/8 1 13 / 73 Definition (3.3.1) If there is a nonnegative function fX (x) defined over R such that Z x2 Pr (x1 ≤ X ≤ x2 ) = fX (x) dx x1 for any x1 and x2 satisfying x1 ≤ x2 , then X is a continuous random variable and fX (x) is called density function. R∞ By Axiom (2) of probability, −∞ fX (x) dx = 1. Note: when X is continuous, the following expressions are equivalent Pr (x1 ≤ X ≤ x2 ), Pr (x1 < X ≤ x2 ), Pr (x1 ≤ X < x2 ), or Pr (x1 < X < x2 ). 14 / 73 Definition (3.4.1) If there is a nonnegative function fX,Y (x, y) defined over R2 such that Z y2 Z x2 Pr (x1 ≤ X ≤ x2 , y1 ≤ Y ≤ y2 ) = fX,Y (x, y) dx dy y1 x1 for any x1 , x2 , y1 , y2 satisfying x1 ≤ x2 , y1 ≤ y2 , then (X, Y )0 is a bivariate continuous random variable and fX,Y (x, y) is called density function. In order for fX,Y (x, y) to be a joint density function, it must be nonnegative and Z ∞Z ∞ fX,Y (x, y) dx dy = 1. −∞ ∞ Example (3.4.1) If fX,Y (x, y) = xye−(x+y) , x > 0, y > 0, and 0 otherwise, that is Pr (X > 1, Y < 1)? 15 / 73 Marginal densities Definition Let (X, Y )0 be a continuous bivariate random vector with joint probability density function fX,Y (x, y). The marginal pdfs fX (x) and fY (y) are defined as Z ∞ fX (x) = fX,Y (x, y) dy −∞ <x < ∞ Z−∞ ∞ fY (y) = fX,Y (x, y) dx −∞ <y < ∞. −∞ 16 / 73 Conditional densities Definition Let (X, Y )0 be a continuous bivariate random vector with joint probability density function fX,Y (x, y) and marginal pdfs fX (x) and fY (y). For any x such that fX (x) > 0, the conditional pdf of Y given X = x is the function of y denoted by fY |X (y|x) and defined by fY,X (y, x) . fY |X (y|x) = fX (x) 17 / 73 Definition (3.4.6) Continuous random variables X and Y are said to be independent if, for all x and y, fX,Y (x, y) = fX (x)fY (y), and we write X ⊥ ⊥Y. 18 / 73 Example (3.4.5) Suppose fX,Y (x, y) = ( (3/2)(x2 + y 2 ) for 0 < x < 1 and 0 < y < 1 0 otherwise. Calculate 1 1 Pr 0 < X < 0 < Y < 2 2 and determine if X and Y are independent. 19 / 73 Distribution function Definition The cumulative distribution function of a random variable X, denoted by FX (·), is defined by FX (x) = Pr (X ≤ x) , x ∈ R. Example (3.5.1) Suppose fX (x) = ( 1 −x/2 2e 0 if x > 0, otherwise. Find FX (x). 20 / 73 Example (3.5.2) Suppose fX (x) = ( 2(1 − x) if 0 < x < 1, 0 otherwise. Find FX (x). Example (3.5.3 Mixture random variable) Suppose Draw FX (x). 0 0.5 FX (x) = x 1 if if if if x ≤ 0, 0 < x ≤ 0.5, 0.5 < x ≤ 1, x > 1. 21 / 73 Change of variables Theorem (3.6.1) Let fX (x) be the density of X and let Y = g(X), where g(·) is a monotonic and differentiable function. Then, the density fY (y) of Y is given by −1 dg (y) , fY (y) = fX [g −1 (y)] · dy where g −1 (·) denotes the inverse function of g(·) (do not mistake it for 1 over g). Example Let X be a continuous random variable with pdf √ 2 fX (x) = (1/ 2π)e−x /2 , x ∈ R, and let Y = µ + σX, where µ and σ are two constants, with σ > 0. Determine fY (y). 22 / 73 Section 3 Moments 23 / 73 Expected value Definition (4.1.1) Let X be a discrete random variable taking value xi ∈ X with probability Pr (X = xi ), i = 1, . . . . The expected value of X, denoted by E (X), is defined by X E (X) = xi Pr (X = xi ) . xi ∈X Definition (4.1.2) Let X be a continuous random variable with density fX (x). The expected value of X, denoted by E (X), is defined by Z ∞ E (X) = xfX (x) dx. −∞ 24 / 73 Definition A median of a distribution is a value m such that Pr(x ≤ m) ≥ 1/2 and Pr(X ≥ m) ≥ 1/2. If X is continuous, m satisfies Z ∞ Z m 1 fX (x) dx = fX (x) dx = . 2 −∞ m Example (4.1.1) Let fX (x) = 2x−3 . Find the expected value and the median of X. Example (4.1.2) Let fX (x) = x−2 . Find the expected value and the median of X. 25 / 73 Theorem (4.1.1) Let X be a discrete random variable taking value xi ∈ X with probability Pr (X = xi ), i = 1, · · · , and let φ(·) be an arbitrary function. Then X E[φ(X)] = φ(xi ) Pr (X = xi ) . xi ∈X Theorem (4.1.2) Let X be a continuous random variable with density fX (x) and let φ(·) be a function for which the integral below is defined. Then Z ∞ E[φ(X)] = φ(x)fX (x) dx. −∞ 26 / 73 Higher moments Definition I For each integer n, the n-th moment of X is defined as E(X n ). I The n-th central moment of X is defined as E{[X − E(X)]n }. I The variance of a random variable X, written var(X), is its second central moment, var(X) = E{[X − E(X)]2 }. 2 . Sometime it is denoted by σX I The standard deviation of X is the positive square root of var(X). Sometime it is denoted by σX . 27 / 73 Theorem (4.1.3) Let (X, Y )0 be a bivariate discrete random variable taking value (xi , yj ), xi ∈ X and yj ∈ Y, with probability Pr (X = xi , Y = yj ), i, j = 1, · · · , and let φ(·, ·) be an arbitrary function. Then X X φ(xi , yj ) Pr (X = xi , Y = yj ) . E[φ(X, Y )] = xi ∈X yj ∈Y Theorem (4.1.4) Let (X, Y )0 be a bivariate continuous random variable with joint density fX,Y (x, y) and let φ(·, ·) be an arbitrary function for which the integral below is defined. Then Z ∞Z ∞ E[φ(X, Y )] = φ(x, y)fX,Y (x, y) dx dy. −∞ −∞ 28 / 73 Expected value Some useful properties Theorem (4.1.5) If α is a constant, E(α) = α. Theorem (4.1.6) If X and Y are random variables and α and β are constants, E(αX + βY ) = α E(X) + β E(Y ). Theorem If X is a random variable with finite variance, then 1. var(X) = E X 2 − [E (X)]2 . 2. var(αX + β) = α2 var(X) for any constant α and β. Theorem (4.1.7) If X and Y are independent random variables, E(XY ) = E(X) E(Y ). 29 / 73 Mixture random variable Theorem (4.1.8) If X be a mixture random variable taking discrete value xi ∈ X , i = 1, · · · with probability Pr (X = xi ) = pi and a continuum of values in interval (a, b) according to density fX (x). Then X xi p i + pi + Z E(X) = Z b xfX (x). a xi ∈X Note that we must have X b fX (x) = 1. a xi ∈X 30 / 73 Covariance Definition (4.3.1) A measure of the relationship between two random variables X and Y is the covariance which is defined as cov(X, Y ) = E{[X − E(X)][Y − E(Y )]} = E(XY ) − [E(X) E(Y )]. Example (4.3.1) X Y 1 -1 1 α/2 -1 (1 − α)/2 Pr (Y = yj ) Compute cov(X, Y ). Pr (X = xi ) (1 − α)/2 α/2 1 31 / 73 Theorem (4.3.1) If X and Y are independent, then cov(X, Y ) = 0. Example (4.3.2) Let the joint probability distribution of (X, Y ) be given by X Y -1 0 1 Pr (X = xi ) 1 1/6 1/12 1/6 0 1/12 0 1/12 -1 1/6 1/12 1/6 Pr (Y = yj ) 1 Compute cov(X, Y ). Are X and Y independent? 32 / 73 Example (4.3.4) Let the joint density be ( x+y fX,Y (x, y) = 0 if 0 < x < 1 and 0 < y < 1 . otherwise Calculate cov(X, Y ). 33 / 73 Conditional mean and variance Definition Let (X, Y )0 be a bivariate discrete random variable taking values (xi , yj )0 , i, j = 1, · · · . Let Pr (Y = yj |X = x) be the probability of Y = yj given X = x. The conditional expectation of Y given X = x, denoted by E(Y |X = x), is defined by X E(Y |X = x) = yj Pr (Y = yj |X = x) . yj ∈Y The conditional variance of Y given X = x, denoted by var(Y |X = x), is defined by X var(Y |X = x) = [yj − E(Y |X = x)]2 Pr (Y = yj |X = x) . yj ∈Y 34 / 73 Conditional mean and variance Definition Let (X, Y )0 be a bivariate continuous random variable and let fY |X (y|x) be the density function of Y given X = x. Then the conditional expectation of Y given X = x is defined by Z ∞ E(Y |X = x) = yfY |X (y|x) dy. −∞ The conditional variance of Y given X = x, denoted by var(Y |X = x), is defined by Z ∞ var(Y |X = x) = [y − E(Y |X = x)]2 fY |X (y|x) dy. −∞ 35 / 73 Theorem (4.4.1 Law of iterated expectation — LIE) If X and Y are any random variables, then E(Y ) = E[E(Y |X)], (1) provided that the expectations exist. Remark Equation (1) contains an abuse of notation, since the “E’s” used stand for different expectations in the same equation. The “E” in the left-hand side of (1) is expectation wrt the marginal distribution of Y . The first “E” in the right-hand side of (1) is the marginal distribution wrt marginal distribution of X, while the second one stands for the expectation to the conditional distribution of Y |X = x. Some people write E(Y ) = EX [EY |X (Y |X)]. 36 / 73 Proof of Th. 4.4.1. Proof for the continuous case only (discrete case is similar). Z ∞ Z ∞ Z ∞ E(Y ) = yfY (y) dy = y fY,X (y, x) dx dy −∞ −∞ Z−∞ ∞ Z ∞ = yfY,X (y, x) dx dy −∞ Z−∞ ∞ Z ∞ = yfY |X (y|x)fX (x) dx dy −∞ −∞ Z ∞ Z ∞ = yfY |X (y|x) dy fX (x) dx −∞ Z−∞ Z ∞ ∞ = E(Y |X = x)fX (x) dx = g(x)fX (x) dx = E[g(X)] −∞ −∞ = E[E(Y |X)], where g(x) = E(Y |X = x). 37 / 73 Theorem (4.4.2) var(Y ) = E[var(Y |X)] + var[E(Y |X)]. Proof. varY (Y ) = EY (Y 2 ) − [EY (Y )]2 = EX [EY |X (Y 2 |X)] − [EY (Y )]2 = EX {varY |X (Y |X) + [EY |X (Y |X)]2 } − [EY (Y )]2 = EX [varY |X (Y |X)] + EX {[EY |X (Y |X)]2 } − [EY (Y )]2 = EX [varY |X (Y |X)] + EX {[EY |X (Y |X)]2 } − {EX [EY |X (Y |X)]}2 = EX [varY |X (Y |X)] + EX {[g(X)]2 } − {EX [g(X)]}2 = EX [varY |X (Y |X)] + varX [g(X)] = EX [varY |X (Y |X)] + varX [EY |X (Y |X)]. 38 / 73 Example (4.4.1) Suppose fX (x) = 1 for 0 < x < 1 and equal to zero otherwise, and fY |X (y|x) = 1/x for 0 < y < x and equal to zero otherwise. Calculate E(Y ). Example (4.4.2) The marginal density of X is given by fX (x) = 1, 0 < x < 1. The conditional probability of Y given X = x is given by Pr (Y = 1|X = x) = x Pr (Y = 0|X = x) = 1 − x. Find E(Y ) and var(Y ). 39 / 73 Section 4 Some special distributions 40 / 73 Bernoulli Definition A Bernoulli experiment is a random experiment, the outcome of which can be classified in but one of two mutually exclusive and exhaustive ways (e.g., success or failure, female or male, life or death). Definition Let X be a random variable associated with a Bernoulli trial defined as follows X(success) = 1 X(failure) = 0. Let p any real number between 0 and 1, 0 ≤ p ≤ 1. The probability mass function of X can be written as Pr (X = x) = px (1 − p)1−x x = 0, 1, and we say that X has a Bernoulli distribution and we write X ∼ Bernoulli (p). 41 / 73 Exercise Suppose X ∼ Bernoulli (p), where 0 ≤ p ≤ 1. Compute E (X) and var(X). Exercise Suppose Y |X = x ∼ Bernoulli (F (α + βx)), where F (·) is a function which maps R → [0, 1]. Compute E (Y |X = x) and var(Y |X = x). 42 / 73 Binomial distribution Definition (5.1.1) Let X1 , · · · , Xn be n random variables, independently and P identically distributed as Bernoulli (p). Let Y = ni=1 Xi . The distribution of Y is called the binomial distribution, and we write Y ∼ Bin(n, p). Theorem (5.1.1) Let Y ∼ Bin(n, p). Then, n k Pr(Y = k) = p (1 − p)n−k , k E(Y ) = np, n! n = , k k!(n − k)! var(Y ) = np(1 − p), where 0 ≤ k ≤ n, k! = 1 × · · · × k, and 0! = 1. 43 / 73 Normal distribution Definition (5.2.1) Let µ and σ be any two real numbers, with σ > 0. We say a random variable X has a normal distribution if its pdf is " # 1 1 x−µ 2 fX (x|µ, σ) = √ exp − , x∈R 2 σ σ 2π and we write X ∼ N(µ, σ 2 ). 44 / 73 Uniform distribution Definition Let a and b be any two real numbers. We say a random variable X has a uniform distribution if its pdf is ( 1 if x ∈ [a, b] fX (x|a, b) = b−a 0 otherwise and we write X ∼ U(a, b). 45 / 73 Chi-square distribution Definition (1 in Appendix) Let X1 , · · · , Xn be n random variables, independently and P identically distributed as N(0, 1). Let Y = ni=1 Xi2 . The distribution of Y is called the chi-square distribution with n degrees of freedom, and we write Y ∼ χ2n . Exercise Let Y ∼ χ2n . Show that E(Y ) = n and var(Y ) = 2n. Hint: if X ∼ N(0, 1), then E(X 4 ) = 3. Exercise Let X ∼ χ2n and Y ∼ χ2m . Suppose X and Y are independent. Show that X + Y ∼ χ2n+m . 46 / 73 Student’s t distribution Definition (2 in Appendix) Let Y and X be two independent random variables√with √ Y ∼ N(0, 1) and X ∼ χ2n . Let Z be defined as Z = 2Y / X. The distribution of Z is called the Student’s t distribution with n degrees of freedom, and we write Z ∼ tn . 47 / 73 Multivariate normal distribution Definition Let µ be a vector of constants and let Σ be a positive definite matrix. We say that a random vector X has a multivariate normal distribution if, for x ∈ Rn , its pdf is 1 1 0 −1 fX (x) = exp − (x − µ) Σ (x − µ) , 2 (2π)n/2 |Σ|1/2 where |Σ|1/2 is the determinant of the matrix defined as follows Σ 1/2 = S 0 Λ1/2 S, Σ = S 0 ΛS = S 0 Λ1/2 SS 0 Λ1/2 S = Σ 1/2 Σ 1/2 , √ √ and Λ1/2 = diag( λ1 , . . . , λn ), λi eigenvalues of Σ, S is the orthogonal matrix having as columns the eigenvectors of Σ. 48 / 73 Section 5 Statistical inference 49 / 73 Statistical inference In a simple econometric problem, we are dealing with a random variable X of interest, but its probability density function fX (x) is unknown. Our ignorance about fX (x) can be roughly classified in two ways: (a) fX (x) is not known but we make some assumptions about some characteristics of it (for instance, we specify the first and second moment). (b) the form of fX (x) is known down to a parameter θ (or a vector of parameters, in which case we write θ). Example: X ∼ N(µ, σ 2 ), where θ = (µ, σ)0 is unknown. For simplicity, let us start from (b): we (assume we) know fX (x) but we don’t know θ, so we want to estimate it. 50 / 73 Sampling Our information about the unknown parameter θ comes from a sample of X. The sample observations are assumed to have the same distribution of X, and we denoted them as random variables X1 , X2 , · · · , Xn , where n is an integer which denotes the sample size. When the sample is actually drawn, we use lower case letters x1 , x2 , · · · , xn as the values or realizations of the sample. Definition If the random variables X1 , · · · , Xn are independent and identically distributed (i.i.d.), then these random variables constitute a random sample of size n from the common distribution. 51 / 73 Maximum likelihood estimator Definition (7.3.2, random sample, basic case) Let X1 , · · · , Xn be n independent and identically distributed random variables (random sample), with probability density function fX (x|θ), where θ is an unknown parameter, and let xi denote the observed value of Xi . Then we call L(θ|x1 , · · · , xn ) = n Y i=1 fX (xi |θ), the likelihood function of θ given x1 , · · · , xn . The value that maximizes L(·|x1 , · · · , xn ) is called maximum likelihood estimator of θ, and it is written as θˆ or θˆML . Example (7.7.3, Normal distribution) Let X1 , · · · , Xn be a random sample on N(µ, σ 2 ) and let x1 , · · · , xn be their observed values. Compute the maximum likelihood estimator of µ and σ. 52 / 73 Example (7.7.1, Binomial distribution) Let X1 , · · · , Xn be a random sample on Bin(n, p) and let x1 , · · · , xn be their observed values. Compute the maximum likelihood estimator of p. Example (Uniform distribution) Let X1 , · · · , Xn be i.i.d with the uniform (0, θ) density, that is, ( 1/θ if 0 < x < θ fX (x|θ) = 0 otherwise. Find the maximum likelihood estimator of θ. 53 / 73 ML estimator: from statistics to econometrics Let us consider n continuous random vector Z1 , · · · , Zn , where Zi = (Yi , Xi )0 , i = 1, · · · , n. Suppose the following assumptions hold: A1 (independence) for all i 6= j, Zi is independent of Zj Zi ⊥ ⊥ Zj , i 6= j ⇒ fZi ,Zj (·, ·) = fZi (·)fZj (·). A2 (identically distributed) fZ1 (·|θ) = fZ2 (·|θ) = · · · = fZn (·|θ), so the subscript i can be dropped and we can refer generally to fZ (·|θ). A3 (distributional assumptions) A3.a fZ (z|θ) = fY,X (y, x|θ) = fY |X (y|x; θ)fX (x), with fX (x) > 0, x ∈ R. A3.b Y |X = x ∼ N(α + βx, σ 2 ), that is " 2 # 1 y − α − βx 1 fY |X (y|x; θ) = √ exp − 2 σ σ 2π where θ = (α, β, σ)0 , with σ > 0. 54 / 73 ML estimator L(θ; z1 , . . . , zn ) = fZ1 ,...,Zn (z1 , . . . , zn |θ) n Y = fZi (zi |θ) = = i=1 n Y i=1 n Y i=1 (by A1) fZ (zi |θ) (by A2) fY |X (yi |xi ; θ)fX (xi ) (by A3.a) Note: As fX (·) > 0 and it does not depend on θ, then θˆn(ML) = arg max L(θ; z1 , . . . , zn ) = arg max θ∈Θ θ∈Θ n Y i=1 fY |X (yi |xi ; θ) 55 / 73 ML estimator We can further simplify the problem taking logs: ( n ) Y θˆ(ML) = arg max fY |X (yi |xi ; θ) n θ ( i=1 = arg max log θ = arg max θ ( n X i=1 ( n X " n Y i=1 fY |X (yi |xi ; θ) #) ) log fY |X (yi |xi ; θ) ) 1 1 (yi − α − βxi )2 √ exp − α,β,σ 2 σ2 σ 2π i=1 ! n 1 X = arg max −n log σ − 2 (yi − α − βxi )2 α,β,σ 2σ = arg max log i=1 56 / 73 ML estimator FOC with respect to α, β, and σ − − n 1 X ˆ i )(−1) = 0 ⇔ 2(yi − α ˆ − βx 2 2ˆ σ 1 2ˆ σ2 i=1 n X i=1 ˆ i )(−xi ) = 0 ⇔ 2(yi − α ˆ − βx α ˆ = y¯ − βˆx ¯ βˆ = Pn (x − x ¯)(yi − y¯) i=1 Pn i ¯)2 i=1 (xi − x n n n 1 X 1X ˆ i )2 = 0 ⇔ σ ˆ i )2 , − + 3 (yi − α ˆ − βx (yi − α ˆ − βx ˆ2 = σ σ n i=1 i=1 Pn ˆ where ¯)¯ x = 0 and Pn the results for β comes from i=1 (xi − x ¯)¯ x = 0. i=1 (yi − y 57 / 73 Method of moments (MM ) estimator Let us (again) consider n continuous random vector Z1 , · · · , Zn , where Zi = (Yi , Xi )0 , i = 1, · · · , n. Suppose the following assumptions hold: B1 (independence) The same as Assumption A1 B2 (identically distributed) The same as Assumption A2 B3 (distributional assumptions) B3.a E(Y |X = x) = α + βx. B3.b var(Y |X = x) = E{[Y − E(Y |X = x)]2 } = σ 2 . Note: I B3 is a much weaker assumption than A3. In A3, we assume we know exactly the entire shape of the distribution up to some unknown parameter we want to estimate. In B3 we suppose we know only mean and variance (up to some unknown parameter we want to estimate). I The name comes from the fact that we use (the first and second) moments. 58 / 73 Note: If you have never heard of the method of moments estimator, just think of it as the ordinary least square method, and substitute MM with LS in these slides. 59 / 73 MM estimator Define U = Y − E(Y |X) = Y − α − βX and note that E(U |X) = E[Y − E(Y |X)|X] = E(Y |X) − E(Y |X) = 0 which can be use to prove the following two moment conditions: E(U ) = E[E(U |X)] = 0 E(XU ) = E[E(XU |X)] = E[X E(U |X)] = 0 var(U ) = E[var(U |X)] + var[E(U |X)] = E(σ 2 ) + 0 = σ 2 , that is E(Y − α − βX) = 0 E[X(Y − α − βX)] = 0 E[(Y − α − βX)2 ] − σ 2 = 0. 60 / 73 MM estimator To estimate α, β and σ, replace (i) expectations with sample averages, and (ii) random variables with their realized values1 n 1X ˆ i) = 0 ⇔ (yi − α ˆ − βx n i=1 α ˆ = y¯ − βˆx ¯ n 1X ˆ i )] = 0 ⇔ [xi (yi − α ˆ − βx n βˆ = i=1 n Pn (x − x ¯)(yi − y¯) i=1 Pn i ¯)2 i=1 (xi − x n 1X 1X ˆ i )2 ] − σ ˆ i )2 . [(yi − α ˆ − βx ˆ2 = 0 ⇔ σ ˆ2 = (yi − α ˆ − βx n n i=1 i=1 1 This estimation method — doing in the sample what you would have done in the population — is called Analog estimation method. It is good for you to know, but you don’t need to remember this. 61 / 73 Remember I An estimator is a random variable. I Being a function of the data, the estimator has a probability distribution, a mean, a variance, and so on. I A particular realization of this random variable is called the estimate. Definition (Included for completeness, but you can skip it) Let X1 , . . . , Xn be a random sample of size n from a population and let T (x1 , . . . , xn ) be a real-valued or vector-valued function whose domain includes the sample space of (X1 , . . . , Xn ). Then the random variable or random vector Y = T (X1 , . . . , Xn ) is called statistic. The probability of a statistic Y is called the sampling distribution of Y . 62 / 73 Section 6 Large sample theory 63 / 73 Convergence concepts Definition (6.1.2) A sequence of random variables, X1 , X2 , · · · , converges in probability to a random variable X if, for every > 0, lim Pr (|Xn − X| ≥ ) = 0 n→∞ or, equivalently, lim Pr (|Xn − X| < ) = 1. n→∞ p We write Xn −→ X or plim Xn = X. The last equality reads “the probability limit of Xn is X”. Theorem Suppose that X1 , X2 , · · · , converges in probability to a variable X and that h is a continuous function. Then, h(X1 ), h(X2 ), · · · , converges in probability to h(X). 64 / 73 Definition (6.1.4) A sequence of random variables, X1 , X2 , . . ., converges in distribution to a random variable X if lim FXn (x) = FX (x) n→∞ d at all points x where FX (x) is continuous. We write Xn −→ X, and we call FX (·) the limit distribution of the sequence X1 , X2 , . . .. Theorem (6.1.2) p d If Xn −→ X, then Xn −→ X. Theorem (6.1.4, Slutsky’s theorem) p d If Xn −→ X and Yn −→ a, a constant, then d (a) Xn + Yn −→ X + a; d (b) Xn Yn −→ aX. 65 / 73 Theorem (Weak law of large numbers) Let X1 , X2 , · · · be a sequence of independent and identically distributed (i.i.d.) random variables with E(Xi ) = µ and ¯ n := (1/n) Pn Xi . Then, for var(Xi ) = σ 2 < ∞. Define X i=1 every > 0, ¯ n − µ| < = 1, lim Pr |X n→∞ p ¯ n −→ that is, X µ. 66 / 73 Theorem (Central limit theorem) Let X1 , X2 , · · · be a sequence of independent and identically distributed (i.i.d.) random variables with E(Xi ) = µ and var(Xi ) = σ 2P , both µ and σ 2 finite. Define ¯ n := (1/n) n Xi . Let Gn (x) be the cumulative density X √ i=1 ¯ n − µ)/σ. Then, for any x, −∞ < x < ∞, function of n(X Z x 1 2 √ e−y /2 dy, lim Gn (x) = n→∞ 2π −∞ √ ¯ that is, n(X n − µ)/σ has a limiting standard normal distribution, and we write (equivalently): √ ¯ √ n(Xn − µ) d d ¯ n − µ) −→ −→ N(0, 1), or n(X N(0, σ 2 ), σ n √ 1 X d d ¯ n −→ or nX N(µ, σ 2 ), or √ Xi −→ N(µ, σ 2 ). n i=1 67 / 73 Two last tools from asymptotic theory This slide can be skipped Theorem (continuous mapping theorem) Let X1 , X2 , · · · be a sequence of independent and identically distributed (i.i.d.) and let h(·) be a continuous function. 1. If X1 , X2 , · · · , converges in probability to a variable X, then, h(X1 ), h(X2 ), · · · , converges in probability to h(X). 2. If X1 , X2 , · · · , converges in distribution to a variable X, then, h(X1 ), h(X2 ), · · · , converges in distribution to h(X). Theorem (Delta method) Let X1 , . . . , Xn be a sequence of random variables that satisfies √ d n(Xn − θ) −→ N(0, σ 2 ). For a given function g(·) and a specific value θ, suppose that g 0 (θ) exists and is not 0. Then 2 √ d n [g(Xn ) − g(θ)] −→ N 0, σ 2 g 0 (θ) . 68 / 73 Asymptotic normality of OLS Simple regression model Let us focus on the MM estimator only and suppose Assumptions B1–B3 hold. It is possible to prove (see Wooldridge, 2013, Chapter 5 Appendix, p.684) √ σ2 d . n(βˆn − β) −→ N 0, var(X) where σ 2 = var(Y |X) = var(U ) and n 1X p (xi − x ¯)2 −→ E{[X − E(X)]2 } = var(X). n i=1 2 as shorthand notation for var(X). It For simplicity, let use σX follows from the results above that √ ˆ n(βn − β) d −→ N (0, 1) . (σ/σX ) 69 / 73 Section 7 Hypothesis testing 70 / 73 Hypothesis testing Definition A hypothesis is a statement about a population parameter. Definition The two complementary hypotheses in a hypothesis testing problem are called the null hypothesis and the alternative hypothesis. They are denoted by H0 and H1 (or HA ), respectively. A hypothesis is a statement about a population parameter. Example Let θ denote the expected change in a patient’s blood pressure after taking a drug. You might be interested in testing H0 : θ = 0 versus H1 : θ 6= 0. 71 / 73 Hypothesis testing Definition A hypothesis testing procedure or hypothesis test is a rule that specifies: (i) For which sample values the decision is made to accept H0 as true. (ii) For which sample values H0 is rejected and H1 is accepted as true. The subset of the sample space for which H0 will be rejected is called rejection region. The complement of the rejection region is called acceptance region. 72 / 73 Hypothesis testing Table : Two types of errors in hypothesis testing Decision Truth Accept H0 Reject H0 H0 H1 Correct decision Type II error Type I error Correct decision 73 / 73
© Copyright 2024 ExpyDoc