HW 2(due 02/26)

STA 732: Statistical Inference
HW 2
Due Wed 02/26/2014 in class
1. Suppose (Xi , Yi ), i = 1, 2, . . . ∼ BVN(0, 0, 1, 1, ρ) .
IID
(a) Show that the sample correlation coefficient based on first n observations
∑n
(
)
¯
¯
(1−ρ2 )2
i=1 (Xi − Xn )(Yi − Yn )
∑
rn = ∑n
∼
AN
ρ,
n
¯ n )2 n (Yi − Y¯n )2 }1/2
{ i=1 (Xi − X
i=1
i.e., show that
√
L
n(rn − ρ) ⇒ N (0, (1 − ρ2 )2 )
¯ n = (X1 + · · · + Xn )/n and Y¯n = (Y1 + · · · + Yn )/n.
where X
¯ n , Y¯n , 1 ∑n X 2 , 1 ∑n Y 2 , 1 ∑n Xi Yi )T . By standard multi[Hint: Define Tn = (X
i=1 i n
i=1 i n
i=1
n
√
L
variate CLT, n(Tn −µ0 ) ⇒ N5 (0, Σ). Identity µ0 and Σ and then apply Delta theorem.
The real work lies in getting
√ Σ correctly. Use the fact that (X, Y ) ∼ BVN(0, 0, 1, 1, ρ)
means we could write Y = 1 − ρ2 Z + ρX where Z ∼ N (0, 1) and is independent of X.
Also, for X ∼ N (0, 1), EX k = 0 if k is an odd integer and EX k = 1 × 3 × 5 × · · · × (k − 1)
if k is even.]
(b) Does the same result hold if (Xi , Yi ) ∼ BVN(µ1 , µ2 , σ12 , σ22 , ρ) for some arbitrary µ1 , µ2 ∈
R, σ1 , σ2 ∈ R+ ? Clearly describe which parts hold and which break down, both analytical
calculations and simulations are welcome.
(c) Does the same result hold if (Xi , Yi ), i = 1, 2, . . . were IID from some bivariate nonGaussian pdf with EXi = EYi = 0, VarXi = VarYi = 1 and Corr(Xi , Yi ) = ρ? Clearly
describe which parts hold and which break down, both analytical calculations and simulations are welcome.
2. (Error in variable regression) Let observations Xi = (Zi , Yi ) ∈ R × R, i = 1, 2, . . . denote
paired univariate measurements modeled as
Zi = Ui + ϕi , Yi = α + βUi + ϵi , (Ui , ϕi , ϵi ) ∼ gU × N (0, σ 2 ) × N (0, σ 2 ).
IID
The distribution gU of the latent predictor is unknown, and is of no interest. Primary interest
concerns regression coefficients θ = (α, β, σ 2 ); especially the slope parameter β.
(a) You might substitute Ui = Zi − ϕi in the linear model and rewrite it as
Yi = α + βZi + νi
(1)
where νi = ϵi − βϕi ∼ N (0, σ 2 (1 + β 2 )). Show that the least squares estimate is
inconsistent; it converges in probability to a wrong value. Give a brief explanation of
why do we have inconsistency despite the nice linear model (1)?
(b) Consider a Z-estimator θˆn based on the following score function


Yi − α − βZi
ψ(Xi , θ) =  Zi (Yi − α − βZi ) + βσ 2  .
(Yi − α − βZi )2 − σ 2 (1 + β 2 )
IID
Show that for any arbitrary θ0 , the map θ →
7 E{ψ(Xi , θ)|θ0 } has a unique zero at θ = θ0
˙ i , θ0 )|θ0 } is non-singular. Also give an expression for 1 ψ˙ n (θ).
and E{ψ(X
n
1
(c) Setting up the estimating equations ψn (θ) = 0, it is easy to see that the Z-estimator
ˆ σ
θˆn = (α
ˆ , β,
ˆ 2 ) satisfies:
1 ∑n
2
s
zy
i=1 (Yi − α − βZi )
2
n
¯ βˆ =
α
ˆ = Y¯ − βˆZ,
,
σ
ˆ
=
s2z − σ
ˆ2
1 + βˆ2
∑
¯ s2 = 1 ∑n (Zi − Z)
¯ 2 . The solution vector could
where szy = n1 ni=1 (Yi − Y¯ )(Zi − Z),
z
i=1
n
be obtained by an iteration approach where we start with initial estimates
{ ∑
}
n
1
ˆ0 Zi )2 s2
(Y
−
α
ˆ
−
β
i
0
i=1
z
2
n
¯ σ
βˆ0 = βˆLS , α
ˆ = Y¯ − βˆ0 Z,
ˆ0 = min
,
2
1 + βˆ02
and iteratively update for t = 1, 2, . . .
2 )
ˆt−1
• βˆt ← szy /(s2z − σ
• α
ˆ t ← Y¯ − βˆt Z¯
∑
• σ
ˆt2 ← n1 ni=1 (Yi − α
ˆ t − βˆt Zi )2 /(1 + βˆt2 )
until |βˆt −βˆt−1 | is smaller than a pre-specified tolerance (or we reach a maximum iteration
limit, or the system gets infeasible with σ
ˆ 2 > s2z (note that VarZi = VarUi + σ 2 ≥ σ 2 ).
Use the R-code provided below or write your own code to carry out the following simulation study:
i. Create a synthetic data set of size n = 100 with observations (Zi , Yi ) simulated from
the truth: θ0 = (α = 5, β = 2, σ 2 = 0.52 ) and gU = Uniform(0, 1), and show the
output of your code (or the code provided below). Your output must show 4 things:
ˆ
the parameter estimates; the estimated variance matrix Σ/n;
number of iterations
used; and convergence status of the iteration approach.
ii. Now replicate the above for 1000 times, and compare the (histogram) distributions
of (ˆ
α − α0 )/se
ˆ α , (βˆ − β0 )/se
ˆ β and (ˆ
σ 2 − σ02 )/se
ˆ σ2 against the N (0, 1) pdf [here se
ˆ 2α
ˆ
is the estimated variance of α
ˆ , i.e., the first diagonal element of Σ/n,
and so on.]
iii. For testing H0 : β = 2 vs. H1 : β ̸= 2, what is the power of the level 5% Z-estimator
based test at θ = (5, 3, 0.52 ), gU = Uniform(0, 1)?
3. (Slope of the t-test for testing location shift) Consider two samples of observations W1 , . . . , Wm
and V1 , . . . , Vk modeled as
Wi ∼ g(wi − θ), Vj ∼ g(vj ), Wi ’s, Vj ’s independent
IID
IID
where θ ∈ R and g ∈ {all densities on R} are unknown. We are interested in testing H0 : θ =
0, vs. H1 : θ ̸= 0. The two-sample t-test is given by the statistic
∑m

¯ 2 ∑k
¯ 2
i=1 (Wi −W ) + j=1 (Vj −V )
n
n

either
(
+
)
“equal variance”

¯
¯
m
k
m+k−2
W −V
2
{
}
∑
Tn =
, with Sn =
∑m
k
¯ 2
¯ 2

j=1 (Vj −V )
Sn
i=1 (Wi −W )
 or n
+
“unequal variance”,
m(m−1)
k(k−1)
where n = m + k. For the asymptotic study, assume both m, k → ∞ such that m/n → λ ∈
(0, 1).
P
(a) Argue that for any fixed θ, either choice of Sn satisfies Sn2 →
∫
∫
σg2 = Var(V1 |g) = v 2 g(v)dv − { vg(v)dv}2 .
2
σg2
λ(1−λ)
under pn,θ where
P
σ2
g
(b) Next, argue that for any sequence (θn , n = 1, 2, . . .), Sn2 → λ(1−λ)
under pn,θn . [Hint.
¯
does the distribution of Wi − W depend on θ?]
√ ¯
σg2
L
(c) Next, argue that for any sequence (θn , n = 1, 2, . . .), n(W
− V¯ − θn ) ⇒ N (0, λ(1−λ)
)
under pn,θn .
(d) Finally, establish that Tn satisfies the regularity condition (5) of Handout 4 for some
µ(θ), σ(θ). Give simple expressions for both functions and calculate the slope of the
t-test for test H0 : θ = 0.
4. Traffic accident counts X1 , . . . , Xn of n = 1000 drivers from a county are modeled by the
IID
following zero-inflated Poisson distribution: Xi ∼ g(xi |µ, π), µ > 0, π ∈ [0, 1] where
{
(1 − π) + πe−µ xi = 0
g(xi |µ, π) =
xi
πe−µ µxi !
xi = 1, 2, · · · ,
which is same as saying Xi ’s are IID and each Xi is zero with probability 1 − π and is drawn
from Poisson(µ) with probability π. For this discussion we focus on testing H0 : π = 1, i..e,
there is no zero-inflation.
(a) Give a simplified expression for the log-likelihood ℓX (µ, π). Your expression, up to an
additive constant, should involve data only through the statistics n0 (X) = number of
¯
Xi equaling zero and X.
(b) Some algebra shows that a unique solution (ˆ
µ, π
ˆ ) exists to the first-order equations
∂
∂
ℓX (µ, π) = 0,
ℓX (µ, π) = 0
∂µ
∂π
¯ > 0 (i.e., not al Xi are zero) and that these µ
whenever X
ˆ, π
ˆ also satisfy
¯
X
ˆ = hX (ˆ
µ)
π
ˆ= , µ
µ
ˆ
where
hX (µ) =
¯ − e−µ )
X(1
1−
n0 (X)
n
.
¯ > 0, the function hX (µ) is concave in µ with
It is simple to check that whenever X
˙
hX (0) = 0, hX (0) > 1 and consequently a graph of hX (µ) looks like the curve in Figure
1 (it cuts the 45 degree line precisely at two points, one being 0 and the other a positive
number, and stays above the line only in between these two points)
¯
Argue why the solution (ˆ
µ, π
ˆ ) can not be the MLE whenever n0 n(X) < e−X [however, the
MLE does exist in this case].
(c) When Xi ∼ Poisson(µ), it follows from multivariate CLT that


(( ) ( −µ
))
n0 (X)
−µ
0
e (1 − e−µ ) −µe−µ
−
e
√
L
n
 ⇒ N2
,
.
n
¯ −µ
0
−µe−µ
µ
X
IID
Argue that when Xi ∼ Poisson(µ) we must have
)
(
√
n0 (X)
¯
L
−X
−e
⇒ N (0, σ(µ)2 )
n
n
IID
for some σ(µ) > 0. Simplify your expression of σ(µ)2 .
3
hx(µ)
µ
¯ > 0. The dashed line is the 45 degree line
Figure 1: Plot of hX (µ) for an X with X
(d) What is the implication of the result in part (4c) on ML tests for H0 : π = 1?
(e) In (4c), the quantity σ(µ) is continuous in µ and so whenever Xi ∼ Poisson(µ),
IID
√ n0 (X)
¯
n( n − e−X ) L
Z(X) =
⇒ N (0, 1)
¯
σ(X)
P
¯→
by the fact that X
µ (coupled with Slutsky’s theorem). Figure 2 confirms this through
a simulation study where Z(X) is calculated for 10,000 samples of X = (X1 , . . . , Xn ),
each with n = 1000, simulated from a zero-inflated Poisson distribution with π = 1 and
µ set to either 1/3, 1 or 3.
µ = 1, π = 1
-2
0
Z
2
4
0.4
0.3
0.1
0.0
0.1
0.0
-4
0.2
Density
0.3
0.4
µ = 3, π = 1
0.2
Density
0.2
0.0
0.1
Density
0.3
0.4
µ = 1/3, π = 1
-4
-2
0
2
Z
4
-4
-2
0
2
4
Z
Figure 2: Distribution of Z under 3 choices of (µ, π) each satisfying H0 : π = 1. The solid line is
the pdf of N (0, 1).
We could think of two (approximately) size-α tests for H0 : π = 1:
4
i. Test 1: reject H0 if |Z(X)| ≥ Φ−1 (1 − α/2) or
ii. Test 2: reject H0 if Z(X) ≥ Φ−1 (1 − α)
Justify why Test 2 is more appropriate. Write your answer with clear logic, but no
technical proof is required. [Hint: what happens to Z when H0 is not true? You may
find this inequality useful: 1 − π + πe−µ > e−πµ whenever 0 < π < 1 and µ > 0.]
(f) Another approximately size-α test for H0 : π = 1 is the so called over-dispersion test
given by:
√
)
n−1( 2 ¯
reject H0 if O(X) =
sX /X − 1 ≥ Φ−1 (1 − α)
2
L
which again relies on the result that when Xi ∼ Poisson(µ), O(X) ⇒ N (0, 1). Simulations of O(X) under the null give very similar pictures as in the case of Z(X) in part
(4e).
However simulating Z(x) and O(X) under (µ, π) taken from outside the null show some
differences. Figure 3 reports histograms of Z(X) and O(X) simulated under a zeroinflated Poisson distribution with π = 0.95 and µ ∈ {1/3, 1, 3}.
IID
µ = 1, π = 0.95
-2
0
2
4
6
8
10
0.4
0.3
0.1
0.0
0.1
0.0
-4
0.2
Density
0.3
0.4
µ = 3, π = 0.95
0.2
Density
0.2
0.0
0.1
Density
0.3
0.4
µ = 1/3, π = 0.95
-4
-2
Z (gray) and O (b/w)
0
2
4
6
8
Z (gray) and O (b/w)
10
-4
-2
0
2
4
6
8
10
Z (gray) and O (b/w)
Figure 3: Distribution of Z under 3 choices of (µ, π) each with π = 0.95. Gray solid histogram is
for Z(X) and the histogram with black outline and white interior is for O(X). The solid line is the
pdf of N (0, 1).
Which test would you prefer using – the test based on Z(X) [Test 2 from part (4e)] or
the test based on O(X)? Explain your choice. [No proof needed, give a clear logical
argument.]
(g) Could you point out any reason for the difference we see in part (4f)? Does one statistic
make better use of data than the other? Justify your answer.
5. Consider testing for trend in annual TC counts based on count data X = (X1 , · · · , Xn ) from
n consecutive years. A useful test statistic providing evidence against H0 : “no trend” is
T (X) =
#{Xt+1 > Xt : t = 1, · · · , n − 1}
.
n−1
Indeed, if Xt ∼ g(xt ) for some common pmf g(x) on {0, 1, · · · }, then T (X) should be around
∑
IID
µ(g) = 12 {1 − x g(x)2 } which gives the probability of X2 > X1 when X1 , X2 ∼ g [this
1
probability equals 2 P (X1 ̸= X2 )], a number that is typically close to half. An increasing
IID
5
trend will tend to produce larger T (X) values and a decreasing trend will tend to produce
T (X) closer to 0.
To turn the heuristic choice of T (X) into a rigorous test we need to figure out its distribution
under H0 . It is not easy to achieve that in general. Notice that #{Xt+1 > Xt : t =
1, · · · , n − 1} is not a binomial, because the events {X2 > X1 } and {X3 > X2 } are not
mutually independent.
However, we could construct a “permutation” test. For any x = (x1 , · · · , xn ) with each xi a
non-negative integer, define Px to the set of all permutations of x. If x consists of k unique
values x∗1 < · · · < x∗k , and x∗1 is repeated n1 times, x∗2 is repeated n2 times and so on, then Px
has
(
)
n
M=
n1 n2 · · · nk
m
many elements. Denote the elements of Px as xm = (xm
1 , · · · , xn ), m = 1, · · · , M . For
m
m
m
each x ∈ PX , calculate t = T (x ). Find a rearrangement {i(1), · · · , i(M )} such that
ti(1) ≤ · · · ≤ ti(M ) .
Fix an α ∈ (0, 1). Find ℓ = the smallest m such that m > M α/2 and u = largest m such that
m < M (1 − α/2) and define cℓ (x, α) = ti(ℓ) and cu (x, α) = ti(u) . [In practice, for a given x we
do not enumerate the entire PX to evaluate cℓ (x, α) and cu (x, α). Instead a large number of
random permutations of x are used to approximate these quantities.]
Consider the testing rule:
“reject H0 : Xt ∼ g for some common pmf g if T (X) ̸∈ [cℓ (X, α), cu (X, α)]”
IID
and show that it has size ≤ α.
[Hint: Despite the long description the mathematics of checking size is fairly straightforward.
BUT, notice that in calculating P (T (X) ̸∈ [cℓ (X, α), cu (X, α)] both the test statistic and the
two bounds depend on X.]
[A real hint: Let sort(x) denote the sorting function which rearranges the elements of a
vector x = (x1 , · · · , xn ) in non-decreasing order. Argue that for any s = (s1 , · · · , sn ) with
s1 ≤ · · · ≤ sn , the conditional distribution of X given sort(X) = s under H0 is precisely the
(discrete) uniform distribution over Ps . Extend this argument to size calculation.]
6
## Iterative Z-estimator finder for error-in-variable regression
est.mme <- function(z, y, rel.tol = 1e-6, maxit = 1e3){
## basic statistics
n <- length(y)
z.bar <- mean(z)
y.bar <- mean(y)
s.zz <- mean(z^2) - mean(z)^2
s.zy <- mean(z * y) - mean(z)*mean(y)
## initialization
beta.hat <- s.zy / s.zz
alpha.hat <- y.bar - beta.hat * z.bar
sigsq.hat <- min(mean((y - alpha.hat - beta.hat * z)^2) / (1 + beta.hat^2), s.zz / 2)
## stopping tolerance and flags
tol <- abs(beta.hat) * rel.tol
flag <- TRUE
bad.sig <- FALSE
## iteration
iter <- 0
while(flag){
iter <- iter + 1
diff <- s.zy / (s.zz - sigsq.hat) - beta.hat
## this is beta[t] - beta[t - 1]
beta.hat <- beta.hat + diff
alpha.hat <- y.bar - beta.hat * z.bar
resid <- y - alpha.hat - beta.hat * z
sigsq.hat <- mean((resid)^2) / (1 + beta.hat^2)
if(sigsq.hat > s.zz){
## infeasible system
flag <- FALSE
bad.sig <- TRUE
} else {
## check stopping criterion
flag <- (iter < maxit) & (abs(diff) > tol)
}
}
## Sigma.hat construction
psit.dot.bar <- matrix(c(-1, -z.bar, 0,
-z.bar, sigsq.hat - mean(z^2), beta.hat,
-2*mean(resid), -2*mean(z*resid) - 2*beta.hat*sigsq.hat, -(1+beta.hat^2)),
3, 3, byrow = TRUE)
psi.hat <- cbind(resid, z*resid + beta.hat*sigsq.hat, resid^2 - sigsq.hat*(1 + beta.hat^2))
psi.psit.bar <- matrix(rowMeans(apply(psi.hat, 1, tcrossprod)), 3, 3)
Sigma <- tcrossprod(solve(psit.dot.bar, psi.psit.bar), solve(psit.dot.bar))
## return estimates, para var-cov matrix Sigma / n and
## convergence status (0 = converged, 1 = infeasible, 2 = reach max iteration)
return(list(par = c(alpha.hat, beta.hat, sqrt(sigsq.hat)),
iterations = iter,
vcov = Sigma / n,
code = bad.sig + 2 * (iter == maxit)))
}
7