Exam Solutions 2013

smfexamsoln(13).tex
STATISTICAL METHODS FOR FINANCE: EXAM
SOLUTIONS 2013
Q1. Fisher information.
The joint density is f = f (x1 , . . . , xn ; θ): f = f (x; θ). The likelihood is
L = L(θ) := f (x; θ), with the data here as x = (x1 , . . . , xn ) (so L is a statistic
– can be calculated from the data). The log-likelihood is ℓ = ℓ(θ) := log L(θ).
The (Fisher) score function is s(θ) := ℓ′ (θ).
[2]
2
The (Fisher) information is defined by either I(θ) := E[s(θ) ] (so I ≥ 0),
or I(θ) := −E[s′ (θ)] (so I is additive)
– see below). [2]
∫ (these are equivalent
∫
The density integrates to 1: f (x; θ)dx = 1: f = 1. We assume f
smooth enough to differentiate under the integral sign w.r.t. θ, twice. Then
∫
∫
∫ (
∫ (
)
∂f
∂
∂
1 ∂f )
∂
=
f=
1=0:
.f = 0 :
log f .f = 0.
∂θ
∂θ
∂θ
f ∂θ
∂θ
∫
∫
Now E[g(X)] = g(x)f (x; θ)dx = gf , so E[∂ log L/∂θ] = 0: E[∂ℓ/∂θ] = 0:
E[ℓ′ (θ)] = 0: E[s(θ)] = 0.
[6]
Differentiate under the integral sign wrt θ again:
∫ (
∫
∂
1 ∂f )
∂ [( 1 ∂f ) ]
.f = 0,
.f = 0 :
∂θ
f ∂θ
∂θ f ∂θ
∫ [(
1 ∂f ) ∂f
∂ ( 1 ∂f )]
+f
= 0.
f ∂θ ∂θ
∂θ f ∂θ
As the bracket in the second term is ∂ log f /∂θ, this says
∫ [(
∫ [(
]
∂ log f )2
1 ∂f )2
∂ ( ∂ log f )]
∂2
f = 0,
+
+ 2 (log f ) f = 0 :
f ∂θ
∂θ
∂θ
∂θ
∂θ
[( ∂
)2
]
2
∂
E
log L + 2 log L = 0 : E[{ℓ′ (θ)}2 + ℓ′′ (θ)] = 0 :
∂θ
∂θ
E[s(θ)2 + s′ (θ)] = 0.
[6]
As above, write I(θ) = E[s2 (θ)] = −E[s′ (θ)], and call I(θ) the (Fisher)
information on θ (in the sample (x1 , . . . , xn )). So, combining:
The score function s(θ) := ℓ′ (θ) has mean 0 and variance I(θ).
[2]
Application.
The Fisher information appears in the large-sample theory of maximumlikelihood estimation
√ (MLE): in the regular case, with θ0 the true value of
[2]
the parameter θ, nI(θ0 )(θˆ − θ0 ) → Φ = N (0, 1) (n → ∞).
[Seen – lectures]
1
Q2. Lognormal distribution; normal means.
X has the log-normal distribution with parameters µ and σ, X ∼ LN (µ, σ),
if Y := log X ∼ N (µ, σ 2 ).
[2]
1 2 2
tY
Y
The MGF of Y is MY (t) := E[e ] = exp{µt + 2 σ t }: MY (1) = E[e ] =
exp{µ + 21 σ 2 }.
But eY = X: E[X] = exp{µ + 12 σ 2 }: LN (µ, σ) has mean exp{µ + 21 σ 2 }. [3]
In geometric Brownian motion (GBM), as in the Black-Scholes model, the
price process S = (St ) of a risky asset is driven by the SDE
dSt /St = µdt + σdWt ,
(GBM )
with W = (Wt ) Brownian motion/the Wiener process. This has solution
1
St = S0 exp{(µ − σ 2 )t + σWt } :
2
log St is lognormally distributed.
[5]
For a normal population N (µ, σ) with σ known: to test H0 : µ = µ0 v.
H1 : µ < µ0 . First, take any µ1 < µ0 . To test H0 v. µ = µ1 , by the NeymanPearson Lemma (NP), the best (most powerful) test uses test statistic the
likelihood ratio (LR) λ := L0 /L1 = L(µ0 )/L(µ1 ), where with data x1 , . . . , xn
1∑
exp{−
(xi − µ)2 /σ 2 },
2 1
n
L(µ) = σ
−n
− 21 n
(2π)
and critical region
R of the ∑
form λ ≤ const: reject H0 if λ is too small. Here
∑
1
2
λ = exp{− 2 [ (xi − µ0 ) − (xi − µ1 )2 ]}. Forming the LR λ, the constants
cancel, so R ∑
has the form log λ ≤ const,∑or −2 log λ ≥ const. Expanding the
squares, the
x2i terms cancel, so (as
xi = n¯
x) this is
−2µ0 n¯
x + nµ20 + 2µ1 n¯
x − nµ21 ≥ const :
2(µ1 − µ0 )¯
x + (µ20 − µ21 ) ≥ const.
As µ1 < µ0 , this is x¯ ≤ c. At significance level α, c is the lower α-point
2
of the
x−
√ distribution of x¯ under H0 . Then x¯ ∼ N (µ0 , σ /n), so Z := (¯
µ0 ) n/σ ∼ Φ√= N (0, 1). If cα is the
√ lower σ-point of Φ
√ = N (0, 1), i.e. of
Z := (¯
x − µ0 ) n/σ , cα = (c − µ0 ) n/σ: c = µ0 + σcα / n.
[7]
But this holds for all µ1 < µ0 . So R is uniformly most powerful (UMP)
for H0 : µ = µ0 (simple null) v. H1 : µ < µ0 (composite alternative).
[3]
[Seen – lectures]
2
Q3. Sufficiency and the factorisation criterion.
We give a Bayesian treatment of sufficiency, as this is easier than the
classical one (for which see e.g. I.4, Day 2, Course website).
If x = (x1 , x2 ), where x1 is informative about θ: we call x1 sufficient for
θ if x2 is uninformative, i.e. x2 cannot affect our views on θ, i.e.
(i) f (θ|x) = f (θ|x1 , x2 ) does not depend on x2 , i.e.
f (θ|x1 , x2 ) = f (θ|x1 ),
f (θ, x1 , x2 )
f (x1 , x2 )
=
,
f (θ, x1 )
f (x1 )
or
i.e.
f (θ, x1 , x2 )
f (θ, x1 )
=
:
f (x1 , x2 )
f (x1 )
f (x2 |x1 , θ) = f (x2 |x1 ) :
(ii) f (x2 |x1 , θ) does not depend on θ.
Either of (i), (ii) can be used as the definition of sufficiency in a Bayesian
treatment. [Notice that (i) is essentially a Bayesian statement: it is meaningless in classical statistics, as there θ cannot have a density.]
[5]
Now recall the classical Fisher-Neyman Factorisation Criterion for sufficiency: a statistic x1 is sufficient for the parameter θ iff the likelihood f (x|θ)
factorises as
(iii) f (x|θ), or f (x1 , x2 |θ), = g(x1 , θ)h(x1 , x2 ),
for some functions g, h.
[5]
Proposition. x1 is sufficient for θ iff the Factorisation Criterion (iii) holds.
Proof. (ii) ⇒ (iii):
f (x|θ) = f (x1 , x2 |θ) = f (x1 |θ)f (x2 |x1 , θ)
= f (x1 |θ)f (x2 |x1 )
(as in 2 above)
(by (ii)),
giving (iii).
(iii) ⇒ (i): By Bayes’ Theorem in the form ‘posterior proportional to prior
times likelihood’, the factor h(x1 , x2 ) in (iii) can be absorbed into the constant of proportionality [which is unimportant: it can be recovered from the
remaining terms, its role being merely to make these integrate to one]. Then
x2 drops out, so does not appear in the posterior, giving (i).
[10]
[Seen – lectures]
3
Q4. Regression plane.
With two regressors u and v and response variable y, given a sample of
size n of points (u1 , v1 , y1 ), . . . , (un , vn , yn ) we have to fit a least-squares plane
– that is, choose parameters a, b, c to minimise the sum of squares
∑n
SS :=
(yi − c − aui − bvi )2 .
i=1
Taking ∂SS/∂c = 0 gives
∑n
(yi − c − aui − bvi ) = 0 :
c = y¯ − a¯
u − b¯
v.
i=1
∑n
SS =
[(yi − y¯) − a(ui − u¯) − b(vi − v¯)]2 .
i=1
Then ∂SS/∂a = 0 and ∂SS/∂b = 0 give
∑n
(ui − u¯)[(yi − y¯) − a(ui − u¯) − b(vi − v¯)],
i=1
∑n
(vi − v¯)[(yi − y¯) − a(ui − u¯) − b(vi − v¯)].
i=1
Multiply out, divide by n to turn the sums into averages, and re-arrange:
asuu + bsuv = syu ,
asuv + bsvv = syv .
These are the normal equations (N E) for a and b.
Condition for non-degeneracy. The determinant is
[10]
2
suu svv − s2uv = suu svv (1 − ruv
)
(as ruv := suv /(su .sv )), ̸= 0 iff ruv ̸= ±1, i.e., iff the (ui , vi ) are not collinear,
and this is the condition for (N E) to have a unique solution.
[4]
Application: Grain futures.
The two principal factors affecting grain yields (apart from the weather
near harvest – unpredictable!) are sunshine (in hours) and rainfall (in mm)
during the spring growing season (known in advance). Using these as predictor variables u, v gives a best (linear unbiased) estimator of grain yield y.
The volumes of grain traded yearly are enormous. So, the ability to predict as accurately as possible the size of the summer harvest (and so, by
supply and demand, its price), given information available in the spring, is
very valuable. Such predictions can be used to form trading strategies for
grain futures and grain options, etc. (example, the Great Grain Steal of
1972, by the then USSR, on the USA and Canada).
[6]
[Seen: problem sheets]
4
Q5. Yule-Walker equations and AR(2).
The AR(p) model is (with (ϵt ) white noise W N (σ))
Xt = ϕ1 Xt−1 + ϕ2 Xt−2 + · · · + ϕp Xt−p + ϵt .
[2]
Multiply by Xt−k and take E: as E[Xt−k Xt−i ] = ρ(|k − i|) = ρ(k − i),
ρ(k) = ϕ1 ρ(k − 1) + · · · + ϕp ρ(k − p)
(k > 0).
(Y W )
These are the Yule-Walker equations.
[4]
They give a difference equation of order p, with characteristic polynomial
λp − ϕ1 λp−1 − · · · − ϕp = 0.
If the roots are λ1 , · · · , λp , the trial solution ρ(k) = λk is a solution iff λ is
one of the roots λi . Since the equation is linear,
ρ(k) = c1 λk1 + · · · + cp λkp
(for k ≥ 04 and use ρ(−k) = ρ(k) for k < 0) is a solution for all choices of
constants ci – the general solution of (YW) if all the roots λi are distinct. [4]
Example of an AR(2) process.
1
2
Xt = Xt−1 + Xt−2 + ϵt ,
(ϵt ) W N.
(1)
3
9
The Yule-Walker equations here are ρ(k) = ϕ1 ρ(k − 1) + ϕ2 ρ(k − 2).
The characteristic polynomial is
1
2
λ2 − λ − = 0 : (λ − 2/3)(λ + 1/3) = 0; λ1 = 2/3, λ2 = −1/3.
3
9
So as the roots are distinct, the autocovariance is ρ(k) = aλk1 + bλk2 .
[5]
k = 0: ρ(0) = 1 gives a+b = 1: b = 1−a. So ρ(k) = a(2/3)k +(1−a)(−1/3)k .
k = 1: ρ(1) = ϕ1 ρ(0) + ϕ2 ρ(−1); as ρ(0) = 1 and ρ(−1) = ρ(1), ρ(1) =
ϕ1 /(1 − ϕ2 ). As here ϕ1 = 1/3 and ϕ2 = 2/9, this gives ρ(1) = 3/7. So
ρ(1) = 3/7 = a.(2/3) + (1 − a).(−1/3).
That is,
3 1
2 1
( + ) = a.( + ) = a :
7 3
3 3
a = (9 + 7)/21 = 16/21. Thus
ρ(k) =
5 −1
16 2 k
( ) + ( )k .
21 3
21 3
[Seen, lectures]
5
[5]
Q6. The Bayes linear estimator.
If d(z) is a linear function, a+b′ z, where z and b are vectors, the quadratic
loss is
D = E[(a + b′ z − θ)2 ]
= E[a2 + 2ab′ z + b′ zz ′ b − 2aθ − 2b′ zθ + θ2 ]
= a2 + 2ab′ Ez + b′ E(zz ′ )b − 2aEθ − 2b′ E(zθ) + E(θ2 ).
[4]
Add and subtract [E(θ)]2 , (b′ Ez)2 = b′ EzEz ′ b and 2b′ EzEθ. Write V :=
var z = E(zz ′ ) − EzEz ′ for the covariance matrix of z, c := cov(θ, z) =
E(zθ) − EzEθ for the covariance vector between θ and the elements of the
vector z.
D = (a + b′ Ez − Eθ)2 + b′ (varz)b − 2b′ cov(z, θ) + varθ :
D = (a + b′ Ez − Eθ)2 + b′ V b − 2b′ c + varθ.
∗
Write b := V
−1
∗
′
c, D := var(θ) − c V
−1
[4]
c. Then this becomes
D = (a + b′ Ez − Eθ)2 + (b − b∗ )′ V (b − b∗ ) + D∗
(∗)
(the quadratic terms check as b∗T V b∗ = cT V −1 V V −1 c = cT V −1 c, the linear
terms as c = V b∗ ).
[4]
The third term on the right in (2) does not involve a, b, while the first
two are non-negative (the first is a square, the second a quadratic form with
matrix V , non-negative definite as V is a covariance matrix). So the expected
quadratic loss D is minimised by choosing b = b∗ , a = −b∗ ′ Ez + Eθ. This
choice gives
d(z) = Eθ + cV −1 (z − Ez),
c := cov(z, θ),
V := var(z).
This gives the Bayes linear estimator of θ based on data z = z(x).
[4]
Distributional assumptions.
The Bayes linear estimator depends only on first and second moments:
Eθ, Ez, c = cov(z, θ), V = var(z). So we do not need to know the full
likelihood, just the first and second moments of (θ, z(x)), the parameter and
the function z in which we want the estimator to be linear.
[2]
Application.
The Bayes linear estimator is used in the construction of the Kalman
filter – state-space models for Time Series.
[2]
[Seen, lectures]
6