Statistics for High-Dimensional Data: Selected Topics

Statistics for High-Dimensional Data:
Selected Topics
Peter Buhlmann
¨
Seminar fur
¨ Statistik, ETH Zurich
¨
June 2014
High-dimensional linear model:
basic concepts in methodology, theory and
computation
High-dimensional data
Riboflavin production with Bacillus Subtilis
(in collaboration with DSM (Switzerland))
goal: improve riboflavin production rate of Bacillus Subtilis
using clever genetic engineering
response variables Y ∈ R: riboflavin (log-) production rate
covariates X ∈ Rp : expressions from p = 4088 genes
sample size n = 115, p n
8.4
8.8
−6
y
−9
−10
7
8
10
11
12
6
8.5
9.0
−6
−8
y
8.0
8.5
9.0
8
9
12
11
−6
y
−7
−6
−10
−9
y
−8
−9
−10
11
10
xsel
−7
−6
−7
−8
10
xsel
10
−9
7.5
xsel
−9
9
9
−10
7.0
xsel
8
8
−7
−6
−7
y
−9
−10
8.0
7
xsel
−8
−7
y
−8
−9
−10
7.5
y
9
xsel
−6
xsel
−8
8.0
−8
−7
−6
−7
y
−10
−9
−8
−7
y
−8
−9
−10
7.6
−10
gene expression data
−6
Y versus 9 “reasonable” genes
7.5
8.0
xsel
8.5
9.0
8.5
9.0
9.5 10.0
xsel
11.0
general framework:
Z1 , . . . , Zn (with some ”i.i.d. components”)
dim(Zi ) n
for example:
Zi = (Xi , Yi ), Xi ∈ X ⊆ Rp , Yi ∈ Y ⊆∈ R: regression with p n
Zi = (Xi , Yi ), Xi ∈ X ⊆∈ Rp , Yi ∈ {0, 1}: classification for p n
numerous applications:
biology, imaging, economy, environmental sciences, ...
High-dimensional linear models
Yi =
p
X
(j)
βj0 Xi
+ εi , i = 1, . . . , n
j=1
pn
in short: Y = Xβ + ε
goals:
I
I
I
prediction, e.g. w.r.t. squared prediction error
estimation of β 0 , e.g. w.r.t. kβˆ − β 0 kq (q = 1, 2)
variable selection
i.e. estimating the active set with the effective variables
(having corresponding coefficient 6= 0)
we need to regularize...
and there are many proposals
I
Bayesian methods for regularization
I
greedy algorithms: aka forward selection or boosting
I
preliminary dimension reduction
I
...
e.g. 4’160’000 entries on Google Scholar for
“high dimensional linear model” ...
we need to regularize...
and there are many proposals
I
Bayesian methods for regularization
I
greedy algorithms: aka forward selection or boosting
I
preliminary dimension reduction
I
...
e.g. 4’160’000 entries on Google Scholar for
“high dimensional linear model” ...
we need to regularize...
and there are many proposals
I
Bayesian methods for regularization
I
greedy algorithms: aka forward selection or boosting
I
preliminary dimension reduction
I
...
e.g. 4’160’000 entries on Google Scholar for
“high dimensional linear model” ...
Penalty-based methods
if true β 0 is sparse w.r.t.
I
I
kβ 0 k00 = number of non-zero coefficients
; regularize with the k · k0 -penalty:
argminβ (n−1 kY − Xβk2 + λkβk00 ), e.g. AIC, BIC
; computationally infeasible if p is large (2p sub-models)
P
kβ 0 k1 = pj=1 |βj0 |
; penalize with the k · k1 -norm, i.e. Lasso:
argminβ (n−1 kY − Xβk2 + λkβk1 )
; convex optimization:
computationally feasible and very fast for large p
as we will see:
regularization with the `1 -norm is good (“nearly optimal”)
even if truth is sparse w.r.t. kβ 0 k00
; can exploit computational advantage of k.k1 -norm
regularization even if the problem has a different sparsity
structure
The Lasso (Tibshirani, 1996)
Lasso for linear models
ˆ
β(λ)
= argminβ (n−1 kY − Xβk2 + |{z}
λ kβk1 )
| {z }
≥0 Pp
j=1
; convex optimization problem
I
I
Lasso does variable selection
some of the βˆj (λ) = 0
(because of “`1 -geometry”)
ˆ
β(λ)
is a shrunken LS-estimate
|βj |
more about “`1 -geometry”
equivalence to primal problem
βˆprimal (R) = argminβ;kβk1 ≤R kY − Xβk22 /n,
with a correspondence between λ and R which depends on the
data (X1 , Y1 ), . . . , (Xn , Yn )
[such an equivalence holds since
I
kY − Xβk22 /n is convex in β
I
convex constraint kβk1 ≤ R
see e.g. Bertsekas (1995)]
p=2
left: `1 -“world”
residual sum of squares reaches a minimal value (for certain
constellations of the data) if its contour lines hit the `1 -ball in its
corner
; βˆ1 = 0
`2 -“world” is different
Ridge regression,
βˆRidge (λ) = argminβ kY − Xβk22 /n + λkβk22 ,
equivalent primal equivalent solution
βˆRidge;primal (R) = argmin
β;kβk2 ≤R kY
− Xβk22 /n,
with a one-to-one correspondence between λ and R
A note on the Bayesian approach
model:
β1 , . . . , βp i.i.d. ∼ p(β)dβ,
given β : Y ∼ Nn (Xβ, σ 2 In ) with density f (y|σ 2 , β)
posterior density:
p(β|Y, σ 2 ) = R
f (Y|β, σ 2 )p(β)
∝ f (Y|β, σ 2 )p(β)
f (Y|β, σ 2 )p(β)dβ
and hence for the MAP (Maximum A-Posteriori) estimator:
βˆMAP = argmaxβ p(β|Y, σ 2 ) = argminβ − log f (Y|β, σ 2 )p(β)


p
X
1
= argminβ  2 kY − X βk22 −
log(p(βj ))
2σ
j=1
examples:
1. Double-Exponential prior DExp(ξ):
p(β) = τ2 exp(−τ β)
; βˆMAP equals the Lasso with penalty parameter λ = n−1 2σ 2 τ
2. Gaussian prior N (0, τ 2 ):
p(β) = √ 1 exp(−β 2 /(2τ 2 ))
2πτ
; βˆMAP equals the Ridge estimator with penalty parameter
λ = n−1 σ 2 /τ 2
but we will argue that Lasso (i.e., the MAP estimator) is also
good if the truth is sparse with respect to kβ 0 k00 , e.g. if prior is
(much) more spiky around zero than Double-Exponential
distribution
Orthonormal design
Y = Xβ + ε, n−1 XT X = I
Lasso = soft-thresholding estimator
βˆj (λ) = sign(Zj )(|Zj | − λ/2)+ , Zj = (n−1 XT Y)j ,
|{z}
=OLS
βˆj (λ) = gsoft (Zj ),
[Exercise]
threshold functions
−3
−2
−1
0
1
2
3
Adaptive Lasso
Hard−thresholding
Soft−thresholding
−3
−2
−1
0
z
1
2
3
Prediction
goal: predict a new observation Ynew
consider expected (w.r.t. new data; and random X ) squared
error loss:
ˆ 2 ] = σ 2 + EX [(Xnew (β 0 − β))
ˆ 2]
EXnew ,Ynew [(Ynew − Xnew β)
new
= σ 2 + (βˆ − β 0 )T |{z}
Σ (βˆ − β 0 )
Cov(X )
; terminology “prediction error”:
for random design X: (βˆ − β 0 )T Σ(βˆ − β 0 ) = EXnew [(Xnew (βˆ − β 0 ))2 ]
ˆ βˆ − β 0 ) = kX(βˆ − β 0 )k2 /n
for fixed design X: (βˆ − β 0 )T Σ(
2
binary lymph node classification using gene expressions:
a high noise problem
n = 49 samples, p = 7130 gene expressions
despite that it is classification: P[Y = 1|X = x] = E[Y |X = x]
ˆ (x) via linear model; can then do classification
; p
cross-validated misclassification error (2/3 training; 1/3 test)
Lasso
L2 Boosting
FPLR
Pelora
1-NN
DLDA
SVM
21.1%
17.7%
35.25%
27.8%
43.25%
36.12%
36.88%
with variable selection
best 200 genes (Wilcoxon test)
no additional variable selection
Lasso selected on CV-average 13.12 out of p = 7130 genes
from a practical perspective:
if you trust in cross-validation: can validate how good we are
i.e. prediction may be a black box, but we can evaluate it!
binary lymph node classification using gene expressions:
a high noise problem
n = 49 samples, p = 7130 gene expressions
despite that it is classification: P[Y = 1|X = x] = E[Y |X = x]
ˆ (x) via linear model; can then do classification
; p
cross-validated misclassification error (2/3 training; 1/3 test)
Lasso
L2 Boosting
FPLR
Pelora
1-NN
DLDA
SVM
21.1%
17.7%
35.25%
27.8%
43.25%
36.12%
36.88%
with variable selection
best 200 genes (Wilcoxon test)
no additional variable selection
Lasso selected on CV-average 13.12 out of p = 7130 genes
from a practical perspective:
if you trust in cross-validation: can validate how good we are
i.e. prediction may be a black box, but we can evaluate it!
and in fact: we will hear that
I
Lasso is consistent for prediction assuming “essentially
nothing”
I
Lasso is optimal for prediction assuming the “compatibility
condition” for X
Estimation of regression coefficients
Y = Xβ 0 + ε, p n
with fixed (deterministic) design X
problem of identifiability:
for p > n: Xβ 0 = X θ
for any θ = β 0 + ξ, ξ in the null-space of X
; cannot say anything about kβˆ − β 0 k without further
assumptions!
; we will work with the compatibility assumption (see later)
and we will explain: under compatibility condition
kβˆ − β 0 k1 ≤ C
s0 p
log(p)/n,
φ20
s0 = |supp(β 0 )| = |{j; βj0 6= 0}|
=⇒ let’s turn to the blackboard!
various conditions and their relations (van de Geer & PB, 2009)
oracle inequalities for prediction and estimation
RIP
weak (S,2s)- RIP
8
|S*\S| ≤ s
coherence
6
weak (S, 2s)irrepresentable
6
6
7
4
adaptive (S, 2s)restricted regression
3
adaptive (S, s)restricted regression
3
(S,2s)-irrepresentable
6
5
(S,2s)-restricted
eigenvalue
(S,s)-restricted
eigenvalue
(S,s)-uniform
irrepresentable
6
2
S-compatibility
9
|S*\S| =0
S*=S
6
Variable selection
Example: Motif regression
for finding HIF1α transcription factor binding sites in DNA seq.
Muller,
Meier, PB & Ricci (never published...)
¨
Yi ∈ R: univariate response measuring binding intensity of
HIF1α on coarse DNA segment i (from CHIP-chip experiments)
(1)
(p)
Xi = (Xi , . . . , Xi ) ∈ Rp :
(j)
Xi = abundance score of candidate motif j in DNA segment i
(using sequence data and computational biology algorithms,
e.g. MDSCAN)
question: relation between the binding intensity Y and the
abundance of short candidate motifs?
; linear model is often reasonable
“motif regression” (Conlon, X.S. Liu, Lieb & J.S. Liu, 2003)
Y = Xβ + , n = 287, p = 195
goal: variable selection
; find the relevant motifs among the p = 195 candidates
Lasso for variable selection
ˆ
S(λ)
= {j; βˆj (λ) 6= 0}
for
S0 = {j; βj0 6= 0}
no significance testing involved
it’s convex optimization only!
(and that can be a problem... see later)
Motif regression
for finding HIF1α transcription factor binding sites in DNA seq.
Yi ∈ R: univariate response measuring binding intensity on
coarse DNA segment i (from CHIP-chip experiments)
(j)
Xi = abundance score of candidate motif j in DNA segment i
variable selection in linear model Yi = β0 +
p
X
j=1
i = 1, . . . , n = 287, p = 195
; Lasso selects 26 covariates and R 2 ≈ 50%
i.e. 26 interesting candidate motifs
(j)
βj Xi
+ εi ,
ˆλ
ˆ CV )
motif regression: estimated coefficients β(
0.10
0.05
0.00
coefficients
0.15
0.20
original data
0
50
100
variables
150
200
“Theory” for variable selection with Lasso
for (fixed design) linear model Y = Xβ 0 + ε with
active set S0 = {j; βj0 6= 0}
two key assumptions
1. neighborhood stability condition for design X
⇔ irrepresentable condition for design X
2. beta-min condition
min |βj0 | ≥ C
j∈S0
p
s0 log(p)/n, C suitably large
both conditions are sufficient and “essentially” necessary for
p
ˆ
S(λ)
= S0 with high probability, λ log(p)/n
| {z }
larger than for pred.
already proved in Meinshausen & PB, 2004 (publ: 2006)
and both assumptions are restrictive!
“Theory” for variable selection with Lasso
for (fixed design) linear model Y = Xβ 0 + ε with
active set S0 = {j; βj0 6= 0}
two key assumptions
1. neighborhood stability condition for design X
⇔ irrepresentable condition for design X
2. beta-min condition
min |βj0 | ≥ C
j∈S0
p
s0 log(p)/n, C suitably large
both conditions are sufficient and “essentially” necessary for
p
ˆ
S(λ)
= S0 with high probability, λ log(p)/n
| {z }
larger than for pred.
already proved in Meinshausen & PB, 2004 (publ: 2006)
and both assumptions are restrictive!
neighborhood stability condition ⇔ irrepresentable condition
(Zhao & Yu, 2006)
ˆ
n−1 XT X = Σ
active set S0 = {j; βj 6= 0} = {1, . . . , s0 } consists of the first s0
variables; partition
!
ˆ S ,S Σ
ˆ S ,S c
Σ
0
0
0
0
ˆ=
Σ
ˆ S c ,S Σ
ˆ S c ,S c
Σ
0
0
0
0
ˆ S c ,S Σ
ˆ −1 sign(β 0 , . . . , βs0 )T k∞ < 1
irrep. condition : kΣ
1
0
0 0 S0 ,S0
not very realistic assumptions... what can we expect?
recall: under compatibility condition
kβˆ − β 0 k1 ≤ C
s0 p
log(p)/n
φ20
consider the relevant active variables
s0 p
Srelev = {j; |βj0 | > C 2 log(p)/n}
φ0
then, clearly,
ˆ ⊇ Srelev with high probability
S
screening for detecting the relevant variables is possible!
without beta-min condition and assuming compatibility
condition only
in addition: assuming beta-min condition
min |βj0 | > C
j∈S0
s0 p
log(p)/n
φ20
ˆ ⊇ S0 with high probability
S
screening for detecting the true variables
Tibshirani (1996):
LASSO = Least Absolute Shrinkage and Selection Operator
new translation:
LASSO = Least Absolute Shrinkage and Screening Operator
Practical perspective
ˆ CV from cross-validation
choice of λ: λ
empirical and theoretical indications (Meinshausen & PB, 2006)
that
ˆ λ
ˆ CV ) ⊇ S0 (or Srelev )
S(
moreover
ˆ λ
ˆ CV )| ≤ min(n, p)(= n if p n)
|S(
; huge dimensionality reduction (in the original covariates)
ˆλ
ˆ CV )
motif regression: estimated coefficients β(
0.10
0.00
0.05
coefficients
0.15
0.20
original data
0
50
100
150
200
variables
ˆ are false positives?
which variables in S
(p-values would be very useful!)
recall:
ˆ λ
ˆ CV ) ⊇ S0 (or Srelev )
S(
and we would then use a second-stage to reduce the number
of false positive selections
ˆ
; re-estimation on much smaller model with variables from S
ˆ with e.g. BIC variable selection
I OLS on S
I
thresholding coefficients and OLS re-estimation
I
adaptive Lasso (Zou, 2006)
I
...
recall:
ˆ λ
ˆ CV ) ⊇ S0 (or Srelev )
S(
and we would then use a second-stage to reduce the number
of false positive selections
ˆ
; re-estimation on much smaller model with variables from S
ˆ with e.g. BIC variable selection
I OLS on S
I
thresholding coefficients and OLS re-estimation
I
adaptive Lasso (Zou, 2006)
I
...
Adaptive Lasso (Zou, 2006)
re-weighting the penalty function
βˆ = argminβ (kY − Xβk22 /n + λ
p
X
|βj |
),
|βˆinit,j |
j=1
βˆinit,j from Lasso in first stage (or OLS if p < n)
|
{z
}
Zou (2006)
threshold functions
; less bias than Lasso
0
−1
−2
Adaptive Lasso = NN-garrote
−3
for orthogonal design,
if βˆinit = OLS:
1
2
3
Adaptive Lasso
Hard−thresholding
Soft−thresholding
−3
−2
−1
0
z
1
2
3
motif regression
0.20
0.15
−0.05
0.00
0.05
0.10
coefficients
0.10
−0.05
0.00
0.05
coefficients
0.15
0.20
0.25
Adaptive Lasso
0.25
Lasso
0
50
100
variables
150
200
0
50
100
150
200
variables
Lasso selects 26 variables Adaptive Lasso selects 16 variables
KKT conditions and Computation
characterization of solution(s) βˆ as minimizer of the criterion
function
Qλ (β) = kY − Xβk22 /n + λkβk1
since Qλ (·) is a convex function:
necessary and sufficient that subdifferential of ∂Qλ (β)/∂β at βˆ
contains the zero element
Lemma (first part)
denote by G(β) = −2XT (Y − Xβ)/n the gradient vector of
kY − Xβk22 /n
Then: βˆ is a solution if and only if
ˆ = −sign(βˆj )λ if βˆj 6= 0,
Gj (β)
ˆ ≤ λ if βˆj = 0
|Gj (β)|
Lemma (second part)
If the solution of argminβ Qλ (β) is not unique (e.g. if p > n), and
ˆ < λ for some solution β,
ˆ then βˆj = 0 for all (other)
if Gj (β)
solutions βˆ in argminβ Qλ (β).
The zeroes are “essentially” unique
ˆ = λ)
(“essentially” refers to the situation: βˆj = 0 and Gj (β)
Proof: Exercise (optional), or see in the book by Buhlmann
and
¨
van de Geer (2011; Lemma 2.1)
Coordinate descent algorithm for computation
ˆ grid,k ) and use it as a
general idea is to compute a solution β(λ
ˆ grid,k−1 )
starting value for the computation of β(λ
| {z }
<λgrid,k
β (0) ∈ Rp an initial parameter vector. Set m = 0.
REPEAT:
Increase m by one: m ← m + 1.
For j = 1, . . . , p:
(m−1)
if |Gj (β−j
(m)
)| ≤ λ : set βj
(m)
otherwise: βj
= 0,
(m−1)
= argminβj Qλ (β+j
),
β−j : parameter vector setting jth component to zero
(m−1)
β+j
: parameter vector which equals β (m−1) except for jth
component equalling βj
UNTIL numerical convergence
for squared error loss: explicit up-dating formulae (Exercise)
Gj (β) = −2XTj (Y − Xβ)/n
(m)
βj
=
sign(Zj )(|Zj | − λ/2)+
,
ˆ jj
Σ
ˆ = n−1 XT X.
Zj = XTj (Y − Xβ−j )/n, Σ
; componentwise soft-thresholding
this is very fast if true problem is sparse
active set strategy: can do non-systematic cycling, visiting
mainly the active (non-zero) components
riboflavin example, n=71, p=4088
0.33 secs. CPU using glmnet-package in R
(Friedman, Hastie & Tibshirani, 2008)
coordinate descent algorithm converges to a stationary point
(Paul Tseng ≈ 2000)
; convergence to a global optimum, due to convexity of the
problem
main assumption:
objective function = smooth function + penalty
| {z }
separable
here: “separable” means “additive”, i.e., pen(β) =
Pp
j=1 pj (βj )
failure of coordinate descent algorithm:
Fused Lasso
p
X
2
ˆ
β = argminβ kY − Xβk2 /n + λ
|βj − βj−1 | + λ2 kβk1
j=2
but
Pp
j=2 |βj
− βj−1 | is non-separable
contour lines of penalties for p = 2
5
3.
5
3.
3.
5
1.
5
0.
5
2.
5
5
2.
3
1
1
|beta1| + |beta2|
2
2
|beta1 − beta2|
1.5
3
3
beta2
0
1
1
−2
−1
5
2.
0
beta1
3
1
5
3.
−2
5
1.
0.5
−1
−1
2
0
5
0.
2
−2
beta2
2.
5
2
2.
5
1
3
3.
5
−2
2
−1
0
beta1
2.
5
3
1
5
3.
2
Thank you!
References: most of the material is covered in
Buhlmann,
P. and van de Geer, S. (2011). Statistics for
¨
High-Dimensional Data: Methodology, Theory and Applications.
Springer.