Statistics for High-Dimensional Data: Selected Topics Peter Buhlmann ¨ Seminar fur ¨ Statistik, ETH Zurich ¨ June 2014 High-dimensional linear model: basic concepts in methodology, theory and computation High-dimensional data Riboflavin production with Bacillus Subtilis (in collaboration with DSM (Switzerland)) goal: improve riboflavin production rate of Bacillus Subtilis using clever genetic engineering response variables Y ∈ R: riboflavin (log-) production rate covariates X ∈ Rp : expressions from p = 4088 genes sample size n = 115, p n 8.4 8.8 −6 y −9 −10 7 8 10 11 12 6 8.5 9.0 −6 −8 y 8.0 8.5 9.0 8 9 12 11 −6 y −7 −6 −10 −9 y −8 −9 −10 11 10 xsel −7 −6 −7 −8 10 xsel 10 −9 7.5 xsel −9 9 9 −10 7.0 xsel 8 8 −7 −6 −7 y −9 −10 8.0 7 xsel −8 −7 y −8 −9 −10 7.5 y 9 xsel −6 xsel −8 8.0 −8 −7 −6 −7 y −10 −9 −8 −7 y −8 −9 −10 7.6 −10 gene expression data −6 Y versus 9 “reasonable” genes 7.5 8.0 xsel 8.5 9.0 8.5 9.0 9.5 10.0 xsel 11.0 general framework: Z1 , . . . , Zn (with some ”i.i.d. components”) dim(Zi ) n for example: Zi = (Xi , Yi ), Xi ∈ X ⊆ Rp , Yi ∈ Y ⊆∈ R: regression with p n Zi = (Xi , Yi ), Xi ∈ X ⊆∈ Rp , Yi ∈ {0, 1}: classification for p n numerous applications: biology, imaging, economy, environmental sciences, ... High-dimensional linear models Yi = p X (j) βj0 Xi + εi , i = 1, . . . , n j=1 pn in short: Y = Xβ + ε goals: I I I prediction, e.g. w.r.t. squared prediction error estimation of β 0 , e.g. w.r.t. kβˆ − β 0 kq (q = 1, 2) variable selection i.e. estimating the active set with the effective variables (having corresponding coefficient 6= 0) we need to regularize... and there are many proposals I Bayesian methods for regularization I greedy algorithms: aka forward selection or boosting I preliminary dimension reduction I ... e.g. 4’160’000 entries on Google Scholar for “high dimensional linear model” ... we need to regularize... and there are many proposals I Bayesian methods for regularization I greedy algorithms: aka forward selection or boosting I preliminary dimension reduction I ... e.g. 4’160’000 entries on Google Scholar for “high dimensional linear model” ... we need to regularize... and there are many proposals I Bayesian methods for regularization I greedy algorithms: aka forward selection or boosting I preliminary dimension reduction I ... e.g. 4’160’000 entries on Google Scholar for “high dimensional linear model” ... Penalty-based methods if true β 0 is sparse w.r.t. I I kβ 0 k00 = number of non-zero coefficients ; regularize with the k · k0 -penalty: argminβ (n−1 kY − Xβk2 + λkβk00 ), e.g. AIC, BIC ; computationally infeasible if p is large (2p sub-models) P kβ 0 k1 = pj=1 |βj0 | ; penalize with the k · k1 -norm, i.e. Lasso: argminβ (n−1 kY − Xβk2 + λkβk1 ) ; convex optimization: computationally feasible and very fast for large p as we will see: regularization with the `1 -norm is good (“nearly optimal”) even if truth is sparse w.r.t. kβ 0 k00 ; can exploit computational advantage of k.k1 -norm regularization even if the problem has a different sparsity structure The Lasso (Tibshirani, 1996) Lasso for linear models ˆ β(λ) = argminβ (n−1 kY − Xβk2 + |{z} λ kβk1 ) | {z } ≥0 Pp j=1 ; convex optimization problem I I Lasso does variable selection some of the βˆj (λ) = 0 (because of “`1 -geometry”) ˆ β(λ) is a shrunken LS-estimate |βj | more about “`1 -geometry” equivalence to primal problem βˆprimal (R) = argminβ;kβk1 ≤R kY − Xβk22 /n, with a correspondence between λ and R which depends on the data (X1 , Y1 ), . . . , (Xn , Yn ) [such an equivalence holds since I kY − Xβk22 /n is convex in β I convex constraint kβk1 ≤ R see e.g. Bertsekas (1995)] p=2 left: `1 -“world” residual sum of squares reaches a minimal value (for certain constellations of the data) if its contour lines hit the `1 -ball in its corner ; βˆ1 = 0 `2 -“world” is different Ridge regression, βˆRidge (λ) = argminβ kY − Xβk22 /n + λkβk22 , equivalent primal equivalent solution βˆRidge;primal (R) = argmin β;kβk2 ≤R kY − Xβk22 /n, with a one-to-one correspondence between λ and R A note on the Bayesian approach model: β1 , . . . , βp i.i.d. ∼ p(β)dβ, given β : Y ∼ Nn (Xβ, σ 2 In ) with density f (y|σ 2 , β) posterior density: p(β|Y, σ 2 ) = R f (Y|β, σ 2 )p(β) ∝ f (Y|β, σ 2 )p(β) f (Y|β, σ 2 )p(β)dβ and hence for the MAP (Maximum A-Posteriori) estimator: βˆMAP = argmaxβ p(β|Y, σ 2 ) = argminβ − log f (Y|β, σ 2 )p(β) p X 1 = argminβ 2 kY − X βk22 − log(p(βj )) 2σ j=1 examples: 1. Double-Exponential prior DExp(ξ): p(β) = τ2 exp(−τ β) ; βˆMAP equals the Lasso with penalty parameter λ = n−1 2σ 2 τ 2. Gaussian prior N (0, τ 2 ): p(β) = √ 1 exp(−β 2 /(2τ 2 )) 2πτ ; βˆMAP equals the Ridge estimator with penalty parameter λ = n−1 σ 2 /τ 2 but we will argue that Lasso (i.e., the MAP estimator) is also good if the truth is sparse with respect to kβ 0 k00 , e.g. if prior is (much) more spiky around zero than Double-Exponential distribution Orthonormal design Y = Xβ + ε, n−1 XT X = I Lasso = soft-thresholding estimator βˆj (λ) = sign(Zj )(|Zj | − λ/2)+ , Zj = (n−1 XT Y)j , |{z} =OLS βˆj (λ) = gsoft (Zj ), [Exercise] threshold functions −3 −2 −1 0 1 2 3 Adaptive Lasso Hard−thresholding Soft−thresholding −3 −2 −1 0 z 1 2 3 Prediction goal: predict a new observation Ynew consider expected (w.r.t. new data; and random X ) squared error loss: ˆ 2 ] = σ 2 + EX [(Xnew (β 0 − β)) ˆ 2] EXnew ,Ynew [(Ynew − Xnew β) new = σ 2 + (βˆ − β 0 )T |{z} Σ (βˆ − β 0 ) Cov(X ) ; terminology “prediction error”: for random design X: (βˆ − β 0 )T Σ(βˆ − β 0 ) = EXnew [(Xnew (βˆ − β 0 ))2 ] ˆ βˆ − β 0 ) = kX(βˆ − β 0 )k2 /n for fixed design X: (βˆ − β 0 )T Σ( 2 binary lymph node classification using gene expressions: a high noise problem n = 49 samples, p = 7130 gene expressions despite that it is classification: P[Y = 1|X = x] = E[Y |X = x] ˆ (x) via linear model; can then do classification ; p cross-validated misclassification error (2/3 training; 1/3 test) Lasso L2 Boosting FPLR Pelora 1-NN DLDA SVM 21.1% 17.7% 35.25% 27.8% 43.25% 36.12% 36.88% with variable selection best 200 genes (Wilcoxon test) no additional variable selection Lasso selected on CV-average 13.12 out of p = 7130 genes from a practical perspective: if you trust in cross-validation: can validate how good we are i.e. prediction may be a black box, but we can evaluate it! binary lymph node classification using gene expressions: a high noise problem n = 49 samples, p = 7130 gene expressions despite that it is classification: P[Y = 1|X = x] = E[Y |X = x] ˆ (x) via linear model; can then do classification ; p cross-validated misclassification error (2/3 training; 1/3 test) Lasso L2 Boosting FPLR Pelora 1-NN DLDA SVM 21.1% 17.7% 35.25% 27.8% 43.25% 36.12% 36.88% with variable selection best 200 genes (Wilcoxon test) no additional variable selection Lasso selected on CV-average 13.12 out of p = 7130 genes from a practical perspective: if you trust in cross-validation: can validate how good we are i.e. prediction may be a black box, but we can evaluate it! and in fact: we will hear that I Lasso is consistent for prediction assuming “essentially nothing” I Lasso is optimal for prediction assuming the “compatibility condition” for X Estimation of regression coefficients Y = Xβ 0 + ε, p n with fixed (deterministic) design X problem of identifiability: for p > n: Xβ 0 = X θ for any θ = β 0 + ξ, ξ in the null-space of X ; cannot say anything about kβˆ − β 0 k without further assumptions! ; we will work with the compatibility assumption (see later) and we will explain: under compatibility condition kβˆ − β 0 k1 ≤ C s0 p log(p)/n, φ20 s0 = |supp(β 0 )| = |{j; βj0 6= 0}| =⇒ let’s turn to the blackboard! various conditions and their relations (van de Geer & PB, 2009) oracle inequalities for prediction and estimation RIP weak (S,2s)- RIP 8 |S*\S| ≤ s coherence 6 weak (S, 2s)irrepresentable 6 6 7 4 adaptive (S, 2s)restricted regression 3 adaptive (S, s)restricted regression 3 (S,2s)-irrepresentable 6 5 (S,2s)-restricted eigenvalue (S,s)-restricted eigenvalue (S,s)-uniform irrepresentable 6 2 S-compatibility 9 |S*\S| =0 S*=S 6 Variable selection Example: Motif regression for finding HIF1α transcription factor binding sites in DNA seq. Muller, Meier, PB & Ricci (never published...) ¨ Yi ∈ R: univariate response measuring binding intensity of HIF1α on coarse DNA segment i (from CHIP-chip experiments) (1) (p) Xi = (Xi , . . . , Xi ) ∈ Rp : (j) Xi = abundance score of candidate motif j in DNA segment i (using sequence data and computational biology algorithms, e.g. MDSCAN) question: relation between the binding intensity Y and the abundance of short candidate motifs? ; linear model is often reasonable “motif regression” (Conlon, X.S. Liu, Lieb & J.S. Liu, 2003) Y = Xβ + , n = 287, p = 195 goal: variable selection ; find the relevant motifs among the p = 195 candidates Lasso for variable selection ˆ S(λ) = {j; βˆj (λ) 6= 0} for S0 = {j; βj0 6= 0} no significance testing involved it’s convex optimization only! (and that can be a problem... see later) Motif regression for finding HIF1α transcription factor binding sites in DNA seq. Yi ∈ R: univariate response measuring binding intensity on coarse DNA segment i (from CHIP-chip experiments) (j) Xi = abundance score of candidate motif j in DNA segment i variable selection in linear model Yi = β0 + p X j=1 i = 1, . . . , n = 287, p = 195 ; Lasso selects 26 covariates and R 2 ≈ 50% i.e. 26 interesting candidate motifs (j) βj Xi + εi , ˆλ ˆ CV ) motif regression: estimated coefficients β( 0.10 0.05 0.00 coefficients 0.15 0.20 original data 0 50 100 variables 150 200 “Theory” for variable selection with Lasso for (fixed design) linear model Y = Xβ 0 + ε with active set S0 = {j; βj0 6= 0} two key assumptions 1. neighborhood stability condition for design X ⇔ irrepresentable condition for design X 2. beta-min condition min |βj0 | ≥ C j∈S0 p s0 log(p)/n, C suitably large both conditions are sufficient and “essentially” necessary for p ˆ S(λ) = S0 with high probability, λ log(p)/n | {z } larger than for pred. already proved in Meinshausen & PB, 2004 (publ: 2006) and both assumptions are restrictive! “Theory” for variable selection with Lasso for (fixed design) linear model Y = Xβ 0 + ε with active set S0 = {j; βj0 6= 0} two key assumptions 1. neighborhood stability condition for design X ⇔ irrepresentable condition for design X 2. beta-min condition min |βj0 | ≥ C j∈S0 p s0 log(p)/n, C suitably large both conditions are sufficient and “essentially” necessary for p ˆ S(λ) = S0 with high probability, λ log(p)/n | {z } larger than for pred. already proved in Meinshausen & PB, 2004 (publ: 2006) and both assumptions are restrictive! neighborhood stability condition ⇔ irrepresentable condition (Zhao & Yu, 2006) ˆ n−1 XT X = Σ active set S0 = {j; βj 6= 0} = {1, . . . , s0 } consists of the first s0 variables; partition ! ˆ S ,S Σ ˆ S ,S c Σ 0 0 0 0 ˆ= Σ ˆ S c ,S Σ ˆ S c ,S c Σ 0 0 0 0 ˆ S c ,S Σ ˆ −1 sign(β 0 , . . . , βs0 )T k∞ < 1 irrep. condition : kΣ 1 0 0 0 S0 ,S0 not very realistic assumptions... what can we expect? recall: under compatibility condition kβˆ − β 0 k1 ≤ C s0 p log(p)/n φ20 consider the relevant active variables s0 p Srelev = {j; |βj0 | > C 2 log(p)/n} φ0 then, clearly, ˆ ⊇ Srelev with high probability S screening for detecting the relevant variables is possible! without beta-min condition and assuming compatibility condition only in addition: assuming beta-min condition min |βj0 | > C j∈S0 s0 p log(p)/n φ20 ˆ ⊇ S0 with high probability S screening for detecting the true variables Tibshirani (1996): LASSO = Least Absolute Shrinkage and Selection Operator new translation: LASSO = Least Absolute Shrinkage and Screening Operator Practical perspective ˆ CV from cross-validation choice of λ: λ empirical and theoretical indications (Meinshausen & PB, 2006) that ˆ λ ˆ CV ) ⊇ S0 (or Srelev ) S( moreover ˆ λ ˆ CV )| ≤ min(n, p)(= n if p n) |S( ; huge dimensionality reduction (in the original covariates) ˆλ ˆ CV ) motif regression: estimated coefficients β( 0.10 0.00 0.05 coefficients 0.15 0.20 original data 0 50 100 150 200 variables ˆ are false positives? which variables in S (p-values would be very useful!) recall: ˆ λ ˆ CV ) ⊇ S0 (or Srelev ) S( and we would then use a second-stage to reduce the number of false positive selections ˆ ; re-estimation on much smaller model with variables from S ˆ with e.g. BIC variable selection I OLS on S I thresholding coefficients and OLS re-estimation I adaptive Lasso (Zou, 2006) I ... recall: ˆ λ ˆ CV ) ⊇ S0 (or Srelev ) S( and we would then use a second-stage to reduce the number of false positive selections ˆ ; re-estimation on much smaller model with variables from S ˆ with e.g. BIC variable selection I OLS on S I thresholding coefficients and OLS re-estimation I adaptive Lasso (Zou, 2006) I ... Adaptive Lasso (Zou, 2006) re-weighting the penalty function βˆ = argminβ (kY − Xβk22 /n + λ p X |βj | ), |βˆinit,j | j=1 βˆinit,j from Lasso in first stage (or OLS if p < n) | {z } Zou (2006) threshold functions ; less bias than Lasso 0 −1 −2 Adaptive Lasso = NN-garrote −3 for orthogonal design, if βˆinit = OLS: 1 2 3 Adaptive Lasso Hard−thresholding Soft−thresholding −3 −2 −1 0 z 1 2 3 motif regression 0.20 0.15 −0.05 0.00 0.05 0.10 coefficients 0.10 −0.05 0.00 0.05 coefficients 0.15 0.20 0.25 Adaptive Lasso 0.25 Lasso 0 50 100 variables 150 200 0 50 100 150 200 variables Lasso selects 26 variables Adaptive Lasso selects 16 variables KKT conditions and Computation characterization of solution(s) βˆ as minimizer of the criterion function Qλ (β) = kY − Xβk22 /n + λkβk1 since Qλ (·) is a convex function: necessary and sufficient that subdifferential of ∂Qλ (β)/∂β at βˆ contains the zero element Lemma (first part) denote by G(β) = −2XT (Y − Xβ)/n the gradient vector of kY − Xβk22 /n Then: βˆ is a solution if and only if ˆ = −sign(βˆj )λ if βˆj 6= 0, Gj (β) ˆ ≤ λ if βˆj = 0 |Gj (β)| Lemma (second part) If the solution of argminβ Qλ (β) is not unique (e.g. if p > n), and ˆ < λ for some solution β, ˆ then βˆj = 0 for all (other) if Gj (β) solutions βˆ in argminβ Qλ (β). The zeroes are “essentially” unique ˆ = λ) (“essentially” refers to the situation: βˆj = 0 and Gj (β) Proof: Exercise (optional), or see in the book by Buhlmann and ¨ van de Geer (2011; Lemma 2.1) Coordinate descent algorithm for computation ˆ grid,k ) and use it as a general idea is to compute a solution β(λ ˆ grid,k−1 ) starting value for the computation of β(λ | {z } <λgrid,k β (0) ∈ Rp an initial parameter vector. Set m = 0. REPEAT: Increase m by one: m ← m + 1. For j = 1, . . . , p: (m−1) if |Gj (β−j (m) )| ≤ λ : set βj (m) otherwise: βj = 0, (m−1) = argminβj Qλ (β+j ), β−j : parameter vector setting jth component to zero (m−1) β+j : parameter vector which equals β (m−1) except for jth component equalling βj UNTIL numerical convergence for squared error loss: explicit up-dating formulae (Exercise) Gj (β) = −2XTj (Y − Xβ)/n (m) βj = sign(Zj )(|Zj | − λ/2)+ , ˆ jj Σ ˆ = n−1 XT X. Zj = XTj (Y − Xβ−j )/n, Σ ; componentwise soft-thresholding this is very fast if true problem is sparse active set strategy: can do non-systematic cycling, visiting mainly the active (non-zero) components riboflavin example, n=71, p=4088 0.33 secs. CPU using glmnet-package in R (Friedman, Hastie & Tibshirani, 2008) coordinate descent algorithm converges to a stationary point (Paul Tseng ≈ 2000) ; convergence to a global optimum, due to convexity of the problem main assumption: objective function = smooth function + penalty | {z } separable here: “separable” means “additive”, i.e., pen(β) = Pp j=1 pj (βj ) failure of coordinate descent algorithm: Fused Lasso p X 2 ˆ β = argminβ kY − Xβk2 /n + λ |βj − βj−1 | + λ2 kβk1 j=2 but Pp j=2 |βj − βj−1 | is non-separable contour lines of penalties for p = 2 5 3. 5 3. 3. 5 1. 5 0. 5 2. 5 5 2. 3 1 1 |beta1| + |beta2| 2 2 |beta1 − beta2| 1.5 3 3 beta2 0 1 1 −2 −1 5 2. 0 beta1 3 1 5 3. −2 5 1. 0.5 −1 −1 2 0 5 0. 2 −2 beta2 2. 5 2 2. 5 1 3 3. 5 −2 2 −1 0 beta1 2. 5 3 1 5 3. 2 Thank you! References: most of the material is covered in Buhlmann, P. and van de Geer, S. (2011). Statistics for ¨ High-Dimensional Data: Methodology, Theory and Applications. Springer.
© Copyright 2024 ExpyDoc