當Y是數值變項時
„
Multiple Regressions
„
„
„
„
複廻歸
Rosner, Section 11.9
„
Outcome Y is a continuous variable…
Only one X, and X is continuous-Simple Linear
Regression
Only one X, and X is categorical-ANOVA
All X’s are continuous-Multiple Linear Regression
All X’s are categorical (use of dummy variables)-
Two-way ANOVA, three-way ANOVA, multiway ANOVA
Some X’s are continuous, and some are categorical
-Analysis of Covariance, ANCOVA, Linear
Models
2
Multiple Linear Regression
„
„
„
針對一個或多個觀察變項(design variables,
or independent variables)觀察其對某一
Outcome variable之線性關係
Outcome→Dependent Variable必須是
continuous
Design →Independent Variables
„
„
„
多半是continuous
若categorical則需特殊處理(dummy variables)
X與X之間必須互相獨立
Page 106, Pearson and Turton: Statistical methods in environmental health
3
4
Multiple Linear Regression: Y (continuous)
„
y = α+β1X1+β2X2+ β3X3 …+ βkXk
regression coefficient
„
Dependent Variable Y
Independent Variables X1 , X2 , … , Xk
Multiple Linear Regression: Y (continuous)
„
yˆ i= αˆ +βˆ1X1+ βˆ2 X2+ βˆ3 X3 …+ βˆ Xk
„
plug in parameter estimates
Estimated value
5
Multiple Linear Regression: Y (continuous)
„
yˆ i= αˆ +βˆ1X1+ βˆ2 X2+ βˆ3 X3 …+ βˆ Xk
„
plug in parameter estimates
k
Estimated value
k
Parameter estimates的取得是
X1 , X2 , … , Xk 彼此互相調整後
而得!
7
6
βj 迴歸係數的意義:
„ 在調整過其他影響因素的情況
下,一單位Xj的變化相當於Y變
項βj單位的變化量;
„ β 所對應的p-value乃是檢定
j
H0: βj=0的結果,βj的數值越
大,愈容易顯著。
„
8
Meaning of βj
Meaning of βj
„
„
„
For each βj, j=1, 2, …, k
The average increase in y per unit increase
in Xj, with all other variables held constant
Or, after adjusting for all other variables
in the model
Hypothesis testing of βj (p-value)
H0: βj=0
„ βj>0 positive direction,
„ β <0 negative direction
j
„
9
10
11
12
Rosner example 11.39
„
„
„
„
Birthweight in oz (X1)
Age in days (X2)
Systolic blood pressure in mmHg (y)
K=2
Response SBP
Summary of Fit
Rsquare
0.88091
Rsquare Adj
0.862589
Root Mean Square Error
2.479173
Mean of Response
88.0625
Observations(or Sum Wgts)
16
Parameter Estimates
Term
Estimate
Std Error t Ratio Prob> t
Intercept
β
birthwgt(oz)
age(days)
4.531889
0.034336
0.680205
1
檢定所有
的X變項
合起來對
Y變項是否
據顯著性相關
„
„
DF
2
13
15
Sum of Squares
591.03564
79.90186
670.93750
Relationship of each X
vs Y
Mean Square
295.518
6.146
11.79 <.0001
3.66 0.0029
8.66 <.0001
β2
Analysis of Variance
Source
Model
Error
C.Total
53.450194
0.1255833
5.8877191
F Ratio
48.0806
Prob>F
<0.0001
Significance of
each β
R2 = 88.1%, birth weight and age explained 88.1% Y’s variance
β1 迴歸係數的意義:在調整過Age後,1
oz出生體重的變化量相當於0.13 mmHg
的血壓變化量,此變化量在統計上顯著
性不同於0(p-值=0.0029);
β2 迴歸係數的意義:在調整過出生體重
後,年齡每增加1天相關於5.89 mmHg的
血壓變化量,此變化量具在統計上顯著
性意義(p-值<0.0001)。
15
„
„
β1 : 1 oz increase in birthweight
relates to 0.13 mmHg increase in SBP
(p=0.0029) after adjusting for age;
β2 : 1 day increase in age relates to
5.89 mmHg increase in SBP (p<0.0001)
after adjusting for birthweight.
16
TABLE 7.1
Sample Table for Reporting a Multiple Linear Regression Model with Three Eplanatory
Varlables.
Sample Presentation:
Variable
We developed a model to predict a score of overall function,
Y, for patients with multiple sclerosis based on disease
severity, X1, (level 1 being least severe and level 15 being most
severe); ambulatory ability (measured as the rate of walking in
laps per minute), X2; and number of lesions, X3:
Intercept
Coefficient(
Stadard
β)
Error
95% CI
40.79
2.55
─
3.98
2.37
-0.67 to 5.63
X2
.123
0.29
X3
-2.09
0.28
X1
Ward
X2
─
P
─
1.68
0.100
0.66 to 1.80
4.20
<0.001
-2.64 to-1.54
-7.43
<0.001
where
intercept
= a mathematical constant; no clinical interpretation
= the three explanatory variables
X1 to X3
coefficient
= the mathematical weightings of the explanatory variables in the equation
standard error = estimated precision of the coefficients
95%CI
= 95% confidence intervals for the coefficients
= the Wald test statistic calculated from the data to be compared with the chi-square
Wald X2
distribution with 1 degree of freedom
P value
= variables 2 and 3 are statistically significant predictors of the response variable
Y = 40.8 + 3.98X1 + 1.22X2 – 2.09X3
From: Lang & Secic, How to report statistics in medicine. 2nd (2006)
From: Lang & Secic, How to report statistics in medicine. 2nd (2006)
17
When do we use regression?
™
™
19
Characterize the relationship between
the dependent and independent
variables by determining the extent,
direction, and strength of the
association.
Seek a quantitative formula or equation
to describe (e.g., predict) the dependent
variable Y as a function of the
independent variables.
From pages 34-35, Kleinbaum, Kupper, Muller, Nizam: applied
Regression Analysis and Multivariable Methods. Duxbury, CA, USA
20
When do we use regression?
™
™
™
Describe quantitatively or qualitatively
the relationship between X’s and Y but
control for the effects of still other
variables.
Determine the which of several
independent variables are important and
which are not for describing or
predicting a dependent variable.
Determine the best mathematical model
for describing the relationsship
between Y and X’s.
From pages 34-35, Kleinbaum, Kupper, Muller, Nizam: applied
Regression Analysis and Multivariable Methods. Duxbury, CA, USA
When do we use regression?
™
™
™
21
From pages 34-35, Kleinbaum, Kupper, Muller, Nizam: applied
Regression Analysis and Multivariable Methods. Duxbury, CA, USA
Association vs causality
™
™
™
™
22
Association vs causality
A “statistically significant” association
in a particular study does not establish a
causal relationship.
To evaluate claims of causality, must
consider criteria that are external to
the specific characteristics and results.
Experimental proof: a change in X always
produce a chagne in Y
Combined results from several studies
From pages 36-37, Kleinbaum, Kupper, Muller, Nizam: applied
Regression Analysis and Multivariable Methods. Duxbury, CA, USA
Compare several derived regression
relationships.
Assess the interactive effects of 2
or more independent variables.
Obtain a valid and precise estimate
of one or more regression
coefficient
™
™
™
™
™
™
™
™
23
7 criteria:
Strength of association
Dose-response effect
Lack of temporal ambiguity
Consistency of the findings
Biological and theoretical plausibility of
the hypothesis
Coherence of the evidence
Specificity of the association
From pages 36-37, Kleinbaum, Kupper, Muller, Nizam: applied
Regression Analysis and Multivariable Methods. Duxbury, CA, USA
24
Statistical vs determinitic
„
Always involve error
From Kleinbaum, Kupper, Muller, Nizam: applied Regression
Analysis and Multivariable Methods. Duxbury, CA, USA
25
線性(Linearity)
y與x呈線性關係
迴歸的基本假設 KKM 115-117
存在性(existence)
data中所有X的數值皆可對應出y,y的平均值與標準
差存在且有限
獨立性(Independence)
每筆資料都互相獨立沒有關聯
線性(Linearity)
y與x呈線性關係
均質性(Homoscedasticity)
y的變異數經x變項調整後所剩餘之剩餘量的變異數
均相同
常態分配(Normality)
y經x調整後之剩餘量呈常態分配
From Kleinbaum, Kupper, Muller, Nizam: applied Regression
Analysis and Multivariable Methods. Duxbury, CA, USA
27
26
„
„
y = α+β1X1+β2X2+ β3X3 …+ βkXk
regression coefficient
Dependent Variable Y
Independent Variables X1 , X2 , … , Xk
28
均質性(Homoscedasticity)
常態分配(Normality)
Identifying
Confounding Factors
From Kleinbaum, Kupper, Muller, Nizam: applied Regression
Analysis and Multivariable Methods. Duxbury, CA, USA
Definition 13.9: A factor that is
associated with both the Y and X
variables. Such a variable must usually
be controlled for before looking at
the original Y-X relationship. (Rosner)
29
Confounding
Y=SBP, C=age
whether variable C is a confounding factor of Y&X1:
adjusted (multiple or multivariable or
multivariate)
model 1: Y=X1 C
crude (univariate):
model 2: Y=X1
SBP = α1 + 4.1 × X 1 + β1 × AGE
SBP = α 2 + 4.2 × X 1
comparing β’s of X1 in the 2 models
31
32
SBP = α1 + β1 × X1 + β3 × AGE
SBP = α 2 + β 2 × X 1
SBP = α1 + 4.1 × X 1 + β1 × AGE
SBP = α 2 + 15.9 × X 1
β1=β2
Relationship of X1 and SBP is
NOT affected by AGE
β1>β2
Relationship of X1 and SBP is
affected by AGE
β1<β2
Relationship of X1 and SBP is
affected by AGE
33
34
SBP=PlasmaRA
Plasma renin
activity was
inversely
related to
both Systolic
and diastolic
BP in the total
sample,
independent
of age, gender,
race, BMI,
alcohol
consumption,
and heart
rate.
How big is DIFFERENT!
„
Objectively!
Clinical point of view
Max/min > 2 (definitely different)
Max/min <1.5 (definitely not different)
„
Between 1.5 and 2, up to the authors!
„
„
„
Pp 558-559.
SBP=PlasmaRA+age+gender+BMI+alcohol+HR+race
He, et al., American Journal of Hypertension 1999; 12:555-562
35
36
Interactions (definition 12.8)
„
„
y = α+β1X1+β2X2+ β3X1×X2 …+ βkXk
regression coefficient
From: Lang & Secic, How to report statistics in medicine. 2nd (2006)
37
38
Analysis with multiple regression
indicated a statistically significant
difference among the age levels
(p=0.0211; Table 1), but not the group
effects (p=0.1665). A significant
interaction effect (p=0.0101) was also
found between age and group. Hence,
the age differences were distributed
differently among groups.
From: Lang & Secic, How to report statistics in medicine. 2nd (2006)
39
From: Lang & Secic, How to report statistics in medicine. 2nd (2006)
40
First interaction,
„ Then confounding
Dummy variables
„
虛擬變項
41
Dummy variables,
„
„
rosner pages 585-588
Indicator variables for categories in
categorical variables
dietary group (DIET): SV, LV and NOR
for k items, generate (k-1) dummy variables; choose NOR as
the reference group, make dummy variables for SV and LV
XSV = 1, if DIET=’SV’;
= 0, otherwise
XLV = 1, if DIET=’LV’;
= 0, otherwise
43
ID
1
2
3
4
5
6
7
diet
LV
NOR
SV
SV
LV
NOR
SV
XSV
0
0
1
1
0
0
1
XLV
1
0
0
0
1
0
0
44
45
46
47
48
Dummy SV
Dummy
LV
Intercept
collinearity
„
„
„
Independence among X variables
X’s variables should avoid collinearity
Use VIF or collinearity analysis
49
From Kleinbaum, Kupper, Muller, Nizam: applied Regression
Analysis and Multivariable Methods. Duxbury, CA, USA
Page 240
50
Condition number (CN) = maximum of condition index
Have collinearity when CN>=30
Have collinearity when VIF>10
From Kleinbaum, Kupper, Muller, Nizam: applied Regression
Analysis and Multivariable Methods. Duxbury, CA, USA
51
From Kleinbaum, Kupper, Muller, Nizam: applied Regression
Analysis and Multivariable Methods. Duxbury, CA, USA
52
Centering:
centered variable = variable - mean
Scaling :
scaled variable = variable/s, s=10,
100, or …
From Kleinbaum, Kupper, Muller, Nizam: applied Regression
Analysis and Multivariable Methods. Duxbury, CA, USA
53
54
Any questions?
引用圖文出處:
Rosner: Fundamentals of Biostatistics, 6th. Wadsworth Publishing
Company.
KKM: Kleinbaum, Kupper, Muller, Nizam: applied Regression Analysis
and Multivariable Methods. Duxbury, CA, USA
Lang & Secic: How to report statistics in medicine. 2nd ed (2006)
Perason & Turton: Statistical methods in environmental health.
Chapman and Hall
From Kleinbaum, Kupper, Muller, Nizam: applied Regression
Analysis and Multivariable Methods. Duxbury, CA, USA
55