this tutorial on MLR

LAB 8: Multiple Linear Regression Chapter 11 Multiple linear regression (MLR) uses more than one explanatory variable to predict the response variable. Our response variable is still represented by y and we now have p predictor variables; x1,x2,…,xp. The goal of MLR is to get a better prediction equation by using more explanatory variables. The statistical model for MLR is found on pg 609 of your textbook. Objectives 1.
2.
3.
4.
5.
Plots and preliminary analysis Multiple linear regression in PROC REG Check assumptions MODEL statement options for confidence intervals and prediction intervals “Missing y trick” to obtain confidence and prediction intervals. The Data We will use the same data as we did for SLR. The data contains body measurements of 48 female sparrows. The variables are listed below. Sparrows.txt. total total length of the bird head length of beak and head humerus length of humerus Use the INFILE statement to create the SAS data set data sparrows;
infile 'sparrows.txt' firstobs=2;
input total head humerus;
run; Objective 1: Plots and Preliminary Analysis In Lab 7 we used one explanatory variable at a time to try to predict the response variable. Here we will use head and humerus simultaneously to predict total. As with SLR it is useful to have an understanding of the relationships within the data before starting analysis. We will use PROC GPLOT to create scatterplots for all paired combinations of the data (i.e. total*head, total*humerus, head*humerus). Recall that we can use the SYMBOL statement to have more control over the display of our graph. Next we will use PROC CORR to obtain numerical values for the relationships between the variables. symbol value=dot cv=blue;
proc gplot data=sparrows;
plot total*head;
plot total*humerus;
plot head*humerus;
run;
proc corr data=sparrows;
var total head humerus;
run;
Results We examined the first two scatterplots in Lab 7. There appears to be a positive linear relationship between head and total as well as between humerus and total. We have not previously considered if there is a relationship between head and humerus. There also appears to be a positive relationship between head and humerus. Looking at the output from PROC CORR we see that there is statistical evidence to support that each pair of variables has a positive linear relationship. Objective 2: Multiple Linear Regression in PROC REG In PROC REG we again use the MODEL statement to fit our regression. The code does not change much from the code for SLR. We are only adding an extra explanatory variable to our model. Therefore, we only add humerus to our PROC REG code. proc reg data=sparrows;
model total=head humerus;
run;
quit;
Results The output for MLR is very similar to that from SLR. The output for PROC REG starts with the ANOVA table. The ANOVA table includes sources of variation, DF, SS, MS, the F‐value and its associated p‐value. However, there are some differences in the ANOVA table for SLR and the ANOVA table for MLR. The DF for Model changed from 1 to 2 since we have increased the overall number of parameters from 2 to 3. Note that in MLR the DF for model is (p+1)‐1 = p since we have (p +1) parameters, the intercept plus p coefficients for the p predictor variables. The F‐value for MLR is calculated in the same way it is for SLR, by dividing model mean square by error mean square. However, it is not testing the same hypothesis that it is in SLR. Here the F‐value is testing H0: β1 = β2 = 0 vs. HA: β1 ≠ 0 or β2 ≠ 0. In words, the F‐test is testing whether or not both partial slopes are zero. In this case the F‐test is testing whether or not either head or humerus is useful in predicting total. The change in interpretation for r2 is similar. R2 is the fraction of variation in y explained by x1,x2,…,and xp. Give an interpretation of r2 in the context of this example. The Parameter Estimates section is also very similar to the output from SLR. The output includes estimates for the intercept parameter (β0), the partial slope parameter for head (β1), and the partial slope parameter for humerus (β2). The output also gives the associated standard errors for the estimates and t‐values with associated p‐values. These t‐values are not exactly testing the same hypothesis as from SLR. The t‐values are now testing the significance of the parameters given that all other predictors are in the model. For this model the t‐value 2.45 with p‐value 0.018 is testing whether or not the parameter for head, β1, is zero given that humerus is in the model. Also, interpretations of the parameter estimates are different. Now b1 is interpreted as the expected change in y for a unit increase in x1 given fixed values of x2,…,xp. Note that the parameter estimates in the MLR are different than the parameter estimates received when we did SLR with each variable separately. Both of the partial slope parameter estimates are less than what they were for the SLR models. Can you give a reason why this might be so? Give an interpretation of b0, b1, and b2. State the results of the ANOVA F‐test and the individual t‐tests. Objective 3: Check Assumptions Again, it is important to verify the assumptions for regression. The assumptions for the MLR model are as follows: i.
Linearity between y and x1, …, xp ii.
Constant variance for the residuals iii.
Normality for the residuals We have already examined the plots of total*head and total*humerus in Objective 1. While they appear to be linear we can plot the residuals*head and residuals*humerus to get a better look. In SLR we used residuals*x to examine the variance of the residuals. In MLR we plot the predicted values for y by the residuals to check for constant variance of the residuals. For normality, we still use the normal quartile plot for the residuals. Recall there are two ways to obtain the diagnostics plots. Here we will add code to our PROC REG. proc reg data=sparrows;
model total=head humerus;
plot residual.*head; *checking for linearity;
plot residual.*humerus; *checking for linearity;
plot residual.*predicted.; *checking for constant variance;
plot nqq.*residual.; *checking for normality;
run; Results Do the assumptions appear to be met? Explain. Objective 4: MODEL statement options for confidence intervals and prediction intervals The options to obtain confidence intervals are the same as in SLR. The CLB option gives confidence intervals for our parameters β0, β1, and β2. The CLM and CLI options give confidence and prediction values for data points already in our data set. Notice that SAS uses a default significance level of α=0.05. We can change the significance level using the ALPHA= option. The code below is using an α=0.1 significance level. proc reg data=sparrows alpha=0.1;
model total=head humerus/clb clm cli;
run; Results Notice that the labels for the intervals now read 90% CL. If we ran the above code using the default α=0.05, would we expect our intervals to be larger or smaller? Why? Objective 5: “Missing y trick” to obtain confidence and prediction intervals. The “missing y trick” also works in MLR when we would like predictions for values not in our data set. Again, be aware that it is never a good idea to try to predict outside the range of values in the data set. We have no way of knowing if the assumptions are met beyond the current values we have. The code follows almost exactly that as in SLR. We are now adding values for humerus in our missing data set. In the code below we are creating confidence and prediction intervals for a female sparrow with head length of 30.2 and humerus length of 17.7 and for a female sparrow with head length of 32.4 and humerus length of 19.6. Also we still have the ALPHA=0.1 option in the PROC REG. This will create intervals using a significance level of 0.10. See the SLR lab for code details. data missing;
input total head humerus;
datalines;
.
30.2 17.7
.
32.4 19.6
;
data sparrows;
set sparrows missing;
run;
proc reg data=sparrows alpha=0.1;
model total=head humerus;
output out=out1 r=resid p=pred ucl=pihigh lcl=pilow uclm=cihigh
lclm=cilow stdp=stdmean;
run;
data intervals;
set out1;
if total=.;
run;
proc print data=intervals;
run; Results Interpret the confidence and prediction intervals given in the output. Why are there missing values for the residuals?