Model Selection for Prestige data using AIC and BIC model

Model Selection for Prestige data using AIC and BIC model
selection criteria
Nikita Shivdikar
Graduate Student
Department of Computer Science
Rochester Institute of Technology
[email protected]
Abstract: This report is an analysis of Prestige dataset. This analysis comprises of log
transformation of the predictor variable, model building and model selection using model selection
criteria like Akaike Information Criteria(AIC) and Bayesian Information Criteria (BIC). The final
model obtained is then used for prediction.
Keywords: Multiple Linear Regression, Akaike Information Criteria (AIC), Bayesian Information
Criteria (BIC), Prediction Error, Root Mean Squared Error.
1: Introduction
The Prestige dataset analysed in this report has been taken from the library ‘car’ in R. This dataset has
102 observations and 6 attributes. The observations are the Canadian occupations. Out of the 6
attributes, one attribute namely ‘type’ representing the type of occupation has been eliminated from
the dataset for the purpose of analysis using Linear Regression technique. The other attributes are
‘education’ representing average education of occupational incumbents; ‘income’ which represents
average income of incumbents; ‘women’ represents percentage of incumbents who are women;
‘census’ is the Canadian Census occupational code and lastly ‘prestige’ is the response variable which
represents the prestige score for occupation. In this project, I have applied log transformation on the
attribute ‘income’. The reason behind this transformation is explained later in this report. Using the
transformed data, the goal is to find the optimal Multiple Linear Regression Model that best fits the
Prestige data. This model consists of all the significant variables that are deemed as best predictors of
response variable ‘prestige’. I have used model selction criteria like Akaike Information Criteria(AIC)
and Bayesian Information Criteria (BIC) that assist in selecting the best model. Next I have used the
model for prediction and analysed how accurately the model predicts the prestige score given new
observations.
2: Analysis of the Prestige dataset
With this dataset, the goal is to study the reponse variable ‘prestige’. The following histogram gives
the distribution of the response variable ‘prestige’.
The p-value is less than alpha (0.05). Thus the response variable is not normally distributed. However
we are much more concerned about the distribution of noise term which will be tested later.
Pairwise Scatter Plots:
The following diagram illustrates pairwise scatter plots of four predictor variables and one response
variable.
R output:
From the pairwise scatter plots, we can see that the predictor variable ‘education’ exhibits a strong
linear positive relationship with the response variable ‘prestige’.
Correlation Matrix:
To confirm the result obtained on studying the scatter plots, the following correlation matrix is
generated.
The predictor variable ‘education’ indeed has a high correlation with the response variable ‘prestige’.
The second highly correlated variable is ‘income’.
Cor (education, prestige) = 0.85
Cor (income, prestige) = 0.71
3: Model Fitting
Firstly, I have fitted the model with all variables and studied the significance of the model and the
variables.
Fitting the Model with all variables:
R output:
From the summary obtained above, it can be inferred that the model is significant from its p-value.
Also the variables education and income are significant because their p-values is 0 (less than any
alpha). But the model also contains insignificant variables like women and census which can be
clearly seen from their large p-values.
Plot of Residuals:
To find out whether the model fits the data well, the plot of residuals is generated as shown below.
From the plot of Residuals vs fitted values, it is seen that the noise has a structure/pattern. Hence the
model is not good.
Fitting the model with two significant explanatory variables: ‘education’
and ‘income’:
It is quite clear that the above model is not good. Therefore I refit the model with only significant
variables and observe its plot of residuals.
R output:
From the model summary, it is seen that the model is significant and also the variables ‘education’
and ‘income’. We will then study the plot of its residuals.
Plot of Residuals:
From this plot, it is seen that the noise has a structure. Hence the model is not good to fit the data well.
Fitting the above model without the intercept:
R output:
In this model, I have eliminated the intercept since in the previous model the intercept was not very
significant. As before, I will study the plot of residuals.
Plot of Residuals
From the plot, we can conclude that the model is not good. The noise term has a structure.
4: Log Transformation
After studying the above models, it is observed that none of the above models were good.
Thus we need to apply some transformations on the predictor variable. Log transformations are one
of the most commonly used transformations. Logarithmic transformations are also a convenient
means of transforming a highly skewed variable into one that is more approximately normal. In this
project, I have applied log transformation on the predictor variable ‘income’. Following diagrams are
the histograms of ‘income’ variable before and after transformation.
The transformed data is called as logPrestige. Further analysis will be done on this transformed data.
Fitting full model
R output:
We have two significant variables (education &income) as well as intercept in this model. I will use
model selection criteria as described in the next section to select the optimal model.
5: Model Selection Criteria - AIC & BIC on transformed data
Model selection using AIC
The following model was selected by AIC.
From the model summary, it is observed that AIC has selected insignificant variable ‘women’. It is
not preferable to have a model containing insignificant variable. Therefore instead of moving ahead to
plot the residuals, we would like to see the model selected by BIC.
Model selection using BIC
The following model was selected by BIC.
Thus it can be inferred that the model selected by BIC is indeed significant. The predictor variables as
well as the intercept are significant. Therefore, we now go ahead and test how well the model fits the
data by studying the plot of residuals.
Plot of Residuals
From the plot of residuals vs fitted values, it is observed that the plot of residuals is a pure random
scatter i.e. the noise does not show structure/pattern as before. Hence the model is good. It fits the
data very well. Thus we can now use the model for our further analysis.
Normality test of residuals:
From the normality test of residuals, we obtain large p-value. Hence, we can conclude that the
residuals are normally distributed. The Q-Q plot shown above exhibits this normality test of residuals.
6: Prediction using the BIC selected model
It is quite interesting to study how accurately the model predicts the prestige score given new
observations. For the purpose of prediction, I have divided the dataset into 70% training set and 30%
test set. The model is learned on the training data and tested on the test data. The following results
were obtained.
Calculating Prediction Error:
Prediction Error = Actual Value a(i) – Predicted Value p(i)
Root Mean Squared Error =
=
6.05309
7: Conclusion
Thus, starting with basic model fitting and proceeding the analysis with log transformation and model
selection using AIC and BIC, one important conclusion that can be drawn is that BIC performs better
than AIC. The BIC selected model is the optimal model. This model fits the data well. Also one can
learn the importance of log transformations on the data while building models using Linear
Regression technique. Without transformation, it was not possible to get an optimal model as we can
clearly see from this project.
References:
1: Principles of Statistical Mining
Lecture Notes [Basic]: Basic Statistics with R
Professor:Ernest Fokoue
2: Principles of Statistical Mining
Lecture Notes [Intro Regression]: Introduction to Regression with R
Professor:Ernest Fokoue