Model Selection for Prestige data using AIC and BIC model selection criteria Nikita Shivdikar Graduate Student Department of Computer Science Rochester Institute of Technology [email protected] Abstract: This report is an analysis of Prestige dataset. This analysis comprises of log transformation of the predictor variable, model building and model selection using model selection criteria like Akaike Information Criteria(AIC) and Bayesian Information Criteria (BIC). The final model obtained is then used for prediction. Keywords: Multiple Linear Regression, Akaike Information Criteria (AIC), Bayesian Information Criteria (BIC), Prediction Error, Root Mean Squared Error. 1: Introduction The Prestige dataset analysed in this report has been taken from the library ‘car’ in R. This dataset has 102 observations and 6 attributes. The observations are the Canadian occupations. Out of the 6 attributes, one attribute namely ‘type’ representing the type of occupation has been eliminated from the dataset for the purpose of analysis using Linear Regression technique. The other attributes are ‘education’ representing average education of occupational incumbents; ‘income’ which represents average income of incumbents; ‘women’ represents percentage of incumbents who are women; ‘census’ is the Canadian Census occupational code and lastly ‘prestige’ is the response variable which represents the prestige score for occupation. In this project, I have applied log transformation on the attribute ‘income’. The reason behind this transformation is explained later in this report. Using the transformed data, the goal is to find the optimal Multiple Linear Regression Model that best fits the Prestige data. This model consists of all the significant variables that are deemed as best predictors of response variable ‘prestige’. I have used model selction criteria like Akaike Information Criteria(AIC) and Bayesian Information Criteria (BIC) that assist in selecting the best model. Next I have used the model for prediction and analysed how accurately the model predicts the prestige score given new observations. 2: Analysis of the Prestige dataset With this dataset, the goal is to study the reponse variable ‘prestige’. The following histogram gives the distribution of the response variable ‘prestige’. The p-value is less than alpha (0.05). Thus the response variable is not normally distributed. However we are much more concerned about the distribution of noise term which will be tested later. Pairwise Scatter Plots: The following diagram illustrates pairwise scatter plots of four predictor variables and one response variable. R output: From the pairwise scatter plots, we can see that the predictor variable ‘education’ exhibits a strong linear positive relationship with the response variable ‘prestige’. Correlation Matrix: To confirm the result obtained on studying the scatter plots, the following correlation matrix is generated. The predictor variable ‘education’ indeed has a high correlation with the response variable ‘prestige’. The second highly correlated variable is ‘income’. Cor (education, prestige) = 0.85 Cor (income, prestige) = 0.71 3: Model Fitting Firstly, I have fitted the model with all variables and studied the significance of the model and the variables. Fitting the Model with all variables: R output: From the summary obtained above, it can be inferred that the model is significant from its p-value. Also the variables education and income are significant because their p-values is 0 (less than any alpha). But the model also contains insignificant variables like women and census which can be clearly seen from their large p-values. Plot of Residuals: To find out whether the model fits the data well, the plot of residuals is generated as shown below. From the plot of Residuals vs fitted values, it is seen that the noise has a structure/pattern. Hence the model is not good. Fitting the model with two significant explanatory variables: ‘education’ and ‘income’: It is quite clear that the above model is not good. Therefore I refit the model with only significant variables and observe its plot of residuals. R output: From the model summary, it is seen that the model is significant and also the variables ‘education’ and ‘income’. We will then study the plot of its residuals. Plot of Residuals: From this plot, it is seen that the noise has a structure. Hence the model is not good to fit the data well. Fitting the above model without the intercept: R output: In this model, I have eliminated the intercept since in the previous model the intercept was not very significant. As before, I will study the plot of residuals. Plot of Residuals From the plot, we can conclude that the model is not good. The noise term has a structure. 4: Log Transformation After studying the above models, it is observed that none of the above models were good. Thus we need to apply some transformations on the predictor variable. Log transformations are one of the most commonly used transformations. Logarithmic transformations are also a convenient means of transforming a highly skewed variable into one that is more approximately normal. In this project, I have applied log transformation on the predictor variable ‘income’. Following diagrams are the histograms of ‘income’ variable before and after transformation. The transformed data is called as logPrestige. Further analysis will be done on this transformed data. Fitting full model R output: We have two significant variables (education &income) as well as intercept in this model. I will use model selection criteria as described in the next section to select the optimal model. 5: Model Selection Criteria - AIC & BIC on transformed data Model selection using AIC The following model was selected by AIC. From the model summary, it is observed that AIC has selected insignificant variable ‘women’. It is not preferable to have a model containing insignificant variable. Therefore instead of moving ahead to plot the residuals, we would like to see the model selected by BIC. Model selection using BIC The following model was selected by BIC. Thus it can be inferred that the model selected by BIC is indeed significant. The predictor variables as well as the intercept are significant. Therefore, we now go ahead and test how well the model fits the data by studying the plot of residuals. Plot of Residuals From the plot of residuals vs fitted values, it is observed that the plot of residuals is a pure random scatter i.e. the noise does not show structure/pattern as before. Hence the model is good. It fits the data very well. Thus we can now use the model for our further analysis. Normality test of residuals: From the normality test of residuals, we obtain large p-value. Hence, we can conclude that the residuals are normally distributed. The Q-Q plot shown above exhibits this normality test of residuals. 6: Prediction using the BIC selected model It is quite interesting to study how accurately the model predicts the prestige score given new observations. For the purpose of prediction, I have divided the dataset into 70% training set and 30% test set. The model is learned on the training data and tested on the test data. The following results were obtained. Calculating Prediction Error: Prediction Error = Actual Value a(i) – Predicted Value p(i) Root Mean Squared Error = = 6.05309 7: Conclusion Thus, starting with basic model fitting and proceeding the analysis with log transformation and model selection using AIC and BIC, one important conclusion that can be drawn is that BIC performs better than AIC. The BIC selected model is the optimal model. This model fits the data well. Also one can learn the importance of log transformations on the data while building models using Linear Regression technique. Without transformation, it was not possible to get an optimal model as we can clearly see from this project. References: 1: Principles of Statistical Mining Lecture Notes [Basic]: Basic Statistics with R Professor:Ernest Fokoue 2: Principles of Statistical Mining Lecture Notes [Intro Regression]: Introduction to Regression with R Professor:Ernest Fokoue
© Copyright 2024 ExpyDoc