1 Application of Data Mining Techniques like Linear Discriminant Analysis(LDA), k-means clustering, Multiple Linear Regression, Principle Component Analysis(PCA) and Logistic Regression on Datasets Ankit Ghosalkar,Nikita Shivdikar,Pallavi Tilak and Rohan Dalvi Abstract—Data mining techniques are used for a wide variety of purposes. Techniques such as Classification, Association and Clustering are often tremendously used for finding interesting patterns in the data. In this project, we have implemented data mining techniques like Classification, Logistic Regression, Linear Discriminant Analysis (LDA), K-Means Clustering and Principle Component Analysis (PCA) and Multiple Linear Regression on three datasets namely Wine Quality dataset, Spambase Dataset and Communities, Crimes dataset. We have evaluated the performance of these techniques on separate training and test sets. The experimental results were analysed to determine the drawbacks of the implemented technique on the dataset. . Index Terms—Data Mining,,Classification.Clustering ,Logistic Regression,Linear Discriminant Analysis. I. Introduction The amount of data generated from various sources is increasing by leaps and bounds. This data generated would be of no use if it does not reflect any useful information. Thus there is a need to extract knowledge from the data. This led to the evolution of the concept of data mining. Data mining refers to extracting knowledge or finding interesting patterns in data. The knowledge obtained is then stored in knowledge database for future usage. This knowledge is utilized by business analyst for decision making purposes. Traditionally, the task data mining was complex and computationally expensive. The data analysts had to manually dredge through the data for finding interesting patterns. With small datasets, the task was much simpler; but with larger datasets this task was very much time consuming. Additionally experts were required to mine such huge datasets and extract useful knowledge. Thus there arises a need for inventing new techniques for data mining. The new techniques for data mining are more efficient and automated. Techniques such as Classification, Clustering, Association, Regression etc. have made the task of data mining much more simpler and efficient. The Classification technique is used for supervised learning wherein we know the class labels and we have to predict the class of new instances. Various algorithms are developed that carry out the task of classification namely 1R, Naive Bayes, Decision trees etc. All these algorithms have their own advantages and disadvantages. The Clustering technique is used for unsupervised learning wherein we do not know the class label in advance and we have to cluster together those instances which are similar to each other such that at the end of clustering we obtain distinct clusters. The various algorithms used for clustering are K-Means clustering, Hierarchical clustering, spectral clustering etc. Again, depending on the size and the characteristics of the data that we are analysing, we have to choose appropriate algorithm. We can also use Association for discovering relevant rules from the dataset. Before applying any of these algorithms, we need to process the raw data. This raw data which is collected from various sources might contain missing values or the format of the data might not be proper. This constitutes the data pre-processing stage wherein the data is cleaned (outliers are detected) and prepared for mining. After implementing data mining techniques on the data, we evaluate the performance of the models built and analyse the results of the experiments performed on the data that is using the built model for testing new instances and measuring the accuracy and error rate of the classifiers. This gives insights into how well a particular classifier performs on a particular data set and what measures need to be taken to tune the parameters of the classifier. This also gives useful information about which classifier would perform better on the given dataset. The experimental results are used for various purposes such as advertisement companies can analyse customer data and their trends of purchasing goods which can be useful for targeted marketing purposes. The various other domains wherein data mining is widely used are banking, sports, mobile industries, networking, education, business and government. In this project, we have mainly focussed on Classification namely the binary classification using Logistic Regression, Linear Discriminant Analysis, Clustering and Principle Component Analysis. We have obtained datasets from UCI Machine learning repository website. The description of these datasets is as follows: Wine quality dataset: wine is increasingly enjoyed by a large number of customers. Therefore wine industries are inventing new strategies to increase production and sale of wine. The key aspect to be taken into consideration during wine production is the quality of wine. Evaluating the quality of wine is an important task of wine industries 2 because they have to ensure that the wine produced is unadulterated and safe for human health. Wine is assessed by various physicochemical tests which include determining its density, alcohol or pH values and sensory tests which are conducted by human experts. This dataset is of particular interest because it holds valuable information with respect to the wine quality assessment. The dataset contains 4898 instances/samples of white wine from the north of Portugal. These samples are tested for their wine quality. The first 11 attributes are used to analyse the various parameters of wine assessment and the 12th attribute determines the wine quality on a scale ranging from 0(very bad) to 10(excellent). We have used Linear Discriminant Analysis (LDA) technique for the purpose of classification of wine samples. We have also used Principle Component Analysis (PCA) and K-means Clustering on this dataset.Applying Data Mining techniques on this raw data will help extracting useful knowledge. It is interesting to predict good use of this knowledge for wine businesses.Evaluating quality will improve decision making such as identifying factors that are more relevant in decision making and accordingly setting prices and gaining customer satisfaction.Also it would be interesting to know how this knowledge will help control several parameters during the wine production process for improving the quality and provide more customer satisfaction. For example, increasing or decreasing the concentration of residual sugar, alcohol or other attributes in order to improve quality and measure the concentration of each ingredient as to how it affects the popularity of the wine. SpamBase Dataset:The number of spam emails that internet user receives everyday has increased tremendously. This dataset consist of spam and non-spam emails collected from various sources. This Multivariate dataset has 4601 instances and 57 attributes.The attributes characteristics are Integer or Real. The description of various predictor variables are: word freq WORD - frequency of alphanumeric characters which are covered by non-alphanumeric characters; char freq CHAR - number of CHAR occurrences to the total number of characters in email; capital run length average- average length of continuous sequences of capital letters; capital run length longestaverage length of longest continuous sequences of capital letters; capital run length total- total number of capital letters in the e-mail; Class- 1 for the email considered as spam and 0 for the non-spam email. For the purpose of classification, we have used Logistic Regression which is well known binary classification technique that is when there are only two class labels. The application of data mining on Spambase dataset can be used in the designing of spam filters. The various tools that we have used for the purpose of data mining are WEKA, Matlab and R.Usage of these tools makes the task of data mining much easier, faster and efficient. The results generated were either numeric values or confusion matrices or graphs and plots which were analysed and evaluated for performance and accuracy as well as comparison with other approaches.The rest of the article is organized as follows: Section 2 describes data collection and preprocessing. Section 3 describes the various techniques in detail that were used for data mining on the wine quality and Spambase data sets as well as the results of the experiments that were performed on the datasets. Section 4 compares these techniques with respect to various performance measures. Section 5 concludes the paper. Communities and Crimes dataset: This data combines socio-economic data from the 1990 US Census, law enforcement data from the 1990 US LEMAS survey, and crime data from the 1995 FBI UCR. The number of instances in the dataset are 1994 with 128 attributes. It has a large number of missing values. It is a multivariate dataset. The attribute characteristics is real. It contains all those attributes that help predict the crime or are have some connection with the crimes. We have implemented Multiple Linear Regression technique on this dataset to build a model that would correctly predict the value of the response variable that is total number violent crimes per 100k population. The information of the various attributes of this dataset is as follows:[3] state: US state (by number) county: numeric code for county contains many missing values community: numeric code for community - not predictive and many missing values (numeric) communityname: community name - not predictive for information only (string) fold: fold number for non-random 10 fold cross validation, potentially useful for debugging, paired tests - not predictive (numeric) population: population for community: (numeric decimal) householdsize: mean people per household (numeric decimal) racepctblack: percentage of population that is african american (numeric - decimal) racePctWhite: percentage of population that is caucasian (numeric - decimal) racePctAsian: percentage of population that is of asian heritage (numeric - decimal) racePctHisp: percentage of population that is of hispanic heritage (numeric - decimal) agePct12t21: percentage of population that is 12-21 in age (numeric - decimal) agePct12t29: percentage of population that is 12-29 in age (numeric - decimal) agePct16t24: percentage of population that is 16-24 in age (numeric - decimal) agePct65up: percentage of population that is 65 and over in age (numeric - decimal) numbUrban: number of people living in areas classified as urban (numeric - decimal) pctUrban: percentage of people living in areas classified : as urban (numeric - decimal) medIncome: median household income (numeric decimal) pctWWage: percentage of households with wage or salary income in 1989 (numeric - decimal) pctWFarmSelf : percentage of households with farm or self employment income in 1989 (numeric - decimal) pctWInvInc: percentage of households with investment / rent income in 1989 (numeric - decimal) pctWSocSec: percentage of households with social security income in 1989 (numeric - decimal) pctWPubAsst: percentage of households with public assistance income in 1989 (numeric - decimal) pctWRetire: percentage of households with retirement income in 1989 (numeric - decimal) medFamInc: median family income (differs from household income for non-family households) (numeric decimal) perCapInc: per capita income (numeric - decimal) whitePerCap: per capita income for caucasians (numeric - decimal) blackPerCap: per capita income for african americans (numeric - decimal) indianPerCap: per capita income for native americans (numeric - decimal) AsianPerCap: per capita income for people with asian heritage (numeric - decimal) OtherPerCap: per capita income for people with ’other’ heritage (numeric - decimal) HispPerCap: per capita income for people with hispanic heritage (numeric - decimal) NumUnderPov: number of people under the poverty level (numeric - decimal) PctPopUnderPov: percentage of people under the poverty level (numeric - decimal) PctLess9thGrade: percentage of people 25 and over with less than a 9th grade education (numeric - decimal) PctNotHSGrad: percentage of people 25 and over that are not high school graduates (numeric - decimal) PctBSorMore: percentage of people 25 and over with a bachelors degree or higher education (numeric decimal) PctUnemployed: percentage of people 16 and over, in the labor force, and unemployed (numeric - decimal) PctEmploy: percentage of people 16 and over who are employed (numeric - decimal) PctEmplManu: percentage of people 16 and over who are employed in manufacturing (numeric - decimal) PctEmplProfServ: percentage of people 16 and over who are employed in professional services (numeric decimal) PctOccupManu: percentage of people 16 and over who are employed in manufacturing (numeric - decimal) PctOccupMgmtProf : percentage of people 16 and over who are employed in management or professional occupations (numeric - decimal) MalePctDivorce: percentage of males who are divorced (numeric - decimal) MalePctNevMarr: percentage of males who have never married (numeric - decimal) 3 FemalePctDiv: percentage of females who are divorced (numeric - decimal) TotalPctDiv: percentage of population who are divorced (numeric - decimal) PersPerFam: mean number of people per family (numeric - decimal) PctFam2Par: percentage of families (with kids) that are headed by two parents (numeric - decimal) PctKids2Par: percentage of kids in family housing with two parents (numeric - decimal) PctYoungKids2Par: percent of kids 4 and under in two parent households (numeric - decimal) PctTeen2Par: percent of kids age 12-17 in two parent households (numeric - decimal) PctWorkMomYoungKids: percentage of moms of kids 6 and under in labor force (numeric - decimal) PctWorkMom: percentage of moms of kids under 18 in labor force (numeric - decimal) NumIlleg: number of kids born to never married (numeric - decimal) PctIlleg: percentage of kids born to never married (numeric - decimal) NumImmig: total number of people known to be foreign born (numeric - decimal) PctImmigRecent: percentage of immigrants who immigated within last 3 years (numeric - decimal) PctImmigRec5: percentage of immigrants who immigated within last 5 years (numeric - decimal) PctImmigRec8: percentage of immigrants who immigated within last 8 years (numeric - decimal) PctImmigRec10: percentage of immigrants who immigated within last 10 years (numeric - decimal) PctRecentImmig: percent of population who have immigrated within the last 3 years (numeric - decimal) PctRecImmig5: percent of population who have immigrated within the last 5 years (numeric - decimal) PctRecImmig8: percent of population who have immigrated within the last 8 years (numeric - decimal) PctRecImmig10: percent of population who have immigrated within the last 10 years (numeric - decimal) PctSpeakEnglOnly: percent of people who speak only English (numeric - decimal) PctNotSpeakEnglWell: percent of people who do not speak English well (numeric - decimal) PctLargHouseFam: percent of family households that are large (6 or more) (numeric - decimal) PctLargHouseOccup: percent of all occupied households that are large (6 or more people) (numeric - decimal) PersPerOccupHous: mean persons per household (numeric - decimal) PersPerOwnOccHous: mean persons per owner occupied household (numeric - decimal) PersPerRentOccHous: mean persons per rental household (numeric - decimal) PctPersOwnOccup: percent of people in owner occupied households (numeric - decimal) PctPersDenseHous: percent of persons in dense housing (more than 1 person per room) (numeric - decimal) PctHousLess3BR: percent of housing units with less than 3 bedrooms (numeric - decimal) 4 MedNumBR: median number of bedrooms (numeric decimal) HousVacant: number of vacant households (numeric decimal) PctHousOccup: percent of housing occupied (numeric decimal) PctHousOwnOcc: percent of households owner occupied (numeric - decimal) PctVacantBoarded: percent of vacant housing that is boarded up (numeric - decimal) PctVacMore6Mos: percent of vacant housing that has been vacant more than 6 months (numeric - decimal) MedYrHousBuilt: median year housing units built (numeric - decimal) PctHousNoPhone: percent of occupied housing units without phone (in 1990, this was rare!) (numeric decimal) PctWOFullPlumb: percent of housing without complete plumbing facilities (numeric - decimal) OwnOccLowQuart: owner occupied housing - lower quartile value (numeric - decimal) OwnOccMedVal: owner occupied housing - median value (numeric - decimal) OwnOccHiQuart: owner occupied housing - upper quartile value (numeric - decimal) RentLowQ: rental housing - lower quartile rent (numeric - decimal) RentMedian: rental housing - median rent (Census variable H32B from file STF1A) (numeric decimal) RentHighQ: rental housing - upper quartile rent (numeric - decimal) MedRent: median gross rent (Census variable H43A from file STF3A - includes utilities) (numeric - decimal) MedRentPctHousInc: median gross rent as a percentage of household income (numeric - decimal) MedOwnCostPctInc: median owners cost as a percentage of household income - for owners with a mortgage (numeric - decimal) MedOwnCostPctIncNoMtg: median owners cost as a percentage of household income - for owners without a mortgage (numeric - decimal) NumInShelters: number of people in homeless shelters (numeric - decimal) NumStreet: number of homeless people counted in the street (numeric - decimal) PctForeignBorn: percent of people foreign born (numeric - decimal) PctBornSameState: percent of people born in the same state as currently living (numeric - decimal) PctSameHouse85: percent of people living in the same house as in 1985 (5 years before) (numeric - decimal) PctSameCity85: percent of people living in the same city as in 1985 (5 years before) (numeric - decimal) PctSameState85: percent of people living in the same state as in 1985 (5 years before) (numeric - decimal) LemasSwornFT: number of sworn full time police officers (numeric - decimal) LemasSwFTPerPop: sworn full time police officers per 100K population (numeric - decimal) LemasSwFTFieldOps: number of sworn full time police officers in field operations (on the street as opposed to administrative etc.) (numeric - decimal) LemasSwFTFieldPerPop: sworn full time police officers in field operations (on the street as opposed to administrative etc.) per 100K population (numeric decimal) LemasTotalReq: total requests for police (numeric decimal) LemasTotReqPerPop: total requests for police per 100K population (numeric - decimal) PolicReqPerOffic: total requests for police per police officer (numeric - decimal) PolicPerPop: police officers per 100K population (numeric - decimal) RacialMatchCommPol: a measure of the racial match between the community and the police force. High values indicate proportions in community and police force are similar (numeric - decimal) PctPolicWhite: percent of police that are caucasian (numeric - decimal) PctPolicBlack: percent of police that are african american (numeric - decimal) PctPolicHisp: percent of police that are hispanic (numeric - decimal) PctPolicAsian: percent of police that are asian (numeric - decimal) PctPolicMinor: percent of police that are minority of any kind (numeric - decimal) OfficAssgnDrugUnits: number of officers assigned to special drug units (numeric - decimal) NumKindsDrugsSeiz: number of different kinds of drugs seized (numeric - decimal) PolicAveOTWorked: police average overtime worked (numeric - decimal) LandArea: land area in square miles (numeric - decimal) PopDens: population density in persons per square mile (numeric - decimal) PctUsePubTrans: percent of people using public transit for commuting (numeric - decimal) PolicCars: number of police cars (numeric - decimal) PolicOperBudg: police operating budget (numeric decimal) LemasPctPolicOnPatr: percent of sworn full time police officers on patrol (numeric - decimal) LemasGangUnitDeploy: gang unit deployed (numeric - decimal - but really ordinal - 0 means NO, 1 means YES, 0.5 means Part Time) LemasPctOfficDrugUn: percent of officers assigned to drug units (numeric - decimal) PolicBudgPerPop: police operating budget per population (numeric - decimal) ViolentCrimesPerPop: total number of violent crimes per 100K population (numeric - decimal) response variable The various tools that we have used for the purpose of data mining are WEKA, Matlab and R. Usage of these tools makes the task of data mining much easier, faster and efficient. The results generated were either : 5 numeric values or confusion matrices or graphs and plots which were analysed and evaluated for performance and accuracy as well as comparison with other approaches. The rest of the article is organized as follows: Section 2 describes data collection and pre-processing. Section 3 describes the various techniques in detail that were used for data mining on the wine quality, Spambase and Communities and crimes datasets as well as the results of the experiments that were performed on the datasets. Section 4 compares these techniques with respect to various performance measures. Section 5 concludes the paper and gives directions for future work. II. Data Collection and Pre-Processing: A. Logistic Regression on Spambase Dataset Logistic Regression is a binary classification technique. We use this technique when our response variable is binary (i.e. 0 or 1) and a collection of real valued explanatory variables. The goal is given a vector X, the task is to predict its class as accurately as possible. The response variable is related to the predictor variable through a relationship of the form. The Wine quality, Spambase and Communities and crimes datasets were obtained from UCI machine learning repository website. The Spambase dataset obtained from the site was in the proper format as expected and did not contain any missing values. So we not did not carry out any pre-processing tasks on this dataset. The Wine quality dataset however was not in the proper format. It contained all the values of each instance in a single cell of excel spread sheet separated by semicolon and thus was hard to read. In order to bring it in proper readable format, we wrote the following code in R which copied all the values separated by semicolon into another table of new excel file. wine < −read.table(0 winequality − white.csv 0 , sep =0 ;0 ) write.table(wine,’winequality-whitewine.csv’, sep=’,’) Further the Wine quality dataset did not contain any missing values. So we could now use this cleaned dataset for data mining. The Communities and crimes dataset had lots of missing values for many attributes. Also since we are doing Multiple Linear regression on this dataset, we eliminated the nominal attribute named country from the dataset. Since the missing values for the attributes was more than 85% we eliminated such attributes from our dataset. The eliminated attributes were country, community, communityname, LemasSwornFT, LemasSwFTPerPop, LemasSwFTFieldOps, LemasSwTotalReq, LemasTotReqPerPop, PolicReqPerOffic, PolicePerPop, RacialMatchCommPol, PctPolicWhite, PctPoliceBlack, PctPoliceHisp, PctPoliceAsian, PoliceCars, PoliceOperBudg, LemasPctOnPatr, LemasGangUnitDeploy, PoliceBudgPerPop. Finally, all the datasets were cleaned and ready to use for data mining. Application of Data Mining techniques on Wine quality, Spambase and Communities and Crimes datasets III. In this section, we specify the analysis which were carried along with our findings. Fig. 1. Analysis of the Spambase dataset: Building the Model Applying the Logistic Regression technique using R, we first fit the full model that is we build the model that includes all the explanatory variables of Spambase dataset. The function glm in R is used for model fitting in Logistic Regression. We use the following commands in R: spam~ < −read.csv(0 spambase.csv 0 , header = T ) glm.spam − glm(Class~., data = spam, f amily = binomial(0 logit0 )) summary(glm.spam) R output after fitting full model: 6 After studying this model, we have observed that the model is significant because residual deviance of the model is 1815.8. However the model contains lots of noise variables or insignificant variables. This can be inferred from the large p-values of these noise variables. Hence we need a model that contains only significant variables. For this, we use a technique called variable selection wherein only significant variables are retained in the model while the rest are eliminated. The following is the R code for selecting variables that have high correlation with the response variable: XY < −spam p < −ncol(XY ) good.var < −c() for(i in 1:(p-1)) {if (abs(cor(XY [, i], XY [, p])) >= if (cor.test(XY [, i], XY [, p])$p.value < 0.05) {good.var < −c(good.var, colnames(XY )[i]) }} 0.30) The output obtained after applying this technique: Four variables namely ”word freq remove” ”word freq your”, ”word freq 000”, ”char freq ..4” were deemed significant. Hence we build the model again using these variables. glm.newmodel¡-glm(Class word freq remove+word freq your +word freq 000+char freq ..4, data=spam) summary(glm.newmodel) R output after Refitting the model: : 7 Performance of Logistic Regression on Spambase dataset: We have obtained the confusion matrix showing actual and predicted values. Actual¡-spam$Class confmat¡-table(Actual, predicted) confmat From the above confusion matrix, we have computed various parameters: True Positives (TP): Number of emails correctly predicted as spam are 1042. False Positives (FP): 137 emails which were predicted as spam were actually non-spams. True Negatives (TN): 2651 emails predicted as nonspams were non-spams. False Negatives (FN): 771 emails were predicted as non-spams but were actually spams. Precision = TP/(TP+FP) = 0.8837 = 88.37% Recall = TP/(TP+FN) = 0.5747 = 57.47% F-measure= 2 (Precision X Recall) / (Precision + Recall) = 0.6964 The summary of this model reveals that the model is significant from the value of its residual deviance which is less than the degrees of freedom. Also all the variables are significant since their p-values are zeros. Hence we can use this model for our further analysis: Prediction with Logistic Regression The model obtained above can now be used for prediction. In this, we predict the log odd ratios for each of the instances. We compute R output after Refitting the model: And then predict the probability of the class (0-non-spam or 1-spam) of each instance We have used the following R commands for our predictions: logs.odds.ratio¡-predict(glm.newmodel, spam[,-57]) probabilities¡-predict(glm.newmodel,spam[,-57], type=’response’) predicted¡-rep(0,4601) positive¡-which(probabilities¿0.5) predicted[positive]¡-1 Performance of Classification(PCC): (TN+TP)/number of instances = 80.27% Error rate (Misclassification rate)= 1- PCC = 2% ROC (Receiver Operating Curve) The ROC curve is widely used to measure the quality of classification technique. This curve plots the False Positive Rate and True Positive Rate. The following plot is ROC curve with respect to the new model obtained. 8 This ROC curve illustrates the performance of the Spambase classifier. B. Linear Discriminant Analysis (LDA) on Wine quality dataset LDA is similar to Principle Component Analysis. The difference is PCA does more of attribute classification whereas LDA does more of data classification. Like PCA, LDA performs dimensionality reduction while preserving much of the information. The models obtained using LDA sometimes shows high accuracy than the more complex models. LDA technique is used for classification purpose. LDA finds a discriminant function of the two predictors X any Y and results in a new set of transformed values that gives more accurate discrimination than the predictor variable alone. It tries to find directions along which the classes are best separated. Additionally it not only finds scatter within the classes but also between classes. This LDA technique is implemented on Wine quality dataset to classify the instances based on the wine quality. The library MASS in R is used to perform LDA on wine quality dataset wine.lda < −lda(quality~., data = wine) wine.lda We select first two significant components as shown below: R output: If Sw is non-singular, then Eigen value is obtained by LDA Method 1: Let N be the number of classes 2: Let be the mean vector of class i=1,2N 3: Let be the number of samples within class i=1,2N 4: Let L be the total number of samples We now compute the scatter matrix within the class using the following formula: Scatter matrix between the class: Next we find the mean of the entire dataset: LDA tries to minimize the scatter within the class while maximizes the scatter between the classes while reducing the variation due to sources and retaining the class separability. (Maximize) (det(Sb)/det(Sw)) Linear Transformation: The Linear transformation is given by matrix U with columns as the Eigen Vectors of inverse(Sw)(Sb). There are at most N-1 non generalized Eigen Vectors. The generalized Eigen Vector is given by: If Sw is non-singular, then Eigen value is obtained by From this output, we can infer that the first discriminant function obtained is the linear combination of variables that is: (1.864)fixed.acidity-(4.755)volatile.acidity -(7.046)citric.acidity+(1.89)residual.sugar(5.294)chlorides+(1.0608)free.sulphur.dioxide(1.229)total.sulphu.dioxide-(3.445)density +(1.698)pH+(1.61)sulphates+(5.3604)alcohol. Next,we need to calculate the values of the first discriminant function for each instance in the dataset. wine.lda.vals < −predict(wine.lda, wine[1 : 11]) wine.lda.vals$x[,1] R output for first few instances: : 9 classif ier < −lda(quality ., data = wine, subset = trainSet) P redicted < −predict(classif ier, wine[−trainSet, ])$class Actual < −wine$quality[−trainSet] table(Actual, Predicted) Output obtained: Also from the proportion of trace, we get the percentage of separation between the groups achieved by each discriminant function. For example, the first discriminant function achieves a separation of 83.12% while the second discriminant function achieves a separation of 11.83%.Therefore to achieve a good separation between the groups, we need to use both the discriminant functions. Results of Linear Discriminant Analysis: The results of applying LDA on wine quality dataset is shown by the stacked histogram of the values of the discriminant function for the instances of different groups. We use the function ldahist() in R to make a stacked histogram of the values of the first discriminant function. Prediction using LDA To understand the accuracy of the discriminator in classification, we have taken two new instances and used LDA to classify them. #R code dis < −lda(quality~., data = testwine) predict(disc) predict(disc)$class newInstance1: fixed.acidity=7.1, volatile.acidity=0.24, citric.acid=0.41, residual.sugar=17.8, chlorides=0.046, free.sulfur.dioxide=39, total.sulfur.dioxide=145, density=0.9998, pH=3.32, sulphates=0.39, alcohol=8.7 predict(disc, newInstance) class predicted =5; actual class=5 newInstance2: fixed.acidity=8.1, volatile.acidity=0.27, citric.acid=0.41, residual.sugar=1.45, chlorides=0.033, free.sulfur.dioxide=11, total.sulfur.dioxide=63, density=0.9908, pH=2.99, sulphates=0.56, alcohol=12 class predicted =5; actual class=4 Cross Validation using LDA: The classifier is trained on the part of data and is used to predict the rest of the data #R code trainSet < −sample(1 : 2020, 1010) table(wine$quality[trainSet]) The confusion matrix obtained above shows the number of instances that were predicted correctly and the number of instances that were misclassified. Thus LDA technique is widely used for classification purpose especially when there are more number of classes. C. k means clustering on wine quality dataset Clustering is defined as grouping of similar things together. It is often confused as being similar to classification. Clustering is unsupervised exploration procedure whereas classification is supervised and used for prediction purposes. K-means is a clustering algorithm which is unsupervised. It helps to assign the data to a particular cluster given k clusters which is decided a priori. The first step is to decide on which are the ideal choices for the k centroids. Next step is to associate the values of the dataset with nearest possible centroid iteratively by computing the distance. The main objective of clustering is optimising the objective function in squared error function. It can be employed in detecting the odds in the data for example for fraudulent transactions or scams. We have used K-Means clustering technique on wine quality dataset to cluster the wine samples into seven clusters representing the quality level of the wine sample (ranging from 3 to 9). The following is the pseudo code of K-means clustering algorithm Pseudo code: 1:assign each tuple to a randomly chosen cluster 2: calculate the centroid for each cluster 3: loop until no new centroid is obtained 4: assign each instance to the closest cluster 5: (the cluster with closest centroid to tuple) 6: update each centroid of the cluster 7: (based on new cluster assignments) 8: end loop 9: return clusters 10 We have implemented K-means clustering in Matlab. And obtained the following output : Principle Component: A linear combination of the original variables. First Principle component explains most of the variation in the data. The second principle component explains most of the rest of the variance and so on. Standard Deviation(s): Fig. 2. K-means clustering The plot obtained did not clearly indicate the seven clusters . In order to get better visualization of the of the seven clusters we used a library called rattle in R. The following plot was obtained after running K-means in . Variance: Variance is defined as the spread of data in data set. Variance = s*s Fig. 3. K-means clustering D. Princliple Component Analysis (PCA) on Wine Quality dataset PCA stands for Principal Component analysis. It is the process for transforming the set of observations of relative variables into value sets which are called principal components. Number of Principal components is less than or equal to the number of initial variables. This structure gives us the first Principal Component which possesses largest variance. It identifies patterns in data on the basis of the variance in their similarities and differences. It is popular as it works well on data of high dimensions. Moreover compression of data by reduction of dimensions does not incur any kind of loss in the overall data. Pseudo Code of PCA: Step 1: Calculate the mean of the data set Step 2: Subtract the mean from the original value (X-XMEAN) and (X-XMEAN)*(X-XMEAN) Step 3: Calculate the Standard Deviation (s) and Variance with the above stated values. This step performs the data centering of data. Step 4: Covariance of Data is calculated by the formula Step 5: Calculate the eigenvectors and eigenvalues of the covariance matrix. Step 6: Choose the principal Components. Select the first two values (largest) from the vector to be the principal components Step 7: Derive the new data set. We have implemented PCA on wine quality dataset in Matlab. The following plots were obtained: : 11 Where Y is the response variable and and are the intercept and the slope respectively. x is the predictor variable and is the noise variable. The Multiple Linear Regression is shown as follows: In this model we have more than one predictor variables and their slope variables. Multiple Linear Regression technique on Communities and Crimes dataset has helped in finding out the number of significant attributes that are responsible to predict the response (i.e. ViolentCrimesPerPop) more accurately. Fig. 4. First two principle Components In the case of Wine Quality dataset the attributes alcohol content and fixed acidity are observed to be best describing the wine quality. Thus these two features are used as dimensions in plotting of the graph. The lesser the acidity the better it is in quality whereas the alcohol content lying between 10 to 12 per cent is considered to be good. It is considered to be very good if alcohol content is more than 12 per cent and acidity as low as possible. PCA is widely used for exploratory data analysis because what it does is finding the most significant variables that explain most of the variance in the data. So when the dataset is huge, PCA can make the task much more easier. PCA is used in diverse fields. It reduces the complex dataset to a lower dimension to reveal the interesting patterns in the data. E. Multiple Linear Regression on Communities and Crimes Dataset Linear Regression is an approach for modelling the relationship between the response variable or the dependent variable and one or more predictor variables or the independent variables. There are two types of Linear Regression techniques namely Simple Linear Regression wherein there is only one predictor variable and Multiple Linear Regression wherein there are more than one predictor variables. The Simple Linear Regression model is shown as follows: Analysis of the Communities and Crimes Dataset: Building the model First we have built the model using all the attributes of the dataset. The following R commands were used: lm.mo < −lm(V iolentCrimesP erP op~., data = crimes) summary(lm.mo) R output: 12 From the above model obtained, we can infer that the model has significant number of noise variables. In order to eliminate the noise variables, we have implemented variable selection algorithm wherein we select only the significant variables that have high correlation with the response variable. We perform the correlation test on the predictor variables and then rebuild the model with the significant variables alone. Re-building the model with significant variables: R-output: The model obtained above is significant as we can see the p-value of the model is 0 as well as the all the predictor variables. Also we have obtained a large value of F-statistics which represents significance. Prediction using the above model: Given the set of values of the new instance, state = 1, racepctblack =0.48 , PctEmploy=0.57 , MalePctNevMarr = 0.45, PctWorkMom= 0.54, NumStreet = 0.09 We can now predict the value of the response variable by using the following equation of multiple linear regression that we have obtained from the above model: ViolentCrimesPerPop = 0.3208 (0.00214)state + (0.4989) racepctblack (0.1681)PctEmploy + (0.1072)MalePctNewMarr (0.1592)PctWorkMom + (0.4610)NumStreet = 0.46607 The predicted value of the ViolentCrimesPerPop of the new instance is 0.466 whereas the Actual value is 0.5. Hence we can say that the model does fairly well when given new instance. Thus using Multiple Linear regression, we have obtained one of the optimal prediction model though there can be other optimal models similar to the one we have derived. From this model we can infer that the ViolentCrimesPerPop is dependent on the predictor variables like state, racepctblack, PctEmploy, MalePctNevMarr, PctWorkWomen and NumStreet. : 13 Multiple linear regression is a very flexible method. The predictor variables can be numeric or categorical, and interactions between variables can be incorporated. This technique makes use of the data very efficiently. The models obtained are used for the prediction of new instances. One of the disadvantages of Multiple Linear Regression is that it is sensitive to outliers. IV. Performance of the data mining techniques discussed above Beginning with Spambase dataset, the performance of Logistic Regression on this dataset is 80.27% which is pretty good. The reason behind high performance is that Logistic Regression is used for binary classification and performs well on the datasets that have binary class labels (0 or 1) just like the case of Spambase dataset. Other Classification techniques might not give such high performance of classification as Logistic regression. Logistic Regression technique is more robust. The predictor variables dont have to be necessarily normally distributed, or have equal variance in each group. However there are several drawbacks of Logistic Regression. Logistic regression can perform well on a dataset that has a large number of predictor variables. But, there are many situations when it is not. The parameter estimation procedure of logistic regression relies heavily on having an adequate number of samples for each combination of predictor variables. Hence small sample sizes can lead to widely inaccurate estimates of parameters. Thus, before using logistic regression we should first make sure that the sample size is large enough and then implement logistic regression method. Data mining techniques like K-Means clustering, Principle Component Analysis (PCA) and Linear Discriminant Analysis (LDA) were performed on Wine quality dataset. From the performance of all these techniques, it was observed that LDA performs well than the other two techniques. K-Means clustering does not perform well as we have seen that at the first instance the clusters obtained were not distinct. This is due to the fact that the data points overlap and do not have a perfect separation from other data points. PCA and LDA are quite similar to each other. However PCA is not always optimal for classification. Whereas LDA is better than PCA. In PCA the shape and the location of the original dataset changes when dimensions are reduced whereas LDA does not change the location of the original dataset but instead provides a better separation between the classes as we have clearly seen in the analysis of wine quality using LDA and has given 83.12% separation. The main advantage of PCA is that it is completely non-parametric that is it can produce the results given any dataset. PCA serves to represent the data in simpler and reduced form. But in this case, from the very nature/characteristics of the wine quality dataset, it happens that LDA technique best suits for the classification of wine samples. Lastly we analyse the implementation of Multiple Linear Regression (MLR) on Communities and crimes dataset. The model built using MLR provides good prediction with lowest error rate. MLR technique is widely used for building models that has more than one predictor variables. While implementing MLR, we first need to make sure that the class that we are predicting is numeric and continuous. One of the drawbacks of MLR is that all the attributes must be numeric. Therefore to satisfy this condition we had to eliminate one nominal attribute from the dataset before applying MLR technique. On an average, this technique is better suited for datasets l V. Conclusion and Future work We have studied the application of data mining techniques like K-means clustering, Logistic Regression, Principle Component Analysis, Multiple Linear Regression, and Linear Discriminant Analysis on the UCI machine learning datasets like Wine quality, Spambase, Communities and Crimes datasets. We have analysed the performance of the algorithms on these datasets. We were also able to draw various inferences regarding the selection of a specific data mining technique on a particular dataset depending on the characteristics of the dataset. Thus with the selection of appropriate techniques, we were able to obtain the expected results. Next we compared the performance of the three techniques like K-means clustering, PCA and LDA on Wine quality dataset and concluded that LDA performs better than the other two. Our future work would probably include analysis of more number datasets and the application of data mining techniques in addition to the ones described in this project. References [1] [2] [3] [4] [5] [6] http://archive.ics.uci.edu/ml/datasets/Wine http://archive.ics.uci.edu/ml/datasets/Spambase http://archive.ics.uci.edu/ml/datasets/Communities+and+Crime http://www.mathworks.com/help/stats/princomp.html http://udel.edu/ mcdonald/statlogistic.html http://www.wikipedia.org/
© Copyright 2024 ExpyDoc