Project Report

1
Application of Data Mining Techniques like Linear
Discriminant Analysis(LDA), k-means clustering,
Multiple Linear Regression, Principle Component
Analysis(PCA) and Logistic Regression on Datasets
Ankit Ghosalkar,Nikita Shivdikar,Pallavi Tilak and Rohan Dalvi
Abstract—Data mining techniques are used for a wide variety of purposes. Techniques such as Classification, Association and Clustering are often tremendously used for
finding interesting patterns in the data. In this project,
we have implemented data mining techniques like Classification, Logistic Regression, Linear Discriminant Analysis (LDA), K-Means Clustering and Principle Component
Analysis (PCA) and Multiple Linear Regression on three
datasets namely Wine Quality dataset, Spambase Dataset
and Communities, Crimes dataset. We have evaluated the
performance of these techniques on separate training and
test sets. The experimental results were analysed to determine the drawbacks of the implemented technique on the
dataset. .
Index Terms—Data Mining,,Classification.Clustering ,Logistic Regression,Linear Discriminant Analysis.
I.
Introduction
The amount of data generated from various sources
is increasing by leaps and bounds. This data generated
would be of no use if it does not reflect any useful
information. Thus there is a need to extract knowledge
from the data. This led to the evolution of the concept of
data mining. Data mining refers to extracting knowledge
or finding interesting patterns in data. The knowledge
obtained is then stored in knowledge database for future
usage. This knowledge is utilized by business analyst for
decision making purposes. Traditionally, the task data
mining was complex and computationally expensive. The
data analysts had to manually dredge through the data
for finding interesting patterns. With small datasets, the
task was much simpler; but with larger datasets this task
was very much time consuming. Additionally experts were
required to mine such huge datasets and extract useful
knowledge. Thus there arises a need for inventing new
techniques for data mining.
The new techniques for data mining are more efficient and automated. Techniques such as Classification,
Clustering, Association, Regression etc. have made the
task of data mining much more simpler and efficient.
The Classification technique is used for supervised learning wherein we know the class labels and we have to
predict the class of new instances. Various algorithms
are developed that carry out the task of classification
namely 1R, Naive Bayes, Decision trees etc. All these
algorithms have their own advantages and disadvantages.
The Clustering technique is used for unsupervised learning
wherein we do not know the class label in advance and
we have to cluster together those instances which are
similar to each other such that at the end of clustering we
obtain distinct clusters. The various algorithms used for
clustering are K-Means clustering, Hierarchical clustering,
spectral clustering etc. Again, depending on the size
and the characteristics of the data that we are analysing,
we have to choose appropriate algorithm. We can also
use Association for discovering relevant rules from the
dataset. Before applying any of these algorithms, we
need to process the raw data. This raw data which is
collected from various sources might contain missing
values or the format of the data might not be proper.
This constitutes the data pre-processing stage wherein
the data is cleaned (outliers are detected) and prepared
for mining. After implementing data mining techniques
on the data, we evaluate the performance of the models
built and analyse the results of the experiments performed
on the data that is using the built model for testing new
instances and measuring the accuracy and error rate of the
classifiers. This gives insights into how well a particular
classifier performs on a particular data set and what
measures need to be taken to tune the parameters of the
classifier. This also gives useful information about which
classifier would perform better on the given dataset. The
experimental results are used for various purposes such as
advertisement companies can analyse customer data and
their trends of purchasing goods which can be useful for
targeted marketing purposes. The various other domains
wherein data mining is widely used are banking, sports,
mobile industries, networking, education, business and
government.
In this project, we have mainly focussed on Classification namely the binary classification using Logistic
Regression, Linear Discriminant Analysis, Clustering
and Principle Component Analysis. We have obtained
datasets from UCI Machine learning repository website.
The description of these datasets is as follows:
Wine quality dataset: wine is increasingly enjoyed by a
large number of customers. Therefore wine industries are
inventing new strategies to increase production and sale
of wine. The key aspect to be taken into consideration
during wine production is the quality of wine. Evaluating
the quality of wine is an important task of wine industries
2
because they have to ensure that the wine produced
is unadulterated and safe for human health. Wine is
assessed by various physicochemical tests which include
determining its density, alcohol or pH values and sensory
tests which are conducted by human experts. This
dataset is of particular interest because it holds valuable
information with respect to the wine quality assessment.
The dataset contains 4898 instances/samples of white
wine from the north of Portugal. These samples are tested
for their wine quality. The first 11 attributes are used
to analyse the various parameters of wine assessment
and the 12th attribute determines the wine quality on
a scale ranging from 0(very bad) to 10(excellent). We
have used Linear Discriminant Analysis (LDA) technique
for the purpose of classification of wine samples. We
have also used Principle Component Analysis (PCA)
and K-means Clustering on this dataset.Applying Data
Mining techniques on this raw data will help extracting
useful knowledge. It is interesting to predict good use
of this knowledge for wine businesses.Evaluating quality
will improve decision making such as identifying factors
that are more relevant in decision making and accordingly
setting prices and gaining customer satisfaction.Also
it would be interesting to know how this knowledge
will help control several parameters during the wine
production process for improving the quality and provide
more customer satisfaction. For example, increasing or
decreasing the concentration of residual sugar, alcohol or
other attributes in order to improve quality and measure
the concentration of each ingredient as to how it affects
the popularity of the wine.
SpamBase Dataset:The number of spam emails that internet user receives everyday has increased tremendously.
This dataset consist of spam and non-spam emails collected
from various sources. This Multivariate dataset has 4601
instances and 57 attributes.The attributes characteristics
are Integer or Real. The description of various predictor
variables are: word freq WORD - frequency of alphanumeric characters which are covered by non-alphanumeric
characters; char freq CHAR - number of CHAR occurrences to the total number of characters in email;
capital run length average- average length of continuous
sequences of capital letters; capital run length longestaverage length of longest continuous sequences of capital
letters; capital run length total- total number of capital
letters in the e-mail; Class- 1 for the email considered as
spam and 0 for the non-spam email. For the purpose of
classification, we have used Logistic Regression which is
well known binary classification technique that is when
there are only two class labels. The application of data
mining on Spambase dataset can be used in the designing
of spam filters.
The various tools that we have used for the purpose
of data mining are WEKA, Matlab and R.Usage of these
tools makes the task of data mining much easier, faster and
efficient. The results generated were either numeric values
or confusion matrices or graphs and plots which were
analysed and evaluated for performance and accuracy as
well as comparison with other approaches.The rest of the
article is organized as follows: Section 2 describes data collection and preprocessing. Section 3 describes the various
techniques in detail that were used for data mining on the
wine quality and Spambase data sets as well as the results
of the experiments that were performed on the datasets.
Section 4 compares these techniques with respect to various performance measures. Section 5 concludes the paper.
Communities and Crimes dataset:
This data
combines socio-economic data from the 1990 US Census,
law enforcement data from the 1990 US LEMAS survey,
and crime data from the 1995 FBI UCR. The number of
instances in the dataset are 1994 with 128 attributes. It
has a large number of missing values. It is a multivariate
dataset. The attribute characteristics is real. It contains
all those attributes that help predict the crime or are have
some connection with the crimes. We have implemented
Multiple Linear Regression technique on this dataset to
build a model that would correctly predict the value of
the response variable that is total number violent crimes
per 100k population. The information of the various
attributes of this dataset is as follows:[3]
state: US state (by number)
county: numeric code for county contains many missing
values
community: numeric code for community - not predictive and many missing values (numeric)
communityname: community name - not predictive for information only (string)
fold: fold number for non-random 10 fold cross validation, potentially useful for debugging, paired tests - not
predictive (numeric)
population: population for community: (numeric decimal)
householdsize: mean people per household (numeric decimal)
racepctblack: percentage of population that is african
american (numeric - decimal)
racePctWhite: percentage of population that is caucasian (numeric - decimal)
racePctAsian: percentage of population that is of asian
heritage (numeric - decimal)
racePctHisp: percentage of population that is of hispanic heritage (numeric - decimal)
agePct12t21: percentage of population that is 12-21 in
age (numeric - decimal)
agePct12t29: percentage of population that is 12-29 in
age (numeric - decimal)
agePct16t24: percentage of population that is 16-24 in
age (numeric - decimal)
agePct65up: percentage of population that is 65 and
over in age (numeric - decimal)
numbUrban: number of people living in areas classified
as urban (numeric - decimal)
pctUrban: percentage of people living in areas classified
:
as urban (numeric - decimal)
medIncome: median household income (numeric decimal) pctWWage: percentage of households with
wage or salary income in 1989 (numeric - decimal)
pctWFarmSelf : percentage of households with farm or
self employment income in 1989 (numeric - decimal)
pctWInvInc: percentage of households with investment
/ rent income in 1989 (numeric - decimal)
pctWSocSec: percentage of households with social
security income in 1989 (numeric - decimal)
pctWPubAsst: percentage of households with public
assistance income in 1989 (numeric - decimal)
pctWRetire: percentage of households with retirement
income in 1989 (numeric - decimal)
medFamInc:
median family income (differs from
household income for non-family households) (numeric decimal)
perCapInc: per capita income (numeric - decimal)
whitePerCap: per capita income for caucasians (numeric
- decimal)
blackPerCap: per capita income for african americans
(numeric - decimal)
indianPerCap: per capita income for native americans
(numeric - decimal)
AsianPerCap: per capita income for people with asian
heritage (numeric - decimal)
OtherPerCap: per capita income for people with ’other’
heritage (numeric - decimal)
HispPerCap: per capita income for people with hispanic
heritage (numeric - decimal)
NumUnderPov: number of people under the poverty
level (numeric - decimal)
PctPopUnderPov: percentage of people under the
poverty level (numeric - decimal)
PctLess9thGrade: percentage of people 25 and over
with less than a 9th grade education (numeric - decimal)
PctNotHSGrad: percentage of people 25 and over that
are not high school graduates (numeric - decimal)
PctBSorMore: percentage of people 25 and over with a
bachelors degree or higher education (numeric decimal)
PctUnemployed: percentage of people 16 and over, in
the labor force, and unemployed (numeric - decimal)
PctEmploy: percentage of people 16 and over who are
employed (numeric - decimal)
PctEmplManu: percentage of people 16 and over who
are employed in manufacturing (numeric - decimal)
PctEmplProfServ: percentage of people 16 and over
who are employed in professional services (numeric decimal)
PctOccupManu: percentage of people 16 and over who
are employed in manufacturing (numeric - decimal)
PctOccupMgmtProf : percentage of people 16 and
over who are employed in management or professional
occupations (numeric - decimal)
MalePctDivorce: percentage of males who are divorced
(numeric - decimal)
MalePctNevMarr: percentage of males who have never
married (numeric - decimal)
3
FemalePctDiv: percentage of females who are divorced
(numeric - decimal)
TotalPctDiv: percentage of population who are divorced
(numeric - decimal)
PersPerFam: mean number of people per family (numeric - decimal)
PctFam2Par: percentage of families (with kids) that are
headed by two parents (numeric - decimal)
PctKids2Par: percentage of kids in family housing with
two parents (numeric - decimal)
PctYoungKids2Par: percent of kids 4 and under in two
parent households (numeric - decimal)
PctTeen2Par: percent of kids age 12-17 in two parent
households (numeric - decimal)
PctWorkMomYoungKids: percentage of moms of kids
6 and under in labor force (numeric - decimal)
PctWorkMom: percentage of moms of kids under 18 in
labor force (numeric - decimal)
NumIlleg: number of kids born to never married (numeric - decimal)
PctIlleg: percentage of kids born to never married
(numeric - decimal)
NumImmig: total number of people known to be foreign
born (numeric - decimal)
PctImmigRecent: percentage of immigrants who
immigated within last 3 years (numeric - decimal)
PctImmigRec5: percentage of immigrants who immigated within last 5 years (numeric - decimal)
PctImmigRec8: percentage of immigrants who immigated within last 8 years (numeric - decimal)
PctImmigRec10:
percentage of immigrants who
immigated within last 10 years (numeric - decimal)
PctRecentImmig: percent of population who have
immigrated within the last 3 years (numeric - decimal)
PctRecImmig5: percent of population who have
immigrated within the last 5 years (numeric - decimal)
PctRecImmig8: percent of population who have
immigrated within the last 8 years (numeric - decimal)
PctRecImmig10: percent of population who have
immigrated within the last 10 years (numeric - decimal)
PctSpeakEnglOnly: percent of people who speak only
English (numeric - decimal)
PctNotSpeakEnglWell: percent of people who do not
speak English well (numeric - decimal)
PctLargHouseFam: percent of family households that
are large (6 or more) (numeric - decimal)
PctLargHouseOccup: percent of all occupied households that are large (6 or more people) (numeric - decimal)
PersPerOccupHous: mean persons per household (numeric - decimal) PersPerOwnOccHous: mean persons
per owner occupied household (numeric - decimal)
PersPerRentOccHous: mean persons per rental household (numeric - decimal) PctPersOwnOccup: percent of
people in owner occupied households (numeric - decimal)
PctPersDenseHous: percent of persons in dense housing (more than 1 person per room) (numeric - decimal)
PctHousLess3BR: percent of housing units with less
than 3 bedrooms (numeric - decimal)
4
MedNumBR: median number of bedrooms (numeric decimal)
HousVacant: number of vacant households (numeric decimal)
PctHousOccup: percent of housing occupied (numeric decimal)
PctHousOwnOcc: percent of households owner occupied (numeric - decimal)
PctVacantBoarded: percent of vacant housing that is
boarded up (numeric - decimal)
PctVacMore6Mos: percent of vacant housing that has
been vacant more than 6 months (numeric - decimal)
MedYrHousBuilt: median year housing units built
(numeric - decimal)
PctHousNoPhone: percent of occupied housing units
without phone (in 1990, this was rare!) (numeric decimal)
PctWOFullPlumb: percent of housing without complete
plumbing facilities (numeric - decimal)
OwnOccLowQuart: owner occupied housing - lower
quartile value (numeric - decimal)
OwnOccMedVal: owner occupied housing - median
value (numeric - decimal) OwnOccHiQuart: owner occupied housing - upper quartile value (numeric - decimal)
RentLowQ: rental housing - lower quartile rent (numeric
- decimal) RentMedian: rental housing - median rent
(Census variable H32B from file STF1A) (numeric decimal)
RentHighQ: rental housing - upper quartile rent (numeric - decimal) MedRent: median gross rent (Census
variable H43A from file STF3A - includes utilities)
(numeric - decimal)
MedRentPctHousInc: median gross rent as a percentage of household income (numeric - decimal)
MedOwnCostPctInc: median owners cost as a percentage of household income - for owners with a mortgage
(numeric - decimal)
MedOwnCostPctIncNoMtg: median owners cost as
a percentage of household income - for owners without a
mortgage (numeric - decimal)
NumInShelters: number of people in homeless shelters
(numeric - decimal)
NumStreet: number of homeless people counted in the
street (numeric - decimal)
PctForeignBorn: percent of people foreign born (numeric - decimal)
PctBornSameState: percent of people born in the same
state as currently living (numeric - decimal)
PctSameHouse85: percent of people living in the same
house as in 1985 (5 years before) (numeric - decimal)
PctSameCity85: percent of people living in the same
city as in 1985 (5 years before) (numeric - decimal)
PctSameState85: percent of people living in the same
state as in 1985 (5 years before) (numeric - decimal)
LemasSwornFT: number of sworn full time police
officers (numeric - decimal)
LemasSwFTPerPop: sworn full time police officers per
100K population (numeric - decimal)
LemasSwFTFieldOps: number of sworn full time police
officers in field operations (on the street as opposed to
administrative etc.) (numeric - decimal)
LemasSwFTFieldPerPop:
sworn full time police
officers in field operations (on the street as opposed to
administrative etc.) per 100K population (numeric decimal)
LemasTotalReq: total requests for police (numeric decimal)
LemasTotReqPerPop: total requests for police per
100K population (numeric - decimal)
PolicReqPerOffic: total requests for police per police
officer (numeric - decimal)
PolicPerPop: police officers per 100K population (numeric - decimal)
RacialMatchCommPol: a measure of the racial match
between the community and the police force. High values
indicate proportions in community and police force are
similar (numeric - decimal)
PctPolicWhite: percent of police that are caucasian
(numeric - decimal)
PctPolicBlack: percent of police that are african american (numeric - decimal)
PctPolicHisp: percent of police that are hispanic
(numeric - decimal)
PctPolicAsian: percent of police that are asian (numeric
- decimal)
PctPolicMinor: percent of police that are minority of
any kind (numeric - decimal)
OfficAssgnDrugUnits: number of officers assigned to
special drug units (numeric - decimal)
NumKindsDrugsSeiz: number of different kinds of
drugs seized (numeric - decimal)
PolicAveOTWorked: police average overtime worked
(numeric - decimal)
LandArea: land area in square miles (numeric - decimal)
PopDens: population density in persons per square mile
(numeric - decimal)
PctUsePubTrans: percent of people using public transit
for commuting (numeric - decimal)
PolicCars: number of police cars (numeric - decimal)
PolicOperBudg: police operating budget (numeric decimal)
LemasPctPolicOnPatr: percent of sworn full time
police officers on patrol (numeric - decimal)
LemasGangUnitDeploy: gang unit deployed (numeric
- decimal - but really ordinal - 0 means NO, 1 means YES,
0.5 means Part Time)
LemasPctOfficDrugUn: percent of officers assigned to
drug units (numeric - decimal)
PolicBudgPerPop: police operating budget per population (numeric - decimal)
ViolentCrimesPerPop: total number of violent crimes
per 100K population (numeric - decimal) response variable
The various tools that we have used for the purpose
of data mining are WEKA, Matlab and R. Usage of
these tools makes the task of data mining much easier,
faster and efficient. The results generated were either
:
5
numeric values or confusion matrices or graphs and plots
which were analysed and evaluated for performance and
accuracy as well as comparison with other approaches.
The rest of the article is organized as follows: Section
2 describes data collection and pre-processing. Section
3 describes the various techniques in detail that were
used for data mining on the wine quality, Spambase and
Communities and crimes datasets as well as the results
of the experiments that were performed on the datasets.
Section 4 compares these techniques with respect to
various performance measures. Section 5 concludes the
paper and gives directions for future work.
II.
Data Collection and Pre-Processing:
A. Logistic Regression on Spambase Dataset
Logistic Regression is a binary classification technique.
We use this technique when our response variable is binary
(i.e. 0 or 1) and a collection of real valued explanatory variables. The goal is given a vector X, the task is to predict
its class as accurately as possible. The response variable is
related to the predictor variable through a relationship of
the form.
The Wine quality, Spambase and Communities and
crimes datasets were obtained from UCI machine learning
repository website. The Spambase dataset obtained from
the site was in the proper format as expected and did not
contain any missing values. So we not did not carry out
any pre-processing tasks on this dataset. The Wine quality
dataset however was not in the proper format. It contained
all the values of each instance in a single cell of excel
spread sheet separated by semicolon and thus was hard to
read. In order to bring it in proper readable format, we
wrote the following code in R which copied all the values
separated by semicolon into another table of new excel file.
wine < −read.table(0 winequality − white.csv 0 , sep =0 ;0 )
write.table(wine,’winequality-whitewine.csv’, sep=’,’)
Further the Wine quality dataset did not contain
any missing values. So we could now use this cleaned
dataset for data mining. The Communities and crimes
dataset had lots of missing values for many attributes.
Also since we are doing Multiple Linear regression on
this dataset, we eliminated the nominal attribute named
country from the dataset. Since the missing values
for the attributes was more than 85% we eliminated
such attributes from our dataset. The eliminated attributes were country, community, communityname,
LemasSwornFT,
LemasSwFTPerPop,
LemasSwFTFieldOps,
LemasSwTotalReq,
LemasTotReqPerPop,
PolicReqPerOffic, PolicePerPop, RacialMatchCommPol,
PctPolicWhite, PctPoliceBlack, PctPoliceHisp, PctPoliceAsian, PoliceCars, PoliceOperBudg, LemasPctOnPatr,
LemasGangUnitDeploy, PoliceBudgPerPop. Finally, all
the datasets were cleaned and ready to use for data
mining.
Application of Data Mining
techniques on Wine quality,
Spambase and Communities and
Crimes datasets
III.
In this section, we specify the analysis which were carried
along with our findings.
Fig. 1. Analysis of the Spambase dataset: Building the Model
Applying the Logistic Regression technique using R,
we first fit the full model that is we build the model
that includes all the explanatory variables of Spambase
dataset. The function glm in R is used for model fitting
in Logistic Regression. We use the following commands in
R:
spam~ < −read.csv(0 spambase.csv 0 , header = T )
glm.spam − glm(Class~., data = spam, f amily = binomial(0 logit0 ))
summary(glm.spam)
R output after fitting full model:
6
After studying this model, we have observed that the
model is significant because residual deviance of the
model is 1815.8. However the model contains lots of noise
variables or insignificant variables. This can be inferred
from the large p-values of these noise variables. Hence we
need a model that contains only significant variables. For
this, we use a technique called variable selection wherein
only significant variables are retained in the model while
the rest are eliminated. The following is the R code for
selecting variables that have high correlation with the
response variable:
XY < −spam
p < −ncol(XY )
good.var < −c()
for(i in 1:(p-1))
{if (abs(cor(XY [, i], XY [, p]))
>=
if (cor.test(XY [, i], XY [, p])$p.value < 0.05)
{good.var < −c(good.var, colnames(XY )[i])
}}
0.30)
The output obtained after applying this technique: Four
variables namely ”word freq remove” ”word freq your”,
”word freq 000”, ”char freq ..4” were deemed significant.
Hence we build the model again using these variables.
glm.newmodel¡-glm(Class word freq remove+word freq your
+word freq 000+char freq ..4, data=spam)
summary(glm.newmodel)
R output after Refitting the model:
:
7
Performance of Logistic Regression on Spambase
dataset:
We have obtained the confusion matrix showing actual
and predicted values.
Actual¡-spam$Class
confmat¡-table(Actual, predicted)
confmat
From the above confusion matrix, we have computed
various parameters:
True Positives (TP): Number of emails correctly
predicted as spam are 1042.
False Positives (FP): 137 emails which were predicted
as spam were actually non-spams.
True Negatives (TN): 2651 emails predicted as nonspams were non-spams.
False Negatives (FN): 771 emails were predicted as
non-spams but were actually spams.
Precision = TP/(TP+FP) = 0.8837 = 88.37%
Recall = TP/(TP+FN) = 0.5747 = 57.47%
F-measure= 2 (Precision X Recall) / (Precision +
Recall) = 0.6964
The summary of this model reveals that the model is
significant from the value of its residual deviance which is
less than the degrees of freedom. Also all the variables are
significant since their p-values are zeros. Hence we can
use this model for our further analysis:
Prediction with Logistic Regression
The model obtained above can now be used for prediction. In this, we predict the log odd ratios for each of
the instances. We compute R output after Refitting the
model:
And then predict the probability of the class (0-non-spam
or 1-spam) of each instance
We have used the following R commands for our predictions:
logs.odds.ratio¡-predict(glm.newmodel, spam[,-57])
probabilities¡-predict(glm.newmodel,spam[,-57],
type=’response’)
predicted¡-rep(0,4601)
positive¡-which(probabilities¿0.5)
predicted[positive]¡-1
Performance
of
Classification(PCC):
(TN+TP)/number of instances = 80.27%
Error rate (Misclassification rate)= 1- PCC = 2%
ROC (Receiver Operating Curve)
The ROC curve is widely used to measure the quality
of classification technique. This curve plots the False
Positive Rate and True Positive Rate. The following plot
is ROC curve with respect to the new model obtained.
8
This ROC curve illustrates the performance of the
Spambase classifier.
B. Linear Discriminant Analysis (LDA) on Wine
quality dataset
LDA is similar to Principle Component Analysis. The
difference is PCA does more of attribute classification
whereas LDA does more of data classification. Like PCA,
LDA performs dimensionality reduction while preserving
much of the information. The models obtained using LDA
sometimes shows high accuracy than the more complex
models. LDA technique is used for classification purpose.
LDA finds a discriminant function of the two predictors X
any Y and results in a new set of transformed values that
gives more accurate discrimination than the predictor
variable alone. It tries to find directions along which the
classes are best separated. Additionally it not only finds
scatter within the classes but also between classes.
This LDA technique is implemented on Wine quality
dataset to classify the instances based on the wine quality.
The library MASS in R is used to perform LDA on wine
quality dataset
wine.lda < −lda(quality~., data = wine)
wine.lda
We select first two significant components as shown below: R output: If Sw is non-singular, then Eigen value is
obtained by
LDA Method
1: Let N be the number of classes
2: Let be the mean vector of class i=1,2N
3: Let be the number of samples within class i=1,2N
4: Let L be the total number of samples
We now compute the scatter matrix within the class using the following formula:
Scatter matrix between the class:
Next we find the mean of the entire dataset:
LDA tries to minimize the scatter within the class
while maximizes the scatter between the classes while
reducing the variation due to sources and retaining the
class separability.
(Maximize) (det(Sb)/det(Sw))
Linear Transformation:
The Linear transformation is given by matrix U with
columns as the Eigen Vectors of inverse(Sw)(Sb). There
are at most N-1 non generalized Eigen Vectors. The
generalized Eigen Vector is given by:
If Sw is non-singular, then Eigen value is obtained by
From this output, we can infer that the first discriminant
function obtained is the linear combination of variables
that is:
(1.864)fixed.acidity-(4.755)volatile.acidity
-(7.046)citric.acidity+(1.89)residual.sugar(5.294)chlorides+(1.0608)free.sulphur.dioxide(1.229)total.sulphu.dioxide-(3.445)density
+(1.698)pH+(1.61)sulphates+(5.3604)alcohol.
Next,we need to calculate the values of the first discriminant function for each instance in the dataset.
wine.lda.vals < −predict(wine.lda, wine[1 : 11])
wine.lda.vals$x[,1]
R output for first few instances:
:
9
classif ier < −lda(quality ., data = wine, subset =
trainSet)
P redicted < −predict(classif ier, wine[−trainSet, ])$class
Actual < −wine$quality[−trainSet]
table(Actual, Predicted)
Output obtained:
Also from the proportion of trace, we get the percentage of separation between the groups achieved by each
discriminant function. For example, the first discriminant function achieves a separation of 83.12% while the
second discriminant function achieves a separation of
11.83%.Therefore to achieve a good separation between
the groups, we need to use both the discriminant functions.
Results of Linear Discriminant Analysis:
The results of applying LDA on wine quality dataset
is shown by the stacked histogram of the values of the
discriminant function for the instances of different groups.
We use the function ldahist() in R to make a stacked
histogram of the values of the first discriminant function.
Prediction using LDA
To understand the accuracy of the discriminator in
classification, we have taken two new instances and used
LDA to classify them.
#R code
dis < −lda(quality~., data = testwine)
predict(disc) predict(disc)$class
newInstance1: fixed.acidity=7.1, volatile.acidity=0.24,
citric.acid=0.41, residual.sugar=17.8, chlorides=0.046,
free.sulfur.dioxide=39, total.sulfur.dioxide=145, density=0.9998, pH=3.32, sulphates=0.39, alcohol=8.7
predict(disc, newInstance)
class predicted =5; actual class=5
newInstance2: fixed.acidity=8.1, volatile.acidity=0.27,
citric.acid=0.41, residual.sugar=1.45, chlorides=0.033,
free.sulfur.dioxide=11,
total.sulfur.dioxide=63,
density=0.9908, pH=2.99, sulphates=0.56, alcohol=12
class predicted =5; actual class=4
Cross Validation using LDA:
The classifier is trained on the part of data and is used to
predict the rest of the data
#R code
trainSet < −sample(1 : 2020, 1010)
table(wine$quality[trainSet])
The confusion matrix obtained above shows the number
of instances that were predicted correctly and the number
of instances that were misclassified. Thus LDA technique
is widely used for classification purpose especially when
there are more number of classes.
C. k means clustering on wine quality dataset
Clustering is defined as grouping of similar things
together. It is often confused as being similar to classification. Clustering is unsupervised exploration procedure
whereas classification is supervised and used for prediction
purposes. K-means is a clustering algorithm which is
unsupervised. It helps to assign the data to a particular
cluster given k clusters which is decided a priori. The
first step is to decide on which are the ideal choices for
the k centroids. Next step is to associate the values of
the dataset with nearest possible centroid iteratively by
computing the distance. The main objective of clustering
is optimising the objective function in squared error
function. It can be employed in detecting the odds in the
data for example for fraudulent transactions or scams. We
have used K-Means clustering technique on wine quality
dataset to cluster the wine samples into seven clusters
representing the quality level of the wine sample (ranging
from 3 to 9). The following is the pseudo code of K-means
clustering algorithm
Pseudo code: 1:assign each tuple to a randomly
chosen cluster
2: calculate the centroid for each cluster
3: loop until no new centroid is obtained
4: assign each instance to the closest cluster
5: (the cluster with closest centroid to tuple)
6: update each centroid of the cluster
7: (based on new cluster assignments)
8: end loop
9: return clusters
10
We have implemented K-means clustering in Matlab.
And obtained the following output :
Principle Component:
A linear combination of the original variables. First
Principle component explains most of the variation in the
data. The second principle component explains most of
the rest of the variance and so on.
Standard Deviation(s):
Fig. 2. K-means clustering
The plot obtained did not clearly indicate the seven
clusters . In order to get better visualization of the of
the seven clusters we used a library called rattle in R.
The following plot was obtained after running K-means in .
Variance:
Variance is defined as the spread of data in data set.
Variance = s*s
Fig. 3. K-means clustering
D. Princliple Component Analysis (PCA) on Wine
Quality dataset
PCA stands for Principal Component analysis. It is
the process for transforming the set of observations of
relative variables into value sets which are called principal
components. Number of Principal components is less than
or equal to the number of initial variables. This structure
gives us the first Principal Component which possesses
largest variance. It identifies patterns in data on the basis
of the variance in their similarities and differences. It
is popular as it works well on data of high dimensions.
Moreover compression of data by reduction of dimensions
does not incur any kind of loss in the overall data.
Pseudo Code of PCA:
Step 1:
Calculate the mean of the data set
Step 2:
Subtract the mean from the original value (X-XMEAN)
and (X-XMEAN)*(X-XMEAN)
Step 3:
Calculate the Standard Deviation (s) and Variance with
the above stated values. This step performs the data
centering of data.
Step 4:
Covariance of Data is calculated by the formula
Step 5:
Calculate the eigenvectors and eigenvalues of the covariance matrix.
Step 6:
Choose the principal Components. Select the first two
values (largest) from the vector to be the principal
components
Step 7:
Derive the new data set.
We have implemented PCA on wine quality dataset in
Matlab. The following plots were obtained:
:
11
Where Y is the response variable and and are the
intercept and the slope respectively. x is the predictor
variable and is the noise variable. The Multiple Linear
Regression is shown as follows:
In this model we have more than one predictor variables
and their slope variables. Multiple Linear Regression
technique on Communities and Crimes dataset has helped
in finding out the number of significant attributes that are
responsible to predict the response (i.e. ViolentCrimesPerPop) more accurately.
Fig. 4. First two principle Components
In the case of Wine Quality dataset the attributes
alcohol content and fixed acidity are observed to be best
describing the wine quality. Thus these two features are
used as dimensions in plotting of the graph. The lesser
the acidity the better it is in quality whereas the alcohol
content lying between 10 to 12 per cent is considered to be
good. It is considered to be very good if alcohol content
is more than 12 per cent and acidity as low as possible.
PCA is widely used for exploratory data analysis because
what it does is finding the most significant variables that
explain most of the variance in the data. So when the
dataset is huge, PCA can make the task much more easier.
PCA is used in diverse fields. It reduces the complex
dataset to a lower dimension to reveal the interesting
patterns in the data.
E. Multiple Linear Regression on Communities
and Crimes Dataset
Linear Regression is an approach for modelling the
relationship between the response variable or the dependent variable and one or more predictor variables or the
independent variables. There are two types of Linear
Regression techniques namely Simple Linear Regression
wherein there is only one predictor variable and Multiple
Linear Regression wherein there are more than one
predictor variables. The Simple Linear Regression model
is shown as follows:
Analysis of the Communities and Crimes Dataset:
Building the model
First we have built the model using all the attributes of
the dataset. The following R commands were used:
lm.mo < −lm(V iolentCrimesP erP op~., data = crimes)
summary(lm.mo)
R output:
12
From the above model obtained, we can infer that the
model has significant number of noise variables. In order
to eliminate the noise variables, we have implemented
variable selection algorithm wherein we select only the
significant variables that have high correlation with the
response variable. We perform the correlation test on the
predictor variables and then rebuild the model with the
significant variables alone. Re-building the model with
significant variables:
R-output:
The model obtained above is significant as we can
see the p-value of the model is 0 as well as the all the
predictor variables. Also we have obtained a large value
of F-statistics which represents significance.
Prediction using the above model:
Given the set of values of the new instance, state =
1, racepctblack =0.48 , PctEmploy=0.57 , MalePctNevMarr = 0.45, PctWorkMom= 0.54, NumStreet = 0.09
We can now predict the value of the response variable by
using the following equation of multiple linear regression
that we have obtained from the above model:
ViolentCrimesPerPop = 0.3208 (0.00214)state + (0.4989)
racepctblack
(0.1681)PctEmploy + (0.1072)MalePctNewMarr (0.1592)PctWorkMom + (0.4610)NumStreet =
0.46607
The predicted value of the ViolentCrimesPerPop of
the new instance is 0.466 whereas the Actual value is
0.5. Hence we can say that the model does fairly well
when given new instance. Thus using Multiple Linear
regression, we have obtained one of the optimal prediction
model though there can be other optimal models similar
to the one we have derived. From this model we can
infer that the ViolentCrimesPerPop is dependent on the
predictor variables like state, racepctblack, PctEmploy,
MalePctNevMarr, PctWorkWomen and NumStreet.
:
13
Multiple linear regression is a very flexible method.
The predictor variables can be numeric or categorical,
and interactions between variables can be incorporated.
This technique makes use of the data very efficiently.
The models obtained are used for the prediction of new
instances. One of the disadvantages of Multiple Linear
Regression is that it is sensitive to outliers.
IV. Performance of the data mining techniques
discussed above
Beginning with Spambase dataset, the performance of
Logistic Regression on this dataset is 80.27% which is
pretty good. The reason behind high performance is that
Logistic Regression is used for binary classification and
performs well on the datasets that have binary class labels (0 or 1) just like the case of Spambase dataset. Other
Classification techniques might not give such high performance of classification as Logistic regression. Logistic Regression technique is more robust. The predictor variables
dont have to be necessarily normally distributed, or have
equal variance in each group. However there are several
drawbacks of Logistic Regression. Logistic regression can
perform well on a dataset that has a large number of predictor variables. But, there are many situations when it
is not. The parameter estimation procedure of logistic regression relies heavily on having an adequate number of
samples for each combination of predictor variables. Hence
small sample sizes can lead to widely inaccurate estimates
of parameters. Thus, before using logistic regression we
should first make sure that the sample size is large enough
and then implement logistic regression method.
Data mining techniques like K-Means clustering, Principle Component Analysis (PCA) and Linear Discriminant
Analysis (LDA) were performed on Wine quality dataset.
From the performance of all these techniques, it was observed that LDA performs well than the other two techniques. K-Means clustering does not perform well as we
have seen that at the first instance the clusters obtained
were not distinct. This is due to the fact that the data
points overlap and do not have a perfect separation from
other data points. PCA and LDA are quite similar to each
other. However PCA is not always optimal for classification. Whereas LDA is better than PCA. In PCA the shape
and the location of the original dataset changes when dimensions are reduced whereas LDA does not change the
location of the original dataset but instead provides a better separation between the classes as we have clearly seen
in the analysis of wine quality using LDA and has given
83.12% separation. The main advantage of PCA is that it
is completely non-parametric that is it can produce the results given any dataset. PCA serves to represent the data
in simpler and reduced form. But in this case, from the
very nature/characteristics of the wine quality dataset, it
happens that LDA technique best suits for the classification of wine samples.
Lastly we analyse the implementation of Multiple Linear
Regression (MLR) on Communities and crimes dataset.
The model built using MLR provides good prediction with
lowest error rate. MLR technique is widely used for building models that has more than one predictor variables.
While implementing MLR, we first need to make sure that
the class that we are predicting is numeric and continuous.
One of the drawbacks of MLR is that all the attributes
must be numeric. Therefore to satisfy this condition we
had to eliminate one nominal attribute from the dataset
before applying MLR technique. On an average, this technique is better suited for datasets l
V. Conclusion and Future work
We have studied the application of data mining techniques like K-means clustering, Logistic Regression, Principle Component Analysis, Multiple Linear Regression, and
Linear Discriminant Analysis on the UCI machine learning datasets like Wine quality, Spambase, Communities
and Crimes datasets. We have analysed the performance
of the algorithms on these datasets. We were also able to
draw various inferences regarding the selection of a specific
data mining technique on a particular dataset depending
on the characteristics of the dataset. Thus with the selection of appropriate techniques, we were able to obtain
the expected results. Next we compared the performance
of the three techniques like K-means clustering, PCA and
LDA on Wine quality dataset and concluded that LDA performs better than the other two. Our future work would
probably include analysis of more number datasets and
the application of data mining techniques in addition to
the ones described in this project.
References
[1]
[2]
[3]
[4]
[5]
[6]
http://archive.ics.uci.edu/ml/datasets/Wine
http://archive.ics.uci.edu/ml/datasets/Spambase
http://archive.ics.uci.edu/ml/datasets/Communities+and+Crime
http://www.mathworks.com/help/stats/princomp.html
http://udel.edu/ mcdonald/statlogistic.html
http://www.wikipedia.org/