Pattern Recognition MM7 - Institut for Medicin og Sundhedsteknologi

Pattern Recognition MM7
n 
Feature evaluation
n 
n 
What to consider when choosing features
Is a feature robust?
n 
n 
n 
How many samples do we need to represent a feature
(mean and covariance)?
Is the feature normal distributed?
Break
n 
Dimensionality reduction of the feature space
Number of samples
n 
How many samples do we need to
describe a feature for a class?
n 
1) Scientific Table
n 
2) Variance analysis
Number of samples: Scientific Table
n 
n 
n 
n 
n 
Distribution-free tolerance limits
Which number of samples, N, is required in order to ensure
that βp% of the population is within the min. and max.
values, with a confidence of βt % ?
K. Diem and C. Leitner. Scientific Tables. Ciba-Cjeigy Ltd.
1975
Typically: 95% for both => 93 samples
Often samples are normally distributed => fewer samples are
required
Number of samples: Variance analysis
n 
n 
n 
Plot the variance as a function of N
Choose N so the variance is stable
Do this for each feature in each class
Is a feature normally distributed?
n 
n 
If the features defining a class are normally
distributed then Bayes’ classifier is reduced to
Mahalanobis distance
In praxis it is often assumed that all features
are normally distributed, but how do we test
this?
1) Histogram inspection
2) Goodness of fit
Is a feature normally distributed?
n 
1) Histogram inspection
n 
Matlab: normplot
Is a feature normally distributed?
2) (”Goodness of fit” or χ2-test (chi))
n 
n 
n 
Idea: Compare data with a perfect normal distribution
Algorithm:
a) Divide in k intervals (k as small as possible)
- Choose k: so fi > 1 for all i and fi > 5 for 80% of the k
- Choose k: so each interval approx. has the same probability
b) Compare the measured data with the expected data
- Error measure: T
- T is χ2 distributed
c) If T < THα => normal distributed with significant
level α (see stat. table)
What to remember ?
n 
Feature evaluation
n 
n 
Robustness (invariant wrt. the application)
Number of samples
n 
n 
n 
Scientific Table
Variance analysis
Normally distributed (Bayes’ rule)
n 
n 
Histogram inspection (qualitative analysis)
Goodness of fit (statistical analysis)
Break
Method for reducing the dimensionality
of the feature-space
Reduce the number of features
n 
Why?
n 
”The curse of dimensionality”
n 
n 
n 
n 
How?
n 
n 
n 
Visualization
Remove ”noise” (10 dependent features + 1 independent)
Faster processing
If features are correlated => redundancy
Remove redundancy
Methods
n 
n 
Hierarchical dimensionality reduction
Principal Component Analysis (PCA)
Methods
n 
n 
n 
Unsupervised
Ignore that samples come
from different classes
Reduce the dimensionality
(compression)
Hierarchical dimensionality reduction
n 
n 
Correlation matrix
Algorithm:
1) 
2) 
3) 
4) 
5) 
6) 
Calc. the correlation matrix
Find max Ckl k ≠ l
Merge feature Fk and Fl
Save merged feature as Fk
Delete Fl
Stop or go to 1)
n 
Stop criterion:
n 
n 
n 
n 
Max Ckl is too small
Number of dim. Ok
Others…
Merge features
n 
n 
n 
n 
Keep Fk and delete Fl
(Fk + Fl) / 2
(w1Fk + w2Fl) / 2
Others…
Principal Component Analysis (PCA)
Principal Component Analysis (PCA)
n 
n 
n 
n 
n 
n 
Combine features into new features! and then
ignore some of the new features
PCA is used a lot, especially when you have many
dimensions
Basic idea: Features with a large variance separates
the classes better
If both features have
large variances – then what?
Transform the feature-space,
so we get large variances and
no correlation!
Variance = Information !
PCA – Transform
x2
y1
y2
y2
x1
n 
n 
Ignore y2 without loosing info when
classifying
y1 and y2 are the principal components
y1
PCA – How to
1.  Collect data (x)
2.  Calc. the covariance matrix: Cx
n  Matlab: Cx = cov(x);
3.  Solve the Eigen-value problem => A and Cy
n  Matlab: [Evec,Eval] = eig(Cx);
4.  Transform: x => y: y = A (x-µ)
5.  Analyze (PCA)
1. 
2. 
M-method
J-measure
What to remember
Feature reduction
n 
n 
n 
Unsupervised
Hierarchical dimensionality reduction
n 
n 
n 
Correlation matrix
Merge features or delete features
Principal Component Analysis (PCA)
n 
n 
n 
n 
n 
Combine features into new features
Ignore some of the new features
VARIANCE = INFORMATION
Transform the feature-space from the Eigen-vectors of the covariance
matrix => uncorrelated features !
Analyze
n 
n 
M-method
J-measure
X-tra slides
Is a feature normally distributed?
n 
2) Skewness and Kurtosis
n 
n 
One feature and one class
A distribution’s i’th moment (mi) can be expressed as:
1
mi =
N
N
∑ (x
j
− µ)
i
j =1
N : Number of samples
x j : Sample number j
µ : Mean value for the feature
Is a feature normally distributed?
n 
2) Skewness and Kurtosis
The methods are not used so much any
more, but can be seen in older reports/
papers
n  BUT they do describe general aspects for a
distribution AND can be used as features !
n 
AAU laver milliardaftale med GE Healthcare (12. okt 2005)
Aalborg Universitet (AAU) har indgået en licens- og produktionsaftale med GE
Healthcare, som vil generere indtægter mellem 0,5 og 1 mia. kr. Licensen drejer
sig om en ny opfindelse, der gør det nemmere at opdage hjertesygdommen
Long QT-syndrom, der hvert år rammer millioner af mennesker på verdensplan.
Det er en gruppe studerende fra Institut for Sundhedsteknologi, der har udviklet
måleapparatet, og instituttet vil modtage en tredjedel af pengene fra aftalen.
AAU modtager en anden tredjedel, mens de tre studerende og en række lærere
deler den sidste tredjedel af beløbet.
Millionaftale til Aalborg Universitet (21. okt 2005)
Tre nyuddannede ingeniører fra Aalborg Universitets sundhedsteknologiske
uddannelse har patenteret en metode til at diagnosticere en farlig
hjertesygdom. En af verdens største leverandører af hospitalsudstyr,
General Healthcare, har underskrevet en millionaftale om at benytte sig af
teknologien. Videnskabsminister, Helge Sander kalder aftalen for den
største nogensinde mellem et universitet og et privat firma, skriver
Ingeniøren.
Methods where we DO use the
class information
n 
n 
n 
Supervised
Use info. of classes and reduce the
dimensionality
Methods:
n 
n 
SEPCOR
Linear Discriminant Methods
SEPCOR
n 
n 
n 
n 
n 
Inspired from: Hierarchical dimensionality reduction
Method to choose the X best (most discriminative)
features
Idea: combine Hierarchical dimensionality reduction with
class info.
SEPCOR = separability + correlation
Principle:
n 
Calc. a measure for how good (discriminative) each feature, xi, is
wrt. classification:
n 
n 
Variability measure: V(xi)
Keep the most discriminative features, which have a
low correlation with the other features
SEPCOR – Variability measure
V(xi) =
n 
The variance of the mean values on xi
The mean value of the variances on xi
V(xi) large => good feature wrt. classification
n 
That is: large nominator and small denominator
x2
V >> 1
V~1
x1
V(x1) < V(x2) => x2 best
SEPCOR – The algorithm
1. 
2. 
Make a list with features ordered after V-value
Repeat until we have the desired number of
features or the list is empty
1. 
2. 
3. 
Remove and store the feature with largest V-value
Find the correlation between the removed feature and all
the other features in the list
Ignore all features with correlation bigger than MAXCOR
Linear Discriminant Methods
Linear Discriminant Methods
n 
n 
n 
Transform data to the new feature space
Linear transform (rotation): y=Ax
The transform is defined so that classification
becomes as easy as possible =>
n 
n 
Fisher Linear Discriminant method
n 
n 
Info = discriminative power
Map data to one dimension
Multiple Discriminant analysis
n 
Map data to a M-dimensional space
Fisher Linear Discriminant
n 
Idea: Map data to a line, y
n 
n 
n 
The orientation of the line is defined so that the classes are as
separated as possible
Transform: y = wTx, w is the direction of the line, y
PCA: w is defined as the 1st eigen-vector for the covariance
matrix (vis prob.)
Fisher Linear Discriminant
•  Example: 4 classes in 2D
PCA:
Fisher:
Transformation
y = wTx
Find w
Fisher Linear Discriminant
n 
n 
Transform:
y = wTx
Find w so that the
following criterion
function is maximum
The variance of the means
J (w ) =
. For two classes =>
The mean of the variances
2
~
~
m1 − m2
J (w ) = ~ 2 ~ 2 ,
Line = arg max J ( w )
s1 + s2
w
~ is the mean of the i´th class, D , mapped onto w
m
i
i
~
si 2 is the variance of the i´th class, Di , mapped onto w
Multiple Discriminant Analysis
n 
Generalized Fisher Linear Discriminant method
n  N classes
n  Mapped into a M-dimensional space (M < N)
n 
Fx 3 points will span a plan
•  Example with 3
classes in 3D
mapped into two
different sub-spaces
What to remember
n 
Feature reduction where we use the class info.
n 
n 
Discriminative power = information
SEPCOR (ignore some of the features)
n 
n 
Hierarchical dimensionality reduction (correlation) +
Variability measure:
n 
n 
The variance of the means / the mean of the variances
Linear Discriminant Methods (make new features and
ignore some)
n 
Fisher Linear Discriminant (map onto a line)
n 
n 
n 
Transform: y = wTx, w is the direction of the line, y
Variability measure
Multiple Discriminant Analysis
n 
Generalized Fisher Linear Discriminant method
n  N classes
n  Map data into a M-dimensional space (M < N)