Download as a PDF

Gaussian Mixture PDF Approximation and Fuzzy c-Means
Clustering with Entropy Regularization
Hidetomo ICHIHASHI, Katsuhiro HONDA and Naoki TANI
College of Engineering
Osaka Prefecture University
[email protected]
(presented at AFSS’2000 and revised on June 1)
Abstract
EM algorithm is a popular density estimation
method that uses the likelihood function as the measure of fit. Fuzzy c-Means (FCM) is also a popular
clustering algorithm by the distance-based objective
function methods. This paper discusses about similarity between the Gaussian mixture model with EM
algorithm and the FCM based on the Mahalanobis distance with the entropy regularization. We show that
the same algorithm as the EM algorithm can be derived
from a modified FCM with regularization by K-L information. And, then we apply it to Fuzzy c-Varieties
(FCV) which is a clustering algorithm to obtain linear
or ellipsoidal clusters.
1
Introduction
Data mining is the search for relationships and
global patterns that exist in large databases, but are
hidden among the vast amounts of data. One important topic in data mining research is concerned with
the discovery of association rule which describes an
interesting association relationship among different attributes. Besides these new techniques, conventional
multivariate analysis techniques have been often used
for this purpose. In order to find local relationships
and local patterns, the local principal components obtained by cluster analysis or local probability density
estimation techniques can be considered as a useful
tool for data mining.
In density estimation, the likelihood function gives
a measure of how well the probability density function (PDF) which is defined by the parameters fits the
given data set. If a parameter set maximizes the likelihood, then these parameters are considered to define the best fitting PDF for the data set. A popular method for finding the values of the parameters is
the EM algorithm. In the Gaussian mixture model,
the PDF is approximated by a mixture of PDF. The
EM algorithm alternately estimates the group membership of the data points using a previous estimate
of the parameters of the model, and then updates this
estimate of the parameters using the estimate of the
group membership of the data points. This Gaussian
mixture model [9] is sometimes referred to as the probabilistic neural network. The obtained Gaussian PDFs
partition the given data set. More direct approach to
partitioning the set is the clustering.
Clustering is a technique to partition a set of samples into exactly c disjoint subsets. Samples in the
same cluster are somehow more similar than other samples in other clusters. One way to make this problen
a well defined one is to define a criterion function that
measures the clustering quality of any partition of the
data.
Fuzzy c-Means [1] clustering algorithm is the popular one by the distance-based objective function methods (i.e., the within-group sum-of-squared-error (WGSS)
criterion). Other approaches are driven by optimization of a generalized fuzzy c-prototypes functional defined by a measure of similarity (or dissimilarity) between datum and cluster center (i.e., prototype). In
Bezdek et al. [2, 3] the fitting prototypes are either
straight lines and the measure is the orthogonal distance, or more generally, prototypes that are convex
combinations of points and lines.
The Mahalanobis distance between x1 , x2 ∈ RI defined by the weighted inner product (x1 − x2 )T A(x1 −
x2 ) is an important tool for pattern classification.
Gustafson and Kessel [5] proposed a modified fuzzy cmeans algorithm based on the Mahalanobis distance,
which appears more natural through the use of a fuzzy
covariance matrix. The local variation of the norm may
identify clusters of various geometric shapes including
linear clusters.
The original FCM clustering algorithm is an extension of Hard c-Means [4], and real membership values
in unit interval [0,1] are obtained by the Lagrangian
multiplier method. Singularity in the clustering is considered by Miyamoto et al. [8] which implies the case
where proper partition is not obtained by the Lagrangian
multiplier method. They introduced an entropy term
to obtain fuzzy clusters. This approach is referred to
as the entropy regularization.
This paper discusses about similarity between the
Gaussian mixture model with EM algorithm and the
FCM based on the Mahalanobis distance [5] with the
entropy regularization. We show that the same algorithm as the EM algorithm can be derived from a modified FCM with regularization by Kullback-Leibler divergence (K-L information). And we apply it to Fuzzy
c-Varieties (FCV) [2, 3], which is a clustering algorithm
to obtain linear or ellipsoidal clusters, for developing a
datamining tool with local principal component analysis technique.
2
By applying the EM algorithm, the log-likelihood
Q = log
c
n X
Y
πi pi (xk )
is maximized. The E-step of the EM algorithm is defined to be
− 12
πi exp − 12 (xk − v i )T A−1
i (xk − v i ) |Ai |
uik = Pc
1
− 12
T −1
j=1 πj exp − 2 (xk − v j ) Aj (xk − v j ) |Aj |
Gaussian Mixture Model and Entropy Regularization
The Gaussian mixture model is well recognized as
a statistical technique for density estimation, and the
EM algorithm is used to fit a fixed number of Gaussians
to a data set. The Gaussian mixture model regards
the data set as a collection of Gaussian distributions
whereas the FCM clustering regards it as a collection
of clusters.
The likelihood function is a function of the model
parameters and it gives a measure of how well the PDF
defined by the parameters fits the given data set. The
best fitting PDF for the data set will be defined by
a parameter set that maximizes the likelihood. There
is thus a need to find an estimate of these maximum
likelihood parameters. The EM algorithm is a well
known method for this purpose. The EM algorithm alternately estimates the group membership of the data
points using a previous estimate of the parameters of
the model, and then updates this estimate of the parameters using the estimate of the group membership
of the data points.
In the Gaussian mixture model, the PDF g(x), is
approximated by a mixture of PDF denoted by πi pi (x),
that is,
g(x) =
c
X
πi pi (x)
(1)
i=1
pi (x)
1
(2π)n/2 |Ai |1/2
1
exp − (x − v i )T A−1
(x
−
v
)
i
i
2
=
(2)
The unknown parameters of the matrix Ai , the mean
v i , and the proportion πi which represents the contribution of the ith Gaussian PDF are estimated by the
maximum likelihood approach.
(3)
k=1 i=1
(4)
The M-step of the EM method maximizes the loglikelihood function Q. By differentiating Q with respect to each parameter and setting the resulting partial derivatives equal to zero, this maximization problem is solved. The Lagrange multiplier method can be
used if parameter constraints are necessary.
1X
uik
n
(5)
1X
uik
πi =
n
(6)
Pn
k=1 uik xk
vi = P
n
k=1 uik
(7)
n
πi =
k=1
n
k=1
The algorithm is the repetition through E-step and
M-step. The number of iterations required for the
algorithm to converge increases with the number of
Gaussian functions in the model. The algorithm must
check that the covariance matrices are non-singular
and hence invertible in order to confirm convergence
to a maxima in the interior of the parameter space.
If a covariance matrix have an entry on the diagonal
which is approaching zero then that trial of the EM algorithm restarted with a different initialization. Since
the global optimality is not guaranteed, to be confident
that the resulting parameters are at a global maximum
of the likelihood function it is desired to run the EM
algorithm a number of times using different initializations.
Though the scope of the application is not limited
to mixture modeling problems. We confine ourself to
the Gaussian mixture models. The examples of other
problems where the EM algorithm is applicable are
given in [7, 6].
The fuzzy c-means clustering algorithm by J.C.
Bezdek [1] partition the data set by introducing the
membership to fuzzy cluster. p dimensional vector
v i denotes prototype parameter (i.e., cluster center)
which is used instead of the mean of the Gaussian distribution. The uik denotes the membership of the kth
data to the ith cluster. Each feature vector consists of
s real-valued measurements describing the features of
the object represented by x. The clustering criterion
used to define good clusters for fuzzy c-means partitions is the FCM function:
Jm =
n
c X
X
(uik )m dik
(8)
i=1 k=1
where m is the weighting exponent on each fuzzy membership. The larger m is, the fuzzier the partition. The
nonnegative membership uik sum to one with respect
to c clusters for each object.
dik = (x − v i )T A−1
i (x − v i )
Ik
= {i|1 ≤ i ≤ c; dik = kxk − v i k = 0}
Iek
= {1, 2, ..., c} − Ik
1
dik 1/(m−1)
j=1 djk
(10)
When Ik 6= φ
∀i ∈ Iek
and
X
uik = 1
(11)
i∈Ik
Pn
m
k=1 (uik ) xk
vi = P
n
m
k=1 (uik )
(12)
The optimal uik and xk for all i and k are sought
using an alternating optimization scheme of the type
generally described in [1]. Similar to the EM algorithm, the scheme is the iteration through necessary
condition of the optimality Eqs.(10)-(12).
The original FCM clustering algorithm is an extension of Hard c-Means [4], and real membership values are introduced for fuzzy partitioning. There is
one technical trick given by Eq.(11). This condition is
called singularity. The other singularity is considered
by Miyamoto et al. [8] which implies the case where
n
c X
X
uik dik + λ
i=1 k=1
n
c X
X
uik log uik
(13)
i=1 k=1
With the Lagrangian multiplier method, the necessary
condition with respect to uik is written as.
exp(− λ1 dik )
uik = Pc
1
j=1 exp(− λ djk )
(14)
and
Pn
k=1 uik xk
vi = P
n
k=1 uik
(15)
As is often the case like in the Gaussian mixture model,
the algorithm which uses calculus-based optimization
method such as FCM can be trapped by local extrema
in the process of optimizing the clustering criterion.
They are also sensitive to initialization.
3
The necessary conditions for the optimality of Jm with
respect to uik is, when Ik = φ,
uik = 0
Jλ =
(9)
is a measure of the distance from x to the ith cluster prototype. The Euclidean distance metric is used
when A is a diagonal matrix. In the modified FCM
by D.E.Gustafson and W.C.Kessel [5], the matrices Ai
are also decision variables.
uik = P
c
proper partition is not obtained by the Lagrangian
multiplier method. They introduced an entropy term
to obtain fuzzy clusters. They call it the entropy regularization. In their approach the trick in Eq.(11) is not
needed and by introducing a regularization term K and
a positive parameter λ, Jλ = J1 + λK is minimized.
FCM and FCV clustering with regularization by K-L information
In this section, first we show that the EM algorithm
for Gaussian mixture model can be derived from FCM
clustering by introducing regularization with K-L information instead of entropy. Then we apply the KL information regularization to FCV clustering whose
clustering results are expected to be similar to the
ones by the EM algorithm for Gaussian mixture model.
FCV clustering is a modified FCM clustering method
whose clusters form linear varieties, and can be regarded as a method to find local principal components.
By replacing the entropy term in the objective function (13) of FCM clustering with K-L information term
and including constraint term in a Lagrangian function, we consider the minimization of the following
function
Jλτ
=
n
c X
X
uik dik + λ
i=1 k=1
+
n
c X
X
uik log
i=1 k=1
uik log |Ai | −
i=1 k=1
c
X
πi − 1)
− τ(
i=1
n
c X
X
n
X
k=1
uik
πi
c
X
ηk (
uik − 1)
i=1
(16)
where the sums, with respect to i, of the membership
uik and the additional parameter πi are one respectively. The entropy term in Eq.(13) is introduced to
equalize memberships. In our technique the second
term of Eq.(16) represents K-L information which is
minimized by equalizing uik with πi for each i. All the
memberships uik become close to πi when the weight
coefficient λ becomes large. In Gustafson and Kessel [5],
each determinant Ai is associated with the volume of
a cluster and predetermined and fixed before the start
of the algorithm, but in our approach it is minimized
through the repetition of the algorithm taking membership values into account. In these fuzzy clustering
methods, we do not aim to derive memberships based
on statistical theories, but to derive a fuzzy partition
by using the forms of entropy or K-L information that
are well-known tools in Statistics.
The necessary condition for the optimality of Lagrangian function Eq.(13)
(17)
yields
η
1
k
−1
uik = πi exp − dik exp
λ
λ
(18)
and then the constraint
uik = 1
(19)
1
πi exp − λ1 dik |Ai |− λ
= Pc
1
1
−λ
j=1 πj exp − λ djk |Aj |
(20)
i=1
yields
uik
Furthermore,
∂L
=0
∂πi
(21)
yields
πi =
n
−λ X
uik
τ
(22)
k=1
and from the constraint
c
X
i=1
Pn
n
1X
u
Pn ik
πi = Pc k=1
=
uik
n
j=1
k=1 ujk
(24)
k=1
Likewise,
∂L
=0
∂Ai
(25)
yields
Ai = Pn
1
k=1 uik
n
X
uik (xk − v i )(xk − v i )T
(26)
k=1
and
∂L
=0
∂v i
(27)
yields
∂L
=0
∂uik
c
X
we have
πi = 1
(23)
Pn
k=1 uik xk
vi = P
n
k=1 uik
(28)
Therefore, the algorithm is a repetition through Eq.(20),
Eq.(24) and Eq.(28). When the weight of K-L information term λ equals two, this fuzzy clustering algorithm
is equivalent to the EM algorithm with the Gaussian
mixture model in Eqs.(4)-(7). Though these two algorithms have different objectives (density estimation or
fuzzy clustering), it can be said that they are essentially quite similar.
The membership function Eq.(20) is different from
the one in the FCM with entropy regularization in the
point that Eq.(20) contains πi in both the numerator
and the denominator. As shown in Eq.(24), πi represents the capacity of the cluster i, K-L information
regularization also attempts to optimize clusters with
respect to the capacities. When we adopt the matrix
Ai as unknown parameters in our algorithm, it might
cause the singularity like in the EM algorithm with
Gaussian mixture models.
Gaussian mixture models and fuzzy clustering methods described above both derive partitions with ellipsoidal clusters using the probability density estimation
or the local Mahalanobis distances. FCV [2, 3] is also a
clustering method that can derive some ellipsoidal clusters in which the prototypes are multi-dimensional linear varieties represented by some local principal component vectors.
In this section, we introduce the K-L information
regularization into FCV and compare with Gaussian
mixture models. Assume that prototypes are rdimensional linear varieties Vr spanned by r linearly
independent vectors pj .
Vr
=
{z|z = v +
r
X
Our future works include to improve hard clustering results by applying an annealing technique to parameter λ and to take various distributions, which are
not confined only to Gaussian ones, into consideration.
tj p j ; tj ∈ R 1 }
j=1
=
v + span({pj })
(29)
For instance, in a 3-dimensional space, a plane through
the point v is used as a prototype instead of a cluster
center. The objective function is obtained by replacing the distance dik in Eq.(16) with the orthogonal distance from a data point to the ith prototypical linear
variety. The orthogonal distance dik is as follows,
dik
= kxk − v i k2 −
r 2
X
T
(xk − v i ) pij
[1] J. C. Bezdek: Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press (1981).
[2] J. C. Bezdek, C. Coray, R. Gunderson and J. Watson: Detection and characterization of cluster substructure. I. linear structure, fuzzy c-lines, SIAM
J. Appl. Math., 40, 2, 339-357 (1981).
[3] J. C. Bezdek, C. Coray, R. Gunderson and J. Watson: Detection and characterization of cluster substructure. II. fuzzy c-varieties and convex combinations thereof, SIAM J. Appl. Math., 40, 2, 358-372
(1981).
j=1
= (xk − v i )T (xk − v i )
r
X
References
(30)
[4] R. O. Duda and P. E. Hart: Pattern Classification
and Scene Analysis, Wiley, New York (1973).
where pij , j = 1, ..., r are linearly independent unit vectors.
Pn
m
T
Bi =
k=1 (uik ) (xk − v i ) (xk − v i ) is a fuzzy
scatter matrix. Under the constraint that pT p = 1,
the vector pij is derived as the jth unit eigenvector
of the fuzzy scatter matrix Bi corresponding to its
jth largest eigenvalue. The membership uik is derived
from Eq.(20) in FCM by replacing the distance dik with
Eq.(30) and v i is also derived from Eq.(28). The algorithm of FCV is a repetition of calculating the eigenvectors of the fuzzy scatter matrix Bi in Eqs.(20) and
(28). Fuzzy c-Elliptotypes (FCE) [3] uses the convex
combination of the squared distance from a data point
to the cluster center v i and that from the point to the
linear prototype. FCE is similar to the method proposed by D.E.Gustafson and W.C.Kessel in the sense
that it derives some ellipsoidal clusters.
[5] D. E. Gustafson and W. C. Kessel: Fuzzy clustering with a fuzzy covariance matrix, Proc. IEEECDC, 2, 761-766 (1979).
−
pTij (xk − v i )(xk − v i )T pij
j=1
4
Conclusion
We showed that the same algorithm as the EM
algorithm can be derived from a modified FCM with
regularization by K-L information. And we applied it
to FCV, which is a clustering algorithm to obtain linear or ellipsoidal clusters, for developing a datamining
tool with local principal component analysis technique.
Density estimation with Gaussian mixture model can
be considered as a probabilistic approach to clustering which is strictly confined to Gaussian distributions.
On the other hand, Fuzzy c-means clustering with regularization by K-L Information can be seen as a nonGaussian version of density estimation.
[6] N. M. Laird, A. P. Dempster and D. B. Rubin:
Maximum likelihood from incomplete data via the
EM algorithm. Proc. Roy. Stat. Soc., 1 (1977).
[7] G. J. McLachlan and T. Krishnan: The EM algorithm and extensions, John Wiley and Sons (1997).
[8] S. Miyamoto and M. Mukaidono: Fuzzy c-means
as a regularization and maximum entropy approach, Proc. of the 7th International Fuzzy Systems Association World Congress(IFSA’97), II,
86-92 (1997).
[9] R. L. Streit and T. E. Luginbuhl: Maximum likelihood training of probabilistic neural networks,
IEEE Transactions on Neural Networks, 5, 5, 764783 (1994).