Gaussian Mixture PDF Approximation and Fuzzy c-Means Clustering with Entropy Regularization Hidetomo ICHIHASHI, Katsuhiro HONDA and Naoki TANI College of Engineering Osaka Prefecture University [email protected] (presented at AFSS’2000 and revised on June 1) Abstract EM algorithm is a popular density estimation method that uses the likelihood function as the measure of fit. Fuzzy c-Means (FCM) is also a popular clustering algorithm by the distance-based objective function methods. This paper discusses about similarity between the Gaussian mixture model with EM algorithm and the FCM based on the Mahalanobis distance with the entropy regularization. We show that the same algorithm as the EM algorithm can be derived from a modified FCM with regularization by K-L information. And, then we apply it to Fuzzy c-Varieties (FCV) which is a clustering algorithm to obtain linear or ellipsoidal clusters. 1 Introduction Data mining is the search for relationships and global patterns that exist in large databases, but are hidden among the vast amounts of data. One important topic in data mining research is concerned with the discovery of association rule which describes an interesting association relationship among different attributes. Besides these new techniques, conventional multivariate analysis techniques have been often used for this purpose. In order to find local relationships and local patterns, the local principal components obtained by cluster analysis or local probability density estimation techniques can be considered as a useful tool for data mining. In density estimation, the likelihood function gives a measure of how well the probability density function (PDF) which is defined by the parameters fits the given data set. If a parameter set maximizes the likelihood, then these parameters are considered to define the best fitting PDF for the data set. A popular method for finding the values of the parameters is the EM algorithm. In the Gaussian mixture model, the PDF is approximated by a mixture of PDF. The EM algorithm alternately estimates the group membership of the data points using a previous estimate of the parameters of the model, and then updates this estimate of the parameters using the estimate of the group membership of the data points. This Gaussian mixture model [9] is sometimes referred to as the probabilistic neural network. The obtained Gaussian PDFs partition the given data set. More direct approach to partitioning the set is the clustering. Clustering is a technique to partition a set of samples into exactly c disjoint subsets. Samples in the same cluster are somehow more similar than other samples in other clusters. One way to make this problen a well defined one is to define a criterion function that measures the clustering quality of any partition of the data. Fuzzy c-Means [1] clustering algorithm is the popular one by the distance-based objective function methods (i.e., the within-group sum-of-squared-error (WGSS) criterion). Other approaches are driven by optimization of a generalized fuzzy c-prototypes functional defined by a measure of similarity (or dissimilarity) between datum and cluster center (i.e., prototype). In Bezdek et al. [2, 3] the fitting prototypes are either straight lines and the measure is the orthogonal distance, or more generally, prototypes that are convex combinations of points and lines. The Mahalanobis distance between x1 , x2 ∈ RI defined by the weighted inner product (x1 − x2 )T A(x1 − x2 ) is an important tool for pattern classification. Gustafson and Kessel [5] proposed a modified fuzzy cmeans algorithm based on the Mahalanobis distance, which appears more natural through the use of a fuzzy covariance matrix. The local variation of the norm may identify clusters of various geometric shapes including linear clusters. The original FCM clustering algorithm is an extension of Hard c-Means [4], and real membership values in unit interval [0,1] are obtained by the Lagrangian multiplier method. Singularity in the clustering is considered by Miyamoto et al. [8] which implies the case where proper partition is not obtained by the Lagrangian multiplier method. They introduced an entropy term to obtain fuzzy clusters. This approach is referred to as the entropy regularization. This paper discusses about similarity between the Gaussian mixture model with EM algorithm and the FCM based on the Mahalanobis distance [5] with the entropy regularization. We show that the same algorithm as the EM algorithm can be derived from a modified FCM with regularization by Kullback-Leibler divergence (K-L information). And we apply it to Fuzzy c-Varieties (FCV) [2, 3], which is a clustering algorithm to obtain linear or ellipsoidal clusters, for developing a datamining tool with local principal component analysis technique. 2 By applying the EM algorithm, the log-likelihood Q = log c n X Y πi pi (xk ) is maximized. The E-step of the EM algorithm is defined to be − 12 πi exp − 12 (xk − v i )T A−1 i (xk − v i ) |Ai | uik = Pc 1 − 12 T −1 j=1 πj exp − 2 (xk − v j ) Aj (xk − v j ) |Aj | Gaussian Mixture Model and Entropy Regularization The Gaussian mixture model is well recognized as a statistical technique for density estimation, and the EM algorithm is used to fit a fixed number of Gaussians to a data set. The Gaussian mixture model regards the data set as a collection of Gaussian distributions whereas the FCM clustering regards it as a collection of clusters. The likelihood function is a function of the model parameters and it gives a measure of how well the PDF defined by the parameters fits the given data set. The best fitting PDF for the data set will be defined by a parameter set that maximizes the likelihood. There is thus a need to find an estimate of these maximum likelihood parameters. The EM algorithm is a well known method for this purpose. The EM algorithm alternately estimates the group membership of the data points using a previous estimate of the parameters of the model, and then updates this estimate of the parameters using the estimate of the group membership of the data points. In the Gaussian mixture model, the PDF g(x), is approximated by a mixture of PDF denoted by πi pi (x), that is, g(x) = c X πi pi (x) (1) i=1 pi (x) 1 (2π)n/2 |Ai |1/2 1 exp − (x − v i )T A−1 (x − v ) i i 2 = (2) The unknown parameters of the matrix Ai , the mean v i , and the proportion πi which represents the contribution of the ith Gaussian PDF are estimated by the maximum likelihood approach. (3) k=1 i=1 (4) The M-step of the EM method maximizes the loglikelihood function Q. By differentiating Q with respect to each parameter and setting the resulting partial derivatives equal to zero, this maximization problem is solved. The Lagrange multiplier method can be used if parameter constraints are necessary. 1X uik n (5) 1X uik πi = n (6) Pn k=1 uik xk vi = P n k=1 uik (7) n πi = k=1 n k=1 The algorithm is the repetition through E-step and M-step. The number of iterations required for the algorithm to converge increases with the number of Gaussian functions in the model. The algorithm must check that the covariance matrices are non-singular and hence invertible in order to confirm convergence to a maxima in the interior of the parameter space. If a covariance matrix have an entry on the diagonal which is approaching zero then that trial of the EM algorithm restarted with a different initialization. Since the global optimality is not guaranteed, to be confident that the resulting parameters are at a global maximum of the likelihood function it is desired to run the EM algorithm a number of times using different initializations. Though the scope of the application is not limited to mixture modeling problems. We confine ourself to the Gaussian mixture models. The examples of other problems where the EM algorithm is applicable are given in [7, 6]. The fuzzy c-means clustering algorithm by J.C. Bezdek [1] partition the data set by introducing the membership to fuzzy cluster. p dimensional vector v i denotes prototype parameter (i.e., cluster center) which is used instead of the mean of the Gaussian distribution. The uik denotes the membership of the kth data to the ith cluster. Each feature vector consists of s real-valued measurements describing the features of the object represented by x. The clustering criterion used to define good clusters for fuzzy c-means partitions is the FCM function: Jm = n c X X (uik )m dik (8) i=1 k=1 where m is the weighting exponent on each fuzzy membership. The larger m is, the fuzzier the partition. The nonnegative membership uik sum to one with respect to c clusters for each object. dik = (x − v i )T A−1 i (x − v i ) Ik = {i|1 ≤ i ≤ c; dik = kxk − v i k = 0} Iek = {1, 2, ..., c} − Ik 1 dik 1/(m−1) j=1 djk (10) When Ik 6= φ ∀i ∈ Iek and X uik = 1 (11) i∈Ik Pn m k=1 (uik ) xk vi = P n m k=1 (uik ) (12) The optimal uik and xk for all i and k are sought using an alternating optimization scheme of the type generally described in [1]. Similar to the EM algorithm, the scheme is the iteration through necessary condition of the optimality Eqs.(10)-(12). The original FCM clustering algorithm is an extension of Hard c-Means [4], and real membership values are introduced for fuzzy partitioning. There is one technical trick given by Eq.(11). This condition is called singularity. The other singularity is considered by Miyamoto et al. [8] which implies the case where n c X X uik dik + λ i=1 k=1 n c X X uik log uik (13) i=1 k=1 With the Lagrangian multiplier method, the necessary condition with respect to uik is written as. exp(− λ1 dik ) uik = Pc 1 j=1 exp(− λ djk ) (14) and Pn k=1 uik xk vi = P n k=1 uik (15) As is often the case like in the Gaussian mixture model, the algorithm which uses calculus-based optimization method such as FCM can be trapped by local extrema in the process of optimizing the clustering criterion. They are also sensitive to initialization. 3 The necessary conditions for the optimality of Jm with respect to uik is, when Ik = φ, uik = 0 Jλ = (9) is a measure of the distance from x to the ith cluster prototype. The Euclidean distance metric is used when A is a diagonal matrix. In the modified FCM by D.E.Gustafson and W.C.Kessel [5], the matrices Ai are also decision variables. uik = P c proper partition is not obtained by the Lagrangian multiplier method. They introduced an entropy term to obtain fuzzy clusters. They call it the entropy regularization. In their approach the trick in Eq.(11) is not needed and by introducing a regularization term K and a positive parameter λ, Jλ = J1 + λK is minimized. FCM and FCV clustering with regularization by K-L information In this section, first we show that the EM algorithm for Gaussian mixture model can be derived from FCM clustering by introducing regularization with K-L information instead of entropy. Then we apply the KL information regularization to FCV clustering whose clustering results are expected to be similar to the ones by the EM algorithm for Gaussian mixture model. FCV clustering is a modified FCM clustering method whose clusters form linear varieties, and can be regarded as a method to find local principal components. By replacing the entropy term in the objective function (13) of FCM clustering with K-L information term and including constraint term in a Lagrangian function, we consider the minimization of the following function Jλτ = n c X X uik dik + λ i=1 k=1 + n c X X uik log i=1 k=1 uik log |Ai | − i=1 k=1 c X πi − 1) − τ( i=1 n c X X n X k=1 uik πi c X ηk ( uik − 1) i=1 (16) where the sums, with respect to i, of the membership uik and the additional parameter πi are one respectively. The entropy term in Eq.(13) is introduced to equalize memberships. In our technique the second term of Eq.(16) represents K-L information which is minimized by equalizing uik with πi for each i. All the memberships uik become close to πi when the weight coefficient λ becomes large. In Gustafson and Kessel [5], each determinant Ai is associated with the volume of a cluster and predetermined and fixed before the start of the algorithm, but in our approach it is minimized through the repetition of the algorithm taking membership values into account. In these fuzzy clustering methods, we do not aim to derive memberships based on statistical theories, but to derive a fuzzy partition by using the forms of entropy or K-L information that are well-known tools in Statistics. The necessary condition for the optimality of Lagrangian function Eq.(13) (17) yields η 1 k −1 uik = πi exp − dik exp λ λ (18) and then the constraint uik = 1 (19) 1 πi exp − λ1 dik |Ai |− λ = Pc 1 1 −λ j=1 πj exp − λ djk |Aj | (20) i=1 yields uik Furthermore, ∂L =0 ∂πi (21) yields πi = n −λ X uik τ (22) k=1 and from the constraint c X i=1 Pn n 1X u Pn ik πi = Pc k=1 = uik n j=1 k=1 ujk (24) k=1 Likewise, ∂L =0 ∂Ai (25) yields Ai = Pn 1 k=1 uik n X uik (xk − v i )(xk − v i )T (26) k=1 and ∂L =0 ∂v i (27) yields ∂L =0 ∂uik c X we have πi = 1 (23) Pn k=1 uik xk vi = P n k=1 uik (28) Therefore, the algorithm is a repetition through Eq.(20), Eq.(24) and Eq.(28). When the weight of K-L information term λ equals two, this fuzzy clustering algorithm is equivalent to the EM algorithm with the Gaussian mixture model in Eqs.(4)-(7). Though these two algorithms have different objectives (density estimation or fuzzy clustering), it can be said that they are essentially quite similar. The membership function Eq.(20) is different from the one in the FCM with entropy regularization in the point that Eq.(20) contains πi in both the numerator and the denominator. As shown in Eq.(24), πi represents the capacity of the cluster i, K-L information regularization also attempts to optimize clusters with respect to the capacities. When we adopt the matrix Ai as unknown parameters in our algorithm, it might cause the singularity like in the EM algorithm with Gaussian mixture models. Gaussian mixture models and fuzzy clustering methods described above both derive partitions with ellipsoidal clusters using the probability density estimation or the local Mahalanobis distances. FCV [2, 3] is also a clustering method that can derive some ellipsoidal clusters in which the prototypes are multi-dimensional linear varieties represented by some local principal component vectors. In this section, we introduce the K-L information regularization into FCV and compare with Gaussian mixture models. Assume that prototypes are rdimensional linear varieties Vr spanned by r linearly independent vectors pj . Vr = {z|z = v + r X Our future works include to improve hard clustering results by applying an annealing technique to parameter λ and to take various distributions, which are not confined only to Gaussian ones, into consideration. tj p j ; tj ∈ R 1 } j=1 = v + span({pj }) (29) For instance, in a 3-dimensional space, a plane through the point v is used as a prototype instead of a cluster center. The objective function is obtained by replacing the distance dik in Eq.(16) with the orthogonal distance from a data point to the ith prototypical linear variety. The orthogonal distance dik is as follows, dik = kxk − v i k2 − r 2 X T (xk − v i ) pij [1] J. C. Bezdek: Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press (1981). [2] J. C. Bezdek, C. Coray, R. Gunderson and J. Watson: Detection and characterization of cluster substructure. I. linear structure, fuzzy c-lines, SIAM J. Appl. Math., 40, 2, 339-357 (1981). [3] J. C. Bezdek, C. Coray, R. Gunderson and J. Watson: Detection and characterization of cluster substructure. II. fuzzy c-varieties and convex combinations thereof, SIAM J. Appl. Math., 40, 2, 358-372 (1981). j=1 = (xk − v i )T (xk − v i ) r X References (30) [4] R. O. Duda and P. E. Hart: Pattern Classification and Scene Analysis, Wiley, New York (1973). where pij , j = 1, ..., r are linearly independent unit vectors. Pn m T Bi = k=1 (uik ) (xk − v i ) (xk − v i ) is a fuzzy scatter matrix. Under the constraint that pT p = 1, the vector pij is derived as the jth unit eigenvector of the fuzzy scatter matrix Bi corresponding to its jth largest eigenvalue. The membership uik is derived from Eq.(20) in FCM by replacing the distance dik with Eq.(30) and v i is also derived from Eq.(28). The algorithm of FCV is a repetition of calculating the eigenvectors of the fuzzy scatter matrix Bi in Eqs.(20) and (28). Fuzzy c-Elliptotypes (FCE) [3] uses the convex combination of the squared distance from a data point to the cluster center v i and that from the point to the linear prototype. FCE is similar to the method proposed by D.E.Gustafson and W.C.Kessel in the sense that it derives some ellipsoidal clusters. [5] D. E. Gustafson and W. C. Kessel: Fuzzy clustering with a fuzzy covariance matrix, Proc. IEEECDC, 2, 761-766 (1979). − pTij (xk − v i )(xk − v i )T pij j=1 4 Conclusion We showed that the same algorithm as the EM algorithm can be derived from a modified FCM with regularization by K-L information. And we applied it to FCV, which is a clustering algorithm to obtain linear or ellipsoidal clusters, for developing a datamining tool with local principal component analysis technique. Density estimation with Gaussian mixture model can be considered as a probabilistic approach to clustering which is strictly confined to Gaussian distributions. On the other hand, Fuzzy c-means clustering with regularization by K-L Information can be seen as a nonGaussian version of density estimation. [6] N. M. Laird, A. P. Dempster and D. B. Rubin: Maximum likelihood from incomplete data via the EM algorithm. Proc. Roy. Stat. Soc., 1 (1977). [7] G. J. McLachlan and T. Krishnan: The EM algorithm and extensions, John Wiley and Sons (1997). [8] S. Miyamoto and M. Mukaidono: Fuzzy c-means as a regularization and maximum entropy approach, Proc. of the 7th International Fuzzy Systems Association World Congress(IFSA’97), II, 86-92 (1997). [9] R. L. Streit and T. E. Luginbuhl: Maximum likelihood training of probabilistic neural networks, IEEE Transactions on Neural Networks, 5, 5, 764783 (1994).
© Copyright 2024 ExpyDoc