On the Bayesian mixture model and identifiability ´s H. Mena Ramse Universidad Nacional Aut´ onoma de M´exico, Mexico Stephen G. Walker University of Texas at Austin, USA This paper is concerned with Bayesian mixture models and identifiability issues. There are two sources of unidentifiability, the well known likelihood invariance under label switching and the perhaps less well known parameter identifiability problem. When using latent allocation variables determined by the mixture model, these sources of unidentifiability create arbitrary labeling which renders estimation of the model very difficult. We endeavour to tackle these problems by proposing a prior distribution on the allocations which provides an explicit interpretation for the labeling by removing gaps with high probability. An MCMC estimation method is proposed and supporting illustrations are presented. Keywords: Allocation variable; Gibbs sampler; Identifiability; Label switching; Mixture model. 2 1 Mena and Walker Introduction The random allocation of objects into clusters or groups is of fundamental interest in statistics and other areas that interact with it. A common way to assign a distribution able to disentangle the different features of observations into homogeneous clusters, e.g. characterized by the same parametric family K(· | θ), is through mixture models. Under such framework, independent and identically distributed observations, y := (yi )ni=1 , are modeled by (1) f (yi | w, θ) = M � j=1 wj K(yi | θj ), M ≥1 M where the weights w := (wj )M j=1 , adding up to one, and locations θ := (θj )j=1 are instrumental and might be considered as random under complete uncertainty. The literature backing up this approach is vast ranging from its origins in Pearson (1894) to expositions such as the ones presented by Titterington, Smith and Makov (1985), McLachlan and Peel (2000) or Marin, Mengersen and Robert (2005). In particular, the case M = ∞, was first considered and studied by Lo (1984) where the weights and locations are derived from the Dirichlet process (Ferguson, 1973). Indeed, this constitutes the workhorse for Bayesian nonparametric inference; for density estimation and also for clustering problems. A recent review in this direction can be found in Hjort, Holmes, M¨ uller and Walker (2010). An alternative way to write such a model, is via an allocation variable, which On the Bayesian mixture model and identifiability 3 tells us which cluster each observation belongs to. That is (2) f (yi | di ) = K(yi | θdi ) with P (di = j|w) = wj , independently for i = 1, . . . , n. The advantage of this latter representation is that there is an explicit characterization of the distribution of the group membership to each observation. Under full uncertainty about the clustering structure featuring y, the natural choice is to have a symmetric distribution such as the multinomial. However, care should be taken as, under such scenario, the likelihood is invariant under the many possible permutations of the indices, namely p(y1 , . . . , yn |d1 , . . . , dn ) = p(y1 , . . . , yn |ρ(d1 ), . . . , ρ(dn )) for any permutation ρ of the allocation variables d := (di )ni=1 . Furthermore, under symmetric priors, across components, for θ and w, this invariance would be inherited in posterior inference. Clearly, having such an invariance implies model unidentifiability. In particular, this issue is well known to cause the label switching problem when performing MCMC based inference; namely the likelihood invariance prevents a proper identification of a grouping structure. We refer to Titterington, Smith and Makov (1985); Celeux, Hurn and Robert (2000) and Stephens (2000) for deeper discussions of the label switching problem. There are various proposals dealing with the label switching problem, some aimed at selecting a particular permutation, e.g. by means of a loss function (Fr¨ uhwirth-Schnatter, 2001; 4 Mena and Walker Nobile and Fearnside, 2007; Papastamoulis and Iliopoulos, 2010; Rodriguez and Walker, 2014), and others aimed at imposing parameter constraints, so to break the aforementioned symmetry and favor a single labeling (Diebolt and Robert, 1994; Richardson and Green, 1997; Chung, Loken and Schafer, 2004). Having said that, there is a seemingly related but different identifiability issue under the above approach, namely the correct identification of the components that better describe y. In other words, besides the label switching problem, there is the problem that different parameter values (w, θ) could lead to the same value of f (·). To expand on the point of this kind of unidentifiability, notice that we can recover a standard normal for f (·), with w1 = 1 and K(· | θ1 ) = N(· | 0, 1), but also with w1 as anything, w2 = 1−w1 , K(· | θ1 ) = N(· | 0, 1) and K(· | θ2 ) ≈ N(· | 0, 1). Indeed, an observation in this direction can be traced back to Ferguson (1983) who states the following about the parameters of the model: “We are not interested in estimating the parameters of the model . . .. It would not make much sense to do so since the parameters are not identifiable.” Solving these unidentifiability issues simultaneously, i.e. likelihood invariance under label switching and parameter unidentifiably, is still a challenging problem. Our attempt to solve them is based on a different starting point, namely by suitably constructing a prior distribution on the allocation variables d, say p(d). Specifically, the idea is to attach some meaning to such allocations that, remove the gaps and at the same time solve such unidentifiability problems with On the Bayesian mixture model and identifiability 5 high probability. We then connect this to the mixture model via the standard conditional distributions p(y, θ | d) and p(w | d). Due to the fact that we start without the weights, we need to assign a prior distribution on the allocations. Some approaches in this direction, e.g. Casella, Robert and Wells (2004), Nobile (1994) and Nobile and Fearnside (2007), follow this idea and construct MCMC samplers to draw posterior inference on the clustering structure of the data using p(d | y). However, having such a large support for the sequence d, not all weight distributions will lead to “ good ” p(d), and thus p(d | y). There are some proposals working directly with p(d | y); but our contention is that if the form for p(d) has not been chosen astutely then the problems of poor allocations will persist. Attempts to solve the problem include constraining the support of d at this stage, so set d1 = 1, then sample d2 from p(d2 | d1 , y) ∝ p(d1 , d2 | y) with d2 from the choice of (1, 2). In general, take di = j from the density proportional to p(d1 , . . . , di−1 , j | y) with j restricted to (1, . . . , k) and k = max{d1 , . . . , di−1 }. However, this constraint on the d leads to a non-exchangeable model and so depends on who is assigned as y1 , and so on. The benefit is that sampling is direct and avoids MCMC, since we can calculate p(d | y) when the d are given or the choice of d values is limited. However, to avoid the order problem, one would need to repeat the process for 6 Mena and Walker each permutation on the first n integers. See Wang and Dunson (2012). In a similar direction Quintana and Iglesias (2003) proposed the idea of using product partition models where weights are not considered and only the model f (y | d) is constructed. In this case, it is only necessary to consider f (y | k, C1 , . . . , Ck ) where (k, C1 , . . . , Ck ) define a partition of the observations into k groups and a prior is placed on it. Lau and Green (2007) propose an optimal clustering procedure via the application of a specific loss function that penalises pairs of clusters assigned to different clusters, when they should be in the same one. A related proposal, but in one dimension only, and without the need of MCMC methods, Dahl (2009) proposes an algorithm to find the modal partition based on a reallocation of overlapping partitions. Although these approaches are very appealing, when looking for the most appropriate clustering of observations, the aforementioned unidentifiability issues persist. Our aim is to construct p(d) so that with high probability label 1 is assigned to the largest group, label 2 the second largest group, and so on, without allowing for gaps. More accurately we want the labels to be associated with a non-increasing sequence of groups sizes so we do not make an issue of equally sized groups. This will remove the label switching problem with high probability. The parameter identifiability problem is caused by too many groups being generated than are actually needed and so p(d) will specifically penalize large numbers of groups. The layout of the paper is as follows: In Section 2, we discuss and motivate On the Bayesian mixture model and identifiability 7 our particular choice of prior for the allocation variables. In this section, we also present a toy example to illustrate the properties of the prior and posterior distributions. In Section 3, we describe the estimation method, via MCMC to obtain samples from p(d | y) and then the procedure for estimating the weights associated with the allocations. Section 4 contains an illustration using MCMC methods to sample p(d, θ | y) and then calculate the weights. In particular, for the benchmark galaxy data set, we obtain a clear three mixture normal representation of the data and provide the associated weights, means and variances of each component. In this section we also discuss what happens when we have groups of equal size. Section 5 concludes with a summary. 2 A prior distribution on allocation variables In this section we describe a prior distribution for allocating n individuals to k ≤ n groups. This can be removed from the context of the mixture model, though obviously we will return to it later on. The aim, for reasons expanded on in the introduction, is to put very high probability on the largest group having label 1, the second largest group label 2, and so on. The prior needs to be exchangeable for all n, that is: (i) For all permutations ρ on {1, . . . , n}, and for all n ≥ 1, it is that p(d1 , . . . , dn ) = p(dρ(1) , . . . , dρ(n) ). 8 Mena and Walker (ii) It is that ∞ � p(d1 , . . . , dn , j) = p(d1 , . . . , dn ) j=1 for all n ≥ 1. It is well known that d = (d1 , . . . , dn ) defines k, the number of distinct di s in d, and (C1 , . . . , Ck ), the groups. That is, C1 is a group in which the di s all share the same value. However, the labeling, or the value that these di s take, is irrelevant; so C1 has no implication other than it describes one of the groups. In particular, Cj �= {i : di = j}, since we have labeled the Cj s from 1 to k and yet it is not the case, in general, that the di s are labeled from 1 to k. To think about how we plan to construct p(d), we ask how should we deal with (1, 2, 2, 4) and (1, 1, 2, 3)? We do not need both of them. We neither want the gap associated with leaving out the 3 in (1, 2, 2, 4) and we actually want the group of size 2 to be labeled as 1. So what is needed is a model for which p(1, 1, 2, 3) � p(1, 2, 2, 4). That is we need the labeling to be biased towards (1, 1, 2, 3) rather than (1, 2, 2, 4). There are two things going on here. The first is that we do not want arbitrary numbers representing the groups, so we want no gaps; and the second is that we On the Bayesian mixture model and identifiability 9 want the size of the group to determine the label, so the larger groups occupy the smaller labels. This is not going to be 100% precise, but the ambition of this direction should be a motivation when proposing a p(d). It is clear from looking at (1, 1, 2, 3) is that it minimizes the sum of the di s which have the same (k, C1 , . . . , Ck ) and this forms the basis of the prior. Therefore, and the thinking is straightforward, our proposal for having no gaps and the larger groups to be labelled with the smaller integers is to have p(d) ∝ g(D) where D= n � di , i=1 D ≥ n, for some decreasing function g(·). Clearly, this will meet the exchangeability requirement (i) provided we make sure that � g(D) < +∞. d We also need to demonstrate (ii) which puts some restrictions on the choice of g(·). The preferred choice is g(D) = q D for some 0 < q < 1 which clearly meets condition (ii). To see this, write p(d) = cn q D so that (ii) implies c n+1 D q ∞ � j=1 q j = cn+1 q D q 1−q 10 Mena and Walker and hence we need c = 1/q − 1. In the illustration we will assign a hyper-prior to q. The construction of a suitable prior for q can be based on the result that P � max {di } ≤ k 1≤i≤n � = � (1 − q k )n π(dq) and hence one can find a π which matches some probability on the number of groups, since P � max {di } ≤ k 1≤i≤n � ≈ P (≤ k groups). A more general class of g(·) is g(D) = Γ(D − n + β) , Γ(D + α + β) for some α, β > 0 though we will not pursue this case in the present paper. Now, returning to the mixture model; this will yield us a p(y | d) which we can combine with p(d) to obtain the posterior p(d | y). The p(y | d) which arises from a mixture model has the form (3) p(y | d) = k � � � j=1 i∈Cj K(yi | θj ) π(dθj ). This does not involve any weight parameters since they have been superseded by the d. Hence, in some way, we are building up a mixture model which is fundamentally different to the usual starting point. For (3), if a conjugate prior has been chosen for the θj s, written as π(θ), we can integrate out the θj s to obtain p(y | d) = p(y | C1 , . . . , Ck , k). On the Bayesian mixture model and identifiability 11 For example, in the particular case when � � � � 1 � K(y | θj ) = N y � µj , λj and the conjugate prior N(µ | 0, (τ λ)−1 ) × Ga(λ | a, b) is chosen independently for each (µj , λj ), we obtain p(y | k, C1 , . . . , Ck ) ∝ k � Γ(a + nj /2) 1 , 2 1/2 (nj + τ ) [b + 0.5 (sj − n2j y¯j2 /(nj + τ ))]a+nj /2 j=1 where y¯j = n−1 j � yi , i∈Cj nj = |Cj |, and s2j = � yi2 . i∈Cj This form for p(y | C1 , . . . , Ck , k) is the classic clustering probability once k, the number of groups is known. In other words, it provides the likelihood for “who is with who”. Our model therefore will be of the type p(d, y) ∝ g(D) p(y | C1 , . . . , Ck , k), where g(D) is a decreasing function of D. Clearly, we have p(y | d) = p(y | C1 , . . . , Ck , k) while p(d) ∝ g(D). 12 Mena and Walker Hence, the model splits neatly into two components: the g(D) which determines the labeling, and p(y | d) which decides “who is with who”. In order to have a clearer idea of the posterior distribution on d, and to see how our idea works, here we consider a small dataset where we can compute exactly the p(d | y). Specifically, we consider n = 10 data points given by −1.521, −1.292, −0.856, −0.104, 2.388, 3.079, 3.312, 3.415, 3.922, 4.194. See Figure 1 for a histogram plot of the data. This data set has been considered before by Fuentes-Garc´ıa, Mena and Walker (2010) in a similar context. To compute the exact posterior values at each d we use the same kernel and prior, on the θj s, specificied earlier in this section, with p(d) ∝ q D and M = 3. In this case we fix M so we can do all the calculations exactly, though in practice as we will see M = ∞ can be dealt with using MCMC methods. That is, we have M n possible values for d. Setting a = 5, b = 1, τ = 0.001 and q = 0.1 we obtain that the mode is placed at (d1 , d2 , . . . , d10 ) = (2, 2, 2, 2, 1, 1, 1, 1, 1, 1) which has posterior probability 0.933. Table 1 show the top nine posterior values for d as well as their corresponding probabilities. It is worth noting that increasing the value of M , e.g. to 4, 5, . . . does not lead to any significant difference in the reported posterior values. We computed the corresponding posterior values for M = 4 and M = 5, but decided not to report them here as they do not provide any difference. From these results it is evident that the higher mass is placed on the allocation values where the largest group is identified by the label 1, the second largest by 2. On the Bayesian mixture model and identifiability 13 Figure 1 here Table 1 here In summary, for any fixed configuration S = (k, C1 , . . . , Ck ) there exists a dS which allocates the largest group to label 1, the second largest to label 2, and so on. This dS will minimize, over all other d∗S yielding the same configuration, the sum D = � i di . Hence, the choice of prior of the type p(d) ∝ g(D). It is important to point out here what the consequences are if we have two groups of exactly the same size. To see what happens we can re-do the toy example giving both groups 5 observations each, e.g. by changing 2.388 to −2 in the above data set. In this case we find the posterior probability of (1, 1, 1, 1, 1, 2, 2, 2, 2, 2) is exactly the same as that of (2, 2, 2, 2, 2, 1, 1, 1, 1, 1), i.e. equal 0.466 each. This is expected and is what causes the label switching problem when running MCMC algorithms, namely the Markov chain will spend 50% of the time in one mode and 50% on the other. However, using our p(d) the chain we run will not allow switching between modes due to the extremely low probability of doing so. This point is expanded on in Section 4.2. 3 Estimation Given the dimension of the support for the allocation variables, we need to obtain p(d | y) using MCMC methods. We actually do this with the locations included in 14 Mena and Walker the estimating model, so we sample from p(d, θ | y). Hence, the MCMC is based on the conditional distributions p(di | θ, yi , d−i ) p(θj | d, y), and for i = 1, . . . , n, and j = 1, 2, . . ., respectively. Indeed, both of these conditional distributions are easy to find. A specific case will be given in Section 4. In the finite M case it is direct since each 1 ≤ di ≤ M . In the infinite case we need to be careful about how to sample the (di ) due to their infinite choice, but this can be resolved using latent variables, as in Kalli, Griffin and Walker (2011). To expand on this point here, we have p(y, d | θ) ∝ g(D) p(y | k, C1 , . . . , Ck , θ) and we can introduce the latent variable u so that the joint density with (d, y) is given by p(y, d, u) ∝ 1(u < g(D)) p(y | k, C1 , . . . , Ck , θ). Then, when it comes to the sampling of di , we have p(di = j | yi , θ) ∝ g(D−i + j) N(yi | µj , 1/λj ) where D−i = � di� i� �=i and so di is constrained to lie in the finite set Ai = {j : g(D−i + j) > u} ≡ {j : j ≤ g −1 (u) − D−i }. On the Bayesian mixture model and identifiability 15 At each iteration we would simulate sufficient number of the θj s values to be able to find the Ai and hence we would need (θj )Jj=1 for J = maxi {g −1 (u) − D−i }. Once this is done, we can use the MCMC estimate of p(d | y) and draw inference about any functional of (d, y), say β(d, y). In particular, we could estimate β(y) = � d β(d, y) p(d | y). In the basic mixture model (1), estimation problems of the type β ∗ (y) = � γ(θ, w) p(θ, w | y) dθ dw, for some function γ(θ, w), make no sense due to the fact that the (θ, w) are unidentifiable. However, if we condition on the allocation variables, which are from p(d | y), then estimation becomes identifiable (with very high probability), and hence we can estimate functionals such as β(d, y) = � γ(θ, w) p(θ, w | d, y) dθ dw via β(y) = � d β(d, y) p(d | y). Here p(θ, w | d, y) = p(w | d) p(θ | d, y) will be some posterior for estimating the (w, θ). In fact we can do this for the standard Dirichlet process mixture model. The key is to think of this as estimation 16 Mena and Walker of the functional β(d, y) with respect to the model p(d | y). In this case, we have w1 = v1 and, for j > 1, wj = vj � l<j (1 − vl ). where each vj is an independent Beta(1, c) variable, for some c > 0, and then, it is easy to show that independently for each j, � � � � p(vj | d) = Beta vj � 1 + #{di = j}, c + #{di > j} . � The form for p(θ | y, d) is also well known. So, for example, a particular γ(θ, w) might be given by γ(θ, w) = ∞ � 1(wj > �) j=1 for some small � > 0, or γ(θ, w) = w, which would tell us how many groups had weights larger than some �. Here we can see clearly the benefit of minimizing D with high probability. If the labeling is arbitrary then we will see di = j for larger values of j than are strictly necessary. This suggests that the number of wj s we should be interested in, i.e. which are larger than �, are more than is actually needed to model the data. Keeping the same structure, i.e. “who is with who”, while minimizing the labeling, means we are interested only in weights which are actually needed. On the Bayesian mixture model and identifiability 4 17 Illustrations In this section we look at two illustrations; one involving the benchmark galaxy data set and the other a simulated data set in which the observations fall into one of two distinct groups of equal size. It is well known the label switching problem arises by the Markov chain switching between modes. If these modes are asymmetric then the chain will spend predominantly all, if not all, the time in one mode of d, since one has much higher probability than the others. This mode will be the one with non-increasing group sizes coinciding with the increasing labels. However, it is worth looking at the case when two modes have the same probability, see the toy example in Section 2, which occurs when two groups have exactly the same size. In this case we will demonstrate that jumping between modes does not happen due to the extremely low probability of doing so. 4.1 Galaxy data Here we re-analyse the galaxy data set. This data set consist of n = 82 observations and was originally presented by Roeder (1990). To expand on the description of the MCMC given earlier, we select M = 10 as an upper bound on the number of clusters, and hence we only have (µj , λj )10 j=1 and ten weights. As we have mentioned before, handling the infinite case involves the introduction of a latent variable. 18 Mena and Walker Thus, for j = 1, . . . , 10 p(di = j | θ, y, d−i ) ∝ g j + � l�=i � � 1 d l × N yi | µ j , . λj Here we take g(D) = q D and take a prior distribution on q, which we have as the beta(1/2, 1/2) distribution. Then, taking the prior N(· | 0, τ −1 ) for the (µj ), with τ = 0.001, and the prior Ga(· | a, b) for the (λj ), with a = 0.5 and b = 0.1, we have � � � nj y¯j 1 � p(µj | λj , y, d) = N µj � , , nj λj + τ nj λ j + τ � � � � p(λj | θ, y, d) = Ga λj �� a + 0.5 nj , b + 0.5 (yi − θj )2 . � di =j � Finally, the conditonal posterior for q is given by p(q | d) = Beta(q | 0.5 + D − n, 0.5 + n). Hence, the sampler is very straightforward. At each iteration of the chain, we have a set of di s. We also have a set of θj s and so we know the locations of the clusters. To find the weights we estimate the functionals, for j = 1, . . . , 10, wj (d) = � wj p(wj | d) dwj with respect to the model p(d | y). Specifically we take p(wj | d) to be a stickbreaking model with the vj s independent Beta(1, c) with c = 1. Thus we are estimating and computing w ¯j = � d wj (d) p(d | y). On the Bayesian mixture model and identifiability 19 In this analysis we are emphasising the fact that the model will, with high probability, return weights w ¯j which are decreasing with j. It will also be possible to identify the number of groups by the size of the w ¯j . With the settings described in this section, we ran the Gibbs sampler for 50,000 iterations and so have 50,000 samples of d from p(d, θ | y). Hence, we have the ¯ j ). From the ouput of the allocations we estimates of the locations, i.e. the (¯ µj , λ obtain the w ¯j ; the results are as follows: w ¯1 = 0.867, w ¯2 = 0.088, w ¯3 = 0.036 with the remaining weights decreasing and with remaining mass of 0.009. The corresponding locations are given by ¯ 1 ) = (21.39, 0.21), (¯ µ1 , λ ¯ 2 ) = (9.72, 4.28), (¯ µ2 , λ ¯ 3 ) = (32.55, 0.80). (¯ µ3 , λ This clearly suggests 3 groups and provides the normal distributions which model the data. A figure of this mixture of 3 normals, with total mass of 0.991, is given in Figure 2, which for comparison includes a histogram representation of the data. We believe this is a good 3 normal mixture representation of the data. Moreover, it has been obtained via posterior parameter values rather than the usual predictive samples. Lau and Green (2007) also find that 3 mixtures are appropriate using a loss function approach to find a posterior estimate of the allocations, and do this by concentrating exclusively on p(d | y). 20 Mena and Walker Figure 2 here 4.2 Toy example 2: Equal group sizes Here we simulate data so there are clearly two groups of equal size. Hence, for n = 1, . . . , 20 we generate observations from N(−20, 1) and for i = 21, . . . , 40 we have observations from N(20, 1). Given this set up we modify the prior for q allowing us to ensure no more than 2 groups with high probability. We aim to show the label switching problem is solved with high probability due to the fact that the chain does not move between the two modes, after a suitable burn-in to deal with the starting position, even though the modes have equal probability. Now P (d1 , . . . , dn ≤ k | q) = (1 − q k )n and hence if we want no more than 2 groups with high probability we should make q about 0.01. In fact we choose the prior for q as Beta(1, 100). We could fix M = 2 but we prefer the prior route. Running the chain for 20,000 iterations, and with a burn-in of 15,000, we collected 5,000 samples and estimated the weights accordingly: w ˜1 = 0.51 and w ˜2 = 0.49, with locations estimated as µ ˜1 = 19.92 and µ ˜2 = −19.72. On the Bayesian mixture model and identifiability 21 In the last 5000 iterations the chain did not switch between modes. This lack of ability to jump can be understood in the following way; at some point in the chain all the 1’s (20 of them) would need to switch to 2 and this probability, effectively adding 20 to D, making q D so small that it can not happen. We also ran such an example now adding 5 further observations from N(0, 1) and putting the prior for q as uniform on (0, 1). With the same settings for the MCMC, we now obtain w ˜1 = 0.45, w ˜2 = 0.42 and w ˜3 = 0.11, with locations estimated as µ ˜1 = 19.73, µ ˜2 = −20.06 and µ ˜3 = 1.04. The last value might seem high but the mean of the samples from the N(0, 1) distribution was 1.2. 5 Summary The idea of the paper is to build the mixture model in a different way to the usual starting point of f (y) = M � j=1 wj N(y | µj , 1/λj ). We start by constructing a prior model p(d), such that with high probability the larger groups are assigned the lower allocations labels, and there are no gaps. 22 Mena and Walker We then adopt the standard p(y | d) from the mixture model, and hence obtain p(d | y). With d from p(d | y) we can have easy access to p(θ, w | y, d); i.e. the weights and locations of the groups. In order to better grasp the idea behind our proposal, let us define for each allocation S = (k, C1 , . . . , Ck ) the dS which minimizes the sum D. Then, for an equivalent allocation vector d∗S , which yields an identical (k, C1 , . . . , Ck ), in terms of clustering, but with different labels, we have that p(dS ) p(d∗S ) increases as � ∗ i diS increases. It is precisely this property which emphasises the importance of our choice of p(d). Even more interesting is the fact that this property also carries over to the posterior, i.e. p(dS | y) p(dS ) = , ∗ p(dS | y) p(d∗S ) since p(y | d∗S ) = p(y | dS ), and is the key to the idea. Specifically, solving the label switching problem. So the posterior really only focuses on the d which for any choice of (k, C1 , . . . , Ck ) minimize D. Indeed, this then yields an interpretable p(d | y) from which we can estimate the weights via the distribution of choice p(w | d) and then estimate functionals of choice. The point is that the d coming from p(d | y) are “good” allocation variables. On the Bayesian mixture model and identifiability 23 Acknowledgements The paper was completed during a visit by the second author to UNAM, Mexico, with support from CONACyT project 131179. The first author gratefully acknowledges the support of PAPIIT-UNAM project IN100411. The authors would like to thank the Editor, an Associate Editor and two referees for their comments on an earlier version of the paper. References Casella, G., Robert, C.P. and Wells, M. T. (2004). Mixture models, latent variables and partitioned importance sampling. Statistical Methodology 1, 1–18. Celeux, G., Hurn, M. and Robert, C. (2000). Computational and inferential difficulties with mixture posterior distributions. J. Amer. Statist. Assoc. 95, 957–970. Chung, H., Loken, E. and Schafer, J.L. (2004). Difficulties in Drawing Inferences With Finite–Mixture Models. Amer. Statist. 58, 152–158. Dahl, D. B. (2009). Modal clustering in a class of product partition models. Bayes. Anal. 4, 243–264 Diebolt J. and Robert C. P. (1994). Estimation of finite mixture distributions through Bayesian sampling. J. Roy. Statist. Soc. Ser. B 56, 363–375. Ferguson, T.S. (1973). A Bayesian analysis of some nonparametric problems. 24 Mena and Walker Ann. Statist. 1, 209–230. Ferguson, T.S. 1983. Bayesian density estimation by mixtures of normal distribution. In Recent Advances In Statistics: Papers in Honor of Herman Chernoff on His Sixtieth Birthday, eds: M. H. Rizvi, J. Rustagi and D. Siegmund, Academic Press: New York. Fuentes-Garc´ıa, R., Mena, R.H. and Walker, S.G. (2009). A nonparametric dependent process for Bayesian regression. Statist. Probab. Lett. 79, 1112–1119. Fuentes-Garc´ıa, R., Mena, R.H. and Walker, S.G. (2010). A New Bayesian Nonparametric Mixture Model. Comm. Statist. Sim. Comput. 39, 669–682. Fuentes-Garc´ıa, R., Mena, R.H. and Walker, S.G. (2010). A probability for classification based on the mixture of Dirichlet process model. Journal of Classification. 27, 389–403. ¨ hwirth-Schnatter, S. (2001). Markov Chain Monte Carlo estimation of Fru classical and dynamic switching and mixture models. J. Amer. Statist. Assoc. 96, 194–209. ¨ ller, P., Walker, S.G. (Eds.) (2010). Hjort, N.L., Holmes, C.C. Mu Bayesian Nonparametrics. Cambridge University Press, Cambridge. Kalli, M., Griffin, J.E. and Walker, S.G. (2011). Slice sampling mixture models. Statistics and Computing, 21, 93-105. On the Bayesian mixture model and identifiability 25 Lau, J.W. and Green, P.J. (2007). Bayesian model based clustering procedures. J. Comp. Graphical Statist. 16, 526–558. ¨ nster, I. (2010). Lijoi, A., Pru Models beyond the Dirichlet process. In Bayesian Nonparametrics, (Hjort, N.L., Holmes, C.C., M¨ uller, P., Walker, S.G. Eds.), Cambridge University Press. Lo, A.Y. (1984). On a class of Bayesian nonparametric estimates I. Density estimates. Ann. Statist. 12, 351–357. Marin, J., Mengersen, K. and Robert, C. Bayesian modelling and inference on mixtures of distributions. In Handbook of Statistics, D. Dey & C. Rao, eds., vol. 25. Elsevier-Sciences. McLachlan, G. & Peel, D. (2000). Finite Mixture Models. Wiley- Interscience, New York. Nobile, A. (1994). Bayesian Analysis of Finite Mixture Distributions. Ph.D. dissertation, Department of Statistics, Carnegie Mellon University, Pittsburgh. Available at http://www.stats.gla.ac.uk/ agostino Nobile, A. and Fearnside, A. T. (2007). Bayesian finite mixtures with an unknown number of components: the allocation sampler. Statistics and Computing. 17, 147–162. Papastamoulis, P. and Iliopoulos, G. (2010). Retrospective MCMC for Dirichlet process hierarchical models. J. Comp. Graphical Statist., 19, 313–331. Pearson, K. (1894). Contributions to the mathematical theory of evolution.. 26 Mena and Walker Proc. Trans. Roy. Soc. A 185, 71–110. Quintana, F. A. and Iglesias, P. L. (2003). Bayesian clustering and product partition models. J. Roy. Statist. Soc. Ser. B 65, 557574. Richardson, S. and Green, P. J. (1997). On Bayesian analysis of mixtures with an unknown number of components. J. Roy. Statist. Soc. Ser. B 59, 731–792. Roeder, K. (1990). Density Estimation With Confidence Sets Exemplified by Superclusters and Voids in the Galaxies. J. Amer. Statist. Assoc. 85, 617–624. Rodriguez, C.E. and Walker, S.W. (2014). Label switching in Bayesian mixture models: Deterministic relabeling strategies. J. Comp. Graphical Statist. 6, 145–178. Titterington, D. M., Smith, A. F. M. and Makov, U. E. (1985). Statistical Analysis of Finite Mixture Distributions. John Wiley & Sons Ltd. Stephens, M. (2000). Dealing with label switching in mixture models. J. Roy. Statist. Soc. Ser. B 62, 795–809. Wang, L. & Dunson, D.B. (2012). Fast Bayesian inference in Dirichlet process mixture models. J. Comp. Graphical Statist. 20, 198–216. 27 0.0 0.5 1.0 1.5 2.0 2.5 3.0 On the Bayesian mixture model and identifiability −2 −1 0 1 2 3 4 Figure 1: Histogram for the small data set. 28 Mena and Walker d(1) d(2) d(3) d(4) d(5) d(6) d(7) d(8) d(9) d1 2 2 1 3 2 2 2 3 2 d2 2 2 1 2 3 2 2 3 2 d3 2 2 1 2 2 2 3 2 3 d4 2 3 1 2 2 2 2 2 3 d5 1 1 2 1 1 3 1 1 1 d6 1 1 2 1 1 1 1 1 1 d7 1 1 2 1 1 1 1 1 1 d8 1 1 2 1 1 1 1 1 1 d9 1 1 2 1 1 1 1 1 1 d10 1 1 2 1 1 1 1 1 1 p(d | y) 0.933 0.027 0.009 0.008 0.004 0.004 0.003 0.002 0.002 Table 1: Highest exact posterior probabilities for d for toy example dataset, where d(r) has the rth highest posterior probability. Note that the largest group is allocated label 1 and the second largest group is allocated label 2, for the configuration with the highest posterior probability. On the Bayesian mixture model and identifiability 29 Figure 2: Galaxy data set and estimation using 3 mixture of normal distributions
© Copyright 2024 ExpyDoc