On the Bayesian mixture model and identifiability

On the Bayesian mixture model and
identifiability
´s H. Mena
Ramse
Universidad Nacional Aut´
onoma de M´exico, Mexico
Stephen G. Walker
University of Texas at Austin, USA
This paper is concerned with Bayesian mixture models and identifiability issues. There are two sources of unidentifiability, the well known
likelihood invariance under label switching and the perhaps less well
known parameter identifiability problem. When using latent allocation
variables determined by the mixture model, these sources of unidentifiability create arbitrary labeling which renders estimation of the model
very difficult. We endeavour to tackle these problems by proposing a
prior distribution on the allocations which provides an explicit interpretation for the labeling by removing gaps with high probability. An
MCMC estimation method is proposed and supporting illustrations are
presented.
Keywords: Allocation variable; Gibbs sampler; Identifiability; Label
switching; Mixture model.
2
1
Mena and Walker
Introduction
The random allocation of objects into clusters or groups is of fundamental interest in statistics and other areas that interact with it. A common way to assign
a distribution able to disentangle the different features of observations into homogeneous clusters, e.g. characterized by the same parametric family K(· | θ),
is through mixture models. Under such framework, independent and identically
distributed observations, y := (yi )ni=1 , are modeled by
(1)
f (yi | w, θ) =
M
�
j=1
wj K(yi | θj ),
M ≥1
M
where the weights w := (wj )M
j=1 , adding up to one, and locations θ := (θj )j=1
are instrumental and might be considered as random under complete uncertainty.
The literature backing up this approach is vast ranging from its origins in Pearson
(1894) to expositions such as the ones presented by Titterington, Smith and Makov
(1985), McLachlan and Peel (2000) or Marin, Mengersen and Robert (2005). In
particular, the case M = ∞, was first considered and studied by Lo (1984) where
the weights and locations are derived from the Dirichlet process (Ferguson, 1973).
Indeed, this constitutes the workhorse for Bayesian nonparametric inference; for
density estimation and also for clustering problems. A recent review in this direction can be found in Hjort, Holmes, M¨
uller and Walker (2010).
An alternative way to write such a model, is via an allocation variable, which
On the Bayesian mixture model and identifiability
3
tells us which cluster each observation belongs to. That is
(2)
f (yi | di ) = K(yi | θdi )
with
P (di = j|w) = wj ,
independently for i = 1, . . . , n. The advantage of this latter representation is that
there is an explicit characterization of the distribution of the group membership to
each observation. Under full uncertainty about the clustering structure featuring
y, the natural choice is to have a symmetric distribution such as the multinomial.
However, care should be taken as, under such scenario, the likelihood is invariant
under the many possible permutations of the indices, namely
p(y1 , . . . , yn |d1 , . . . , dn ) = p(y1 , . . . , yn |ρ(d1 ), . . . , ρ(dn ))
for any permutation ρ of the allocation variables d := (di )ni=1 . Furthermore,
under symmetric priors, across components, for θ and w, this invariance would be
inherited in posterior inference. Clearly, having such an invariance implies model
unidentifiability.
In particular, this issue is well known to cause the label switching problem
when performing MCMC based inference; namely the likelihood invariance prevents a proper identification of a grouping structure. We refer to Titterington,
Smith and Makov (1985); Celeux, Hurn and Robert (2000) and Stephens (2000)
for deeper discussions of the label switching problem. There are various proposals dealing with the label switching problem, some aimed at selecting a particular permutation, e.g. by means of a loss function (Fr¨
uhwirth-Schnatter, 2001;
4
Mena and Walker
Nobile and Fearnside, 2007; Papastamoulis and Iliopoulos, 2010; Rodriguez and
Walker, 2014), and others aimed at imposing parameter constraints, so to break
the aforementioned symmetry and favor a single labeling (Diebolt and Robert,
1994; Richardson and Green, 1997; Chung, Loken and Schafer, 2004).
Having said that, there is a seemingly related but different identifiability issue
under the above approach, namely the correct identification of the components
that better describe y. In other words, besides the label switching problem, there
is the problem that different parameter values (w, θ) could lead to the same value
of f (·). To expand on the point of this kind of unidentifiability, notice that we can
recover a standard normal for f (·), with w1 = 1 and K(· | θ1 ) = N(· | 0, 1), but also
with w1 as anything, w2 = 1−w1 , K(· | θ1 ) = N(· | 0, 1) and K(· | θ2 ) ≈ N(· | 0, 1).
Indeed, an observation in this direction can be traced back to Ferguson (1983) who
states the following about the parameters of the model: “We are not interested
in estimating the parameters of the model . . .. It would not make much sense to
do so since the parameters are not identifiable.”
Solving these unidentifiability issues simultaneously, i.e. likelihood invariance
under label switching and parameter unidentifiably, is still a challenging problem. Our attempt to solve them is based on a different starting point, namely
by suitably constructing a prior distribution on the allocation variables d, say
p(d). Specifically, the idea is to attach some meaning to such allocations that,
remove the gaps and at the same time solve such unidentifiability problems with
On the Bayesian mixture model and identifiability
5
high probability. We then connect this to the mixture model via the standard
conditional distributions p(y, θ | d) and p(w | d).
Due to the fact that we start without the weights, we need to assign a prior
distribution on the allocations. Some approaches in this direction, e.g. Casella,
Robert and Wells (2004), Nobile (1994) and Nobile and Fearnside (2007), follow
this idea and construct MCMC samplers to draw posterior inference on the clustering structure of the data using p(d | y). However, having such a large support
for the sequence d, not all weight distributions will lead to “ good ” p(d), and thus
p(d | y).
There are some proposals working directly with p(d | y); but our contention
is that if the form for p(d) has not been chosen astutely then the problems of
poor allocations will persist. Attempts to solve the problem include constraining
the support of d at this stage, so set d1 = 1, then sample d2 from p(d2 | d1 , y) ∝
p(d1 , d2 | y) with d2 from the choice of (1, 2). In general, take di = j from the
density proportional to
p(d1 , . . . , di−1 , j | y)
with j restricted to (1, . . . , k) and k = max{d1 , . . . , di−1 }. However, this constraint
on the d leads to a non-exchangeable model and so depends on who is assigned
as y1 , and so on. The benefit is that sampling is direct and avoids MCMC, since
we can calculate p(d | y) when the d are given or the choice of d values is limited.
However, to avoid the order problem, one would need to repeat the process for
6
Mena and Walker
each permutation on the first n integers. See Wang and Dunson (2012).
In a similar direction Quintana and Iglesias (2003) proposed the idea of using product partition models where weights are not considered and only the
model f (y | d) is constructed. In this case, it is only necessary to consider
f (y | k, C1 , . . . , Ck ) where (k, C1 , . . . , Ck ) define a partition of the observations
into k groups and a prior is placed on it. Lau and Green (2007) propose an
optimal clustering procedure via the application of a specific loss function that
penalises pairs of clusters assigned to different clusters, when they should be in the
same one. A related proposal, but in one dimension only, and without the need of
MCMC methods, Dahl (2009) proposes an algorithm to find the modal partition
based on a reallocation of overlapping partitions. Although these approaches are
very appealing, when looking for the most appropriate clustering of observations,
the aforementioned unidentifiability issues persist.
Our aim is to construct p(d) so that with high probability label 1 is assigned to
the largest group, label 2 the second largest group, and so on, without allowing for
gaps. More accurately we want the labels to be associated with a non-increasing
sequence of groups sizes so we do not make an issue of equally sized groups. This
will remove the label switching problem with high probability. The parameter
identifiability problem is caused by too many groups being generated than are
actually needed and so p(d) will specifically penalize large numbers of groups.
The layout of the paper is as follows: In Section 2, we discuss and motivate
On the Bayesian mixture model and identifiability
7
our particular choice of prior for the allocation variables. In this section, we
also present a toy example to illustrate the properties of the prior and posterior
distributions. In Section 3, we describe the estimation method, via MCMC to
obtain samples from p(d | y) and then the procedure for estimating the weights
associated with the allocations. Section 4 contains an illustration using MCMC
methods to sample p(d, θ | y) and then calculate the weights. In particular, for the
benchmark galaxy data set, we obtain a clear three mixture normal representation
of the data and provide the associated weights, means and variances of each
component. In this section we also discuss what happens when we have groups of
equal size. Section 5 concludes with a summary.
2
A prior distribution on allocation variables
In this section we describe a prior distribution for allocating n individuals to k ≤ n
groups. This can be removed from the context of the mixture model, though
obviously we will return to it later on. The aim, for reasons expanded on in the
introduction, is to put very high probability on the largest group having label 1,
the second largest group label 2, and so on. The prior needs to be exchangeable
for all n, that is:
(i) For all permutations ρ on {1, . . . , n}, and for all n ≥ 1, it is that
p(d1 , . . . , dn ) = p(dρ(1) , . . . , dρ(n) ).
8
Mena and Walker
(ii) It is that
∞
�
p(d1 , . . . , dn , j) = p(d1 , . . . , dn )
j=1
for all n ≥ 1.
It is well known that d = (d1 , . . . , dn ) defines k, the number of distinct di s
in d, and (C1 , . . . , Ck ), the groups. That is, C1 is a group in which the di s all
share the same value. However, the labeling, or the value that these di s take, is
irrelevant; so C1 has no implication other than it describes one of the groups. In
particular,
Cj �= {i : di = j},
since we have labeled the Cj s from 1 to k and yet it is not the case, in general,
that the di s are labeled from 1 to k.
To think about how we plan to construct p(d), we ask how should we deal
with (1, 2, 2, 4) and (1, 1, 2, 3)? We do not need both of them. We neither want
the gap associated with leaving out the 3 in (1, 2, 2, 4) and we actually want the
group of size 2 to be labeled as 1. So what is needed is a model for which
p(1, 1, 2, 3) � p(1, 2, 2, 4).
That is we need the labeling to be biased towards
(1, 1, 2, 3)
rather than
(1, 2, 2, 4).
There are two things going on here. The first is that we do not want arbitrary
numbers representing the groups, so we want no gaps; and the second is that we
On the Bayesian mixture model and identifiability
9
want the size of the group to determine the label, so the larger groups occupy
the smaller labels. This is not going to be 100% precise, but the ambition of
this direction should be a motivation when proposing a p(d). It is clear from
looking at (1, 1, 2, 3) is that it minimizes the sum of the di s which have the same
(k, C1 , . . . , Ck ) and this forms the basis of the prior.
Therefore, and the thinking is straightforward, our proposal for having no gaps
and the larger groups to be labelled with the smaller integers is to have
p(d) ∝ g(D)
where
D=
n
�
di ,
i=1
D ≥ n,
for some decreasing function g(·). Clearly, this will meet the exchangeability
requirement (i) provided we make sure that
�
g(D) < +∞.
d
We also need to demonstrate (ii) which puts some restrictions on the choice of
g(·). The preferred choice is
g(D) = q D
for some 0 < q < 1 which clearly meets condition (ii). To see this, write
p(d) = cn q D
so that (ii) implies
c
n+1 D
q
∞
�
j=1
q j = cn+1 q D
q
1−q
10
Mena and Walker
and hence we need c = 1/q − 1. In the illustration we will assign a hyper-prior to
q. The construction of a suitable prior for q can be based on the result that
P
�
max {di } ≤ k
1≤i≤n
�
=
�
(1 − q k )n π(dq)
and hence one can find a π which matches some probability on the number of
groups, since
P
�
max {di } ≤ k
1≤i≤n
�
≈ P (≤ k groups).
A more general class of g(·) is
g(D) =
Γ(D − n + β)
,
Γ(D + α + β)
for some α, β > 0 though we will not pursue this case in the present paper.
Now, returning to the mixture model; this will yield us a p(y | d) which we
can combine with p(d) to obtain the posterior p(d | y). The p(y | d) which arises
from a mixture model has the form
(3)
p(y | d) =
k � �
�
j=1
i∈Cj
K(yi | θj ) π(dθj ).
This does not involve any weight parameters since they have been superseded
by the d. Hence, in some way, we are building up a mixture model which is
fundamentally different to the usual starting point.
For (3), if a conjugate prior has been chosen for the θj s, written as π(θ), we
can integrate out the θj s to obtain
p(y | d) = p(y | C1 , . . . , Ck , k).
On the Bayesian mixture model and identifiability
11
For example, in the particular case when
� �
�
�
1
�
K(y | θj ) = N y � µj ,
λj
and the conjugate prior N(µ | 0, (τ λ)−1 ) × Ga(λ | a, b) is chosen independently for
each (µj , λj ), we obtain
p(y | k, C1 , . . . , Ck ) ∝
k
�
Γ(a + nj /2)
1
,
2
1/2
(nj + τ )
[b + 0.5 (sj − n2j y¯j2 /(nj + τ ))]a+nj /2
j=1
where
y¯j = n−1
j
�
yi ,
i∈Cj
nj = |Cj |, and
s2j =
�
yi2 .
i∈Cj
This form for p(y | C1 , . . . , Ck , k) is the classic clustering probability once k,
the number of groups is known. In other words, it provides the likelihood for
“who is with who”.
Our model therefore will be of the type
p(d, y) ∝ g(D) p(y | C1 , . . . , Ck , k),
where g(D) is a decreasing function of D. Clearly, we have
p(y | d) = p(y | C1 , . . . , Ck , k)
while
p(d) ∝ g(D).
12
Mena and Walker
Hence, the model splits neatly into two components: the g(D) which determines
the labeling, and p(y | d) which decides “who is with who”.
In order to have a clearer idea of the posterior distribution on d, and to see how
our idea works, here we consider a small dataset where we can compute exactly
the p(d | y). Specifically, we consider n = 10 data points given by
−1.521, −1.292, −0.856, −0.104, 2.388, 3.079, 3.312, 3.415, 3.922, 4.194.
See Figure 1 for a histogram plot of the data. This data set has been considered
before by Fuentes-Garc´ıa, Mena and Walker (2010) in a similar context. To
compute the exact posterior values at each d we use the same kernel and prior, on
the θj s, specificied earlier in this section, with p(d) ∝ q D and M = 3. In this case
we fix M so we can do all the calculations exactly, though in practice as we will see
M = ∞ can be dealt with using MCMC methods. That is, we have M n possible
values for d. Setting a = 5, b = 1, τ = 0.001 and q = 0.1 we obtain that the
mode is placed at (d1 , d2 , . . . , d10 ) = (2, 2, 2, 2, 1, 1, 1, 1, 1, 1) which has posterior
probability 0.933. Table 1 show the top nine posterior values for d as well as their
corresponding probabilities. It is worth noting that increasing the value of M ,
e.g. to 4, 5, . . . does not lead to any significant difference in the reported posterior
values. We computed the corresponding posterior values for M = 4 and M = 5,
but decided not to report them here as they do not provide any difference. From
these results it is evident that the higher mass is placed on the allocation values
where the largest group is identified by the label 1, the second largest by 2.
On the Bayesian mixture model and identifiability
13
Figure 1 here
Table 1 here
In summary, for any fixed configuration S = (k, C1 , . . . , Ck ) there exists a dS
which allocates the largest group to label 1, the second largest to label 2, and so
on. This dS will minimize, over all other d∗S yielding the same configuration, the
sum D =
�
i di .
Hence, the choice of prior of the type p(d) ∝ g(D).
It is important to point out here what the consequences are if we have two
groups of exactly the same size. To see what happens we can re-do the toy example
giving both groups 5 observations each, e.g. by changing 2.388 to −2 in the above
data set. In this case we find the posterior probability of (1, 1, 1, 1, 1, 2, 2, 2, 2, 2)
is exactly the same as that of (2, 2, 2, 2, 2, 1, 1, 1, 1, 1), i.e. equal 0.466 each. This
is expected and is what causes the label switching problem when running MCMC
algorithms, namely the Markov chain will spend 50% of the time in one mode
and 50% on the other. However, using our p(d) the chain we run will not allow
switching between modes due to the extremely low probability of doing so. This
point is expanded on in Section 4.2.
3
Estimation
Given the dimension of the support for the allocation variables, we need to obtain
p(d | y) using MCMC methods. We actually do this with the locations included in
14
Mena and Walker
the estimating model, so we sample from p(d, θ | y). Hence, the MCMC is based
on the conditional distributions
p(di | θ, yi , d−i )
p(θj | d, y),
and
for i = 1, . . . , n, and j = 1, 2, . . ., respectively. Indeed, both of these conditional
distributions are easy to find. A specific case will be given in Section 4. In the
finite M case it is direct since each 1 ≤ di ≤ M . In the infinite case we need to be
careful about how to sample the (di ) due to their infinite choice, but this can be
resolved using latent variables, as in Kalli, Griffin and Walker (2011). To expand
on this point here, we have
p(y, d | θ) ∝ g(D) p(y | k, C1 , . . . , Ck , θ)
and we can introduce the latent variable u so that the joint density with (d, y) is
given by
p(y, d, u) ∝ 1(u < g(D)) p(y | k, C1 , . . . , Ck , θ).
Then, when it comes to the sampling of di , we have
p(di = j | yi , θ) ∝ g(D−i + j) N(yi | µj , 1/λj )
where
D−i =
�
di�
i� �=i
and so di is constrained to lie in the finite set
Ai = {j : g(D−i + j) > u} ≡ {j : j ≤ g −1 (u) − D−i }.
On the Bayesian mixture model and identifiability
15
At each iteration we would simulate sufficient number of the θj s values to be able
to find the Ai and hence we would need (θj )Jj=1 for J = maxi {g −1 (u) − D−i }.
Once this is done, we can use the MCMC estimate of p(d | y) and draw inference about any functional of (d, y), say β(d, y). In particular, we could estimate
β(y) =
�
d
β(d, y) p(d | y).
In the basic mixture model (1), estimation problems of the type
β ∗ (y) =
�
γ(θ, w) p(θ, w | y) dθ dw,
for some function γ(θ, w), make no sense due to the fact that the (θ, w) are
unidentifiable. However, if we condition on the allocation variables, which are
from p(d | y), then estimation becomes identifiable (with very high probability),
and hence we can estimate functionals such as
β(d, y) =
�
γ(θ, w) p(θ, w | d, y) dθ dw
via
β(y) =
�
d
β(d, y) p(d | y).
Here
p(θ, w | d, y) = p(w | d) p(θ | d, y)
will be some posterior for estimating the (w, θ). In fact we can do this for the
standard Dirichlet process mixture model. The key is to think of this as estimation
16
Mena and Walker
of the functional β(d, y) with respect to the model p(d | y). In this case, we have
w1 = v1 and, for j > 1,
wj = vj
�
l<j
(1 − vl ).
where each vj is an independent Beta(1, c) variable, for some c > 0, and then, it
is easy to show that independently for each j,
�
�
�
�
p(vj | d) = Beta vj � 1 + #{di = j}, c + #{di > j} .
�
The form for p(θ | y, d) is also well known.
So, for example, a particular γ(θ, w) might be given by
γ(θ, w) =
∞
�
1(wj > �)
j=1
for some small � > 0, or γ(θ, w) = w, which would tell us how many groups had
weights larger than some �.
Here we can see clearly the benefit of minimizing D with high probability. If
the labeling is arbitrary then we will see di = j for larger values of j than are
strictly necessary. This suggests that the number of wj s we should be interested
in, i.e. which are larger than �, are more than is actually needed to model the
data. Keeping the same structure, i.e. “who is with who”, while minimizing the
labeling, means we are interested only in weights which are actually needed.
On the Bayesian mixture model and identifiability
4
17
Illustrations
In this section we look at two illustrations; one involving the benchmark galaxy
data set and the other a simulated data set in which the observations fall into
one of two distinct groups of equal size. It is well known the label switching
problem arises by the Markov chain switching between modes. If these modes are
asymmetric then the chain will spend predominantly all, if not all, the time in
one mode of d, since one has much higher probability than the others. This mode
will be the one with non-increasing group sizes coinciding with the increasing
labels. However, it is worth looking at the case when two modes have the same
probability, see the toy example in Section 2, which occurs when two groups have
exactly the same size. In this case we will demonstrate that jumping between
modes does not happen due to the extremely low probability of doing so.
4.1
Galaxy data
Here we re-analyse the galaxy data set. This data set consist of n = 82 observations and was originally presented by Roeder (1990). To expand on the description
of the MCMC given earlier, we select M = 10 as an upper bound on the number of clusters, and hence we only have (µj , λj )10
j=1 and ten weights. As we have
mentioned before, handling the infinite case involves the introduction of a latent
variable.
18
Mena and Walker
Thus, for j = 1, . . . , 10

p(di = j | θ, y, d−i ) ∝ g j +
�
l�=i

�
�
1
d l  × N yi | µ j ,
.
λj
Here we take g(D) = q D and take a prior distribution on q, which we have as the
beta(1/2, 1/2) distribution.
Then, taking the prior N(· | 0, τ −1 ) for the (µj ), with τ = 0.001, and the prior
Ga(· | a, b) for the (λj ), with a = 0.5 and b = 0.1, we have
�
�
� nj y¯j
1
�
p(µj | λj , y, d) = N µj �
,
,
nj λj + τ nj λ j + τ
 �

�
�
�
p(λj | θ, y, d) = Ga λj �� a + 0.5 nj , b + 0.5
(yi − θj )2  .
�
di =j
�
Finally, the conditonal posterior for q is given by
p(q | d) = Beta(q | 0.5 + D − n, 0.5 + n).
Hence, the sampler is very straightforward.
At each iteration of the chain, we have a set of di s. We also have a set of θj s
and so we know the locations of the clusters. To find the weights we estimate the
functionals, for j = 1, . . . , 10,
wj (d) =
�
wj p(wj | d) dwj
with respect to the model p(d | y). Specifically we take p(wj | d) to be a stickbreaking model with the vj s independent Beta(1, c) with c = 1. Thus we are
estimating and computing
w
¯j =
�
d
wj (d) p(d | y).
On the Bayesian mixture model and identifiability
19
In this analysis we are emphasising the fact that the model will, with high probability, return weights w
¯j which are decreasing with j. It will also be possible to
identify the number of groups by the size of the w
¯j .
With the settings described in this section, we ran the Gibbs sampler for 50,000
iterations and so have 50,000 samples of d from p(d, θ | y). Hence, we have the
¯ j ). From the ouput of the allocations we
estimates of the locations, i.e. the (¯
µj , λ
obtain the w
¯j ; the results are as follows:
w
¯1 = 0.867,
w
¯2 = 0.088,
w
¯3 = 0.036
with the remaining weights decreasing and with remaining mass of 0.009. The
corresponding locations are given by
¯ 1 ) = (21.39, 0.21),
(¯
µ1 , λ
¯ 2 ) = (9.72, 4.28),
(¯
µ2 , λ
¯ 3 ) = (32.55, 0.80).
(¯
µ3 , λ
This clearly suggests 3 groups and provides the normal distributions which model
the data. A figure of this mixture of 3 normals, with total mass of 0.991, is
given in Figure 2, which for comparison includes a histogram representation of
the data. We believe this is a good 3 normal mixture representation of the data.
Moreover, it has been obtained via posterior parameter values rather than the
usual predictive samples.
Lau and Green (2007) also find that 3 mixtures are appropriate using a loss
function approach to find a posterior estimate of the allocations, and do this by
concentrating exclusively on p(d | y).
20
Mena and Walker
Figure 2 here
4.2
Toy example 2: Equal group sizes
Here we simulate data so there are clearly two groups of equal size. Hence, for
n = 1, . . . , 20 we generate observations from N(−20, 1) and for i = 21, . . . , 40
we have observations from N(20, 1). Given this set up we modify the prior for q
allowing us to ensure no more than 2 groups with high probability. We aim to
show the label switching problem is solved with high probability due to the fact
that the chain does not move between the two modes, after a suitable burn-in to
deal with the starting position, even though the modes have equal probability.
Now
P (d1 , . . . , dn ≤ k | q) = (1 − q k )n
and hence if we want no more than 2 groups with high probability we should
make q about 0.01. In fact we choose the prior for q as Beta(1, 100). We could fix
M = 2 but we prefer the prior route. Running the chain for 20,000 iterations, and
with a burn-in of 15,000, we collected 5,000 samples and estimated the weights
accordingly:
w
˜1 = 0.51
and
w
˜2 = 0.49,
with locations estimated as
µ
˜1 = 19.92
and
µ
˜2 = −19.72.
On the Bayesian mixture model and identifiability
21
In the last 5000 iterations the chain did not switch between modes. This lack of
ability to jump can be understood in the following way; at some point in the chain
all the 1’s (20 of them) would need to switch to 2 and this probability, effectively
adding 20 to D, making q D so small that it can not happen.
We also ran such an example now adding 5 further observations from N(0, 1)
and putting the prior for q as uniform on (0, 1). With the same settings for the
MCMC, we now obtain
w
˜1 = 0.45,
w
˜2 = 0.42
and
w
˜3 = 0.11,
with locations estimated as
µ
˜1 = 19.73,
µ
˜2 = −20.06
and
µ
˜3 = 1.04.
The last value might seem high but the mean of the samples from the N(0, 1)
distribution was 1.2.
5
Summary
The idea of the paper is to build the mixture model in a different way to the usual
starting point of
f (y) =
M
�
j=1
wj N(y | µj , 1/λj ).
We start by constructing a prior model p(d), such that with high probability the
larger groups are assigned the lower allocations labels, and there are no gaps.
22
Mena and Walker
We then adopt the standard p(y | d) from the mixture model, and hence obtain
p(d | y). With d from p(d | y) we can have easy access to p(θ, w | y, d); i.e. the
weights and locations of the groups.
In order to better grasp the idea behind our proposal, let us define for each
allocation S = (k, C1 , . . . , Ck ) the dS which minimizes the sum D. Then, for an
equivalent allocation vector d∗S , which yields an identical (k, C1 , . . . , Ck ), in terms
of clustering, but with different labels, we have that
p(dS )
p(d∗S )
increases as
�
∗
i diS
increases. It is precisely this property which emphasises the
importance of our choice of p(d). Even more interesting is the fact that this
property also carries over to the posterior, i.e.
p(dS | y)
p(dS )
=
,
∗
p(dS | y)
p(d∗S )
since p(y | d∗S ) = p(y | dS ), and is the key to the idea. Specifically, solving the label
switching problem. So the posterior really only focuses on the d which for any
choice of (k, C1 , . . . , Ck ) minimize D. Indeed, this then yields an interpretable
p(d | y) from which we can estimate the weights via the distribution of choice
p(w | d) and then estimate functionals of choice. The point is that the d coming
from p(d | y) are “good” allocation variables.
On the Bayesian mixture model and identifiability
23
Acknowledgements
The paper was completed during a visit by the second author to UNAM, Mexico,
with support from CONACyT project 131179. The first author gratefully acknowledges the support of PAPIIT-UNAM project IN100411. The authors would
like to thank the Editor, an Associate Editor and two referees for their comments
on an earlier version of the paper.
References
Casella, G., Robert, C.P. and Wells, M. T. (2004). Mixture models, latent
variables and partitioned importance sampling. Statistical Methodology 1, 1–18.
Celeux, G., Hurn, M. and Robert, C. (2000). Computational and inferential
difficulties with mixture posterior distributions. J. Amer. Statist. Assoc. 95,
957–970.
Chung, H., Loken, E. and Schafer, J.L. (2004). Difficulties in Drawing
Inferences With Finite–Mixture Models. Amer. Statist. 58, 152–158.
Dahl, D. B. (2009). Modal clustering in a class of product partition models.
Bayes. Anal. 4, 243–264
Diebolt J. and Robert C. P. (1994). Estimation of finite mixture distributions
through Bayesian sampling. J. Roy. Statist. Soc. Ser. B 56, 363–375.
Ferguson, T.S. (1973). A Bayesian analysis of some nonparametric problems.
24
Mena and Walker
Ann. Statist. 1, 209–230.
Ferguson, T.S. 1983. Bayesian density estimation by mixtures of normal distribution. In Recent Advances In Statistics: Papers in Honor of Herman Chernoff
on His Sixtieth Birthday, eds: M. H. Rizvi, J. Rustagi and D. Siegmund, Academic Press: New York.
Fuentes-Garc´ıa, R., Mena, R.H. and Walker, S.G. (2009). A nonparametric dependent process for Bayesian regression. Statist. Probab. Lett. 79,
1112–1119.
Fuentes-Garc´ıa, R., Mena, R.H. and Walker, S.G. (2010). A New
Bayesian Nonparametric Mixture Model. Comm. Statist. Sim. Comput. 39,
669–682.
Fuentes-Garc´ıa, R., Mena, R.H. and Walker, S.G. (2010). A probability
for classification based on the mixture of Dirichlet process model. Journal of
Classification. 27, 389–403.
¨ hwirth-Schnatter, S. (2001). Markov Chain Monte Carlo estimation of
Fru
classical and dynamic switching and mixture models. J. Amer. Statist. Assoc.
96, 194–209.
¨ ller, P., Walker, S.G. (Eds.) (2010).
Hjort, N.L., Holmes, C.C. Mu
Bayesian Nonparametrics. Cambridge University Press, Cambridge.
Kalli, M., Griffin, J.E. and Walker, S.G. (2011). Slice sampling mixture
models. Statistics and Computing, 21, 93-105.
On the Bayesian mixture model and identifiability
25
Lau, J.W. and Green, P.J. (2007). Bayesian model based clustering procedures. J. Comp. Graphical Statist. 16, 526–558.
¨ nster, I. (2010).
Lijoi, A., Pru
Models beyond the Dirichlet process. In
Bayesian Nonparametrics, (Hjort, N.L., Holmes, C.C., M¨
uller, P., Walker, S.G.
Eds.), Cambridge University Press.
Lo, A.Y. (1984). On a class of Bayesian nonparametric estimates I. Density
estimates. Ann. Statist. 12, 351–357.
Marin, J., Mengersen, K. and Robert, C. Bayesian modelling and inference
on mixtures of distributions. In Handbook of Statistics, D. Dey & C. Rao, eds.,
vol. 25. Elsevier-Sciences.
McLachlan, G. & Peel, D. (2000). Finite Mixture Models. Wiley- Interscience, New York.
Nobile, A. (1994). Bayesian Analysis of Finite Mixture Distributions. Ph.D.
dissertation, Department of Statistics, Carnegie Mellon University, Pittsburgh.
Available at http://www.stats.gla.ac.uk/ agostino
Nobile, A. and Fearnside, A. T. (2007). Bayesian finite mixtures with an
unknown number of components: the allocation sampler. Statistics and Computing. 17, 147–162.
Papastamoulis, P. and Iliopoulos, G. (2010). Retrospective MCMC for
Dirichlet process hierarchical models. J. Comp. Graphical Statist., 19, 313–331.
Pearson, K. (1894). Contributions to the mathematical theory of evolution..
26
Mena and Walker
Proc. Trans. Roy. Soc. A 185, 71–110.
Quintana, F. A. and Iglesias, P. L. (2003). Bayesian clustering and product
partition models. J. Roy. Statist. Soc. Ser. B 65, 557574.
Richardson, S. and Green, P. J. (1997). On Bayesian analysis of mixtures
with an unknown number of components. J. Roy. Statist. Soc. Ser. B 59,
731–792.
Roeder, K. (1990). Density Estimation With Confidence Sets Exemplified by
Superclusters and Voids in the Galaxies. J. Amer. Statist. Assoc. 85, 617–624.
Rodriguez, C.E. and Walker, S.W. (2014). Label switching in Bayesian mixture models: Deterministic relabeling strategies. J. Comp. Graphical Statist. 6,
145–178.
Titterington, D. M., Smith, A. F. M. and Makov, U. E. (1985). Statistical
Analysis of Finite Mixture Distributions. John Wiley & Sons Ltd.
Stephens, M. (2000). Dealing with label switching in mixture models. J. Roy.
Statist. Soc. Ser. B 62, 795–809.
Wang, L. & Dunson, D.B. (2012). Fast Bayesian inference in Dirichlet process
mixture models. J. Comp. Graphical Statist. 20, 198–216.
27
0.0
0.5
1.0
1.5
2.0
2.5
3.0
On the Bayesian mixture model and identifiability
−2
−1
0
1
2
3
4
Figure 1: Histogram for the small data set.
28
Mena and Walker
d(1)
d(2)
d(3)
d(4)
d(5)
d(6)
d(7)
d(8)
d(9)
d1
2
2
1
3
2
2
2
3
2
d2
2
2
1
2
3
2
2
3
2
d3
2
2
1
2
2
2
3
2
3
d4
2
3
1
2
2
2
2
2
3
d5
1
1
2
1
1
3
1
1
1
d6
1
1
2
1
1
1
1
1
1
d7
1
1
2
1
1
1
1
1
1
d8
1
1
2
1
1
1
1
1
1
d9
1
1
2
1
1
1
1
1
1
d10
1
1
2
1
1
1
1
1
1
p(d | y)
0.933
0.027
0.009
0.008
0.004
0.004
0.003
0.002
0.002
Table 1: Highest exact posterior probabilities for d for toy example dataset, where
d(r) has the rth highest posterior probability. Note that the largest group
is allocated label 1 and the second largest group is allocated label 2, for
the configuration with the highest posterior probability.
On the Bayesian mixture model and identifiability
29
Figure 2: Galaxy data set and estimation using 3 mixture of normal distributions