Lecture 14: Bayesian networks II

Lecture 14: Bayesian networks II
CS221 / Autumn 2014 / Liang
Pac-Man competition
1. (1743) Zhiming Shi
2. (1723) Akim Kumok
3. (1710) Wilbur Yang
4. (1698) Cody Murray
5. (1671) Tao Du
CS221 / Autumn 2014 / Liang
1
Review: Bayesian network
C
A
P(C = c, A = a, H = h, I = i)
= p(c)p(a)p(h | c, a)p(i | a)
H
I
Definition: Bayesian network
Let X = (X1 , . . . , Xn ) be random variables.
A Bayesian network is a directed acyclic graph (DAG) that specifies a joint distribution over X as a product of local conditional
distributions, one for each node:
n
Y
P(X1 = x1 , . . . , Xn = xn ) =
p(xi | xParents(i) )
i=1
CS221 / Autumn 2014 / Liang
2
• A Bayesian network allows us to define a joint probability distribution over many variables (e.g.,
P(C, A, H, I)) by specifying local conditional distributions (e.g., p(i | a)).
Review: probabilistic inference
Input
Bayesian network: P(X1 = x1 , . . . , Xn = xn )
Evidence: E = e where E ⊂ X is subset of variables
Query: Q ⊂ X is subset of variables
Output
P(Q = q | E = e) for all values q
Example: if coughing but no itchy eyes, have a cold?
P(C | H = 1, I = 0)
CS221 / Autumn 2014 / Liang
4
• Think of the Bayesian network as a guru who knows everything. Probabilistic inference allows you to ask
the guru anything: what is the probability of having a cold? What if I’m coughing? What if I don’t have
itchy eyes? In this lecture, we’re going to build such a guru.
Roadmap
Preparation
Forward-backward
Gibbs sampling
Particle filtering
CS221 / Autumn 2014 / Liang
6
Example: Markov model
X1
X2
X3
X4
Query: P(X3 = x3 | X2 = x2 )
Tedious way:
X
∝
p(x1 )p(x2 | x1 )p(x3 | x2 )p(x4 | x3 )
x1 ,x4
!
∝
X
p(x1 )p(x2 | x1 ) p(x3 | x2 )
x1
X
p(x4 | x3 )
x4
∝ p(x3 | x2 )
Fast way:
[whiteboard]
CS221 / Autumn 2014 / Liang
7
• Let’s first compute the query the old-fashioned way by grinding through the algebra.
3 =u,X2 =v)
∝ P(X3 , X2 =
• One important note about conditional probabilities: P(X3 = u | X2 = v) = P(XP(X
2 =v)
v). Recall that the conditional probability given evidence is the joint distribution divided by the probability
of evidence, which is just a constant (called the normalization constant). So that means we can just write
”proportional to” the joint. It saves a lot of work just to think in terms of proportionality, because then
we can just drop constants (things that don’t depend on x3 ). You can always normalize at the end.
General strategy
Query:
P(Q | E = e)
Algorithm: general probabilistic inference strategy
•
•
•
•
•
Remove (marginalize) variables not ancestors of Q or E.
Convert Bayesian network to factor graph.
Condition (shade nodes / disconnect) on E = e.
Remove (marginalize) nodes disconnected from Q.
Run probabilistic inference algorithm (manual, variable elimination, Gibbs sampling, particle filtering).
CS221 / Autumn 2014 / Liang
9
• Our goal is to compute the conditional distribution over the query variables Q ⊂ H given evidence E = e.
We can do this with our bare hands by chugging through all the algebra starting with the definition of
marginal and conditional probability, but there is an easier way to do this that exploits the structure of the
Bayesian network.
• Step 1: remove variables which are not ancestors of Q or E. Intuitively, these don’t have an influence on
Q and E, so they can be removed. Mathematically, we verified this property last lecture (consistency of
sub-Bayesian networks).
• Step 2: turn this Bayesian network into a factor graph by simply introducing one potential per node which
is connected to that node and its parents. It’s important to include all the parents and the child into one
factor, not separate factors. From here out, all we need to think about is factor graphs.
• Step 3: condition on the evidence variables. Recall that conditioning on nodes in a factor graph shades
them in, but is a graph operation that rips out those variables from the graph.
• Step 4: remove nodes which are not connected to Q. These are independent of Q, so they have no impact
on the results.
• Step 5: Finally, run a standard probabilistic inference algorithm on the reduced factor graph. We’ll do
this manually for now (chugging through algebra on this hopefully much smaller graph). Later we’ll see
automatic methods for doing this.
Example: alarm
B
E
A
b p(b)
e p(e)
1 1 0 1−
0 1−
b e a p(a | b, e)
0 0 0 1
0 0 1 0
0 1 0 0
0 1 1 1
1 0 0 0
1 0 1 1
1 1 0 0
1 1 1 1
[whiteboard]
Query: P(B)
• Marginalize out A, E
Query: P(B | A = 1)
• Condition on A = 1
CS221 / Autumn 2014 / Liang
11
• Here is another example: the simple v-structured alarm network from last time.
• P(B) is trivial to compute after removing A and E.
• For P(B | A = 1), we can’t remove everything so we have to marginalize out E manually.
Example: A-H
A
B
D
F
C
E
G
H
[whiteboard]
Query: P(C | B = b)
• Marginalize out everything else, note C ⊥
⊥B
Query: P(C, H | E = e)
• Marginalize out A, D, F, G, note C ⊥
⊥H|E
CS221 / Autumn 2014 / Liang
13
• In the first example, once we marginalize out all variables, we are left with C and B disconnected. We
condition on B, which just removes that node, and so we’re just left with P(C) = p(c).
• In the second example, note that the two query variables are
P independent, so we can compute them
separately. The result is P(C = c, H = h | E = e) ∝ p(h | e) b p(e | b, c).
• But how do we compute these query distributions in general?
Roadmap
Preparation
Forward-backward
Gibbs sampling
Particle filtering
CS221 / Autumn 2014 / Liang
15
Hidden Markov model
H1
H2
H3
H4
H5
E1
E2
E3
E4
E5
P(H = h, E = e) = p(h1 )
| {z }
start
n
Y
i=2
p(hi | hi−1 )
|
{z
}
transition
n
Y
i=1
p(ei | hi )
| {z }
emission
Query (filtering):
P(H3 | E1 = e1 , E2 = e2 , E3 = e3 )
Query (smoothing):
P(H3 | E1 = e1 , E2 = e2 , E3 = e3 , E4 = e4 , E5 = e5 )
CS221 / Autumn 2014 / Liang
16
• The forward-backward algorithm will allow us to compute certain types of probabilistic queries exactly for
HMMs.
• Hidden Markov models (HMMs) are an important instance of Bayesian networks. In principle, you could
ask any type of query on an HMM, but there are two common ones: filtering and smoothing.
• Filtering asks for the distribution of some hidden variable Hi conditioned on only the evidence up until
that point. This is useful when you’re doing real-time object tracking, and you can’t see the future.
• Smoothing asks for the distribution of some hidden variable Hi conditioned on all the evidence, including
the future. This is useful when you have collected all the data and want to retroactively go and figure out
what Hi was.
Lattice representation
start
H1 = 1
H2 = 1
H3 = 1
H4 = 1
H5 = 1
H1 = 2
H2 = 2
H3 = 2
H4 = 2
H5 = 2
H1 = 3
H2 = 3
H3 = 3
H4 = 3
H5 = 3
end
• Edge start ⇒ H1 = h1 has weight p(h1 )p(e1 | h1 )
• Edge Hi−1 = hi−1 ⇒ Hi = hi has weight p(hi | hi−1 )p(ei |
hi )
• Each path from start to end is an assignment with weight equal
to the product of node/edge weights
CS221 / Autumn 2014 / Liang
18
• Now let’s actually compute these queries. We will do smoothing first. Filtering is a special case: if we’re
asking for Hi given E1 , . . . , Ei , then we can marginalize out the future, reducing the problem to a smaller
HMM.
• A useful way to think about inference is returning to state-based models. Consider a graph with a start
node, an end node, and a node for each assignment of a value to a variable Hi = v. The nodes are
arranged in a lattice, where each column corresponds to one variable Hi and each row corresponds to a
particular value v. Each path from the start to the end corresponds exactly to a complete assignment to
the nodes.
• Note that in the reduction from a variable-based model to a state-based model, we have committed to an
ordering of the variables.
• Each edge has a weight (a single number) determined by the local conditional probabilities (more generally,
the potentials in a factor graph). For each edge into Hi = hi , we multiply by the transition probability
into hi and emission probability p(ei | hi ). Remember that ei is observed, so we just have a constant. This
defines a weight for each path (assignment) in the graph equal to the joint probability P (H = h, E = e).
• Note that the lattice contains O(Kn) nodes and O(K 2 n) edges, where n is the number of variables and
K is the number of values in the domain of each variable.
Lattice representation
start
H1 = 1
H2 = 1
H3 = 1
H4 = 1
H5 = 1
H1 = 2
H2 = 2
H3 = 2
H4 = 2
H5 = 2
H1 = 3
H2 = 3
H3 = 3
H4 = 3
H5 = 3
end
Marginals P(Hi = hi | E = e) ∝ µi (hi ):
sum of weights of paths from start to end through Hi = hi
Forward messages Fi (hi ):
sum of weights of paths from start to Hi = hi
Backward messages Bi (hi ):
sum of weights of paths from Hi = hi to end
CS221 / Autumn 2014 / Liang
20
• The point of bringing back the search-based view is that we can cast the probability queries we care about
in terms of sums over paths, and effectively use dynamic programming.
• First, let’s define µi (v) to be the sum of the weights over all paths from the start node to the end node
that pass through the intermediate node Xi = v . This is the quantity we want up to normalization:
P(Hi = v | E1 , . . . , En ) ∝ µi (v). There are an exponential number of paths, but we can break it down.
• Define the forward message Fi (v) to be the sum of the weights over all paths from the start node to
Hi = v . Analogously, the backward message Bi (v) to be the sum of the weights over all paths from
Hi = v to the end node.
• Given these two definitions, we have µi (v) = Fi (v)Bi (v).
• In summary, for each node Hi = v , we compute three numbers: Fi (v), Bi (v), µi (v). First, we sweep
forward to compute all the Fi ’s recursively. At the same time, we sweep backward to compute all the Bi ’s
recursively. Then we compute µi by pointwise multiplication.
• Implementation note: we technically can normalize µi to get P(Hi | E = e) at the very end but it’s useful
to normalize Fi and Bi at each step to avoid underflow. In addition, normalization of the forward messages
yields P(Hi = v | E1 = e1 , . . . , Ei = ei ) ∝ Fi (v).
Object tracking
H1
H2
H3
H4
E1
E2
E3
E4
Problem: object tracking
Hi ∈ {1, . . . , K}: location of object at time step i
Ei ∈ {1, . . . , K}: sensor reading at time step i
Start: p(h1 ): uniform over all locations
Transition p(hi | hi−1 ): uniform over adjacent loc.
Emission p(ei | hi ): uniform over adjacent loc.
Observations: E = [1, 2, 3, 6]
[live solution]
CS221 / Autumn 2014 / Liang
22
Summary
• Lattice representation: paths are assignments (think state-based
models)
• Dynamic programming: compute sums efficiently
• Forward-backward algorithm: share intermediate computations
across different queries
CS221 / Autumn 2014 / Liang
23
Roadmap
Preparation
Forward-backward
Gibbs sampling
Particle filtering
CS221 / Autumn 2014 / Liang
24
Particle-based approximation
Key idea: particles
Use a set of assignments (particles) to represent a probability distribution.
Example:
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Sample 8
Sample 9
Sample 10
Estimated marginals
CS221 / Autumn 2014 / Liang
x1
0
0
1
1
0
1
0
1
1
1
0.6
x2
0
0
1
0
0
0
1
1
1
1
0.5
x3
1
0
0
0
1
0
0
0
0
0
0.2
25
• The central idea to both of Gibbs sampling and particle filtering is the use of particles (just a fancy word
for complete assignment) to represent a probability distribution.
• Rather than storing the probability of every single assignment, we have a set of assignments, some of which
can occur multiple times (which implicitly represents a higher probability).
• From a set of particles, we can compute approximate marginals (or any query we want) by simply computing
the fraction of assignments that satisfy the desired condition.
• Once we have a set of particles, then we can compute all the queries we want with it. So now how do we
actually generate the particles?
Gibbs sampling
Algorithm: Gibbs sampling
Initialize x to a random complete assignment
Loop through i = 1, . . . , n until convergence:
Compute weight of x[Xi = v] for each v
Choose x[Xi = v] with probability prop. to weight
[demo]
CS221 / Autumn 2014 / Liang
27
• Recall that Gibbs sampling proceeds by going through each variable Xi , considering all the possible assignments of Xi with some v ∈ Domaini , computing the weight of the resulting assignment x[Xi = v],
and choosing an assignment with probability proportional to the weight.
Gibbs sampling
Gibbs sampling (probabilistic interpretation)
Loop through i = 1, . . . , n until convergence:
• Set Xi = v with prob. P(Xi = v | X−i = x−i )
Notation: X−i = X − {Xi }
Important: computing the conditional of Xi only requires the potentials
touching Xi (4 of them in the above example)
CS221 / Autumn 2014 / Liang
29
• A different way (and the more standard way) to state the same algorithm is in terms of conditional
probabilities. Consider the conditional distribution over Xi given all the other variables X−i .
• Note that conditioning removes all of the graph except the potentials that depend on Xi . Therefore, in
order to sample, we just have to compute the normalization constant for the simple factor graph over Xi ,
which is easy since it only involves adding a weight for each v ∈ Domaini .
• Advanced: Gibbs sampling is an instance of a Markov Chain Monte Carlo (MCMC) algorithm which
generates a sequence of particles X (1) , X (2) , X (3) , . . . . A Markov chain is irreducible if there is positive
probability of getting from any assignment to any other assignment (now the probabilities are over the
random choices of the sampler). When the Gibbs sampler is irreducible, then in the limit as t → ∞, the
distribution of X (t) converges to the true distribution P(X). MCMC is a very rich topic which we will not
talk about very much here.
Application: image reconstruction
Example: image reconstruction
Setup:
• Xi ∈ {0, 1} is pixel value in location i
• Small fraction of pixels are observed
• Neighboring pixels more likely to be same than different
Potentials:
• oi (xi ) = [xi = observed value at i]
• tij (xi , xj ) = [xi = xj ] + 1
CS221 / Autumn 2014 / Liang
31
Application: image reconstruction
Example: image reconstruction
If neighbors are 1, 1, 1, 0 and Xi not observed:
P(Xi = 1 | X−i = x−i ) =
2·2·2·1
2·2·2·1+1·1·1·2
= 0.8
If neighbors are 0, 1, 0, 1 and Xi not observed:
P(Xi = 1 | X−i = x−i ) =
CS221 / Autumn 2014 / Liang
1·2·1·2
1·2·1·2+2·1·2·1
= 0.5
32
• Factor graphs play a huge role in computer vision applications. Here we take a look at a very simple image
reconstruction application. Recall that a general factor graph defines a distribution P(X = x) ∝ Weight(x).
Our example here will not be a Bayesian network, but from the point of probabilistic inference, we don’t
care.
• We assume that we have observed some fraction of the pixels in an image, and we wish to recover the
pixels which have been removed. Our simple factor graph has two potentials: transitions say that adjacent
pixels are more likely to be similar than not; observation potentials exist only for observed pixels and say
that the value Xi of the pixel must equal the observed value (we are assuming no noise other than missing
pixels).
Gibbs sampling: demo
[see web version]
CS221 / Autumn 2014 / Liang
34
• Try playing with the demo by modifying the settings to get a feeling for what Gibbs sampling is doing.
Each iteration corresponds to resampling each pixel (variable).
• When you hit ctrl-enter for the first time, red and black correspond to 1 and 0, and white corresponds to
unobserved.
• showMarginals allows you to either view the particles produced or the marginals estimated from the
particles (this gives you a smoother probability estimate of what the pixel values are).
• If you increase missingFrac, the problem becomes harder.
• If you set coherenceFactor to 1, this is equivalent to turning off the edge potentials.
• If you set icm to true, we will use local search rather than Gibbs sampling, which produces very bad
solutions.
Roadmap
Preparation
Forward-backward
Gibbs sampling
Particle filtering
CS221 / Autumn 2014 / Liang
36
Hidden Markov models
X1
X2
X3
X4
X5
E1
E2
E3
E4
E5
t1
X1
o1
CS221 / Autumn 2014 / Liang
t2
X2
o2
t3
X3
o3
t4
X4
o4
X5
o5
37
• Although particle filtering applies to general factor graphs, we will restrict ourselves to chain-structured
factor graphs. These factor graphs have transition potentials ti (xi−1 , xi ) and observation potentials oi (xi ).
For HMMs, we have ti (xi−1 , xi ) = p(xi | xi−1 ) and oi (xi ) = p(ei | xi ). (Here, we’ve switched notation
from H1 , . . . , Hn to X1 , . . . , Xn .)
Review: beam search
Idea: keep ≤ K candidate list C of partial assignments
Algorithm: beam search
Initialize C ← [{}]
For each i = 1, . . . , n:
Extend:
C 0 ← {x ∪ {Xi : v} : x ∈ C, v ∈ Domaini }
Prune:
C ← K elements of C 0 with highest weights
[demo: beamSearch({K:3})]
CS221 / Autumn 2014 / Liang
39
Review: beam search
Beam size K = 4
CS221 / Autumn 2014 / Liang
40
• Recall that beam search effectively does a pruned BFS of the search tree of partial assignments, where at
each level, we keep track of the K partial assignments with the highest weight.
• There are two phases. In the first phase, we extend all the existing candidates C to all possible assignments
to Xi ; this results in K = |Domaini | candidates C 0 . These C 0 are sorted by weight and pruned by taking
the top K.
Beam search
End result:
• Candidate list C is set of particles
• Use C to compute marginals
Problems:
• Extend: slow because requires considering every possible value for
Xi
• Prune: greedily taking best K doesn’t provide diversity
Solution (3 steps): propose, reweight, resample
CS221 / Autumn 2014 / Liang
42
• Beam search does generate a set of particles, but there are two problems.
• First, it can be slow if Domaini is large because we have to try every single value. Perhaps we can be
smarter about which values to try.
• Second, we are greedily taking the top K candidates, which can be too myopic. Can we somehow encourage
more diversity?
Step 1: propose
t1
X1
t2
X2
o1
o2
t3
X3
t4
X4
o3
o4
X5
o5
Definition: proposal distribution
The proposal distribution πi (xi | x1:i−1 ) is a heuristic guess of
the value of Xi given X1:i−1 = x1:i−1 .
Notation: x1:i = (x1 , . . . , xi )
How to choose a proposal?
CS221 / Autumn 2014 / Liang
44
• The first step is to extend the current partial assignment (particle) from x1:i−1 = (x1 , . . . , xi−1 ) to
x1:i = (x1 , . . . , xi ).
• Recall from factor graphs that upon assigning variable Xi , we can include all the dependent potentials
D(x1:i−1 , Xi ), which for chain-structured factor graphs contains ti (xi−1 , xi ) and oi (xi ).
• But in general, there are many possible values for Xi (for object tracking in a 100x100 grid, we have
|Domaini | = 104 possible values) and we don’t want to enumerate over all of them.
• So we will use a proposal distribution πi , whose purpose is to provide an educated guess about what the
value of Xi should be. The idea is that we can just sample from this distribution, which might be easier
than enumerating all possible values v ∈ Domaini .
Step 1: propose
Example: chain
Uniform: choose without any information
π(xi | x1:i−1 ) ∝ 1
Transitions: choose xi based on previous location
π(xi | x1:i−1 ) ∝ t(xi−1 , xi )
Transitions + observations: choose xi based on all information
π(xi | x1:i−1 ) ∝ t(xi−1 , xi )o(xi )
CS221 / Autumn 2014 / Liang
46
• There is a lot of flexibility in choosing the proposal distribution, and generally, one is faced with an
accuracy/efficiency tradeoff.
• If we propose uniformly from all possible v ∈ Domaini , this is very cheap to compute, but is not as
accurate.
• If we propose based on the transition potentials (forming a distribution proportional to the potentials), we
get something that’s a bit more informed.
• Finally, if we take the observations into account as well, then we can get a much more informed proposal
(assuming observations exist). In some sense, this is is the best proposal we could hope for, since it takes
into account all available potentials that we’re adding.
• For HMMs, we generally propose using πi (xi | x1:i−1 ) = p(xi | xi−1 ), which is the transition distribution.
Step 2: reweight proposed candidates
t1
X1
t2
X2
o1
t3
X3
o2
o3
t4
X4
o4
X5
o5
Multiply new potentials (incorporate real information):
ti−1 (xi−1 , xi )oi (xi )
Divide proposal probability (remove guessed information):
π(xi | x1:i−1 )
Weight:
w(x) =
CS221 / Autumn 2014 / Liang
ti−1 (xi−1 ,xi )oi (xi )
π(xi |x1:i−1 )
48
• Having generated a set of K candidates, we need to now score them. In beam search, we just used the
weight of the candidate, but particles, by virtue of being able to occur multiple times, have already taken
into account all the potentials involving variables X1 , . . . , Xi−1 through their multiplicity. Therefore, we
just need to take into account the weight of the new potentials ti−1 and oi .
• We need to divide the weight by the proposal probability because the proposal was just a guess, not actually
part of the model.
• For example, if we proposed an assignment with twice the probability, we need to assign it half the weight
because we’re going to get it twice as often on average.
Step 2: reweight proposed candidates
X1
X2
X3
X4
X5
E1
E2
E3
E4
E5
Example: HMM
Model:
ti−1 (xi−1 , xi ) = p(xi | xi−1 )
oi (xi ) = p(ei | xi )
Proposal:
π(xi | x1:i−1 ) = p(xi | xi−1 )
Weights:
w(x1:i ) = p(ei | xi )
CS221 / Autumn 2014 / Liang
50
• For the HMM, we multiply the transition and the emission probability, and divide by the proposal (which
happens to be the transition). This means we’re left with the emission probability p(ei | xi ) as the weight
that we assign each particle x1:i .
Step 3: resample
Question: given weighted particles, which to choose?
Tricky situation:
• Target distribution close to uniform
• Fewer particles than locations
CS221 / Autumn 2014 / Liang
52
• Having proposed extensions to the particles and computed a weight for each particle, we now come to the
question of which particles to keep.
• As a motivating example, consider an almost uniform distribution over a set of locations, and trying to
represent this distribution with fewer particles than locations. This is a tough situation to be in.
Step 3: resample
K with highest weight
K sampled from distribution
Intuition: top K assignments not representative.
Maybe random samples will be more representative...
CS221 / Autumn 2014 / Liang
54
• Beam search, which would choose the K locations with the highest weight, would clump all the particles
near the mode. This is risky, because we have no support out farther from the center, where there is
actually substantial probability.
• However, if we sample from the distribution which is proportional to the weights, then we can hedge our
bets and get a more representative set of particles which cover the space more evenly.
Step 3: resample
Key idea: resampling
Given a distribution P(A = a) with n possible values, draw a
sample K times.
Intuition: redistribute particles to more promising areas
Example: resampling
a
a1
a2
a3
a4
CS221 / Autumn 2014 / Liang
P(A = a)
0.70
0.20
0.05
0.05
sample
sample
sample
sample
1
2
3
4
a1
a2
a1
a1
56
• After proposing and reweighting, we end up with a set of samples x1:i , each with some weight w(x1:i ).
Intuitively, if w(x1:i ) is really small, then it might not be worth keeping that particle around.
• Resampling allows us to put (possibly multiple) particles on high weight particles. In the example above,
we failed to sample a3 and a4 because they have low probability of being sampled.
Particle filtering
Algorithm: particle filtering
Initialize C ← [{}]
For each i = 1, . . . , n:
Propose (extend):
C 0 ← {(x1:i−1 , xi ) : x1:i−1 ∈ C, xi ∼ π(xi | x1:i−1 )}
Reweight:
Compute weights w(x1:i ) =
ti−1 (xi−1 ,xi )oi (xi )
π(xi |xi−1 )
for x ∈ C 0
Resample (prune):
C ← K elements drawn independently from ∝ w(x1:i )
[demo: particleFiltering({K:100})]
CS221 / Autumn 2014 / Liang
58
• The final algorithm here is very similar to beam search. We go through all the variables X1 , . . . , Xn .
• For each candidate xi−1 ∈ C, we sample xi according to the proposal π(xi | x1:i−1 ).
• We then reweight this particle using x1:i−1 .
• Finally, we select K particles from ∝ w(x1:i ) by sampling each independently.
• For example, if w([1, 0]) = 3 and w([0, 0]) = 2, then we sample [1, 0] with probability 3/5 and [0, 0] with
probability 2/5
Particle filtering: implementation
• If only care about last Xi , collapse all particles with same Xi (think
elimination)
001 ⇒ 1
101 ⇒ 1
010 ⇒ 0
010 ⇒ 0
110 ⇒ 0
• If many particles are the same, can just store counts
CS221 / Autumn 2014 / Liang
1
1
1:2
0⇒
0:3
0
0
60
• In particle filtering as it is currently defined, each particle is an entire trajectory in the context of object
tracking (assignment to all the variables).
• Often in tracking applications, we only care about the last location Xi , and our chain-structured factor
graph is such that the future (Xi+1 , . . . , Xn ) is conditionally independent of X1 , . . . , Xi−1 given Xi .
Therefore, we often just store the value of Xi rather than its entire ancestry.
• When we only keep track of the Xi , we will have many particles that have the same value, so it can be
useful to store just the counts of each value rather than having duplicates.
Application: tracking
t1
X1
t2
X2
o1
o2
t3
X3
o3
t4
X4
X5
o4
o5
Example: tracking
• Xi : position of object at i
• Transitions: ti (xi , xi+1 ) = [xi near xi+1 ]
• Observations: oi (xi ) = sensor reading...
CS221 / Autumn 2014 / Liang
62
Particle filtering demo
[see web version]
CS221 / Autumn 2014 / Liang
63
• Consider a tracking application where an object is moving around in a grid and we are trying to figure out
its location Xi ∈ {1, . . . , grid-width} × {1, . . . , grid-height}.
• The transition potentials say that from one time step to the next, the object is equally likely to have moved
north, south, east, west, or stayed put.
• Each observation is a location on the grid (a yellow dot). The observation potential is a user-defined
function which depends on the vertical and horizontal distance.
• Play around with the demo to get a sense of how particle filtering works, especially the different observation
potentials.
Application: localization
Setup [Thrun]:
• Xi : location of robot at time i
• Observations oi (xi ): laser range finders provide information about
location
• Transitions ti (xi , xi+1 ): non-zero if xi and xi+1 are nearby
CS221 / Autumn 2014 / Liang
65
• In robot localization, the robot is given a map of the environment, but it doesn’t know where it is. By
wandering around and observing the environment, the robot will pick up noisy observations (distance from
walls), and maintain a distribution over where the robot is (via a set of particles).
• As you watch the video, think about where the robot could be based on the local environment. If two
places look very similar on the map, then the robot will maintain particles in each of those places. As
the robot moves around and collects more data, then the space of possible positions is narrowed, until the
robot is finally localized.
Probabilistic inference
Model (Bayesian network or factor graph):
n
Y
P(X = x) =
p(xi | xParents(i) )
i=1
Probabilistic inference:
P(Q | E = e)
Algorithms:
• Preparation: leverage conditional independence
• Forward-backward: chain-structured (HMMs), exact
• Gibbs sampling, particle filtering: general, approximate
Next time: learning
CS221 / Autumn 2014 / Liang
67