Lecture 14: Bayesian networks II CS221 / Autumn 2014 / Liang Pac-Man competition 1. (1743) Zhiming Shi 2. (1723) Akim Kumok 3. (1710) Wilbur Yang 4. (1698) Cody Murray 5. (1671) Tao Du CS221 / Autumn 2014 / Liang 1 Review: Bayesian network C A P(C = c, A = a, H = h, I = i) = p(c)p(a)p(h | c, a)p(i | a) H I Definition: Bayesian network Let X = (X1 , . . . , Xn ) be random variables. A Bayesian network is a directed acyclic graph (DAG) that specifies a joint distribution over X as a product of local conditional distributions, one for each node: n Y P(X1 = x1 , . . . , Xn = xn ) = p(xi | xParents(i) ) i=1 CS221 / Autumn 2014 / Liang 2 • A Bayesian network allows us to define a joint probability distribution over many variables (e.g., P(C, A, H, I)) by specifying local conditional distributions (e.g., p(i | a)). Review: probabilistic inference Input Bayesian network: P(X1 = x1 , . . . , Xn = xn ) Evidence: E = e where E ⊂ X is subset of variables Query: Q ⊂ X is subset of variables Output P(Q = q | E = e) for all values q Example: if coughing but no itchy eyes, have a cold? P(C | H = 1, I = 0) CS221 / Autumn 2014 / Liang 4 • Think of the Bayesian network as a guru who knows everything. Probabilistic inference allows you to ask the guru anything: what is the probability of having a cold? What if I’m coughing? What if I don’t have itchy eyes? In this lecture, we’re going to build such a guru. Roadmap Preparation Forward-backward Gibbs sampling Particle filtering CS221 / Autumn 2014 / Liang 6 Example: Markov model X1 X2 X3 X4 Query: P(X3 = x3 | X2 = x2 ) Tedious way: X ∝ p(x1 )p(x2 | x1 )p(x3 | x2 )p(x4 | x3 ) x1 ,x4 ! ∝ X p(x1 )p(x2 | x1 ) p(x3 | x2 ) x1 X p(x4 | x3 ) x4 ∝ p(x3 | x2 ) Fast way: [whiteboard] CS221 / Autumn 2014 / Liang 7 • Let’s first compute the query the old-fashioned way by grinding through the algebra. 3 =u,X2 =v) ∝ P(X3 , X2 = • One important note about conditional probabilities: P(X3 = u | X2 = v) = P(XP(X 2 =v) v). Recall that the conditional probability given evidence is the joint distribution divided by the probability of evidence, which is just a constant (called the normalization constant). So that means we can just write ”proportional to” the joint. It saves a lot of work just to think in terms of proportionality, because then we can just drop constants (things that don’t depend on x3 ). You can always normalize at the end. General strategy Query: P(Q | E = e) Algorithm: general probabilistic inference strategy • • • • • Remove (marginalize) variables not ancestors of Q or E. Convert Bayesian network to factor graph. Condition (shade nodes / disconnect) on E = e. Remove (marginalize) nodes disconnected from Q. Run probabilistic inference algorithm (manual, variable elimination, Gibbs sampling, particle filtering). CS221 / Autumn 2014 / Liang 9 • Our goal is to compute the conditional distribution over the query variables Q ⊂ H given evidence E = e. We can do this with our bare hands by chugging through all the algebra starting with the definition of marginal and conditional probability, but there is an easier way to do this that exploits the structure of the Bayesian network. • Step 1: remove variables which are not ancestors of Q or E. Intuitively, these don’t have an influence on Q and E, so they can be removed. Mathematically, we verified this property last lecture (consistency of sub-Bayesian networks). • Step 2: turn this Bayesian network into a factor graph by simply introducing one potential per node which is connected to that node and its parents. It’s important to include all the parents and the child into one factor, not separate factors. From here out, all we need to think about is factor graphs. • Step 3: condition on the evidence variables. Recall that conditioning on nodes in a factor graph shades them in, but is a graph operation that rips out those variables from the graph. • Step 4: remove nodes which are not connected to Q. These are independent of Q, so they have no impact on the results. • Step 5: Finally, run a standard probabilistic inference algorithm on the reduced factor graph. We’ll do this manually for now (chugging through algebra on this hopefully much smaller graph). Later we’ll see automatic methods for doing this. Example: alarm B E A b p(b) e p(e) 1 1 0 1− 0 1− b e a p(a | b, e) 0 0 0 1 0 0 1 0 0 1 0 0 0 1 1 1 1 0 0 0 1 0 1 1 1 1 0 0 1 1 1 1 [whiteboard] Query: P(B) • Marginalize out A, E Query: P(B | A = 1) • Condition on A = 1 CS221 / Autumn 2014 / Liang 11 • Here is another example: the simple v-structured alarm network from last time. • P(B) is trivial to compute after removing A and E. • For P(B | A = 1), we can’t remove everything so we have to marginalize out E manually. Example: A-H A B D F C E G H [whiteboard] Query: P(C | B = b) • Marginalize out everything else, note C ⊥ ⊥B Query: P(C, H | E = e) • Marginalize out A, D, F, G, note C ⊥ ⊥H|E CS221 / Autumn 2014 / Liang 13 • In the first example, once we marginalize out all variables, we are left with C and B disconnected. We condition on B, which just removes that node, and so we’re just left with P(C) = p(c). • In the second example, note that the two query variables are P independent, so we can compute them separately. The result is P(C = c, H = h | E = e) ∝ p(h | e) b p(e | b, c). • But how do we compute these query distributions in general? Roadmap Preparation Forward-backward Gibbs sampling Particle filtering CS221 / Autumn 2014 / Liang 15 Hidden Markov model H1 H2 H3 H4 H5 E1 E2 E3 E4 E5 P(H = h, E = e) = p(h1 ) | {z } start n Y i=2 p(hi | hi−1 ) | {z } transition n Y i=1 p(ei | hi ) | {z } emission Query (filtering): P(H3 | E1 = e1 , E2 = e2 , E3 = e3 ) Query (smoothing): P(H3 | E1 = e1 , E2 = e2 , E3 = e3 , E4 = e4 , E5 = e5 ) CS221 / Autumn 2014 / Liang 16 • The forward-backward algorithm will allow us to compute certain types of probabilistic queries exactly for HMMs. • Hidden Markov models (HMMs) are an important instance of Bayesian networks. In principle, you could ask any type of query on an HMM, but there are two common ones: filtering and smoothing. • Filtering asks for the distribution of some hidden variable Hi conditioned on only the evidence up until that point. This is useful when you’re doing real-time object tracking, and you can’t see the future. • Smoothing asks for the distribution of some hidden variable Hi conditioned on all the evidence, including the future. This is useful when you have collected all the data and want to retroactively go and figure out what Hi was. Lattice representation start H1 = 1 H2 = 1 H3 = 1 H4 = 1 H5 = 1 H1 = 2 H2 = 2 H3 = 2 H4 = 2 H5 = 2 H1 = 3 H2 = 3 H3 = 3 H4 = 3 H5 = 3 end • Edge start ⇒ H1 = h1 has weight p(h1 )p(e1 | h1 ) • Edge Hi−1 = hi−1 ⇒ Hi = hi has weight p(hi | hi−1 )p(ei | hi ) • Each path from start to end is an assignment with weight equal to the product of node/edge weights CS221 / Autumn 2014 / Liang 18 • Now let’s actually compute these queries. We will do smoothing first. Filtering is a special case: if we’re asking for Hi given E1 , . . . , Ei , then we can marginalize out the future, reducing the problem to a smaller HMM. • A useful way to think about inference is returning to state-based models. Consider a graph with a start node, an end node, and a node for each assignment of a value to a variable Hi = v. The nodes are arranged in a lattice, where each column corresponds to one variable Hi and each row corresponds to a particular value v. Each path from the start to the end corresponds exactly to a complete assignment to the nodes. • Note that in the reduction from a variable-based model to a state-based model, we have committed to an ordering of the variables. • Each edge has a weight (a single number) determined by the local conditional probabilities (more generally, the potentials in a factor graph). For each edge into Hi = hi , we multiply by the transition probability into hi and emission probability p(ei | hi ). Remember that ei is observed, so we just have a constant. This defines a weight for each path (assignment) in the graph equal to the joint probability P (H = h, E = e). • Note that the lattice contains O(Kn) nodes and O(K 2 n) edges, where n is the number of variables and K is the number of values in the domain of each variable. Lattice representation start H1 = 1 H2 = 1 H3 = 1 H4 = 1 H5 = 1 H1 = 2 H2 = 2 H3 = 2 H4 = 2 H5 = 2 H1 = 3 H2 = 3 H3 = 3 H4 = 3 H5 = 3 end Marginals P(Hi = hi | E = e) ∝ µi (hi ): sum of weights of paths from start to end through Hi = hi Forward messages Fi (hi ): sum of weights of paths from start to Hi = hi Backward messages Bi (hi ): sum of weights of paths from Hi = hi to end CS221 / Autumn 2014 / Liang 20 • The point of bringing back the search-based view is that we can cast the probability queries we care about in terms of sums over paths, and effectively use dynamic programming. • First, let’s define µi (v) to be the sum of the weights over all paths from the start node to the end node that pass through the intermediate node Xi = v . This is the quantity we want up to normalization: P(Hi = v | E1 , . . . , En ) ∝ µi (v). There are an exponential number of paths, but we can break it down. • Define the forward message Fi (v) to be the sum of the weights over all paths from the start node to Hi = v . Analogously, the backward message Bi (v) to be the sum of the weights over all paths from Hi = v to the end node. • Given these two definitions, we have µi (v) = Fi (v)Bi (v). • In summary, for each node Hi = v , we compute three numbers: Fi (v), Bi (v), µi (v). First, we sweep forward to compute all the Fi ’s recursively. At the same time, we sweep backward to compute all the Bi ’s recursively. Then we compute µi by pointwise multiplication. • Implementation note: we technically can normalize µi to get P(Hi | E = e) at the very end but it’s useful to normalize Fi and Bi at each step to avoid underflow. In addition, normalization of the forward messages yields P(Hi = v | E1 = e1 , . . . , Ei = ei ) ∝ Fi (v). Object tracking H1 H2 H3 H4 E1 E2 E3 E4 Problem: object tracking Hi ∈ {1, . . . , K}: location of object at time step i Ei ∈ {1, . . . , K}: sensor reading at time step i Start: p(h1 ): uniform over all locations Transition p(hi | hi−1 ): uniform over adjacent loc. Emission p(ei | hi ): uniform over adjacent loc. Observations: E = [1, 2, 3, 6] [live solution] CS221 / Autumn 2014 / Liang 22 Summary • Lattice representation: paths are assignments (think state-based models) • Dynamic programming: compute sums efficiently • Forward-backward algorithm: share intermediate computations across different queries CS221 / Autumn 2014 / Liang 23 Roadmap Preparation Forward-backward Gibbs sampling Particle filtering CS221 / Autumn 2014 / Liang 24 Particle-based approximation Key idea: particles Use a set of assignments (particles) to represent a probability distribution. Example: Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Sample 6 Sample 7 Sample 8 Sample 9 Sample 10 Estimated marginals CS221 / Autumn 2014 / Liang x1 0 0 1 1 0 1 0 1 1 1 0.6 x2 0 0 1 0 0 0 1 1 1 1 0.5 x3 1 0 0 0 1 0 0 0 0 0 0.2 25 • The central idea to both of Gibbs sampling and particle filtering is the use of particles (just a fancy word for complete assignment) to represent a probability distribution. • Rather than storing the probability of every single assignment, we have a set of assignments, some of which can occur multiple times (which implicitly represents a higher probability). • From a set of particles, we can compute approximate marginals (or any query we want) by simply computing the fraction of assignments that satisfy the desired condition. • Once we have a set of particles, then we can compute all the queries we want with it. So now how do we actually generate the particles? Gibbs sampling Algorithm: Gibbs sampling Initialize x to a random complete assignment Loop through i = 1, . . . , n until convergence: Compute weight of x[Xi = v] for each v Choose x[Xi = v] with probability prop. to weight [demo] CS221 / Autumn 2014 / Liang 27 • Recall that Gibbs sampling proceeds by going through each variable Xi , considering all the possible assignments of Xi with some v ∈ Domaini , computing the weight of the resulting assignment x[Xi = v], and choosing an assignment with probability proportional to the weight. Gibbs sampling Gibbs sampling (probabilistic interpretation) Loop through i = 1, . . . , n until convergence: • Set Xi = v with prob. P(Xi = v | X−i = x−i ) Notation: X−i = X − {Xi } Important: computing the conditional of Xi only requires the potentials touching Xi (4 of them in the above example) CS221 / Autumn 2014 / Liang 29 • A different way (and the more standard way) to state the same algorithm is in terms of conditional probabilities. Consider the conditional distribution over Xi given all the other variables X−i . • Note that conditioning removes all of the graph except the potentials that depend on Xi . Therefore, in order to sample, we just have to compute the normalization constant for the simple factor graph over Xi , which is easy since it only involves adding a weight for each v ∈ Domaini . • Advanced: Gibbs sampling is an instance of a Markov Chain Monte Carlo (MCMC) algorithm which generates a sequence of particles X (1) , X (2) , X (3) , . . . . A Markov chain is irreducible if there is positive probability of getting from any assignment to any other assignment (now the probabilities are over the random choices of the sampler). When the Gibbs sampler is irreducible, then in the limit as t → ∞, the distribution of X (t) converges to the true distribution P(X). MCMC is a very rich topic which we will not talk about very much here. Application: image reconstruction Example: image reconstruction Setup: • Xi ∈ {0, 1} is pixel value in location i • Small fraction of pixels are observed • Neighboring pixels more likely to be same than different Potentials: • oi (xi ) = [xi = observed value at i] • tij (xi , xj ) = [xi = xj ] + 1 CS221 / Autumn 2014 / Liang 31 Application: image reconstruction Example: image reconstruction If neighbors are 1, 1, 1, 0 and Xi not observed: P(Xi = 1 | X−i = x−i ) = 2·2·2·1 2·2·2·1+1·1·1·2 = 0.8 If neighbors are 0, 1, 0, 1 and Xi not observed: P(Xi = 1 | X−i = x−i ) = CS221 / Autumn 2014 / Liang 1·2·1·2 1·2·1·2+2·1·2·1 = 0.5 32 • Factor graphs play a huge role in computer vision applications. Here we take a look at a very simple image reconstruction application. Recall that a general factor graph defines a distribution P(X = x) ∝ Weight(x). Our example here will not be a Bayesian network, but from the point of probabilistic inference, we don’t care. • We assume that we have observed some fraction of the pixels in an image, and we wish to recover the pixels which have been removed. Our simple factor graph has two potentials: transitions say that adjacent pixels are more likely to be similar than not; observation potentials exist only for observed pixels and say that the value Xi of the pixel must equal the observed value (we are assuming no noise other than missing pixels). Gibbs sampling: demo [see web version] CS221 / Autumn 2014 / Liang 34 • Try playing with the demo by modifying the settings to get a feeling for what Gibbs sampling is doing. Each iteration corresponds to resampling each pixel (variable). • When you hit ctrl-enter for the first time, red and black correspond to 1 and 0, and white corresponds to unobserved. • showMarginals allows you to either view the particles produced or the marginals estimated from the particles (this gives you a smoother probability estimate of what the pixel values are). • If you increase missingFrac, the problem becomes harder. • If you set coherenceFactor to 1, this is equivalent to turning off the edge potentials. • If you set icm to true, we will use local search rather than Gibbs sampling, which produces very bad solutions. Roadmap Preparation Forward-backward Gibbs sampling Particle filtering CS221 / Autumn 2014 / Liang 36 Hidden Markov models X1 X2 X3 X4 X5 E1 E2 E3 E4 E5 t1 X1 o1 CS221 / Autumn 2014 / Liang t2 X2 o2 t3 X3 o3 t4 X4 o4 X5 o5 37 • Although particle filtering applies to general factor graphs, we will restrict ourselves to chain-structured factor graphs. These factor graphs have transition potentials ti (xi−1 , xi ) and observation potentials oi (xi ). For HMMs, we have ti (xi−1 , xi ) = p(xi | xi−1 ) and oi (xi ) = p(ei | xi ). (Here, we’ve switched notation from H1 , . . . , Hn to X1 , . . . , Xn .) Review: beam search Idea: keep ≤ K candidate list C of partial assignments Algorithm: beam search Initialize C ← [{}] For each i = 1, . . . , n: Extend: C 0 ← {x ∪ {Xi : v} : x ∈ C, v ∈ Domaini } Prune: C ← K elements of C 0 with highest weights [demo: beamSearch({K:3})] CS221 / Autumn 2014 / Liang 39 Review: beam search Beam size K = 4 CS221 / Autumn 2014 / Liang 40 • Recall that beam search effectively does a pruned BFS of the search tree of partial assignments, where at each level, we keep track of the K partial assignments with the highest weight. • There are two phases. In the first phase, we extend all the existing candidates C to all possible assignments to Xi ; this results in K = |Domaini | candidates C 0 . These C 0 are sorted by weight and pruned by taking the top K. Beam search End result: • Candidate list C is set of particles • Use C to compute marginals Problems: • Extend: slow because requires considering every possible value for Xi • Prune: greedily taking best K doesn’t provide diversity Solution (3 steps): propose, reweight, resample CS221 / Autumn 2014 / Liang 42 • Beam search does generate a set of particles, but there are two problems. • First, it can be slow if Domaini is large because we have to try every single value. Perhaps we can be smarter about which values to try. • Second, we are greedily taking the top K candidates, which can be too myopic. Can we somehow encourage more diversity? Step 1: propose t1 X1 t2 X2 o1 o2 t3 X3 t4 X4 o3 o4 X5 o5 Definition: proposal distribution The proposal distribution πi (xi | x1:i−1 ) is a heuristic guess of the value of Xi given X1:i−1 = x1:i−1 . Notation: x1:i = (x1 , . . . , xi ) How to choose a proposal? CS221 / Autumn 2014 / Liang 44 • The first step is to extend the current partial assignment (particle) from x1:i−1 = (x1 , . . . , xi−1 ) to x1:i = (x1 , . . . , xi ). • Recall from factor graphs that upon assigning variable Xi , we can include all the dependent potentials D(x1:i−1 , Xi ), which for chain-structured factor graphs contains ti (xi−1 , xi ) and oi (xi ). • But in general, there are many possible values for Xi (for object tracking in a 100x100 grid, we have |Domaini | = 104 possible values) and we don’t want to enumerate over all of them. • So we will use a proposal distribution πi , whose purpose is to provide an educated guess about what the value of Xi should be. The idea is that we can just sample from this distribution, which might be easier than enumerating all possible values v ∈ Domaini . Step 1: propose Example: chain Uniform: choose without any information π(xi | x1:i−1 ) ∝ 1 Transitions: choose xi based on previous location π(xi | x1:i−1 ) ∝ t(xi−1 , xi ) Transitions + observations: choose xi based on all information π(xi | x1:i−1 ) ∝ t(xi−1 , xi )o(xi ) CS221 / Autumn 2014 / Liang 46 • There is a lot of flexibility in choosing the proposal distribution, and generally, one is faced with an accuracy/efficiency tradeoff. • If we propose uniformly from all possible v ∈ Domaini , this is very cheap to compute, but is not as accurate. • If we propose based on the transition potentials (forming a distribution proportional to the potentials), we get something that’s a bit more informed. • Finally, if we take the observations into account as well, then we can get a much more informed proposal (assuming observations exist). In some sense, this is is the best proposal we could hope for, since it takes into account all available potentials that we’re adding. • For HMMs, we generally propose using πi (xi | x1:i−1 ) = p(xi | xi−1 ), which is the transition distribution. Step 2: reweight proposed candidates t1 X1 t2 X2 o1 t3 X3 o2 o3 t4 X4 o4 X5 o5 Multiply new potentials (incorporate real information): ti−1 (xi−1 , xi )oi (xi ) Divide proposal probability (remove guessed information): π(xi | x1:i−1 ) Weight: w(x) = CS221 / Autumn 2014 / Liang ti−1 (xi−1 ,xi )oi (xi ) π(xi |x1:i−1 ) 48 • Having generated a set of K candidates, we need to now score them. In beam search, we just used the weight of the candidate, but particles, by virtue of being able to occur multiple times, have already taken into account all the potentials involving variables X1 , . . . , Xi−1 through their multiplicity. Therefore, we just need to take into account the weight of the new potentials ti−1 and oi . • We need to divide the weight by the proposal probability because the proposal was just a guess, not actually part of the model. • For example, if we proposed an assignment with twice the probability, we need to assign it half the weight because we’re going to get it twice as often on average. Step 2: reweight proposed candidates X1 X2 X3 X4 X5 E1 E2 E3 E4 E5 Example: HMM Model: ti−1 (xi−1 , xi ) = p(xi | xi−1 ) oi (xi ) = p(ei | xi ) Proposal: π(xi | x1:i−1 ) = p(xi | xi−1 ) Weights: w(x1:i ) = p(ei | xi ) CS221 / Autumn 2014 / Liang 50 • For the HMM, we multiply the transition and the emission probability, and divide by the proposal (which happens to be the transition). This means we’re left with the emission probability p(ei | xi ) as the weight that we assign each particle x1:i . Step 3: resample Question: given weighted particles, which to choose? Tricky situation: • Target distribution close to uniform • Fewer particles than locations CS221 / Autumn 2014 / Liang 52 • Having proposed extensions to the particles and computed a weight for each particle, we now come to the question of which particles to keep. • As a motivating example, consider an almost uniform distribution over a set of locations, and trying to represent this distribution with fewer particles than locations. This is a tough situation to be in. Step 3: resample K with highest weight K sampled from distribution Intuition: top K assignments not representative. Maybe random samples will be more representative... CS221 / Autumn 2014 / Liang 54 • Beam search, which would choose the K locations with the highest weight, would clump all the particles near the mode. This is risky, because we have no support out farther from the center, where there is actually substantial probability. • However, if we sample from the distribution which is proportional to the weights, then we can hedge our bets and get a more representative set of particles which cover the space more evenly. Step 3: resample Key idea: resampling Given a distribution P(A = a) with n possible values, draw a sample K times. Intuition: redistribute particles to more promising areas Example: resampling a a1 a2 a3 a4 CS221 / Autumn 2014 / Liang P(A = a) 0.70 0.20 0.05 0.05 sample sample sample sample 1 2 3 4 a1 a2 a1 a1 56 • After proposing and reweighting, we end up with a set of samples x1:i , each with some weight w(x1:i ). Intuitively, if w(x1:i ) is really small, then it might not be worth keeping that particle around. • Resampling allows us to put (possibly multiple) particles on high weight particles. In the example above, we failed to sample a3 and a4 because they have low probability of being sampled. Particle filtering Algorithm: particle filtering Initialize C ← [{}] For each i = 1, . . . , n: Propose (extend): C 0 ← {(x1:i−1 , xi ) : x1:i−1 ∈ C, xi ∼ π(xi | x1:i−1 )} Reweight: Compute weights w(x1:i ) = ti−1 (xi−1 ,xi )oi (xi ) π(xi |xi−1 ) for x ∈ C 0 Resample (prune): C ← K elements drawn independently from ∝ w(x1:i ) [demo: particleFiltering({K:100})] CS221 / Autumn 2014 / Liang 58 • The final algorithm here is very similar to beam search. We go through all the variables X1 , . . . , Xn . • For each candidate xi−1 ∈ C, we sample xi according to the proposal π(xi | x1:i−1 ). • We then reweight this particle using x1:i−1 . • Finally, we select K particles from ∝ w(x1:i ) by sampling each independently. • For example, if w([1, 0]) = 3 and w([0, 0]) = 2, then we sample [1, 0] with probability 3/5 and [0, 0] with probability 2/5 Particle filtering: implementation • If only care about last Xi , collapse all particles with same Xi (think elimination) 001 ⇒ 1 101 ⇒ 1 010 ⇒ 0 010 ⇒ 0 110 ⇒ 0 • If many particles are the same, can just store counts CS221 / Autumn 2014 / Liang 1 1 1:2 0⇒ 0:3 0 0 60 • In particle filtering as it is currently defined, each particle is an entire trajectory in the context of object tracking (assignment to all the variables). • Often in tracking applications, we only care about the last location Xi , and our chain-structured factor graph is such that the future (Xi+1 , . . . , Xn ) is conditionally independent of X1 , . . . , Xi−1 given Xi . Therefore, we often just store the value of Xi rather than its entire ancestry. • When we only keep track of the Xi , we will have many particles that have the same value, so it can be useful to store just the counts of each value rather than having duplicates. Application: tracking t1 X1 t2 X2 o1 o2 t3 X3 o3 t4 X4 X5 o4 o5 Example: tracking • Xi : position of object at i • Transitions: ti (xi , xi+1 ) = [xi near xi+1 ] • Observations: oi (xi ) = sensor reading... CS221 / Autumn 2014 / Liang 62 Particle filtering demo [see web version] CS221 / Autumn 2014 / Liang 63 • Consider a tracking application where an object is moving around in a grid and we are trying to figure out its location Xi ∈ {1, . . . , grid-width} × {1, . . . , grid-height}. • The transition potentials say that from one time step to the next, the object is equally likely to have moved north, south, east, west, or stayed put. • Each observation is a location on the grid (a yellow dot). The observation potential is a user-defined function which depends on the vertical and horizontal distance. • Play around with the demo to get a sense of how particle filtering works, especially the different observation potentials. Application: localization Setup [Thrun]: • Xi : location of robot at time i • Observations oi (xi ): laser range finders provide information about location • Transitions ti (xi , xi+1 ): non-zero if xi and xi+1 are nearby CS221 / Autumn 2014 / Liang 65 • In robot localization, the robot is given a map of the environment, but it doesn’t know where it is. By wandering around and observing the environment, the robot will pick up noisy observations (distance from walls), and maintain a distribution over where the robot is (via a set of particles). • As you watch the video, think about where the robot could be based on the local environment. If two places look very similar on the map, then the robot will maintain particles in each of those places. As the robot moves around and collects more data, then the space of possible positions is narrowed, until the robot is finally localized. Probabilistic inference Model (Bayesian network or factor graph): n Y P(X = x) = p(xi | xParents(i) ) i=1 Probabilistic inference: P(Q | E = e) Algorithms: • Preparation: leverage conditional independence • Forward-backward: chain-structured (HMMs), exact • Gibbs sampling, particle filtering: general, approximate Next time: learning CS221 / Autumn 2014 / Liang 67
© Copyright 2024 ExpyDoc