Natural Language Processing Spring 2007 V. “Juggy” Jagannathan Course Book Foundations of Statistical Natural Language Processing By Christopher Manning & Hinrich Schutze Chapter 9 Markov Models March 5, 2007 Markov models • Markov assumption – Suppose X = (X1, …, XT) is a sequence of random variables taking values in some finite set S = {s1,…,sN}, Markov properties are: • Limited Horizon – P(Xt+1 = sk|X1,…,Xt) = P(Xt+1 = sk|Xt) – i.e. the t+1 value only depends on t value • Time invariant (stationary) • Stochastic Transition matrix A: – aij = P(Xt+1 = sj|Xt=si) where aij 0, i, j & j 1 aij 1, i N Markov model example P( X 1 ,..., X T ) P( X 1 ) P( X 2 | X 1 ) P( X 3 | X 1 , X 2 )...P( X T | X 1 ,..., X T 1 ) P( X 1 ) P( X 2 | X 1 ) P( X 3 | X 2 )...P( X T | X T 1 ) T 1 X1 a X t X t1 t 1 P(t , i, p) P( X 1 t ) P( X 2 i | X 1 t ) P( X 3 p | X 2 i) 1.0 0.3 0.6 0.18 Hidden Markov Model Example Probability: {lem,ice-t} given the machine starts in CP? 0.3x0.7x0.1+0.3x0.3x0.7 =0.021+0.063 = 0.084 Why use HMMs? • Underlying events generating surface observable events • Eg. Predicting weather based on dampness of seaweeds • http://www.comp.leeds.ac.uk/roger/HiddenMarkovModels/html_dev/ main.html • Linear Interpolation in n-gram models: Pli ( wn | wn2 , wn1 ) 1P1 (wn ) 2 P2 (wn | wn1 ) 3 P3 (wn | wn2 , wn1 ) Look at Notes from David Meir Blei [UC Berkley] http://www-nlp.stanford.edu/fsnlp/hmm-chap/blei-hmm-ch9.ppt Slides 1-13 (Observed states) Forward Procedure i (t ) P(o1...ot 1, X t i | ) Initialization: i (1) i ,1 i N Induction: N j (t 1) i (t )aijbijo ,1 t T ,1 j N i 1 Total computation: t N P(O | ) i (T 1) i 1 Forward Procedure i (t ) P(ot ...oT , X t i | ) Initialization: i (T 1) 1,1 i N Induction: N i (t ) aijbijo j (t 1),1 t T ,1 i N j 1 t Total computation: N P(O | ) i i (1) i 1 Backward Procedure P(O, X t i | ) P(o1...oT , X t i | ) P(o1...ot 1 , X t i | ) P(ot ...oT | X t i | ) i (t ) i (t ) N P(O | ) i (t ) i (t ),1 t T 1 i 1 Combining both – forward and backward Finding the best state sequence To determine the state sequence that best explains observations Let: i (t ) P ( X t i, O | ) P (O | ) i (t ) j (t ) N j 1 j (t ) j (t ) Individually the most likely state is: X t arg max i (t ),1 t T 1 1i N This approach, however, does not correctly estimate the most likely state sequence. Finding the best state sequence Viterbi algorithm arg max P ( X | O ) X Store the most probable path that leads to a given node j (t ) max P( X 1...X t 1 , o1...ot 1 , X t j | ) x1 ...xt 1 Initialization j (1) j ,1 j N Induction j (t 1) max i (t )aijbijo ,1 j N 1i N t Store Backtrace j (t 1) arg max i (t )aijbijo ,1 j N t 1i N X T 1 arg max i (T 1) 1i N P ( X ) max i (T 1) 1 i N Parameter Estimation Parameter Estimation Probability of traversing an arc at time t given observation sequence O: pt (i, j ) P( X t i, X t 1 j | O, ) P( X t i, X t 1 j , O | ) P(O | ) i (t )aijbijot j (t 1) N m1 m (t ) m (t ) T (t ) expected _ num ber_ of _ transitions _ from _ state _ i _ in _ O t 1 i T p (i, j) expected _ num ber_ of _ transitions _ from _ state _ i _ to _ j _ in _ O t 1 t Parameter Estimation expected _ num ber_ of _ transitions _ from _ state _ i _ to _ j expected _ num ber_ of _ transitions _ from _ state _ i aij T t 1 T pt (i, j ) t 1 bijk i (t ) expected _ num ber_ of _ transitions _ from _ state _ i _ to _ j _ with _ k _ observed expected _ num ber_ of _ transitions _ from _ state _ i _ to _ j {t:ot k ,1t T } T t 1 t pt (i, j ) p (i, j )
© Copyright 2024 ExpyDoc