Natural Language Processing

Natural Language Processing
Spring 2007
V. “Juggy” Jagannathan
Course Book
Foundations of Statistical
Natural Language Processing
By
Christopher Manning & Hinrich
Schutze
Chapter 9
Markov Models
March 5, 2007
Markov models
• Markov assumption
– Suppose X = (X1, …, XT) is a sequence of
random variables taking values in some finite set
S = {s1,…,sN}, Markov properties are:
• Limited Horizon
– P(Xt+1 = sk|X1,…,Xt) = P(Xt+1 = sk|Xt)
– i.e. the t+1 value only depends on t value
• Time invariant (stationary)
• Stochastic Transition matrix A:
– aij = P(Xt+1 = sj|Xt=si) where
aij  0, i, j &  j 1 aij  1, i
N
Markov model
example
P( X 1 ,..., X T )  P( X 1 ) P( X 2 | X 1 ) P( X 3 | X 1 , X 2 )...P( X T | X 1 ,..., X T 1 )
 P( X 1 ) P( X 2 | X 1 ) P( X 3 | X 2 )...P( X T | X T 1 )
T 1
  X1  a X t X t1
t 1
P(t , i, p)  P( X 1  t ) P( X 2  i | X 1  t ) P( X 3  p | X 2  i)
 1.0  0.3  0.6
 0.18
Hidden Markov Model Example
Probability: {lem,ice-t} given
the machine starts in CP?
0.3x0.7x0.1+0.3x0.3x0.7
=0.021+0.063 = 0.084
Why use HMMs?
• Underlying events  generating surface observable
events
• Eg. Predicting weather based on dampness of seaweeds
• http://www.comp.leeds.ac.uk/roger/HiddenMarkovModels/html_dev/
main.html
• Linear Interpolation in n-gram models:
Pli ( wn | wn2 , wn1 ) 
1P1 (wn )  2 P2 (wn | wn1 )  3 P3 (wn | wn2 , wn1 )
Look at Notes from David Meir Blei [UC
Berkley]
http://www-nlp.stanford.edu/fsnlp/hmm-chap/blei-hmm-ch9.ppt
Slides 1-13
(Observed states)
Forward Procedure
i (t )  P(o1...ot 1, X t  i |  )
Initialization:
i (1)   i ,1  i  N
Induction:
N
 j (t  1)   i (t )aijbijo ,1  t  T ,1  j  N
i 1
Total computation:
t
N
P(O |  )    i (T  1)
i 1
Forward Procedure
i (t )  P(ot ...oT , X t  i |  )
Initialization:
i (T  1)  1,1  i  N
Induction:
N
i (t )   aijbijo  j (t  1),1  t  T ,1  i  N
j 1
t
Total computation:
N
P(O |  )    i  i (1)
i 1
Backward Procedure
P(O, X t  i |  )  P(o1...oT , X t  i |  )
 P(o1...ot 1 , X t  i |  )  P(ot ...oT | X t  i |  )
  i (t ) i (t )
N
P(O |  )   i (t )  i (t ),1  t  T  1
i 1
Combining both – forward and backward
Finding the best state sequence
To determine the state sequence that best explains observations
Let:
 i (t ) 
P ( X t  i, O |  )

P (O |  )
 i (t )  j (t )
 N
 j 1 j (t ) j (t )
Individually the most likely state is:
X t  arg max i (t ),1  t  T  1
1i  N
This approach, however, does not correctly estimate the most
likely state sequence.
Finding the best state sequence
Viterbi algorithm
arg max P ( X | O )
X
Store the most probable path that leads to a given node
 j (t )  max P( X 1...X t 1 , o1...ot 1 , X t  j |  )
x1 ...xt 1
Initialization
 j (1)   j ,1  j  N
Induction
 j (t  1)  max  i (t )aijbijo ,1  j  N
1i  N
t
Store Backtrace
 j (t  1)  arg max i (t )aijbijo ,1  j  N
t
1i  N
X T 1  arg max i (T  1)
1i  N
P ( X )  max  i (T  1)
1 i  N
Parameter Estimation
Parameter Estimation
Probability of traversing an arc at time t given observation sequence O:
pt (i, j )  P( X t  i, X t 1  j | O,  )
P( X t  i, X t 1  j , O |  )

P(O |  )
 i (t )aijbijot  j (t  1)

N
m1 m (t ) m (t )
T
  (t )  expected _ num ber_ of _ transitions _ from _ state _ i _ in _ O
t 1
i
T
 p (i, j)  expected _ num ber_ of _ transitions _ from _ state _ i _ to _ j _ in _ O
t 1
t
Parameter Estimation
expected _ num ber_ of _ transitions _ from _ state _ i _ to _ j
expected _ num ber_ of _ transitions _ from _ state _ i
aij 



T
t 1
T
pt (i, j )
t 1
bijk 


 i (t )
expected _ num ber_ of _ transitions _ from _ state _ i _ to _ j _ with _ k _ observed
expected _ num ber_ of _ transitions _ from _ state _ i _ to _ j
{t:ot  k ,1t T }
T
t 1 t

pt (i, j )
p (i, j )