Natural Language Processing
Spring 2007
V. “Juggy” Jagannathan
Course Book
Foundations of Statistical
Natural Language Processing
By
Christopher Manning & Hinrich
Schutze
Chapter 9
Markov Models
March 5, 2007
Markov models
• Markov assumption
– Suppose X = (X1, …, XT) is a sequence of
random variables taking values in some finite set
S = {s1,…,sN}, Markov properties are:
• Limited Horizon
– P(Xt+1 = sk|X1,…,Xt) = P(Xt+1 = sk|Xt)
– i.e. the t+1 value only depends on t value
• Time invariant (stationary)
• Stochastic Transition matrix A:
– aij = P(Xt+1 = sj|Xt=si) where
aij 0, i, j & j 1 aij 1, i
N
Markov model
example
P( X 1 ,..., X T ) P( X 1 ) P( X 2 | X 1 ) P( X 3 | X 1 , X 2 )...P( X T | X 1 ,..., X T 1 )
P( X 1 ) P( X 2 | X 1 ) P( X 3 | X 2 )...P( X T | X T 1 )
T 1
X1 a X t X t1
t 1
P(t , i, p) P( X 1 t ) P( X 2 i | X 1 t ) P( X 3 p | X 2 i)
1.0 0.3 0.6
0.18
Hidden Markov Model Example
Probability: {lem,ice-t} given
the machine starts in CP?
0.3x0.7x0.1+0.3x0.3x0.7
=0.021+0.063 = 0.084
Why use HMMs?
• Underlying events generating surface observable
events
• Eg. Predicting weather based on dampness of seaweeds
• http://www.comp.leeds.ac.uk/roger/HiddenMarkovModels/html_dev/
main.html
• Linear Interpolation in n-gram models:
Pli ( wn | wn2 , wn1 )
1P1 (wn ) 2 P2 (wn | wn1 ) 3 P3 (wn | wn2 , wn1 )
Look at Notes from David Meir Blei [UC
Berkley]
http://www-nlp.stanford.edu/fsnlp/hmm-chap/blei-hmm-ch9.ppt
Slides 1-13
(Observed states)
Forward Procedure
i (t ) P(o1...ot 1, X t i | )
Initialization:
i (1) i ,1 i N
Induction:
N
j (t 1) i (t )aijbijo ,1 t T ,1 j N
i 1
Total computation:
t
N
P(O | ) i (T 1)
i 1
Forward Procedure
i (t ) P(ot ...oT , X t i | )
Initialization:
i (T 1) 1,1 i N
Induction:
N
i (t ) aijbijo j (t 1),1 t T ,1 i N
j 1
t
Total computation:
N
P(O | ) i i (1)
i 1
Backward Procedure
P(O, X t i | ) P(o1...oT , X t i | )
P(o1...ot 1 , X t i | ) P(ot ...oT | X t i | )
i (t ) i (t )
N
P(O | ) i (t ) i (t ),1 t T 1
i 1
Combining both – forward and backward
Finding the best state sequence
To determine the state sequence that best explains observations
Let:
i (t )
P ( X t i, O | )
P (O | )
i (t ) j (t )
N
j 1 j (t ) j (t )
Individually the most likely state is:
X t arg max i (t ),1 t T 1
1i N
This approach, however, does not correctly estimate the most
likely state sequence.
Finding the best state sequence
Viterbi algorithm
arg max P ( X | O )
X
Store the most probable path that leads to a given node
j (t ) max P( X 1...X t 1 , o1...ot 1 , X t j | )
x1 ...xt 1
Initialization
j (1) j ,1 j N
Induction
j (t 1) max i (t )aijbijo ,1 j N
1i N
t
Store Backtrace
j (t 1) arg max i (t )aijbijo ,1 j N
t
1i N
X T 1 arg max i (T 1)
1i N
P ( X ) max i (T 1)
1 i N
Parameter Estimation
Parameter Estimation
Probability of traversing an arc at time t given observation sequence O:
pt (i, j ) P( X t i, X t 1 j | O, )
P( X t i, X t 1 j , O | )
P(O | )
i (t )aijbijot j (t 1)
N
m1 m (t ) m (t )
T
(t ) expected _ num ber_ of _ transitions _ from _ state _ i _ in _ O
t 1
i
T
p (i, j) expected _ num ber_ of _ transitions _ from _ state _ i _ to _ j _ in _ O
t 1
t
Parameter Estimation
expected _ num ber_ of _ transitions _ from _ state _ i _ to _ j
expected _ num ber_ of _ transitions _ from _ state _ i
aij
T
t 1
T
pt (i, j )
t 1
bijk
i (t )
expected _ num ber_ of _ transitions _ from _ state _ i _ to _ j _ with _ k _ observed
expected _ num ber_ of _ transitions _ from _ state _ i _ to _ j
{t:ot k ,1t T }
T
t 1 t
pt (i, j )
p (i, j )
© Copyright 2026 ExpyDoc