Language modeling

Natural Language Processing
Exercise 4: Language Modeling
1
Introduction
In this exercise, we are going to build and evaluate n-gram models (for various values of n) for the task of
probabilistic language modeling, or predicting the next word in a text. We will start by evaluating models based
on maximum likelihood, by estimating cross-entropy on both seen and unseen data. We will then consider some
simple smoothing techniques.
2
Data and preprocessing
We will use the same corpus as in the previous exercise, The Adventures of Sherlock Holmes, but mostly split
into two halves: holmes1.txt and holmes2.txt. This will be useful later on because we can train models
on one half and evaluate them on the other half.
Note on tokenization: To avoid problems caused by tokenization differences between the lexicon and the
corpus files, you should make sure to use tokenizer5.py for tokenization or simply use the pre-tokenized
files holmes1-tok.txt and holmes2-tok.txt.
3
Evaluate your MLE model
In the previous exercise, you estimated a conditional bigram model using MLE on the entire Holmes corpus.
Your first task is to evaluate this model by estimating its cross-entropy on the two subcorpora holmes1-tok.txt
and holmes2-tok.txt. For this you can use the program entropy.py, which takes as arguments a model
file, containing conditional probabilities in the format used in the previous exercise, and a test file, tokenized in
the usual way. Thus, you should run something like the following command:
python entropy.py 2 modelfile holmes1-tok.txt
You should replace modelfile by whatever you called your own model file, and then rerun the command
with holmes2-tok.txt to get the result for the second half of the corpus. Are you happy with the results?
4
Evaluate your MLE model on unseen data
A problem with the evaluation so far is that both the test files were part of the training file. Does this matter?
To find out, retrain your MLE model only on holmes1-tok.txt and evaluate it on holmes2-tok.txt
and vice versa. Are you still happy with the results?
Note on programming: If the program you wrote to estimate the MLE model cannot easily be applied to a
different test set, you can use the program introduced in the next section.
5
Explore shorter and longer n-grams
Is it possible that the results just observed are peculiar to bigrams? To find out, replicate the experiment
with unigrams and trigrams. For this you can use the program ngram-mle.py, which can estimate a
conditional MLE model for n-grams of arbitrary length. For example, to estimate a trigram model from
holmes1-tok.txt, you need to run:
python ngram-mle.py 3 < holmes1-tok.txt > trigram.txt
If you change 3 to 1, you get a unigram model instead. And so on. All in all, you should (at least) run the
following experiments:
1. Train a unigram model on holmes1-tok.txt, test it on holmes1.txt and holmes2-tok.txt
2. Train a bigram model on holmes1-tok.txt, test it on holmes1.txt and holmes2-tok.txt
3. Train a trigram model on holmes1-tok.txt, test it on holmes1.txt and holmes2-tok.txt
What can you learn from the results? Discuss with your group and with your teacher.
1
6
Add smoothing
The problem with MLE is that it puts all the probability mass on seen events, which results in zero probabilities
for all unseen events. To eliminate the zero probabilities, we need to apply smoothing or regularization by
shifting some of the probability mass from unseen to seen events. For this exercise, we are going to use
additive smoothing, which essentially means that we add some constant k to all our n-gram counts and then
renormalize increasing the shorter n-gram counts proportionally. For example, this gives us the following
formula for conditional bigram probabilities (where |W | is the size of the vocabulary):
f (w1 , w2 ) + k
f (w1 ) + k|W |
In principle, we could implement this model by generating a table of all possible n-grams with the smoothed
probability estimates, but such a table would be huge for higher n-grams. For this exercise, we are going to bypass this problem by using a different version of the entropy calculation program, entropy-freq.py, which
takes as input the raw n-gram frequencies instead of probability estimates, and which computes the probability
estimates on the fly. Here is how you run the program to compute cross-entropy for holmes2-tok.txt with
bigram frequencies stored in the file bigrams-freqs.txt:
python entropy-freq.py 2 8938 bigram-freqs.txt holmes2-tok.txt
The first argument (2) is the bigram order, and the second argument (8938) is the vocabulary size. The
only part of entropy-freq.py that you need to look at is the function that computes probabilities from
frequencies:
def prob(c, w):
wf = 0
cf = 0
if c in cfreq:
cf = cf + cfreq[c]
if (c, w) in nfreq:
wf = wf + nfreq[(c, w)]
if cf > 0:
return float(wf)/float(cf)
else:
return 0.0
Currently this computes the usual MLE probabilities. It first sets both the n-gram frequency (wf) and the
context frequency (cf) to zero. It then tries to look up the context frequency in cfreq and the n-gram
frequency in nfreq and, if successful, adds the frequencies found to cf and wf, respectively. Finally, it
divides the n-gram frequency by the context frequency and returns the result. Your task is now to modify this
definition so that it computes an add-k estimate instead of the MLE. It may be useful to know that the n-gram
order and the vocabulary size are stored in the global variables n and voc, respectively. You may also want
to use the program ngram-freq.py to generate the frequency files. (It works just like ngram-mle.py.)
When you have implemented the smoothing method, you should rerun the experiments from §5 and reflect on
the results. How does the smoothing work?
7
Submitting the second assignment
The second assignment of the course consists of your work with Exercise 3 and 4. Make sure that you submit
all of the following to [email protected] at the end of the first week:
• Maximum likelihood estimates of the joint, marginal and conditional distribution of bigrams based on
holmes.txt (Exercise 3)
• Cross-entropy results for the unigram, bigram and trigram MLE models, trained on holmes1-tok.txt
and tested on holmes1-tok.txt and holmes2-tok.txt, together with your own reflections on
the results.
• Your implementation of additive smoothing in the form of a modified version of entropy-freq.py.
• Cross-entropy results for the unigram, bigram and trigram models with additive smoothing, trained on
holmes1.txt and tested on holmes1-tok.txt and holmes2-tok.txt, together with your own
reflections on the results.
2