Natural Language Processing Exercise 4: Language Modeling 1 Introduction In this exercise, we are going to build and evaluate n-gram models (for various values of n) for the task of probabilistic language modeling, or predicting the next word in a text. We will start by evaluating models based on maximum likelihood, by estimating cross-entropy on both seen and unseen data. We will then consider some simple smoothing techniques. 2 Data and preprocessing We will use the same corpus as in the previous exercise, The Adventures of Sherlock Holmes, but mostly split into two halves: holmes1.txt and holmes2.txt. This will be useful later on because we can train models on one half and evaluate them on the other half. Note on tokenization: To avoid problems caused by tokenization differences between the lexicon and the corpus files, you should make sure to use tokenizer5.py for tokenization or simply use the pre-tokenized files holmes1-tok.txt and holmes2-tok.txt. 3 Evaluate your MLE model In the previous exercise, you estimated a conditional bigram model using MLE on the entire Holmes corpus. Your first task is to evaluate this model by estimating its cross-entropy on the two subcorpora holmes1-tok.txt and holmes2-tok.txt. For this you can use the program entropy.py, which takes as arguments a model file, containing conditional probabilities in the format used in the previous exercise, and a test file, tokenized in the usual way. Thus, you should run something like the following command: python entropy.py 2 modelfile holmes1-tok.txt You should replace modelfile by whatever you called your own model file, and then rerun the command with holmes2-tok.txt to get the result for the second half of the corpus. Are you happy with the results? 4 Evaluate your MLE model on unseen data A problem with the evaluation so far is that both the test files were part of the training file. Does this matter? To find out, retrain your MLE model only on holmes1-tok.txt and evaluate it on holmes2-tok.txt and vice versa. Are you still happy with the results? Note on programming: If the program you wrote to estimate the MLE model cannot easily be applied to a different test set, you can use the program introduced in the next section. 5 Explore shorter and longer n-grams Is it possible that the results just observed are peculiar to bigrams? To find out, replicate the experiment with unigrams and trigrams. For this you can use the program ngram-mle.py, which can estimate a conditional MLE model for n-grams of arbitrary length. For example, to estimate a trigram model from holmes1-tok.txt, you need to run: python ngram-mle.py 3 < holmes1-tok.txt > trigram.txt If you change 3 to 1, you get a unigram model instead. And so on. All in all, you should (at least) run the following experiments: 1. Train a unigram model on holmes1-tok.txt, test it on holmes1.txt and holmes2-tok.txt 2. Train a bigram model on holmes1-tok.txt, test it on holmes1.txt and holmes2-tok.txt 3. Train a trigram model on holmes1-tok.txt, test it on holmes1.txt and holmes2-tok.txt What can you learn from the results? Discuss with your group and with your teacher. 1 6 Add smoothing The problem with MLE is that it puts all the probability mass on seen events, which results in zero probabilities for all unseen events. To eliminate the zero probabilities, we need to apply smoothing or regularization by shifting some of the probability mass from unseen to seen events. For this exercise, we are going to use additive smoothing, which essentially means that we add some constant k to all our n-gram counts and then renormalize increasing the shorter n-gram counts proportionally. For example, this gives us the following formula for conditional bigram probabilities (where |W | is the size of the vocabulary): f (w1 , w2 ) + k f (w1 ) + k|W | In principle, we could implement this model by generating a table of all possible n-grams with the smoothed probability estimates, but such a table would be huge for higher n-grams. For this exercise, we are going to bypass this problem by using a different version of the entropy calculation program, entropy-freq.py, which takes as input the raw n-gram frequencies instead of probability estimates, and which computes the probability estimates on the fly. Here is how you run the program to compute cross-entropy for holmes2-tok.txt with bigram frequencies stored in the file bigrams-freqs.txt: python entropy-freq.py 2 8938 bigram-freqs.txt holmes2-tok.txt The first argument (2) is the bigram order, and the second argument (8938) is the vocabulary size. The only part of entropy-freq.py that you need to look at is the function that computes probabilities from frequencies: def prob(c, w): wf = 0 cf = 0 if c in cfreq: cf = cf + cfreq[c] if (c, w) in nfreq: wf = wf + nfreq[(c, w)] if cf > 0: return float(wf)/float(cf) else: return 0.0 Currently this computes the usual MLE probabilities. It first sets both the n-gram frequency (wf) and the context frequency (cf) to zero. It then tries to look up the context frequency in cfreq and the n-gram frequency in nfreq and, if successful, adds the frequencies found to cf and wf, respectively. Finally, it divides the n-gram frequency by the context frequency and returns the result. Your task is now to modify this definition so that it computes an add-k estimate instead of the MLE. It may be useful to know that the n-gram order and the vocabulary size are stored in the global variables n and voc, respectively. You may also want to use the program ngram-freq.py to generate the frequency files. (It works just like ngram-mle.py.) When you have implemented the smoothing method, you should rerun the experiments from §5 and reflect on the results. How does the smoothing work? 7 Submitting the second assignment The second assignment of the course consists of your work with Exercise 3 and 4. Make sure that you submit all of the following to [email protected] at the end of the first week: • Maximum likelihood estimates of the joint, marginal and conditional distribution of bigrams based on holmes.txt (Exercise 3) • Cross-entropy results for the unigram, bigram and trigram MLE models, trained on holmes1-tok.txt and tested on holmes1-tok.txt and holmes2-tok.txt, together with your own reflections on the results. • Your implementation of additive smoothing in the form of a modified version of entropy-freq.py. • Cross-entropy results for the unigram, bigram and trigram models with additive smoothing, trained on holmes1.txt and tested on holmes1-tok.txt and holmes2-tok.txt, together with your own reflections on the results. 2
© Copyright 2025 ExpyDoc