Natural Language Toolkit

Natural Language Toolkit
Examples taken from: nltk.sourceforge.net/tutorial/introduction/index.html
Overview
• The NLTK is a set of Python modules to carry
out many common natural language tasks.
• Access it at nltk.sourceforge.net
• There are versions for Windows, OS X, Unix,
Linux. Detailed instructions on Installation tab
• In addition to the toolkit you will need two other
modules: tkinter and Numeric. We haven’t been
able to get numeric to install smoothly with
Python 2.4 under Windows, only with 2.3.
• You do also want the contrib and data packages.
• Pay attention to what INSTALL.TXT in the data
package says about the NLTK_CORPORA path.
Examples taken from: nltk.sourceforge.net/tutorial/introduction/index.html
Accessing NLTK
•
•
•
•
Standard Python import command
>>> from nltk.corpus import gutenberg
>>> gutenberg.items()
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt',
'bible-kjv.txt', 'blake-poems.txt', 'blake-songs.txt', 'chestertonball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'miltonparadise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt',
'shakespeare-macbeth.txt', 'whitman-leaves.txt']
Or
• >>> import nltk.corpus
• >>> nltk.corpus.gutenberg.items()
• ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt',
'bible-kjv.txt', 'blake-poems.txt', 'blake-songs.txt', 'chestertonball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'miltonparadise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt',
'shakespeare-macbeth.txt', 'whitman-leaves.txt']
Examples taken from: nltk.sourceforge.net/tutorial/introduction/index.html
Modules
• The NLTK modules include:
– token: classes for representing and processing individual
elements of text, such as words and sentences
– probability: classes for representing and processing probabilistic
information.
– tree: classes for representing and processing hierarchical
information over text.
– cfg: classes for representing and processing context free
grammars.
– fsa: finite state automata
– tagger: tagging each word with a part-of-speech, a sense, etc
– parser: building trees over text (includes chart, chunk and
probabilistic parsers)
– classifier: classify text into categories (includes feature,
featureSelection, maxent, naivebayes
– draw: visualize NLP structures and processes
– corpus: access (tagged) corpus data
• We will cover some of these explicitly as we reach topics.
Examples taken from: nltk.sourceforge.net/tutorial/introduction/index.html
One Simple Example
IDLE 1.0.3
>>> from nltk.tokenizer import *
>>> text_token = Token(TEXT='Hello world. This is a test file.')
>>> print text_token
<Hello world. This is a test file.>
>>> WhitespaceTokenizer(SUBTOKENS='WORDS').tokenize(text_token)
>>> print text_token
<[<Hello>, <world.>, <This>, <is>, <a>, <test>, <file.>]>
>>> print text_token['TEXT']
Hello world. This is a test file.
>>> print text_token['WORDS']
[<Hello>, <world.>, <This>, <is>, <a>, <test>, <file.>]
Examples taken from: nltk.sourceforge.net/tutorial/introduction/index.html
LAB
• Detailed documentation and tutorials
under the Documentation tab at the
Sourceforge site.
• Work through the “gentle introduction”
and “elementary language
processing” tutorials on the NLTK:
nltk.sourceforge.net/tutorial/introduction/index.html
Examples taken from: nltk.sourceforge.net/tutorial/introduction/index.html