Natural Language Toolkit Examples taken from: nltk.sourceforge.net/tutorial/introduction/index.html Overview • The NLTK is a set of Python modules to carry out many common natural language tasks. • Access it at nltk.sourceforge.net • There are versions for Windows, OS X, Unix, Linux. Detailed instructions on Installation tab • In addition to the toolkit you will need two other modules: tkinter and Numeric. We haven’t been able to get numeric to install smoothly with Python 2.4 under Windows, only with 2.3. • You do also want the contrib and data packages. • Pay attention to what INSTALL.TXT in the data package says about the NLTK_CORPORA path. Examples taken from: nltk.sourceforge.net/tutorial/introduction/index.html Accessing NLTK • • • • Standard Python import command >>> from nltk.corpus import gutenberg >>> gutenberg.items() ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'blake-songs.txt', 'chestertonball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'miltonparadise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt'] Or • >>> import nltk.corpus • >>> nltk.corpus.gutenberg.items() • ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'blake-songs.txt', 'chestertonball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'miltonparadise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt'] Examples taken from: nltk.sourceforge.net/tutorial/introduction/index.html Modules • The NLTK modules include: – token: classes for representing and processing individual elements of text, such as words and sentences – probability: classes for representing and processing probabilistic information. – tree: classes for representing and processing hierarchical information over text. – cfg: classes for representing and processing context free grammars. – fsa: finite state automata – tagger: tagging each word with a part-of-speech, a sense, etc – parser: building trees over text (includes chart, chunk and probabilistic parsers) – classifier: classify text into categories (includes feature, featureSelection, maxent, naivebayes – draw: visualize NLP structures and processes – corpus: access (tagged) corpus data • We will cover some of these explicitly as we reach topics. Examples taken from: nltk.sourceforge.net/tutorial/introduction/index.html One Simple Example IDLE 1.0.3 >>> from nltk.tokenizer import * >>> text_token = Token(TEXT='Hello world. This is a test file.') >>> print text_token <Hello world. This is a test file.> >>> WhitespaceTokenizer(SUBTOKENS='WORDS').tokenize(text_token) >>> print text_token <[<Hello>, <world.>, <This>, <is>, <a>, <test>, <file.>]> >>> print text_token['TEXT'] Hello world. This is a test file. >>> print text_token['WORDS'] [<Hello>, <world.>, <This>, <is>, <a>, <test>, <file.>] Examples taken from: nltk.sourceforge.net/tutorial/introduction/index.html LAB • Detailed documentation and tutorials under the Documentation tab at the Sourceforge site. • Work through the “gentle introduction” and “elementary language processing” tutorials on the NLTK: nltk.sourceforge.net/tutorial/introduction/index.html Examples taken from: nltk.sourceforge.net/tutorial/introduction/index.html
© Copyright 2024 ExpyDoc