01_NLP_Introduction

Information
Communication
Theory
(情報伝達学)
Kentaro Inui (乾 健太郎)
Naoaki Okazaki (岡崎 直観)
2011-10-04
Information Communication Theory (情報伝達学)
1
Course Plan
• Part I (Okazaki)
•
• 10/04: Introduction
• 10/11: Classification
• 10/18: Part-of-speech tagging
• 10/25: Syntactic parsing
• 11/01: Statistical parsing
Part II (Inui)
• 11/08: Features and unification
• 11/15: Representation of meaning
• 11/22: Computational semantics
• 11/29: Computational lexical semantics
• 12/06: (no class)
• Part III (Inui, Okazaki, TAs)
• 12/13, 12/20, 2013/01/10, 01/17, 01/24
• Programming exercises and project from Natural
Language Processing with Python(by Steven Bird)
• Lectures given at 計算機大演習室(New Student Laboratory
Building for Information Engineering, 情報新棟1階)
2011-10-04
Information Communication Theory (情報伝達学)
2
Course Format
• Text (optional)
• Jurafsky, Daniel and Martin, James H. Speech and
Language Processing. Prentice-Hall, 2009 (2nd Edition)
• ~ \6,000 available at amazon.co.jp
• Bird, Steven et al. Natural Language Processing with
Python. Oreilly & Associates Inc., 2009
• 萩原 正人,中山 敬広,水野 貴明 訳 『入門 自然言語処理』
O'Reilly Japan, 2010
• Grading
• Exercises (given in lectures): 40%
• Final report (programming project)
2011-10-04
Information Communication Theory (情報伝達学)
3
Handouts
• If necessary, please print out a handout and bring
it to the class by yourself
• Alternatively, browse it on your laptop
• Handouts will be available at (before dawn):
• http://www.cl.ecei.tohoku.ac.jp/index.php?InformationCo
mmunicationTheory
• Username: nlp2012
• Password: chukougishitsu
2011-10-04
Information Communication Theory (情報伝達学)
4
Contact Information
• Office hours:
• Tue, 1:00-2:30pm or by appointment
• Office:
• Room 305 (108 after Nov), Electrical Engineering and
Applied Physics Research Building No.3 (電気系3号館)
• Contact:
• [email protected] @inuikentaro
• [email protected] @chokkanorg
2011-10-04
Information Communication Theory (情報伝達学)
5
Introduction
Naoaki Okazaki
[email protected]
http://www.chokkan.org/
http://twitter.com/#!/chokkanorg
#nlptohoku
http://www.chokkan.org/lectures/2012nlp/p/01.pdf
2011-10-04
Information Communication Theory (情報伝達学)
6
Natural Language Processing (NLP)
• Giving computers the ability to process human language
• As old as the idea of computers themselves!
• Implementations and implications of the exciting idea
• The long-awaited dream (that has not come true yet)
Doraemon
2011-10-04
C-3PO
(Star Wars)
Information Communication Theory (情報伝達学)
Atom
(Astro boy)
7
What are needs to be done for
understanding languages
as humans do?
Part I: Knowledge (disciplines)
2011-10-04
Information Communication Theory (情報伝達学)
8
Lexical semantics (語彙意味論)
How much Chinese silk was exported to Western
Europe by the end of the 18th century?
Meaning of words
N
W
E
S
2011-10-04
Information Communication Theory (情報伝達学)
9
Compositional semantics (合成意味論)
How much Chinese silk was exported to Western
Europe by the end of the 18th century?
Meaning of constituents
1700 1720 1740 1760 1780 1800
The 18th Century
of
the end
2011-10-04
Information Communication Theory (情報伝達学)
10
Compositional??? (with adjectives)
!?
white towel
former girl friend
2011-10-04
white wine
black hole
Information Communication Theory (情報伝達学)
11
Morphology (形態論)
How much Chinese silk was exported to Western
Europe by the end of the 18th century?
Study on word formations
(breaking words down into morphemes)
• Inflection (屈折)
• is – was – being – been
• export – exports – exporting – exported – exported
• Derivation (派生)
• China – Chinese
• West – Western
2011-10-04
Information Communication Theory (情報伝達学)
12
Syntax (統語論,文法)
Principles and rules for constructing
phrases and sentences
• Part-of-speech (POS): Lecture #3
• Categorization of words, e.g., nouns, verbs, adjectives, adverbs
• Constituency: Lectures #4 and #5
• Grouping words that may behave as a single unit or phrase
• e.g., noun phrase, verb phrase, prepositional phrase
• Grammatical relations: Lecture #5
• Relationship between words/constituents
2011-10-04
Information Communication Theory (情報伝達学)
13
Syntactic tagging and parsing
• Assign a structure to an input sentence
S
Constituent parsing
Nivre and Kubler (2006)
VP
NP
PU
PP
POS tagging
NP
JJ
NP
NN
VBD
JJ
NP
NN
IN
JJ
NNS
Economic news had little effect on financial markets .
nmod
Dependency parsing
sbj
nmod
nmod
nmod
obj
pc
p
2011-10-04
Information Communication Theory (情報伝達学)
14
Semantic role (意味役割)
How much Chinese silk was exported to Western
Europe by the end of the 18th century?
TEMPORAL
1700 1720 1740 1760 1780 1800
The 18th Century
How much Chinese silk was exported to Western
Europe by southern merchants?
AGENT
2011-10-04
Information Communication Theory (情報伝達学)
15
Coreference (共参照)
U: Where is The Green Hornet playing in Mountain View?
S: The Green Hornet is playing at the Century 16 theatre.
U: When is it playing there?
S: It’s playing at 2pm, 5pm, and 8pm.
U: I’d like 1 adult and 2 children for the first show.
How much would that cost?
What does “it” refers to?
What does “the first show” refers to?
What does “that” refers to?
2011-10-04
Information Communication Theory (情報伝達学)
We can guess
these easily!
16
Coreference (共参照)
U: Where is The Green Hornet playing in Mountain View?
S: The Green Hornet is playing at the Century 16 theatre.
U: When is it playing there?
S: It’s playing at 2pm, 5pm, and 8pm.
U: I’d like 1 adult and 2 children for the first show.
How much would that cost?
How words like that or pronouns like it refer
to previous parts of the discourse
2011-10-04
Information Communication Theory (情報伝達学)
17
Pragmatics (語用論)
Actions that speakers intend
by their use of text
• Bob: Are you coming to the party?
• Jane: I’m afraid I can’t.
• Bob: Are you coming to the party?
• Jane: You know, I’m really busy.
• Bob: Could you pass me the sugar?
• Jane: Yes. Here you are.
2011-10-04
Information Communication Theory (情報伝達学)
18
Discourse (談話)
Coherent structured
groups of text
http://www.isi.edu/~marcu/discourse/tagging-ref-manual.pdf
2011-10-04
Information Communication Theory (情報伝達学)
19
Various knowledge about languages
• Morphology (形態論): meaningful components within words
• Syntax (文法): structural relationships between words
• Semantics (意味論): meanings of words, phrases, sentences
• Discourse (談話): relationships across/beyond different
sentences or statements; contextual processing
• Pragmatic (語用論): relationship of meaning to the goals and
intentions of speakers; how we use languages to
communicate
• World knowledge (世界知識): facts of the world; common
sense
2011-10-04
Information Communication Theory (情報伝達学)
20
What are needs to be done for
understanding languages
as humans do?
Part II: Ambiguity
2011-10-04
Information Communication Theory (情報伝達学)
21
Ambiguity
• We may build multiple, alternative linguistic
structures and interpretations for a single input
• I made her duck (see more examples later)
• Disambiguation (or resolution): to decide which
linguistic/semantic structure/interpretation is the
most appropriate (in the context)
2011-10-04
Information Communication Theory (情報伝達学)
22
Part-of-speech tagging and ambiguity
Time
flies
like
an
arrow
.
NN
VBZ
IN
DT
NN
.
(光陰矢のごとし)
VB
NNS
IN
DT
NN
.
(ハエの速度を矢のように測定せよ)
NN
NNS
VBP
DT
NN
.
(時蠅は矢を好む)
2011-10-04
Information Communication Theory (情報伝達学)
23
Attachment ambiguity (1/3)
• I saw the girl on the hill with a telescope.
• I saw the girl on the hill with a telescope.
2011-10-04
Information Communication Theory (情報伝達学)
24
Attachment ambiguity (2/3)
• I saw the girl on the hill with a telescope.
• I saw the girl on the hill with a telescope.
2011-10-04
Information Communication Theory (情報伝達学)
25
Attachment ambiguity (3/3)
• I saw the girl on the hill with a telescope.
• I saw the girl on the hill with a telescope.
2011-10-04
Information Communication Theory (情報伝達学)
26
Coordination ambiguity
• Put [[the insects in the box] and [the bowl on the table]]
• Put the insects in [[the box] and [the bowl on the table]]
2011-10-04
Information Communication Theory (情報伝達学)
27
Semantic ambiguity
• Syntax structure is insufficient to represent the meaning
• Distinction between syntax and semantics
• Colorless green ideas sleep furiously (Chomsky, 1957)
• Opposite
• John bought a book from Mary vs
Mary sold a book to John
• Lexical ambiguity
• I went to the bank… (of the river) or (to get some money)
• Quantifier
• Every man loves a woman
2011-10-04
Information Communication Theory (情報伝達学)
28
The state-of-the-art of
Natural Language Processing
2011-10-04
Information Communication Theory (情報伝達学)
29
Commercial world
• A lot of exciting staff going on…
2011-10-04
Information Communication Theory (情報伝達学)
30
Machine translation (Google)
2011-10-04
Information Communication Theory (情報伝達学)
31
Machine translation (Google)
2011-10-04
Information Communication Theory (情報伝達学)
32
Watson (IBM)
• Question answering system built on IBM’s DeepQA technology
• 14-16 February 2011, Watson beat two human competitors, the
biggest all-time money winner on Jeopardy! and the record
holder for the longest championship streak
• Hardware
• 2880 processor cores (3.5 GHz POWER7 eight core processors)
• 16 TB RAM in total
• Software
• Written in Java and C++
• Using Apache Hadoop framework for distributed computing
• Data
• 200M pages (about 1M books) of structured and unstructured content
• Consuming 4T of disk storage
• Encyclopedias, dictionaries, thesauri, newswire articles, literary works
http://en.wikipedia.org/wiki/Watson_(computer)
2011-10-04
Information Communication Theory (情報伝達学)
33
Jeopardy!
• American quiz show featuring
• history, literature, the arts, pop culture, science, sports,
geography, wordplay, etc.
• Six categories are announced, each with five
trivia clues
• A correct response adds the dollar value
• An incorrect response or a failure to respond
within a five-second time limit deducts the dollar
value
http://en.wikipedia.org/wiki/Jeopardy!
2011-10-04
Information Communication Theory (情報伝達学)
34
Final Jeopardy! and the Future of Watson
• Watch the video (08:58):
• http://www.youtube.com/watch?v=Wq0XnBYC3nQ
2011-10-04
Information Communication Theory (情報伝達学)
35
Science behind an answer
• Watch the very nice video (06:42):
• http://www.youtube.com/watch?v=DywO4zksfXw
2011-10-04
Information Communication Theory (情報伝達学)
36
Science behind an answer
• Step 1: Question analysis
• What is type of question being asked?
• What is the question asking for?
• Step 2: Hypothesis generation
• Search millions of documents for possible answers
• Step 3: Hypothesis and evidence scoring
• Collect positive and negative evidences to support each answer
• Score evidences based on everything from source material
reliability to whether time and locations appear correct
• Parallelized evidence scoring for each possible answer
• Step 4: Final merging and ranking
• Learn the importance of each evidence by practicing games
• Yield the final ranking of possible answers
• Decide whether Watson answers the question or not based on the
confidence
2011-10-04
Information Communication Theory (情報伝達学)
37
A shame (of NLP)
• Japanese translation of the book, “Einstein: His
Life and Universe,” published on 23 June 2011
• Chapter 13 was translated by
computers, not by humans!
• How this happened:
http://www.amazon.co.jp/review/R29GQAF5
DUOAEW/ref=cm_cr_rdp_perm
• Very rare incident that an MT’ed book
is published
• Revised version was published on 17
Aug 2011
2011-10-04
Information Communication Theory (情報伝達学)
38
Imagine the original sentence
• ボルンの妻のヘートヴィヒに最大限にしてください。(その
ヘートヴィヒは,彼の家族に関する彼の処理,今や説教さ
れた頃,彼が「自分がそのかなり不幸な回答に駆り立て
られるのを許容していないべきでない」と自由に彼に叱っ
た)。以上は,彼が目立つべきであり,彼女が言ったのを
「科学の人里離れている寺」に尊敬します。
• Max Born's wife, Hedwig, who had freely scolded Einstein
about his treatment of his family, now lectured, “[You
should] not have allowed yourself to be goaded into that
rather unfortunate reply.” He should show more respect,
she said, for “the secluded temple of science.” (P286)
2011-10-04
Information Communication Theory (情報伝達学)
39
Passing exams for University of Tokyo
2011-10-04
Information Communication Theory (情報伝達学)
40
Writing short science fictions
2011-10-04
Information Communication Theory (情報伝達学)
41
Goal of this course
• Overview the issues and technologies for natural
language understanding
• What is possible/easy? What is impossible/difficult?
• Why is this achieved or not achieved by the current
technology?
• Provide fundamental theories and techniques for
natural language processing
• Some techniques are useful for other research fields
• Exercise programming with real NLP tasks
• You will be an experienced engineer!
2011-10-04
Information Communication Theory (情報伝達学)
42
Course plan
1. 4 Oct:
Introduction
2. 11 Oct: Classification
• Spam filtering, linear classifier, feature extraction,
perceptron, logistic regression, evaluation (precision,
recall, F1)
3. 18 Oct: Part-of-speech tagging
4. 25 Oct: Syntactic parsing
5. 1 Nov:
2011-10-04
Statistical parsing
Information Communication Theory (情報伝達学)
43