Information Communication Theory (情報伝達学) Kentaro Inui (乾 健太郎) Naoaki Okazaki (岡崎 直観) 2011-10-04 Information Communication Theory (情報伝達学) 1 Course Plan • Part I (Okazaki) • • 10/04: Introduction • 10/11: Classification • 10/18: Part-of-speech tagging • 10/25: Syntactic parsing • 11/01: Statistical parsing Part II (Inui) • 11/08: Features and unification • 11/15: Representation of meaning • 11/22: Computational semantics • 11/29: Computational lexical semantics • 12/06: (no class) • Part III (Inui, Okazaki, TAs) • 12/13, 12/20, 2013/01/10, 01/17, 01/24 • Programming exercises and project from Natural Language Processing with Python(by Steven Bird) • Lectures given at 計算機大演習室(New Student Laboratory Building for Information Engineering, 情報新棟1階) 2011-10-04 Information Communication Theory (情報伝達学) 2 Course Format • Text (optional) • Jurafsky, Daniel and Martin, James H. Speech and Language Processing. Prentice-Hall, 2009 (2nd Edition) • ~ \6,000 available at amazon.co.jp • Bird, Steven et al. Natural Language Processing with Python. Oreilly & Associates Inc., 2009 • 萩原 正人,中山 敬広,水野 貴明 訳 『入門 自然言語処理』 O'Reilly Japan, 2010 • Grading • Exercises (given in lectures): 40% • Final report (programming project) 2011-10-04 Information Communication Theory (情報伝達学) 3 Handouts • If necessary, please print out a handout and bring it to the class by yourself • Alternatively, browse it on your laptop • Handouts will be available at (before dawn): • http://www.cl.ecei.tohoku.ac.jp/index.php?InformationCo mmunicationTheory • Username: nlp2012 • Password: chukougishitsu 2011-10-04 Information Communication Theory (情報伝達学) 4 Contact Information • Office hours: • Tue, 1:00-2:30pm or by appointment • Office: • Room 305 (108 after Nov), Electrical Engineering and Applied Physics Research Building No.3 (電気系3号館) • Contact: • [email protected] @inuikentaro • [email protected] @chokkanorg 2011-10-04 Information Communication Theory (情報伝達学) 5 Introduction Naoaki Okazaki [email protected] http://www.chokkan.org/ http://twitter.com/#!/chokkanorg #nlptohoku http://www.chokkan.org/lectures/2012nlp/p/01.pdf 2011-10-04 Information Communication Theory (情報伝達学) 6 Natural Language Processing (NLP) • Giving computers the ability to process human language • As old as the idea of computers themselves! • Implementations and implications of the exciting idea • The long-awaited dream (that has not come true yet) Doraemon 2011-10-04 C-3PO (Star Wars) Information Communication Theory (情報伝達学) Atom (Astro boy) 7 What are needs to be done for understanding languages as humans do? Part I: Knowledge (disciplines) 2011-10-04 Information Communication Theory (情報伝達学) 8 Lexical semantics (語彙意味論) How much Chinese silk was exported to Western Europe by the end of the 18th century? Meaning of words N W E S 2011-10-04 Information Communication Theory (情報伝達学) 9 Compositional semantics (合成意味論) How much Chinese silk was exported to Western Europe by the end of the 18th century? Meaning of constituents 1700 1720 1740 1760 1780 1800 The 18th Century of the end 2011-10-04 Information Communication Theory (情報伝達学) 10 Compositional??? (with adjectives) !? white towel former girl friend 2011-10-04 white wine black hole Information Communication Theory (情報伝達学) 11 Morphology (形態論) How much Chinese silk was exported to Western Europe by the end of the 18th century? Study on word formations (breaking words down into morphemes) • Inflection (屈折) • is – was – being – been • export – exports – exporting – exported – exported • Derivation (派生) • China – Chinese • West – Western 2011-10-04 Information Communication Theory (情報伝達学) 12 Syntax (統語論,文法) Principles and rules for constructing phrases and sentences • Part-of-speech (POS): Lecture #3 • Categorization of words, e.g., nouns, verbs, adjectives, adverbs • Constituency: Lectures #4 and #5 • Grouping words that may behave as a single unit or phrase • e.g., noun phrase, verb phrase, prepositional phrase • Grammatical relations: Lecture #5 • Relationship between words/constituents 2011-10-04 Information Communication Theory (情報伝達学) 13 Syntactic tagging and parsing • Assign a structure to an input sentence S Constituent parsing Nivre and Kubler (2006) VP NP PU PP POS tagging NP JJ NP NN VBD JJ NP NN IN JJ NNS Economic news had little effect on financial markets . nmod Dependency parsing sbj nmod nmod nmod obj pc p 2011-10-04 Information Communication Theory (情報伝達学) 14 Semantic role (意味役割) How much Chinese silk was exported to Western Europe by the end of the 18th century? TEMPORAL 1700 1720 1740 1760 1780 1800 The 18th Century How much Chinese silk was exported to Western Europe by southern merchants? AGENT 2011-10-04 Information Communication Theory (情報伝達学) 15 Coreference (共参照) U: Where is The Green Hornet playing in Mountain View? S: The Green Hornet is playing at the Century 16 theatre. U: When is it playing there? S: It’s playing at 2pm, 5pm, and 8pm. U: I’d like 1 adult and 2 children for the first show. How much would that cost? What does “it” refers to? What does “the first show” refers to? What does “that” refers to? 2011-10-04 Information Communication Theory (情報伝達学) We can guess these easily! 16 Coreference (共参照) U: Where is The Green Hornet playing in Mountain View? S: The Green Hornet is playing at the Century 16 theatre. U: When is it playing there? S: It’s playing at 2pm, 5pm, and 8pm. U: I’d like 1 adult and 2 children for the first show. How much would that cost? How words like that or pronouns like it refer to previous parts of the discourse 2011-10-04 Information Communication Theory (情報伝達学) 17 Pragmatics (語用論) Actions that speakers intend by their use of text • Bob: Are you coming to the party? • Jane: I’m afraid I can’t. • Bob: Are you coming to the party? • Jane: You know, I’m really busy. • Bob: Could you pass me the sugar? • Jane: Yes. Here you are. 2011-10-04 Information Communication Theory (情報伝達学) 18 Discourse (談話) Coherent structured groups of text http://www.isi.edu/~marcu/discourse/tagging-ref-manual.pdf 2011-10-04 Information Communication Theory (情報伝達学) 19 Various knowledge about languages • Morphology (形態論): meaningful components within words • Syntax (文法): structural relationships between words • Semantics (意味論): meanings of words, phrases, sentences • Discourse (談話): relationships across/beyond different sentences or statements; contextual processing • Pragmatic (語用論): relationship of meaning to the goals and intentions of speakers; how we use languages to communicate • World knowledge (世界知識): facts of the world; common sense 2011-10-04 Information Communication Theory (情報伝達学) 20 What are needs to be done for understanding languages as humans do? Part II: Ambiguity 2011-10-04 Information Communication Theory (情報伝達学) 21 Ambiguity • We may build multiple, alternative linguistic structures and interpretations for a single input • I made her duck (see more examples later) • Disambiguation (or resolution): to decide which linguistic/semantic structure/interpretation is the most appropriate (in the context) 2011-10-04 Information Communication Theory (情報伝達学) 22 Part-of-speech tagging and ambiguity Time flies like an arrow . NN VBZ IN DT NN . (光陰矢のごとし) VB NNS IN DT NN . (ハエの速度を矢のように測定せよ) NN NNS VBP DT NN . (時蠅は矢を好む) 2011-10-04 Information Communication Theory (情報伝達学) 23 Attachment ambiguity (1/3) • I saw the girl on the hill with a telescope. • I saw the girl on the hill with a telescope. 2011-10-04 Information Communication Theory (情報伝達学) 24 Attachment ambiguity (2/3) • I saw the girl on the hill with a telescope. • I saw the girl on the hill with a telescope. 2011-10-04 Information Communication Theory (情報伝達学) 25 Attachment ambiguity (3/3) • I saw the girl on the hill with a telescope. • I saw the girl on the hill with a telescope. 2011-10-04 Information Communication Theory (情報伝達学) 26 Coordination ambiguity • Put [[the insects in the box] and [the bowl on the table]] • Put the insects in [[the box] and [the bowl on the table]] 2011-10-04 Information Communication Theory (情報伝達学) 27 Semantic ambiguity • Syntax structure is insufficient to represent the meaning • Distinction between syntax and semantics • Colorless green ideas sleep furiously (Chomsky, 1957) • Opposite • John bought a book from Mary vs Mary sold a book to John • Lexical ambiguity • I went to the bank… (of the river) or (to get some money) • Quantifier • Every man loves a woman 2011-10-04 Information Communication Theory (情報伝達学) 28 The state-of-the-art of Natural Language Processing 2011-10-04 Information Communication Theory (情報伝達学) 29 Commercial world • A lot of exciting staff going on… 2011-10-04 Information Communication Theory (情報伝達学) 30 Machine translation (Google) 2011-10-04 Information Communication Theory (情報伝達学) 31 Machine translation (Google) 2011-10-04 Information Communication Theory (情報伝達学) 32 Watson (IBM) • Question answering system built on IBM’s DeepQA technology • 14-16 February 2011, Watson beat two human competitors, the biggest all-time money winner on Jeopardy! and the record holder for the longest championship streak • Hardware • 2880 processor cores (3.5 GHz POWER7 eight core processors) • 16 TB RAM in total • Software • Written in Java and C++ • Using Apache Hadoop framework for distributed computing • Data • 200M pages (about 1M books) of structured and unstructured content • Consuming 4T of disk storage • Encyclopedias, dictionaries, thesauri, newswire articles, literary works http://en.wikipedia.org/wiki/Watson_(computer) 2011-10-04 Information Communication Theory (情報伝達学) 33 Jeopardy! • American quiz show featuring • history, literature, the arts, pop culture, science, sports, geography, wordplay, etc. • Six categories are announced, each with five trivia clues • A correct response adds the dollar value • An incorrect response or a failure to respond within a five-second time limit deducts the dollar value http://en.wikipedia.org/wiki/Jeopardy! 2011-10-04 Information Communication Theory (情報伝達学) 34 Final Jeopardy! and the Future of Watson • Watch the video (08:58): • http://www.youtube.com/watch?v=Wq0XnBYC3nQ 2011-10-04 Information Communication Theory (情報伝達学) 35 Science behind an answer • Watch the very nice video (06:42): • http://www.youtube.com/watch?v=DywO4zksfXw 2011-10-04 Information Communication Theory (情報伝達学) 36 Science behind an answer • Step 1: Question analysis • What is type of question being asked? • What is the question asking for? • Step 2: Hypothesis generation • Search millions of documents for possible answers • Step 3: Hypothesis and evidence scoring • Collect positive and negative evidences to support each answer • Score evidences based on everything from source material reliability to whether time and locations appear correct • Parallelized evidence scoring for each possible answer • Step 4: Final merging and ranking • Learn the importance of each evidence by practicing games • Yield the final ranking of possible answers • Decide whether Watson answers the question or not based on the confidence 2011-10-04 Information Communication Theory (情報伝達学) 37 A shame (of NLP) • Japanese translation of the book, “Einstein: His Life and Universe,” published on 23 June 2011 • Chapter 13 was translated by computers, not by humans! • How this happened: http://www.amazon.co.jp/review/R29GQAF5 DUOAEW/ref=cm_cr_rdp_perm • Very rare incident that an MT’ed book is published • Revised version was published on 17 Aug 2011 2011-10-04 Information Communication Theory (情報伝達学) 38 Imagine the original sentence • ボルンの妻のヘートヴィヒに最大限にしてください。(その ヘートヴィヒは,彼の家族に関する彼の処理,今や説教さ れた頃,彼が「自分がそのかなり不幸な回答に駆り立て られるのを許容していないべきでない」と自由に彼に叱っ た)。以上は,彼が目立つべきであり,彼女が言ったのを 「科学の人里離れている寺」に尊敬します。 • Max Born's wife, Hedwig, who had freely scolded Einstein about his treatment of his family, now lectured, “[You should] not have allowed yourself to be goaded into that rather unfortunate reply.” He should show more respect, she said, for “the secluded temple of science.” (P286) 2011-10-04 Information Communication Theory (情報伝達学) 39 Passing exams for University of Tokyo 2011-10-04 Information Communication Theory (情報伝達学) 40 Writing short science fictions 2011-10-04 Information Communication Theory (情報伝達学) 41 Goal of this course • Overview the issues and technologies for natural language understanding • What is possible/easy? What is impossible/difficult? • Why is this achieved or not achieved by the current technology? • Provide fundamental theories and techniques for natural language processing • Some techniques are useful for other research fields • Exercise programming with real NLP tasks • You will be an experienced engineer! 2011-10-04 Information Communication Theory (情報伝達学) 42 Course plan 1. 4 Oct: Introduction 2. 11 Oct: Classification • Spam filtering, linear classifier, feature extraction, perceptron, logistic regression, evaluation (precision, recall, F1) 3. 18 Oct: Part-of-speech tagging 4. 25 Oct: Syntactic parsing 5. 1 Nov: 2011-10-04 Statistical parsing Information Communication Theory (情報伝達学) 43
© Copyright 2024 ExpyDoc