Automatic conversion of a Japanese text corpus into f-structure OYA, Masanori (大矢 政徳) 1 29/11/06 National Center for Language Technology School of Computing, Dublin City University NCLT Seminar series 1. Overview 2 Japanese grammar Kyoto Text Corpus (KTC) Converting KTC into dependency trees Converting KTC into f-structure Problems Evaluation Summary NCLT Seminar series 29/11/06 2. Japanese grammar Syntax – – – – – – – – – – – 3 Writing system SOV as the basic word order Use of particles for grammatical functions Tense, aspect and mood are specified by verbal or adjectival morphology “bunsetsu” (sentential units) Ellipses of core arguments Topicalization Two types of relative clauses Case particles derived from verbs Adverbial nouns Coordination NCLT Seminar series 29/11/06 2. Japanese grammar Writing system: three different types of scripts – Chinese characters (1945 and more) Nouns (possible to be written in Hiragana or Katakana) Stems of verbs and adjectives – Hiragana (104) Inflections of verbs and adjectives Particles Words that have no Chinese counterparts – Katakana (124) Nouns borrowed from foreign languages Technical and scientific names Onomatopoeia – 4 No spaces are given between words NCLT Seminar series 29/11/06 2. Japanese grammar The chart of Hiragana a i e o あ い う え a i u e o き く け こ ki ku ke ko k か ka g が ぎ ぐ げ ご gi gu ge go s さ し す せ そ sa shi su se so z ざ じ ず ぜ ぞ ji zu ze zo t た ち つ て と ta chi tsu te to ぢ づ で ど ji zu de do に ぬ ね の ni nu ne no ひ ふ へ ほ hi fu he ho び ぶ べ ぼ bi bu be bo d だ da n な na h は ha b ば ba p ぱ pa m ま ma ぴ ぷ ぺ ぽ pi pu pe po み む め も mu me mo mi y や ゆ ya r yu a yo きゃ きゅ きょ kya kyu kyo ぎゃ ぎゅ ぎょ gya gyu sha shu ja ju jo ちゃ ちゅ ちょ cha chu cho nyu nyo ひゃ ひゅ ひょ hya hyu hyo びゃ びゅ びょ bya byu byo ぴゃ ぴゅ ぴょ pya pyu pyo みゃ みゅ みょ mya myu myo る れ ろ ra ri ru re ro を りゃ りゅ rya ryu りょ ryo n’ ya yu yo ウ u エ e オ o k カ ka キ ki ク ku ケ ke コ ko キャ kya キュ kyu キョ kyo g ガ ga ギ gi グ gu ゲ ge ゴ go ギャ gya ギュ gyu ギョ gyo サ sa シ si ス su セ se ソ so sha シ shi シュ shu シェ she ショ sho za ジ zi ズ zu ゼ ze ゾ zo ジ ji ジュ ju ジェ je ジョ jo ニャ nya ニュ nyu ニョ nyo ヒャ hya ヒュ hyu ヒョ hyo ビャ bya ビュ byu ビョ byo ピャ pya ピュ pyu ピョ pyo ミャ mya ミュ myu ミョ myo リャ rya リュ ryu リョ ryo ザ j ジャ ja t ta ティ ti ツ tsu テ te ト to cha チ chi チュ chu チェ che チョ ch o ダ da ディ di デュ du デ de ド do ナ na ニ ni ヌ nu ネ ne ノ no ハ ha ヒ hi フ hu ヘ he ホ ho f ファ fa フィ fi フ fu フェ fe フォ fo b ba ビ bi ブ bu ベ be ボ bo ヴィ vi ヴ vu ヴェ ve ヴォ vo ピ pi プ pu ペ pe ポ po ミ mi ム mu メ me モ mo ユ yu ヨ yo d nya o i タ ch チャ にゃ にゅ にょ ん e イ z sho じゃ じゅ じょ u a sh シャ しゃ しゅ しょ i ア s gyo yo り wa yu よ ら w わ ya お ga za 5 u The chart of Katakana n h バ v ヴァ va p パ pa m マ ma y ヤ ya r ラ ra ワ wa w o NCLT Seminar series リ ri ウィ wi ル ru ウ wu レ re ウェ we ロ ro ウォ wo ヲ o ン n 29/11/06 2. Japanese grammar 6 SOV as the basic word order; scrambling is prevalent Use of particles for grammatical functions Example: 太郎はダブリンの大学に行った。 Taro-wa dabulin-no daigaku-ni it-ta Taro-TOP Dublin-in college-to go-PST “Taro went to a college in Dublin.” – “-wa”, “-ga”, “-wo” and “-ni” – used for core grammatical functions – Other particles – used for adjuncts (postpositional phrases or complementizer) (Tsujimura 2006) – The particle “-ni” is ambiguous; it can be used as the OBL case marker or a postposition for temporal or locative adverbials (semantic distinction is possible). NCLT Seminar series 29/11/06 2. Japanese grammar Tense, aspect and mood are specified by verbal or adjectival morphology Example: 太郎はダブリンの大学に行っている。 Taro-wa dabulin-no daigaku-ni it-teiru Taro-TOP Dublin-in college-to go-PROG.PRES “Taro is going to a college in Dublin.” 太郎はダブリンの大学に行ったのだろうか。 Taro-wa dabulin-no daigaku-ni it-ta-nodarou-ka Taro-TOP Dublin-in college-to go-PST-AUX-INT “(I wonder) whether Taro went to a college in Dublin.” etc. 7 NCLT Seminar series 29/11/06 2. Japanese grammar “Bunsetsu”, or syntactic units – One bunsetsu = a content word + a particle or inflection ≈ Chinese characters + hiragana or katakana Example: 太郎はダブリンの大学に行っている。 Taro-wa dabulin-no daigaku-ni it-teiru Unit 0 Unit 1 Unit 2 Unit 3 • Spaces represent bunsetsu boundaries. 8 • Hyphens represent morphological boundaries within a bunsetsu. NCLT Seminar series 29/11/06 2. Japanese grammar 9 Ellipses of core arguments – Pro-drop: contextually-evident units are absent from the sentence – Gender, person and number of the subject are not specified by verbal or adjectival morphology Example: ダブリンの大学に行った。 dabulin-no daigaku-ni it-ta Dublin-in college-OBL go-PST “I/We/You/He/She/It/They/(Someone in the context) went to a college in Dublin.” -Personal pronouns are also available, but they are not equivalent with personal pronouns in English (e.g., variations of 1st singular personal pronouns: ‘ore’, ‘atashi’, ‘boku’, ‘watashi’, ‘watakushi’, etc.; variations of 2nd singular personal pronouns: ‘kimi’, ‘anata’, ‘anta’, ‘omae’, etc) NCLT Seminar series 29/11/06 2. Japanese grammar Topicalization – – Topicalized units have the particle “wa” Non-topicalized units are the focus of the sentence Example: ダブリンの大学には太郎が行った。 dabulin-no daigaku-ni-wa Taro-ga it-ta Dublin-in college-OBL-TOP Taro-NOM go-PST “To a college in Dublin, Taro went.” or “It is Taro who went to a college in Dublin 10 NCLT Seminar series 29/11/06 2. Japanese grammar Relative clauses – If a clause ends with a verb in a sentence-ending form and it comes before a noun, then the clause is a relative clause: Japanese has no relative pronouns. Example: 私が行った大学 watashi-ga itta daigaku 1sg-NOM go-PST college “the college I went to.” 11 NCLT Seminar series 29/11/06 2. Japanese grammar 12 Two types of “relative clauses”; “inner” relative clauses (true relative clauses) and “outer” relative clauses (appositions) (Teramura 1991) Example: – 私が答えを見つけた証拠 watashi-ga kotae-wo mitsuketa shoko 1sg-NOM answer-ACC find-PST evidence “The evidence that I found out the answer” (“outer”) – 私が見つけた証拠 watashi-ga mitsuketa shoko 1sg-NOM find-PST evidence “The evidence that I found out ∅” (“inner”: ∅ =evidence) “The evidence that I found out PRO” (“outer”: PRO≠evidence; something evident in the context) – If one of the core arguments is in ellipsis, then it is difficult to distinguish a true relative clause from an apposition. NCLT Seminar series 29/11/06 2. Japanese grammar Particles derived from verbs: – Some case particles are derived from verbs; case particles of this type have a fixed meaning (Masuoka and Takubo 1992) Example: ついて tsuite “about” (same form with the adverbial form of the verb つ く “attach”) 私は計算言語学について話した。 Watashi-wa keisangengogaku-ni-tsuite hanashi-ta I-TOP computational linguistics-OBL-about talk-PST “I talked about computational linguistics.” 13 NCLT Seminar series 29/11/06 2. Japanese grammar Adverbial nouns They function as the head of an adverbial phrase with a complement (Masuoka and Takubo 1992) Example: ダブリンの大学に通っている時、津波が日本を襲った。 Dabulin-no daigaku-ni kayotteiru toki, tsunami-ga nihon-wo osot-ta. Dublin-in college-OBL go-PROG time, tsunami-NOM Japan-ACC strike-PST “When I was studying at a college in Dublin, a tsunami struck Japan.” – – 14 It is also difficult to distinguish the complements in these cases from relative clauses; no syntactic nor morphological clues are available. NCLT Seminar series 29/11/06 2. Japanese grammar Coordination – The first coordinated bunsetsu has the particle “to” (but not necessarily), and it is dependent on the next coordinated bunsetsu. Example: ダブリンの大学に通っている時、地震と津波が日本を襲った。 Dabulin-no daigaku-ni kayotteiru toki, jishin-to tsunami-ga nihon-wo osot-ta. Dublin-in college-OBL go-PROG time, jishin-AND tsunami-NOM Japan-ACC strike-PST “When I was studying at a college in Dublin, an earthquake and a tsunami struck Japan.” – 15 Only the last coordinated bunsetsu has the particle which specifies its grammatical function; NCLT Seminar series 29/11/06 3. Kyoto Text Corpus (KTC) 16 An automatically parsed text corpus of a newspaper (Mainichi Shimbun) All articles from the 1st to the 17th of January, 1995 (19,687 sentences, 518,687 tokens) and the editorials from January to December, 1995 (18,708 sentences, 453995 tokens). Developed by Sadao Kurohashi and Makoto Nagao at the University of Kyoto, using JUMAN and KNP NCLT Seminar series 29/11/06 3. Kyoto Text Corpus (KTC) 17 All the texts are automatically annotated with morphological tags by JUMAN (Kurohashi and Nagao 1994) (“juman” means 100,000) The output of JUMAN are parsed by KNP (Kurohashi and Nagao 1994) based on the dependency among “bunsetsu”, and corrected manually No syntactic CFG category tags are annotated Valency of verbal predicate is not annotated NCLT Seminar series 29/11/06 3. Kyoto Text Corpus (KTC) JUMAN: morphological analyzer for Japanese based on Bigram information Least-cost path method (Kurohashi and Kawahara 1992) – Costs are assigned to each morpheme and each pair of two morphemes in a sentence – 18 The lower the morpheme frequency, or the lower the frequency of pairs of morphemes, the higher the cost If a sentence has several possible analyses, JUMAN sums up the costs, and determines the least-cost analysis as the most plausible analysis for the sentence Accuracy: around 99.0 % (comparison of automatic analysis and manually corrected analysis of 10,000 sentences) NCLT Seminar series 29/11/06 3. Kyoto Text Corpus (KTC) The example of the output of JUMAN: 太郎は大学に行った。 “Taro went to a college.” Taro wa daigaku ni itta. Taro TOP college OBL went 19 NCLT Seminar series 29/11/06 3. Kyoto Text Corpus (KTC) The example of the output of JUMAN: 太郎は大学に行った。 “Taro went to a college.” Taro wa daigaku ni itta. Taro TOP college OBL went #S-ID: 950101001-001 太郎 tarou * Noun Name * * は wa * Particle AdverbialParticle * * 大学 daigaku * Noun NormalNoun * * に ni * Particle CaseParticle * * 行った itta iku Verb * ConsonantVerb Past 。* mark period * * EOS 20 NCLT Seminar series 29/11/06 3. Kyoto Text Corpus (KTC) 21 KNP: dependency structure analyzer based on “bunsetsu” KNP converts the output of JUMAN into a bunsetsu strings. Accuracy: 90%(comparison of automatic analysis and manually corrected analysis of 10,000 sentences) (Kurohashi and Nagao 1998) NCLT Seminar series 29/11/06 3. Kyoto Text Corpus (KTC) 太郎は大学に行った。 “Taro went to a college.” Taro-wa daigaku-ni it-ta. Taro TOP college OBL went #S-ID: 950101001-001 太郎 tarou * Noun Name * * は wa * Particle AdverbialParticle * * 大学 daigaku * Noun NormalNoun * * に ni * Particle CaseParticle * * 行った itta iku Verb * ConsonantVerb Past 。* mark period * * EOS 22 NCLT Seminar series 29/11/06 3. Kyoto Text Corpus (KTC) 太郎は大学に行った。 “Taro went to a college.” Taro-wa daigaku-ni it-ta. Taro TOP college OBL went 23 #S-ID: 950101001-001 * 0 2D 太郎 tarou * Noun Name * * は wa * Particle AdverbialParticle * * *1 2D 大学 daigaku * Noun NormalNoun * * に ni * Particle CaseParticle * * *2 -1D 行った itta iku Verb * ConsonantVerb Past 。* mark period * * EOS NCLT Seminar series 29/11/06 3. Kyoto Text Corpus (KTC) 太郎は大学に行った。 “Taro went to a college.” Taro wa daigaku ni itta. Taro TOP college OBL went 24 #S-ID: 950101001-001 * 0 2D 太郎 tarou * Noun Name * * は wa * Particle AdverbialParticle * * *1 2D 大学 daigaku * Noun NormalNoun * * に ni * Particle CaseParticle * * *2 -1D 行った itta iku Verb * ConsonantVerb Past 。* mark period * * EOS NCLT Seminar series Unit 0 Unit 1 Unit 2 29/11/06 3. Kyoto Text Corpus (KTC) 大学に太郎は行った。 “Taro went to a college.” daigaku ni Taro wa itta. college OBL Taro TOP went 25 #S-ID: 950101001-001 *0 2D 大学 daigaku * Noun NormalNoun * * に ni * Particle CaseParticle * * *1 2D 太郎 tarou * Noun Name * * は wa * Particle AdverbialParticle * * *2 -1D 行った itta iku Verb * ConsonantVerb Past 。* mark period * * EOS NCLT Seminar series Unit 0 Unit 1 Unit 2 29/11/06 4. Converting KTC into dependency trees Motivation: – Related work: – – – 26 LFG-based automatic grammar induction for Japanese; GramLab: Treebank based Acquisition of Multilingual Resources (Cahill et al. 2002, etc.) Japanese XLE at Fuji Xerox (Masuichi et al. 2006, etc. ) PCFG-based Automatic grammar induction from Japanese Corpus (Tokunaga et al. 2005, etc.) Case frame induction from Japanese Corpus (Kurohashi et al. 2006, etc.) NCLT Seminar series 29/11/06 4. Converting KTC into dependency trees Procedure: At Text corpus Dependency trees least one syntactic category is annotated on each "bunsetsu" in a sentence. All “bunsetsu’ in a sentence are integrated into a dependency tree of the sentence. F-structures 27 NCLT Seminar series 29/11/06 4. Converting KTC into dependency trees 太郎は大学に行った。 “Taro went to a college.” Taro wa daigaku ni itta. Taro TOP college OBL went 28 #S-ID: 950101001-001 * 0 2D 太郎 tarou * Noun Name * * は wa * Particle AdverbialParticle * * *1 2D 大学 daigaku * Noun NormalNoun * * に ni * Particle CaseParticle * * *2 -1D 行った itta iku Verb * ConsonantVerb Past 。* mark period * * EOS NCLT Seminar series Unit 0 Unit 1 Unit 2 29/11/06 4. Converting KTC into dependency trees 太郎は大学に行った。 “Taro went to college.” Taro wa daigaku ni itta. Taro TOP college OBL went 29 #S-ID: 950101001-001 * 0 2D 太郎 tarou * Noun Name * * は wa * Particle AdverbialParticle * * *1 2D 大学 daigaku * Noun NormalNoun * * に ni * Particle CaseParticle * * *2 -1D 行った itta iku Verb * ConsonantVerb Past 。* mark period * * EOS NCLT Seminar series Topic: OBL: Unit 0 TopP Unit 1 NP Unit 2 V 29/11/06 4. Converting KTC into dependency trees Taro 30 wa daigaku ni itta NCLT Seminar series 。 29/11/06 5. Converting KTC into f-structures Motivation: – Are syntactic categories necessary for Japanese? 31 Word order is (relatively) free. The type (or absence) of particles in each unit specifies its grammatical function (e.g., if a noun has a particle “wo”, then it is an object) Verbal morphology specifies the grammatical function of each clause (but not always unambiguous). NCLT Seminar series 29/11/06 5. Converting KTC into f-structures Generating f-structure equations directly from the corpus Text corpus Dependency trees F-structures 32 NCLT Seminar series 29/11/06 5. Converting KTC into f-structures Generating f-structure equations directly from the corpus Text corpus F-structures 33 F-structure equations are directly generated from each unit. All the units are unified into the fstructure of the sentence according to the dependency. NCLT Seminar series 29/11/06 4. Converting KTC into dependency trees 太郎は大学に行った。 “Taro went to a college.” Taro wa daigaku ni itta. Taro TOP college OBL went 34 #S-ID: 950101001-001 * 0 2D 太郎 tarou * Noun Name * * は wa * Particle AdverbialParticle * * *1 2D 大学 daigaku * Noun NormalNoun * * に ni * Particle CaseParticle * * *2 -1D 行った itta iku Verb * ConsonantVerb Past 。* mark period * * EOS NCLT Seminar series Unit 0 Unit 1 Unit 2 29/11/06 4. Converting KTC into dependency trees 太郎は大学に行った。 “Taro went to college.” Taro wa daigaku ni itta. Taro TOP college OBL went 35 #S-ID: 950101001-001 * 0 2D 太郎 tarou * Noun Name * * は wa * Particle AdverbialParticle * * *1 2D 大学 daigaku * Noun NormalNoun * * に ni * Particle CaseParticle * * *2 -1D 行った itta iku Verb * ConsonantVerb Past 。* mark period * * EOS NCLT Seminar series Topic: OBL: f0 f1 f2 29/11/06 4. Converting KTC into dependency trees 太郎は大学に行った。 “Taro went to college.” Taro wa daigaku ni itta. Taro TOP college OBL went Functional equations from the corpus: 36 #S-ID: 950101001-001 * 0 2D 太郎 tarou * Noun Name * * は wa * Particle AdverbialParticle * * *1 2D 大学 daigaku * Noun NormalNoun * * に ni * Particle CaseParticle * * *2 -1D 行った itta iku Verb * ConsonantVerb Past 。* mark period * * EOS NCLT Seminar series F2:pred = '行く', F2:tns = 'pst', F2:stmt = 'decl', F2:style = 'plain', F0:pred = '太郎', F0:prtav = 'は', F0 elm F2:topic, F1:pred = '大学', F1:case = 'に', F2:obl = F1. 29/11/06 4. Converting KTC into dependency trees 太郎は大学に行った。 “Taro went to college.” Taro wa daigaku ni itta. Taro TOP college OBL went F2:pred = '行く', F2:tns = 'pst', F2:stmt = 'decl', F2:style = 'plain', F0:pred = '太郎', F0:prtav = 'は', F0 elm F2:topic, F1:pred = '大学', F1:case = 'に', F2:obl = F1. 37 NCLT Seminar series 29/11/06 4. Converting KTC into dependency trees 太郎は大学に行った。 “Taro went to college.” Taro wa daigaku ni itta. Taro TOP college OBL went F-structure from the functional equations: F2:pred = '行く', F2:tns = 'pst', F2:stmt = 'decl', F2:style = 'plain', F0:pred = '太郎', F0:prtav = 'は', F0 elm F2:topic, F1:pred = '大学', F1:case = 'に', F2:obl = F1. 38 pred : '行く' tns : pst stmt : decl style : plain topic : 1 : pred : '太郎' prtav : 'は' obl : pred : '大学' case : 'に' NCLT Seminar series 29/11/06 5. Problems This “Generating f-structure equations directly from the corpus” method does not always work well. – – – – – 39 Core argument ellipses Two types of relative clauses Particles derived from verbs Adverbial nouns Coordination The context among units must be taken into consideration to make the generation more accurate. NCLT Seminar series 29/11/06 5. Problems 40 Ellipses of core arguments – Contextually-evident units are absent from the sentence – Gender, person and number of the subject are not specified by verbal or adjectival morphology Example: ダブリンの大学に行った。 dabulin-no daigaku-ni it-ta Dublin-in college-OBL go-PST “He/She/They went to a college in Dublin.” NCLT Seminar series 29/11/06 5. Problems Core argument ellipses – – – 41 KTC does not annotate on missing elements. No equations for missing elements can be generated from KTC. For the f-structure with ellipses, “PRO” must be added to make the f-structure complete. NCLT Seminar series 29/11/06 5. Problems Core argument ellipses – – – – 42 If a predicate has no subject in the clause, then an equation for the subject is added. If a transitive verb has no object, then an equation for the subject must be added … However, KTC does not annotate on the valency of verbal predicate, hence it is impossible to tell which verb is transitive only on the basis of annotated information. Case-frame is required to detect missing objects of transitive verbs. NCLT Seminar series 29/11/06 5. Problems Two types of “relative clauses”; “inner” relative clauses (true relative clauses) and “outer” relative clauses (appositions) (Teramura 1991) Example: 私が答えを見つけた証拠 watashi-ga kotae-wo mitsuketa shoko 1sg-NOM answer-ACC find-PST evidence “The evidence that I found out the answer” (“outer”) 私が見つけた証拠 watashi-ga mitsuketa shoko 1sg-NOM find-PST evidence “The evidence that I found out ∅” (“inner”: ∅ =evidence) “The evidence that I found out PRO” (“outer”: PRO≠evidence; something evident in the context) If one of the core arguments is in ellipsis, then it is difficult to distinguish an “outer” relative clause from an “inner” relative clause. 43 NCLT Seminar series 29/11/06 5. Problems Two types of relative clause – – – 44 Features in one bunsetsu are not enough to distinguish them. A probabilistic model of analysing them (Abekawa and Okumura 2005) employs the cooccurrence probability of head nouns and verbal predicates in “outer” relative clauses. This method is expected to be applicable to the present method (in future). NCLT Seminar series 29/11/06 5. Problems Case particles derived from verbs – – – 45 Case particles of this type are analyzed by KNP as verbs, not as case particles. Bunsetsus with them are analyzed as sentential adjuncts, not as postpositional adjuncts or as complements (in the case of “という”). The equations must be revised properly. NCLT Seminar series 29/11/06 5. Problems Particles derived from verbs: Some case particles are derived from verbs; case particles of this type have a fixed meaning (Masuoka and Takubo 1992) Example: ついて tsuite “about” (same form with the adverbial form of the verb つく “attach”) 私は計算言語学について話した。 Watashi-wa keisangengogaku-ni-tsuite hanashi-ta I-TOP computational linguistics-OBL-about talk-PST “I talked about computational linguistics.” 46 NCLT Seminar series 29/11/06 5. Problems Adverbial nouns They function as the head of an adverbial phrase with a complement (Masuoka and Takubo 1992) Example: ダブリンの大学に通っている時、津波が日本を襲った。 Dabulin-no daigaku-ni kayotteiru toki, tsunami-ga nihon-wo osotta. Dublin-in college-OBL go-PROG time, tsunami-NOM Japan-ACC strike-PST “When I was studying at a college in Dublin, a tsunami struck Japan.” – 47 It is also difficult to distinguish the complements in these cases from relative clauses; no syntactic nor morphological clues are available. NCLT Seminar series 29/11/06 5. Problems Adverbial nouns – – 48 Features in one bunsetsu in not enough to distinguish between them. If a clause is dependent on an adverbial noun and it is analyzed as a relative clause, then the equation of the clause must be replaced by that of complement. NCLT Seminar series 29/11/06 5. Problems Coordination – The first coordinated bunsetsu has the particle “to” (but not necessarily), and it is dependent on the next coordinated bunsetsu. Example: ダブリンの大学に通っている時、地震と津波が日本を襲った。 Dabulin-no daigaku-ni kayotteiru toki, jishin-to tsunami-ga nihon-wo osotta. Dublin-in college-OBL go-PROG time, jishin-AND tsunami-NOM Japan-ACC strike-PST “When I was studying at a college in Dublin, an earthquake and a tsunami struck Japan.” – 49 Only the last coordinated bunsetsu has the particle which specifies its grammatical function; NCLT Seminar series 29/11/06 5. Problems Coordination – Only the last coordinated bunsetsu has the particle which specifies its grammatical function; other coordinated bunsetsus cannot be analyzed to have appropriate grammatical functions. – – 50 The last coordinated bunsetsu does not have any feature within it as a coordinate; the bunsetsu context must be taken into consideration in order to convert it properly into f-structure equations Dependency among coordinated bunsetsus must also be reanalyzed; NCLT Seminar series 29/11/06 5. Problems Coordination Dependency among coordinated bunsetsus must be reanalyzed; Example: – jishin-to 51 tsunami-ga NCLT Seminar series 29/11/06 5. Problems Coordination Dependency among coordinated bunsetsus must be reanalyzed; Example: – jishin-to tsunami-ga Jishin-to tsunami -ga The coordinated bunsetsus are the elements of a new unit, which constitutes a new bunsetsu with the case particle (“ga” in this example). 52 NCLT Seminar series 29/11/06 5. Problems Among these problems, the following problems still remain in the method: – – – 53 Object ellipses Distinguishing two types of relative clauses Particles derived from verbs NCLT Seminar series 29/11/06 6. Evaluation of the method 54 200 sentences were randomly selected from KTC. F-structures of these sentences are automatically generated by the method. These f-structures are manually corrected, and used as the Gold standard. The automatically generated f-structure of these 200 sentences are compared with the Gold standard. NCLT Seminar series 29/11/06 6. Evaluation of the method Pred-only GFs PRECISION(%) RECALL(%) F-SCORE(%) adj 80.60 96.43 87.80 cj 100.00 96.80 98.37 comp 74.19 58.97 65.71 obj 98.73 82.54 89.91 obl 85.62 91.91 88.65 padj 98.45 91.81 95.01 rel 70.86 96.40 81.68 sadj 82.26 65.38 72.86 subj 93.17 92.29 92.73 topic 98.68 95.51 97.07 88.26 86.80 86.98 55 NCLT Seminar series 29/11/06 7. Future work … 56 The method of generating f-structure equations directly from the dependency-based corpus of Japanese needs more improvement. The result can be applied to improve the parsing result of KNP. Using Japanese f-structures in MT NCLT Seminar series 29/11/06 References Abekawa, T and M. Okumura. 2005. Corpus-Based Analysis of Japanese Relatie Clause Constructions. IJCNLP 2005 pp. 46-57. Cahill A, Cahill A, M. McCarthy, J. van Genabith and A. Way . Automatic Annotation of the Penn-Treebank with LFG FStructure Information. LREC 2002 workshop on Linguistic Knowledge Acquisition and Representation, pp. 8-15 Kurohashi, S and D.Kawahara. 1992. JUMAN: user's manual. ms. Kurohashi, S and M. Nagao. 1994. A syntactic analysis method of long Japanese sentences based on the detection of conjunctive structures. Computational Linguistics, 20(4), pp. 507-534. Kurohashi, S and M. Nagao. 1998. Building a Japanese Parsed Corpus while Improving the Parsing System. Proceedings of the 1st International Conference on Language Resources and Evaluation, pp. 719-724. Kurohashi, S, D, Kawahara and T. Shibata. 2005. Morphological and syntactic analyses using JUMAN/KNP. ms. Masuoka, T and Y. Takubo. 1992. Kiso nihongo bunpo. Kuroshio Publication. Noguchi, M, H, Ichikawa, T, Hashimoto and T. Takenobu. 2006. A new approach to syntactic annotation. Proceedings of 5th International Conference on Language Resources and Evaluation (LREC2006). pp.6 Noro T, C, Koike, T, Hashimoto, T, Tokunaga and H. Tanaka. 2005. Evaluation of a Japanese CFG Derived from a Syntactically Annotated Corpus with respect to Dependency Measures. The 5th Workshop on Asian Language Resources. pp.9 Shibatani, M. 1990. The Languages of Japan. Cambridge University Press Teramura, H. 1991. Nihongo no shintakusu to imi. Kuroshio Publication Tomoko Ohkuma, Hiroshi Masuichi, and Takeshi Yoshioka. 2006. Disambiguation of Japanese Focus Particles by using Lexical Functional Grammar. Journal of Natural Language Processing, 13(1):27-52. Tsujimura, N. 2006. An Introduction to Japanese Linguistics (2nd ed.). Blackwell Publications 57 NCLT Seminar series 29/11/06 Thank you very much! 58 NCLT Seminar series 29/11/06
© Copyright 2024 ExpyDoc