Analyzing a Japanese Reading Text as a Vocabulary Learning Resource by Lexical Profiling and Indices Tatsuhiko Matsushita (松下達彦) PhD candidate Victoria University of Wellington [email protected] The First Extensive Reading World Congress 4 September, Kyoto Sangyo University Motive • How can we control the vocabulary of a reading text to maximize the vocabulary learning effect? – Too easy -- fewer words to learn – Too many unknown words -- no learning/inference Goals • To show methods to assess a (Japanese) reading text as a vocabulary learning resource by exploiting lexical profiling and indices Conclusion = Main Points The simplest way to rewrite a reading text (with 2000 words or less) for a better resource for vocabulary learning: I. Delete one-timers (or the words whose occurrences are less than the set level) at the lowest frequency level in the text, or II. make them occur more in the text by adding words or replacing other words with the onetimer The index (LEPIX) figure will be improved • These methods make it possible to predict and compare the efficiency of second language vocabulary learning with a reading text. Similar Previous Ideas and Attempts • Nation & Deweerdt (2001) • Ghardirian (2002) • Cobb (2007) *No integrated index is shown in previous studies Lexical Profiling • Basically the same idea as Lexical Frequency Profiling (LFP) (Laufer, 1994) “the percentage of words …… at different vocabulary frequency levels” (p.23) The Baseword Lists for Lexical Profiling • VDRJ: Vocabulary Database for Reading Japanese (Matsushita, 2010; 2011) http://www.geocities.jp/tatsum2003/ – All words are ranked by Usage Coefficient (Juilland & ChangRodrigues, 1964) U = Frequency × Dispersion – Three types of word rankings • For General Learners • For International Students --used for this study • For General Written Japanese • Japanese Character Frequency List (Matsushita, Unpublished) – From the same corpus (BCCWJ) as VDRJ is created from * When analyzing Japanese texts, it is necessary to set a certain level of known characters (Kanji) as well as vocabulary Assumptions I. Required Level of Text Coverage Words which are assumed known to the reader must be within a certain level. (e.g. Hu & Nation, 2000) II. Minimum Occurrences of Target Words Among the words assumed unknown, words which occur more frequently than a certain times can be the learning target words. (e.g. Waring & Takaki, 2003) III. More Types of Target Words The text where the more types of target words occur is a better text as a vocabulary learning resource. IV. Density of Target Words (%) The text where the target words occur at a higher ratio is a better text as a vocabulary learning resource. Methods The main software: AntWordProfiler Ver. 1.200W (Anthony, 2009) I. To identify the lexical level of the text by lexical profiling, set the threshold level of (assumed) known words. In this study, the levels are: A) 98% for an extensive reading text Lexical Level of Text (LLT98) B) 95% for an instructional material Lexical Level of Text (LLT95) (Hu & Nation, 2000) II. To Identify the target words, set the minimum occurrences of target words. *6-10 occurrences are required for learning a word incidentally through reading (e.g. Waring & Takaki, 2003), however, a word is not learned by reading one short text A) B) Twice or more for an extensive reading text *Set occurrence will depend on the text length Twice for a short instructional material III. Count T which is the Number of Types of the Target Words. IV. Calculate (W*100)/N where: W is the Number of Tokens of the Target Words. N is the Total Number of Tokens of the Text. Lexical Learning Possibility Index for a Reading Text *Simply multiply the factors of III & IV {T*(W*100)/N} (LEPIX) = (T*W*100)/N Sample Text (original) 人知のシミュレーションが人工知能だとすれば、コンピュータのなかに 「知をあつかうメカニズム」を作り込まなければならない。 ところでコンピュータとは、要するに〈記号処理マシン〉である。だから この場合の〈知〉とは、「記号で表された知」ということになる。記号と いっても色々あるが、人工知能が得意なのは、いわゆる言語記号である。た とえば、「今は五月だ」「五月は春だ」「楓の葉は、春と夏には緑色、秋に は赤色である」などというのがその守備範囲ということになる。 ところでこういった例は、少しばかり興ざめではなかろうか? というの は、〈知〉とは、単なる知識の断片ではなく、それらを包括し、横断しなが ら世界に光を当てていく精神のダイナミズムのように思えるからである。 〈知〉はイマジネーションの能力を持たなければならない。さらに〈知〉は、 スポーツのような身体の所作にうめこまれている、明言化されない暗黙知の 領域をもカバーしなければならない。それこそが、知の知たるゆえんではな いだろうか? 残念ながら、現在の人工知能技術は、この期待に応えるすべを知らない。 それはいまだに、図像さえ自由自在には扱えないのである。英語や日本語な どの〈自然言語〉を操作するだけでも四苦八苦なのである。 (出典:西垣 通『秘術としてのAI思考』) Sample Text (modified) 人間の頭脳を模倣して作ったものが人工知能だとすれば、コンピュータの 中に「知をあつかうメカニズム」をていねいに作っていかなければならない。 しかしそこへの道はまだ程遠い。 コンピュータとは、要するに〈記号処理のメカニズム〉である。だからこ の場合の知とは、「記号で表された知」ということになる。記号といっても いろいろあるが、人工知能が得意なのは、いわゆる言語記号である。例えば、 「今は五月だ」「五月は春だ」「カエデの葉は、春と夏には緑、秋には赤で ある」などという人工言語的表現は処理しやすいのである。 しかし、こういった例は、少しばかりつまらないのではないだろうか? というのは、知とは、一つ一つの知識がバラバラに存在するのではなく、そ れらを一つにまとめたり、横断したりしながら、世界に光を当てていく精神 の力強い働きのように思えるからである。知は想像力を持たなければならな い。さらに知は、スポーツのような身体の動きの中にある、はっきりとした 言葉にならない知の領域もカバーしなければならない。カエデといえば私た ちが紅葉を見て感じる気持ちまで横断的にカバーしなければならないのだ。 それこそが、知を知として成り立たせているものではないだろうか。 残念ながら、現在の人工知能技術は、この期待に応えるすべを知らない。 人間の頭脳の模倣にはまだ程遠いレベルだ。英語や日本語などの〈自然言 語〉を操作するだけでも非常に苦労しているのである。 Treatment for Low Frequency Words Word Level Type IS_05K IS_05K Frequency Cumulative in Original Text Text Coverage IS_05K 記号 IS_06K IS_06K IS_06K IS_06K IS_07K IS_07K IS_07K IS_08K IS_08K IS_08K IS_08K IS_08K IS_08K IS_09K マシン 9 0 4 1 メカニズム 横断 1 1 緑色 断片 自在 1 1 1 IS_10K IS_10K IS_11K IS_11K IS_16K IS_17K IS_19K IS_19K IS_20K IS_21K+ IS_21K+ IS_21K+ IS_21K+ IS_21K+ 知 紅葉 頭脳 0 包括 暗黙 1 1 楓 模倣 知能 程遠い 1 0 守備 シミュレーション 埋め込む 明言 赤色 所作 図像 八苦 四苦 ダイナミズム イマジネーション 人知 作り込む 由縁 興醒め 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 88.7 88.7 90.2 90.5 90.9 91.3 91.6 92.0 92.4 92.4 92.7 93.1 93.5 93.5 3 94.5 0 94.5 94.9 95.3 95.6 96.0 96.4 96.7 97.1 97.5 97.8 98.2 98.5 98.9 99.3 99.6 100.0 Frequency Cumulative in Modified Text Treatment Text Coverage 9 1 4 0 2 2 0 0 0 2 0 0 2 2 95.6 A 95.6 96.2 96.8 96.8 96.8 96.8 97.3 97.3 97.3 97.9 98.5 Deleted 3 99.4 2 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Explanation of Treatment 94.1 94.4 B B Deleted Deleted Deleted C A: Changed from assumed known words to target words due to the change of Lexical Level of Text Deleted Deleted B C A C Deleted Deleted Deleted Deleted Deleted Deleted Deleted Deleted Deleted Deleted Deleted Deleted Deleted Deleted Deleted B: Changed from nontaget words to target words by adding occurrences to onetimer C: Newly added target words by replacing original sentences with new expressions *Check the level of characters (Kanji) and avoid low frequency ones. Comparison between the Original and the Modified Texts Item Text Length (= Total Number of Token) (N) Total Number of Types Number of Tokens over 95% Text Coverage Number of Types over 95% Text Coverage 95% Text Coverage Level = Lexical Level of the Text (LLT95) Minimum Occurrences of Target Words over 95% Text Coverage Number of Target Tokens over 95% Text Coverage (W95) Number of Target Types over 95% Text Coverage (T95) Density of Target Words (%) (W95*100/N) Average Occurrences of Target Words (W95/T95) Lexical Learning Possibility Index for a Reading Text over 95% Text Coverage (LEPIX9 5 ) ((T9 5 * W9 5 * 1 0 0 )/ N ) Number of Tokens over 98% Text Coverage Number of Types over 98% Text Coverage 98% Text Coverage Level = Lexical Level of the Text (LLT98) Minimum Occurrences of Target Words over 98% Text Coverage Number of Target Tokens over 98% Text Coverage (W98) Number of Target Types over 98% Text Coverage (T98) Density of Target Words (%) (W98*100/N) Average Occurrences of Target Words (W98/T98) Lexical Learning Possibility Index for a Reading Text over 98% Text Coverage (LEPIX9 8 ) ((T9 8 * W9 8 * 1 0 0 )/ N ) Original Text Modified Text 275 118 339 130 14 14 19 8 10K 05K 2 0 0 0.0 0.0 2 19 8 5.6 2.4 0.0 44.8 6 6 7 3 20K 08K 2 0 0 0 0.00 2 7 3 2.1 2.33 0.0 6.2 For Learning DomainSpecific Words I. The target domain is set up at first II. The domain-specific words included in the text are identified by checking the list of the domain-specific words III. The levels of the identified domain-specific words included in the text are checked by lexical profiling to see how many unknown domain-specific words are contained in the text IV. The indices are calculated Indices, Text Coverage and Numbers of Tokens and Types for the Whole Text and the Target Words (Technical Words) Text Number Text Length (= Total Number of Token) (N) Total Number of Types Target Domain Number of Tokens over 95% Text Coverage Number of Types over 95% Text Coverage 95% Text Coverage Level = Lexical Level of the Text (LLT 9 5 ) Number of Techni cal Word Tokens over 95% Text Coverage Number of Techni cal Word Types over 95% Text Coverage Number of Te c h n ic al Wo rd Tokens over 95% Text Coverage (W 9 5 t ) Number of Te c h n ic al Wo rd Types over 95% Text Coverage (T 9 5 t ) Density of Technical Target Words (%) (W 9 5 t *100/N) Average Occurrences of Techni cal Target Words (W 9 5 t /T 9 5 t ) Lexical Learning Possibility Index for a Reading Text over 95% Text Coverage (LEPIX 9 5 t ) ((T 95t *W 95t *100)/N) Number of Tokens over 98% Text Coverage Number of Types over 98% Text Coverage 98% Text Coverage Level = Lexical Level of the Text (LLT 9 8 ) Number of Techni cal Word Tokens over 98% Text Coverage Number of Techni cal Word Types over 98% Text Coverage Number of Te c h n ic al Wo rd Tokens over 95% Text Coverage (W 9 8 t ) Number of Te c h n ic al Wo rd Types over 95% Text Coverage (T 9 8 t ) Density of Technical Target Words (%) (W 9 8 t *100/N) Average Occurrences of Techni cal Target Words (W 9 8 t /T 9 8 t ) Lexical Learning Possibility Index for a Reading Text over 98% Text Coverage (LEPIX 98t ) ((T 98t *W 98t *100)/N) #6-1 #6-2 1193 2823 250 690 Economics 60 24 04K 25 10 22 7 1.84 3.14 142 87 08K 35 15 27 7 0.96 3.86 12.9 6.7 12 8 09K 7 4 5 2 0.42 2.50 52 37 12K 9 6 5 2 0.18 2.50 0.84 0.35 More Examples of Analysis Indices, Text Coverage and Numbers of Tokens and Types for the Whole Text and the Target Words * Data from passages in a textbook which are mostly authentic but slightly modified for advanced learners of Japanese ** Minimum Occurrences of Target Words over 95%/98% Text Coverage = 2 Text Number Text Length (= Total Number of Token) (N) Total Number of Types #5-1 #4-3 #3-1 #4-1 #8-2 #6-1 #1-3 #2-2 #9-2 #1-2 #8-1 #1-1 #3-3 #2-1 #2-3 #3-2 #9-1 #4-2 #7-1 #6-2 #7-2 #5-2 #9-3 SD 504 616 959 1055 1092 1193 1210 1317 1416 1418 1455 1592 1717 1785 1959 2035 2241 2342 2361 2823 2964 3754 4344 1832.7 928.0 226 246 358 296 282 250 335 409 383 406 400 540 530 560 528 621 533 535 555 690 628 849 923 481.9 180.8 Number of Tokens over 95% Text Coverage Number of Types over 95% Text Coverage 26 31 50 53 64 60 61 68 71 71 80 83 86 91 99 102 113 118 120 142 149 227 297 24 19 37 43 39 24 37 48 53 51 32 73 82 62 67 84 77 82 68 87 71 138 138 95% Text Coverage Level = Lexical Level of the Text (LLT 95 ) 07K 08K 08K 04K 05K 04K 06K 09K 06K 07K 10K 06K 07K 08K 07K 07K 06K 05K 06K 08K 08K 06K 07K Number of Target Tokens over 95% Text Coverage (W95 ) 4 14 20 17 36 47 33 33 27 25 58 15 7 33 45 27 55 50 68 81 99 115 184 Number of Target Types over 95% Text Coverage (T 95 ) 2 2 7 7 11 11 9 13 9 5 10 5 3 4 13 9 19 14 16 26 21 26 25 Density of Target Words (%) (W 95 *100/N) 0.8 2.3 2.1 1.6 3.3 3.9 2.7 2.5 1.9 1.8 4.0 0.9 0.4 1.8 2.3 1.3 2.5 2.1 2.9 2.9 3.3 3.1 4.2 Average Occurrences of Target Words (W 95 /T95 ) 2.0 7.0 2.9 2.4 3.3 4.3 3.7 2.5 3.0 5.0 5.8 3.0 2.3 8.3 3.5 3.0 2.9 3.6 4.3 3.1 4.7 4.4 7.4 Lexical Learning Possibility Index for a Reading Text over 95% Text Coverage (LEPIX 95 ) 1.6 4.5 14.6 11.3 36.3 43.3 24.5 32.6 17.2 8.8 39.9 4.7 1.2 7.4 29.9 11.9 46.6 29.9 46.1 74.6 70.1 79.6 105.9 ((T 95 *W 95 *100)/N) Number of Tokens over 98% Text Coverage Number of Types over 98% Text Coverage M 11 10 98% Text Coverage Level = Lexical Level of the Text (LLT 98 ) 13K Number of Target Tokens over 98% Text Coverage (W98 ) 2 Number of Target Types over 98% Text Coverage (T 98 ) 1 Density of Target Words (%) (W 98 *100/N) 0.4 Average Occurrences of Target Words (W 98 /T98 ) 2.0 13 13 10K 0 0 0.0 0.0 21 16 18K 8 3 0.8 2.7 22 15 11K 11 4 1.0 2.8 24 17 11K 12 5 1.1 2.4 24 9 09K 19 4 1.6 4.8 25 12 18K 15 2 1.2 7.5 27 21 18K 10 4 0.8 2.5 30 22 11K 11 3 0.8 3.7 30 21 12K 11 2 0.8 5.5 30 16 15K 18 4 1.2 4.5 32 29 12K 4 1 0.3 4.0 36 34 13K 3 1 0.2 3.0 36 35 13K 2 1 0.1 2.0 40 36 11K 8 4 0.4 2.0 41 39 11K 4 2 0.2 2.0 45 34 11K 20 9 0.9 2.2 48 40 09K 14 6 0.6 2.3 50 25 16K 30 5 1.3 6.0 57 38 12K 31 12 1.1 2.6 61 33 13K 39 11 1.3 3.5 76 66 11K 18 8 0.5 2.3 98.3 60.1 62.4 31.0 47.5 40.1 11.6 7.3 2.4 1.0 4.0 1.6 32.3 27.6 87 37.7 18.5 72 28.4 15.9 12K 23 13.61 9.92 8 4.35 3.23 0.5 0.7 0.4 2.9 3.2 1.6 Lexical Learning Possibility Index for a Reading Text over 98% Text Coverage (LEPIX 98 ) 0.4 0.0 2.5 4.2 5.5 6.4 2.5 3.0 2.3 1.6 4.9 0.3 0.2 0.1 1.6 0.4 8.0 3.6 6.4 13.2 14.5 3.8 4.2 ((T 98 *W 98 *100)/N) 3.9 3.8 How does the text length work for LEPIX? Total Number of Token/Type and LEPIX from Texts with 500-4000 Running Words 5000 120.0 4500 2500 50.0 Text Length (= Total Number of Token) (N) 100.0 4000 Total Number of Token/Type and LEPIX from Texts with 1000-2000 Running Words 45.0 2000 Text Length (= Total Number of Token) (N) 40.0 Total Number of Types 35.0 3500 80.0 Total Number of Types 1500 3000 2500 60.0 2000 40.0 1500 1000 20.0 Lexical Learning Possibility Index for a Reading Text over 95% Text Coverage (LEPIX95) Lexical Learning Possibility Index for a Reading Text over 98% Text Coverage (LEPIX98) 30.0 25.0 1000 20.0 15.0 500 10.0 500 5.0 0 0.0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Total Token Total Number of Tokens (Text Length) Total Type 1 .956 *** 1 .956 *** .837 .685 LEPIX95 *** *** .438 .271 LEPIX98 * n.s. ***: (p < .001), *: (p <.05), n.s.: not significant Total Number of Types 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Correlation Coefficient (Pearson) between Total Number of Token/Type and LEPIX from Texts with 1000-2000 Running Words Total Token LEPIX95 LEPIX98 .837 *** .685 *** 1 .691 *** .438 * .271 n.s. .691 *** 1 Lexical Learning Possibility Index for a Reading Text over 98% Text Coverage (LEPIX98) 0.0 1 Correlation Coefficient (Pearson) between Total Number of Token/Type and LEPIX from Texts with 500-4000 Running Words Lexical Learning Possibility Index for a Reading Text over 95% Text Coverage (LEPIX95) Total Number of Tokens (Text Length) Total Number of Types LEPIX95 LEPIX98 Total Type LEPIX95 LEPIX98 1 .858 . 2 2 2 . 0 6 1 *** n . s . n . s . .858 1 -.195 -.372 *** n.s. n.s. . 2 2 2 -.195 1 .877 n.s. n.s. *** . 0 6 1 -.372 .877 1 n.s. n.s. *** ***: (p < .001), n.s.: not significant LEPIX cannot be compared when the text length is more than double the other. Remaining Issues • If a repeatedly-used essential key word in the text is at the lowest frequency level, the index doesn’t work well. there are some solutions for that, but it makes the procedure/calculation more complicated. • Minimum occurrence level of target words will differ according to the text length. Twice will be enough for a short material text, but it is not clear for a longer extensive reading text. • Validation of the indices through empirical study Conclusion = Main Points The simplest way to rewrite a reading text (with 2000 words or less) for a better resource for vocabulary learning: I. Delete one-timers (or the words whose occurrences are less than the set level) at the lowest frequency level in the text, or II. make them occur more in the text by adding words or replacing other words with the onetimer The index (LEPIX) figure will be improved • These methods make it possible to predict and compare the efficiency of second language vocabulary learning with a reading text. References Anthony, L. (2009). AntWordProfiler 1.200w program. Downloaded from http://www.antlab.sci.waseda.ac.jp/software.html Cobb, T. (2007). Computing the vocabulary demands of L2 reading. Language Learning and Technology, 11(3), 38-63. Ghadirian, S. (2002). Providing controlled exposure to target vocabulary through the screening and arranging of texts. Language Learning and Technology, 6(1), 147-164. Hu, M., & Nation, I. S. P. (2000). Vocabulary density and reading comprehension. Reading in a Foreign Language, 13(1), 403-430. Juilland, A., & Chang-Rodrigues, E. (1964). Frequency Dictionary of Spanish Words. London: Mouton & Co. Nation, I. S. P., & Deweerdt, J. (2001). A defence of simplification. Prospect, 16(3), 55-67. Laufer, B. (1994). The lexical profile of second language writing: does it change over time? RELC Journal, 25(2), 21-33. Matsushita, T. (松下達彦). (2010) 日本語を読むために必要な語彙とは? -書籍とインターネットの大 規模コーパスに基づく語彙リストの作成- [What words are essential to read Japanese? Making word lists from a large corpus of books and internet forum sites]. 2010年度日本語教育学会春季大会予稿 集 [Proceedings for the Conference of the Society for Teaching Japanese as a Foreign Language, Spring 2010], 335-336. Matsushita, T. (松下達彦). (2011). 日本語を読むための語彙データベース (The Vocabulary Database for Reading Japanese) Ver. 1.01. Downloaded from http://www.geocities.jp/tatsum2003/ Waring, R., & Takaki, M. (2003). At what rate do learners learn and retain new vocabulary from reading a graded reader? Reading in a Foreign Language, 15(2), 130-163.
© Copyright 2025 ExpyDoc