Is the vocabulary learning burden of Japanese really heavier than that of English? 日本語の語彙学習負担は 本当に英語よりも大きいか? Tatsuhiko Matsushita PhD candidate, Victoria University of Wellington 17th Biennial Conference of the Japanese Studies Association of Australia (JSAA) Contents 本発表の内容 1. Motives for the study 1. 研究動機 2. Goals and research 2. 目的・研究課題 3. 4. 5. 6. questions Method Results Discussion Conclusion 3. 方法 4. 結果 5. 考察 6. まとめ • 引用文献 • References 1. Motives for the study 研究動機 (1) • Heavy burden in learning Japanese vocabulary? (Tamamura, 1984) • Text coverage study テキストカバー率の研究 Text coverage = Coverage of word tokens (延べ語数) • Top (=most frequent) 1000 words cover 60% in Japanese magazines (NINJAL: The National Institute for Japanese Language, 1962; 2006) • Top 1000 words cover over 70% in English (e.g., Carroll, Davies & Richman, 1971). • To reach 95%/98% text coverage, 9500/20000 words (lexeme 語彙素) are required in Japanese, while only 5000/9000 word families are required in English. (Matsushita, 2011; Nation, 2006) Note: Word family (English) ≒ Lexeme (Japanese)? 1. Motives for the study 研究動機 (2) Word family (English) ≒ Lexeme 語彙素 (Japanese)? • “Word family” adopted by Nation (2006) • Level 6 of Bauer & Nation (1993) -- including derived words with frequent affixes and ‘regular but infrequent affixes’ e.g. Members of abbreviate : abbreviate, abbreviates, abbreviated, abbreviating, abbreviation, abbreviations • Lexeme defined by UniDic (Den et al., 2009) Members of the short unit 短単位 of a lexeme e.g. 読む-読み, やはり-やっぱり, 足-脚, 受け入れる cf. 短縮/する • Why is the text coverage in Japanese and English so different? • Possible explanation: many groups of words with different word-origins 語種 but similar meanings (e.g., Akimoto, 2002) e.g., 旅館, 宿屋, ホテル 1. Motives for the study 研究動機 (3) • Questions about the explanation • Method: magazine texts? Coverage not including function words? • English synonyms with different word-origins e.g. liberty-freedom, spirit-soul • Nature of Japanese: many transparent compounds composed of Kanji e.g. 春季 /shunki/: low frequency word (Ranked at 28587 in Matsushita (2011)) 春 /haru/: high frequency word (1019, ibid) 季節 /kisetsu/: high frequency word (1955, ibid) not difficult to infer the meaning of 春季 if the meanings of 春 and 季節 are already known (春季 is transparent) • For those words, learners normally only need to understand the meanings of components and word formation rules –either implicitly or explicitly. cf. Harlan (2011) 1. Motives for the study 研究動機 (4) cf. Harlan (2011) = a comedian ‘Pakkun’ (パックン) 「漢字はある程度覚えると、逆に語彙力を上げるのがす ごく簡単になるんです。基本の数を覚えてしまえばあとは 応用が利くこともありますし、100覚えれば、その次の100 覚えるのがさらに早くなる。500覚えたら、その次の500、 1000が倍、3倍速くなるんです。」 「漢字を覚えると、新しく聞いた単語を漢字で分析すれば、 その意味もわかります。「冷蔵庫」の冷は冷やす、蔵は「く ら」だし、車庫の庫で、何か物置的なイメージです。その3 つの字を組み合わせれば何となく意味がわかります。」 2. Goals and Research Questions 目的・研究課題 Goals: To estimate the true learning burden of Japanese vocabulary To think about more efficient order for learning Japanese vocabulary Research Questions: 1. How many ‘characters’ learners need to learn to attain a certain level of text coverage of ‘words’? Note: it is not to see the simple text coverage by character. cf. Chikamatsu et al. (2000) To know the meaning of a single character 節 is NOT enough to understand the meaning of 季節. 2. Do the characters which provide the certain level of text coverage (in Q.1) cover all the high frequency words? If no, what Kanji are further required to cover the words? (Is there any discrepancy between the word frequencies and character frequencies?) 3. Method 方法 (1) - 1 1) Calculate character frequencies in BCCWJ (the Balanced Corpus of Contemporary Written Japanese 現代日本語書 き言葉均衡コーパス (BCCWJ) 2009 monitor version: NINJAL, 2009) 2) Give a learning order ranking to each character I. II. Rank the types of character as Alphabet, Hiragana, Katakana and Kanji/signs Rank Kanji by frequency 3) List all words in orthographic forms (書字形) in BCCWJ 4) Separate each word into characters 5) Give the learning order ranking to each character 6) Calculate the text coverage by filtering the character of the words by learning order ranking 3. Method 方法 (1) -2 BCCWJ 2009 monitor version (NINJAL, 2009) • Book corpus (approx. 28 million running words) and • Internet forum site corpus (approx. 5 million running words) • Unit of counting a ‘word’ used for this study: • the short form (短単位) defined by UniDic (Den et al., 2009) • the orthographic form (書字形) i.e. 書く / 書か/ かく or 足 / 脚 are counted as different orthographic forms but as one lexeme (語彙素) 3. Method 方法 (2) For RQ. 2, • Identify the relationship between Kanji frequency levels & the former JLPT 旧日本語能力検定試験 Kanji levels to check if the JLPT Kanji are ranked properly • Identify the words which are not covered by the high frequency Kanji and check what Kanji are used in those words 4. Results 結果 (1) - 1 RQ. 1: How many ‘characters’ learners need to learn to attain a certain level of text coverage of ‘words’? • 64% of the words (half of them are function words): covered only by the phonographic characters (Hiragana, Katakana and alphabet) • 82% : by phonographic characters + top 300 Kanji • Learning 100 Kanji in top 1000 Kanji means potential understanding of 6000 – 7000 types 異な り語 of orthographic forms (3000–4000 lexemes) 4. Results 結果 (1) - 2 • 95 - 96%: by phonographic characters 表音文字 & top 1000 – 1100 Kanji threshold level for reading comprehension? (Hu & Nation, 2000; Komori et al., 2004) • 98%: by phonographic characters & top 1500 kanji 4. Results Number/Ratio of Words (orthographic forms) and Text Coverage by Character Types (+Level of Kanji) in Japanese 日本語の文字タイプ(+漢字レベル)別の語の数/割合とテキストカバー率 Type of Chracter (+Level of Kanji) 文字タイプ(+漢字レベル)(*) A: Alphabets H: Hiragana K: Katanaka Cummulative Number of Words Number of Words (orthographic forms) (orthographic forms) by Character Types by Character Types 文字タイプ(+漢字 文字タイプ(+漢字 レベル別)の語数(書 レベル別)の累積語 字形) 数(書字形) Increased Text Coverage by the Words Increased by Learning 100 More Kanji 漢字100字増加 による語のテキ ストカバー率増 Cummulative Text Coverage by the Words 語の累積テキス トカバー率 Increased Text Coverage by the Characters by Learning 100 more Kanji 漢字100字増加 による文字のテ キストカバー率 Cummulative Text Coverage by the Characters 文字の累積テキ ストカバー率 Only Alphabets アルファベットのみ 17712 17712 0.7% 0.7% 1.1% 1.1% O n ly H iragan a ( * ) ひらがな のみ 20272 37984 59.7% 60.4% 51.9% 52.9% 1 37985 0.0% 60.4% 0.0% 52.9% 49349 87334 3.3% 63.6% 7.3% 60.2% Mixture of A & H アルファベット・ひらがな混合 O n ly Katakan a ( * ) カタカナのみ 625 87959 0.0% 63.6% 0.0% 60.2% 95146 9.7% 70.0% 102506 79.0% 5.8% 75.8% 109894 10.1% 5.2% 3.6% 73.8% Ran ki n g 2 0 1 - 3 0 0 K an j i + A, H & K +漢字3 0 0 字 7187 7360 7318 82.6% 4.1% 79.9% Ranking 301-400 Kanji +A,H & K +漢字400字 6636 116530 2.8% 85.4% 3.3% 83.1% Ranking 401-500 Kanji +A,H & K +漢字500字 6830 123360 2.6% 88.0% 2.9% 86.0% Ranking 501-600 Kanji +A,H & K +漢字600字 6820 130180 2.0% 90.0% 2.4% 88.4% Ranking 601-700 Kanji +A,H & K +漢字700字 6585 136765 1.6% 91.6% 1.8% 90.2% Ranking 701-800 Kanji +A,H & K +漢字800字 6393 143158 1.4% 93.0% 1.6% 91.8% Ranking 801-900 Kanji +A,H & K +漢字900字 6186 149344 1.1% 94.1% 1.4% 93.2% R a n k i n g 9 0 1 - 1 0 0 0 Ka n j i + A ,H & K + 漢 字 1 0 0 0 字 5427 154771 1.0% 1.2% 94.4% R a n k i n g 1 0 0 1 - 1 1 0 0 Ka n j i + A ,H & K + 漢 字 1 1 0 0 字 4703 159474 0.8% 95.1% 96.0% 1.0% 95.3% Ranking 1101-1200 Kanji +A,H & K +漢字1200字 4262 163736 0.7% 96.6% 0.8% 96.1% Ranking 1201-1300 Kanji +A,H & K +漢字1300字 4222 167958 0.6% 97.2% 0.7% 96.8% Ranking 1301-1400 Kanji +A,H & K +漢字1400字 3691 171649 0.5% 97.7% 0.5% 97.4% R a n k i n g 1 4 0 1 - 1 5 0 0 Ka n j i + A ,H & K + 漢 字 1 5 0 0 字 3541 175190 0.4% 98.1% 0.4% 97.8% Ranking 1501-1600 Kanji +A,H & K +漢字1600字 2909 178099 0.3% 98.4% 0.4% 98.2% Ranking 1601-1700 Kanji +A,H & K +漢字1700字 2793 180892 0.3% 98.6% 0.3% 98.5% Ranking 1701-1800 Kanji +A,H & K +漢字1800字 2554 183446 0.2% 98.9% 0.3% 98.7% Ranking 1801-1900 Kanji +A,H & K +漢字1900字 2164 185610 0.2% 99.0% 0.2% 98.9% Ranking 1901-2000 Kanji +A,H & K +漢字2000字 1993 187603 0.2% 99.2% 0.2% 99.1% Ranking 2001-2100 Kanji +A,H & K +漢字2100字 1933 189536 0.1% 99.3% 0.1% 99.3% Ranking 2101-2200 Kanji +A,H & K +漢字2200字 1495 191031 0.1% 99.4% 0.1% 99.4% Ranking 2201-2300 Kanji +A,H & K +漢字2300字 1427 192458 0.1% 99.5% 0.1% 99.5% 15373 207831 0.5% 100.0% 0.5% 100.0% Mix t u r e o f A / H / K ア ル フ ァベッ ト ・ ひ ら が な ・ カ タカ ナ混 合 Ranking 1- 100 Kanji +A,H & K +漢字100字 Ranking 101-200 Kanji +A,H & K +漢字200字 Ranking 2301-6323 Kanji +A,H & K +全部 100.0% 10.0% 95.0% 9.0% 90.0% 8.0% 85.0% 7.0% 80.0% 6.0% 75.0% 5.0% 70.0% 4.0% 65.0% 3.0% 60.0% 2.0% 55.0% 1.0% 50.0% 0.0% Ranking 1- 100 Kanji +A,H… Ranking 101-200 Kanji +A,H … Ranking 201-300 Kanji +A,H … Ranking 301-400 Kanji +A,H … Ranking 401-500 Kanji +A,H … Ranking 501-600 Kanji +A,H … Ranking 601-700 Kanji +A,H … Ranking 701-800 Kanji +A,H … Ranking 801-900 Kanji +A,H … Ranking 901-1000 Kanji… Ranking 1001-1100 Kanji … Ranking 1101-1200 Kanji … Ranking 1201-1300 Kanji … Ranking 1301-1400 Kanji … Ranking 1401-1500 Kanji … Ranking 1501-1600 Kanji … Ranking 1601-1700 Kanji … Ranking 1701-1800 Kanji … Ranking 1801-1900 Kanji … Ranking 1901-2000 Kanji … Ranking 2001-2100 Kanji … Ranking 2101-2200 Kanji … Ranking 2201-2300 Kanji … 4. Results 結果 日本語の単語のテキストカバー率(漢字レベル別/累積) Cummulative Text Coverage by the Words 語の累積テキストカ バー率 Increased Text Coverage by the Words Increased by Learning 100 More Kanji 漢字100字増加による 語のテキストカバー率 増加分 4. Results 結果 (2) - 1 RQ. 2: Do the characters which provide the text coverage in Q.1 cover all the high frequency words? If no, what Kanji are further required to cover the words? (Is there any discrepancy between the word frequencies and character frequencies?) i.e. Can low frequency Kanji be barrier against learning high frequency words? Number of Kanji at Different Frequency Levels and the Former JLPT Levels Former JLPT Level 4 3 2 1 Kanji Frequency Level 1-100 39 41 20 101-300 27 53 110 301-1000 14 63 437 178 1001-2000 Others 0 0 8 0 187 Total 80 165 Subtotal Others Total 0 100 0 100 9 199 200 692 1 8 1 580 159 775 160 225 4163 1000 4323 755 926 1926 4397 6323 700 4. Results 結果 (2) - 2 • A narrow gap between Kanji frequency level and the former JLPT Kanji Level • Among the top 1000 Kanji, more than 800 Kanji are covered by the Kanji at the former JLPT level 4, 3 and 2 • More than 96% of the word tokens (延べ語数) in general texts will be covered by 1200 Kanji of: • All Kanji at the former JLPT level 4, 3, and 2 (Total: 1000) • + Top 200 Kanji at the former JLPT level 1 4. Results 結果 (2) - 3 Top 196 Kanji at the former JLPT level 1 and others 級外 保義公価基条応態郎 & 々 • Within the top 1000: 張士氏視素護離証企異評提姿 井統振吉策影紀為宮江派僕従系衛皇展案松隊施 我整及織環響修遺宗昭撃株節源養項興故裁沢端 障志激弁益嫌佐司眼密載己債訳症標健納請授挙 恵貴徳推描崎抗属盛監傷創徴街善援衆康模敵津 拠継隠称尾聖鮮厳攻妙融丈筋帝秘敷驚射壊刑壁 染功訴討幕扱脱範契弾診詳房避酸倉充典繰儀至 削博瞬仮縁憲択就聴握詩秀柄浜滅拡惑踏華闘微 雄維隣如審誘賀郷霊釈黙魔携掲遣艦剣致 & 誰頃 藤俺之岡伊阪 • Within the top 300: 4. Results 結果 (2) - 4 • 95% text coverage requires • Top 9600 lexemes / Top 20749 orthographic forms (types 異なり語数) • Top 1000 Kanji +Hiragana, Katakana + alphabet • Within top 9600 lexemes, 1700 lexemes are estimated to require Kanji beyond the top 1000 e.g. 比較、記憶、批判、距離、指摘、希望、分析、 韓国、基礎、誕生、監督、雰囲気、卒業、洗濯 • Many of them are often written in Hiragana/Katakana e.g.即ち、駄目、奴、凄い、頑張る、挨拶、嘘、 煙草、匂い、只、是非、無駄、喧嘩、噂、伺う 5. Discussion 考察 1) • For general texts, learners can attain more than 70% comprehension with the 95-96% coverage (For English, see Hu & Nation, 2000; for Japanese, see Komori, Mikuni & Kondo, 2004) • Learning Kanji by order of frequency is much more efficient to gain higher text coverage (Zipf’s Law: Zipf, 1949) • Top 300 – 500 Kanji seems much more essential • Top 1000- 1500 Kanji might be enough for general purposes (with occasional use of dictionary) • It may also mean that learning Kanji without reaching the threshold level is of little use… 5. Discussion 考察 2) • Also, to attain 95% coverage, 1000 Kanji are required; however, there are some important words not covered by the top 1000 Kanji • In other words, some low frequency Kanji are used for high frequency words • Many of those Kanji has low productivity, that is, they are rarely used for other words e.g.雰囲気、卒業、洗濯 • To cover top 9600 words (lexemes), further 200 – 500 Kanji are estimated to be required 5. Discussion 考察 3) • Certainly, the burden of learning Japanese characters is heavier than most other languages • However, the burden of learning Japanese vocabulary may be rather lighter once the learner knows: • the 1000-1500 characters • word formation/compounding rules of Kanji • metaphors of Kanji compounds e.g. 入門 : entering a gate first step, to start training • despite the fact that the text coverage is lower than English at all word frequency levels 5. Discussion 考察 4) In other words, it is possible that • the number of ‘units of learning Japanese vocabulary’ is not so many as generally perceived • It will also be important for students/teachers to learn/teach • association of different readings (typically On-reading and Kun- reading) of each Kanji to reduce the burden of learning Japanese vocabulary e.g. 入門 /nyuRmoN/ 入る /hairu/ + 門 /moN/ 入る (freq. ranking: 117) is more likely to be learned earlier than 入門 (freq. ranking: 6369) (Matsushita, 2011) • Without this kind of association, learners have to learn more words separately 6. Conclusion まとめ • 63% of BCCWJ texts are covered without Kanji (but half of them • • • • are function words) To attain 95% coverage, 1000 Kanji are required; however, some important words are not covered by the top 1000 Kanji To cover those words, further hundreds of Kanji will be required The text coverage in Japanese are generally lower than in English, i.e. Japanese requires more words to learn However, many Japanese words are composed of limited number of Kanji, therefore, the burden of learning Japanese vocabulary may not be heavy as expected from the text coverage studies, once the learner knows: • the 1000-1500 characters • form, meaning and compounding rules of Kanji • metaphors of Kanji compounds • association of different readings (e.g. On-reading and Kunreading) of each Kanji These slides will be uploaded in the site shown below. 「松下言語学習ラボ」 http://www.wa.commufa.jp/~tatsum/ You can find the site by Google with the key words of 松下 (Matsushita) and 言語 (language). References 引用文献 1) Akimoto, M. (秋元美晴). (2002). よくわかる語彙 [Uniderstanding Vocabulary]. Tokyo: Alc(アルク). Bauer, L. & Nation, P. (1993). Word families. International Journal of Lexicography. 6(4), 253-279. Carroll, J. B., Davies, P., & Richman, B. (1971). Word Frequency Book. New York: Houghton Mifflin, Boston American Heritage. Chikamatsu, N., Yokoyama, S., Nozaki, H., Long, E., & Fukuda, S. (2000). A Japanese logographic character frequency list for cognitive science research. Behavior Research Methods, Instruments, & Computers, 32(3), 482-500. Den, Y. (伝康晴), Yamada, A. (山田篤), Ogura, H. (小 椋秀樹), Koiso, H. (小磯花絵), & Ogiso, T. (小木曽智 信). (2009). UniDic. Version. 1.3.11. Downloaded from http://www.tokuteicorpus.jp/dist/ References 引用文献 2) Harlan, P. (パトリック・ハーラン). (2011). ゼロからの日 本語学習と僕の好きな日本のカルチャー (Learning Japanese from zero, and the Japanese culture I like). Cited from http://www.wochikochi.jp/topstory/2011/04/packun.php Hu, M. H. & Nation, P. (2000). Vocabulary density and reading comprehension. Reading in a Foreign Language, 13(1), 403-430. Komori, K. (小森和子), Mikuni, J. (三國純子), & Kondo, A. (近藤安月子). (2004). 文章理解を促進する語彙知識の 量的側面 ―既知語率の閾値探索の試み― (What percentage of known words in a text facilitates reading comprehension: a case study for exploration of the threshold of known words coverage). 日本語教育 [Teaching Japanese as a Foreign Language], 125, 83-92. References 引用文献 3) Matsushita, T. (松下達彦). (2011). 日本語を読むための 語彙データベース (The Database for Reading Japanese). Downloaded from http://www.geocities.jp/tatsum2003/ Nation, I. S. P. (2006). How large a vocabulary is needed for reading and listening? Canadian Modern Language Review, 63(1), 59-82. NINJAL: The National Institute for Japanese Language (国立 国語研究所). (1962). 現代雑誌90種の用字・用語 第一分冊 総記および語彙表 (Vocabulary and Chinese characters in ninety magazines of today: (Volume I) General description & vocabulary frequency tables). Tokyo: Shuuei Shuppan (秀英出 版). References 引用文献 4) NINJAL: The National Institute for Japanese Language (国立 国語研究所). (2006). 現代雑誌200万字言語調査語彙表 (The vocabulary lists from the language survey of contemporary magazines with two million running characters). Downloaded from http://www.kokken.go.jp/katsudo/seika/goityosa/index.html Tamamura, F. (玉村文郎). (1984). 語彙の研究と教育(上). Tokyo: The National Institute for Japanese Language (国立国 語研究所). Zipf, G. (1949). Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. New York: Hafner.
© Copyright 2024 ExpyDoc