発表スライド

Is the vocabulary learning
burden of Japanese really
heavier than that of English?
日本語の語彙学習負担は
本当に英語よりも大きいか?
Tatsuhiko Matsushita
PhD candidate,
Victoria University of Wellington
17th Biennial Conference of the Japanese
Studies Association of Australia (JSAA)
Contents 本発表の内容
1. Motives for the study
1. 研究動機
2. Goals and research
2. 目的・研究課題
3.
4.
5.
6.
questions
Method
Results
Discussion
Conclusion
3. 方法
4. 結果
5. 考察
6. まとめ
• 引用文献
• References
1. Motives for the study 研究動機 (1)
• Heavy burden in learning Japanese vocabulary?
(Tamamura, 1984)
• Text coverage study テキストカバー率の研究
Text coverage = Coverage of word tokens (延べ語数)
• Top (=most frequent) 1000 words cover 60% in Japanese magazines
(NINJAL: The National Institute for Japanese Language, 1962; 2006)
• Top 1000 words cover over 70% in English
(e.g., Carroll, Davies & Richman, 1971).
• To reach 95%/98% text coverage, 9500/20000 words
(lexeme 語彙素) are required in Japanese, while only
5000/9000 word families are required in English.
(Matsushita, 2011; Nation, 2006)
Note: Word family (English) ≒ Lexeme (Japanese)?
1. Motives for the study 研究動機 (2)
Word family (English) ≒ Lexeme 語彙素 (Japanese)?
• “Word family” adopted by Nation (2006)
• Level 6 of Bauer & Nation (1993) -- including derived words with
frequent affixes and ‘regular but infrequent affixes’
e.g. Members of abbreviate : abbreviate, abbreviates, abbreviated,
abbreviating, abbreviation, abbreviations
• Lexeme defined by UniDic (Den et al., 2009)
Members of the short unit 短単位 of a lexeme
e.g. 読む-読み, やはり-やっぱり, 足-脚, 受け入れる
cf. 短縮/する
• Why is the text coverage in Japanese and English so different?
• Possible explanation: many groups of words with different
word-origins 語種 but similar meanings (e.g., Akimoto, 2002)
e.g., 旅館, 宿屋, ホテル
1. Motives for the study 研究動機 (3)
• Questions about the explanation
• Method: magazine texts? Coverage not including function words?
• English synonyms with different word-origins e.g. liberty-freedom, spirit-soul
• Nature of Japanese: many transparent compounds composed of Kanji
e.g. 春季 /shunki/: low frequency word
(Ranked at 28587 in Matsushita (2011))
春 /haru/: high frequency word (1019, ibid)
季節 /kisetsu/: high frequency word (1955, ibid)
 not difficult to infer the meaning of 春季 if the meanings of 春
and 季節 are already known (春季 is transparent)
• For those words, learners normally only need to understand the
meanings of components and word formation rules –either
implicitly or explicitly. cf. Harlan (2011)
1. Motives for the study 研究動機 (4)
cf. Harlan (2011) = a comedian ‘Pakkun’ (パックン)
「漢字はある程度覚えると、逆に語彙力を上げるのがす
ごく簡単になるんです。基本の数を覚えてしまえばあとは
応用が利くこともありますし、100覚えれば、その次の100
覚えるのがさらに早くなる。500覚えたら、その次の500、
1000が倍、3倍速くなるんです。」
「漢字を覚えると、新しく聞いた単語を漢字で分析すれば、
その意味もわかります。「冷蔵庫」の冷は冷やす、蔵は「く
ら」だし、車庫の庫で、何か物置的なイメージです。その3
つの字を組み合わせれば何となく意味がわかります。」
2. Goals and Research Questions
目的・研究課題
Goals:
 To estimate the true learning burden of Japanese vocabulary
 To think about more efficient order for learning Japanese vocabulary
Research Questions:
1. How many ‘characters’ learners need to learn to attain a certain
level of text coverage of ‘words’?
Note: it is not to see the simple text coverage by character.
cf. Chikamatsu et al. (2000)
To know the meaning of a single character 節 is NOT enough to understand the meaning
of 季節.
2.
Do the characters which provide the certain level of text coverage
(in Q.1) cover all the high frequency words? If no, what Kanji are
further required to cover the words? (Is there any discrepancy
between the word frequencies and character frequencies?)
3. Method 方法 (1) - 1
1) Calculate character frequencies in BCCWJ (the Balanced
Corpus of Contemporary Written Japanese 現代日本語書
き言葉均衡コーパス (BCCWJ) 2009 monitor version:
NINJAL, 2009)
2) Give a learning order ranking to each character
I.
II.
Rank the types of character as Alphabet, Hiragana, Katakana and
Kanji/signs
Rank Kanji by frequency
3) List all words in orthographic forms (書字形) in BCCWJ
4) Separate each word into characters
5) Give the learning order ranking to each character
6) Calculate the text coverage by filtering the character of the
words by learning order ranking
3. Method 方法 (1) -2
BCCWJ 2009 monitor version (NINJAL, 2009)
• Book corpus (approx. 28 million running words) and
• Internet forum site corpus (approx. 5 million running
words)
• Unit of counting a ‘word’ used for this study:
• the short form (短単位) defined by UniDic (Den et al.,
2009)
• the orthographic form (書字形)
i.e. 書く / 書か/ かく or 足 / 脚 are counted as different
orthographic forms but as one lexeme (語彙素)
3. Method 方法 (2)
For RQ. 2,
• Identify the relationship between Kanji frequency
levels & the former JLPT 旧日本語能力検定試験 Kanji
levels to check if the JLPT Kanji are ranked properly
• Identify the words which are not covered by the high
frequency Kanji and check what Kanji are used in
those words
4. Results 結果 (1) - 1
RQ. 1: How many ‘characters’ learners need to learn
to attain a certain level of text coverage of ‘words’?
• 64% of the words (half of them are function words):
covered only by the phonographic characters
(Hiragana, Katakana and alphabet)
• 82% : by phonographic characters + top 300 Kanji
• Learning 100 Kanji in top 1000 Kanji means
potential understanding of 6000 – 7000 types 異な
り語 of orthographic forms (3000–4000 lexemes)
4. Results 結果 (1) - 2
• 95 - 96%: by phonographic characters 表音文字
& top 1000 – 1100 Kanji
 threshold level for reading comprehension?
(Hu & Nation, 2000; Komori et al., 2004)
• 98%: by phonographic characters & top 1500 kanji
4. Results Number/Ratio of Words (orthographic forms) and Text Coverage by Character Types (+Level of Kanji) in Japanese
日本語の文字タイプ(+漢字レベル)別の語の数/割合とテキストカバー率
Type of Chracter (+Level of Kanji)
文字タイプ(+漢字レベル)(*)
A: Alphabets
H: Hiragana
K: Katanaka
Cummulative
Number of Words
Number of Words
(orthographic forms) (orthographic forms)
by Character Types by Character Types
文字タイプ(+漢字
文字タイプ(+漢字
レベル別)の語数(書 レベル別)の累積語
字形)
数(書字形)
Increased Text
Coverage by the
Words Increased
by Learning 100
More Kanji
漢字100字増加
による語のテキ
ストカバー率増
Cummulative
Text Coverage
by the Words
語の累積テキス
トカバー率
Increased Text
Coverage by the
Characters by
Learning 100
more Kanji
漢字100字増加
による文字のテ
キストカバー率
Cummulative
Text Coverage
by the
Characters
文字の累積テキ
ストカバー率
Only Alphabets アルファベットのみ
17712
17712
0.7%
0.7%
1.1%
1.1%
O n ly H iragan a ( * ) ひらがな のみ
20272
37984
59.7%
60.4%
51.9%
52.9%
1
37985
0.0%
60.4%
0.0%
52.9%
49349
87334
3.3%
63.6%
7.3%
60.2%
Mixture of A & H アルファベット・ひらがな混合
O n ly Katakan a ( * ) カタカナのみ
625
87959
0.0%
63.6%
0.0%
60.2%
95146
9.7%
70.0%
102506
79.0%
5.8%
75.8%
109894
10.1%
5.2%
3.6%
73.8%
Ran ki n g 2 0 1 - 3 0 0 K an j i + A, H & K +漢字3 0 0 字
7187
7360
7318
82.6%
4.1%
79.9%
Ranking 301-400 Kanji +A,H & K +漢字400字
6636
116530
2.8%
85.4%
3.3%
83.1%
Ranking 401-500 Kanji +A,H & K +漢字500字
6830
123360
2.6%
88.0%
2.9%
86.0%
Ranking 501-600 Kanji +A,H & K +漢字600字
6820
130180
2.0%
90.0%
2.4%
88.4%
Ranking 601-700 Kanji +A,H & K +漢字700字
6585
136765
1.6%
91.6%
1.8%
90.2%
Ranking 701-800 Kanji +A,H & K +漢字800字
6393
143158
1.4%
93.0%
1.6%
91.8%
Ranking 801-900 Kanji +A,H & K +漢字900字
6186
149344
1.1%
94.1%
1.4%
93.2%
R a n k i n g 9 0 1 - 1 0 0 0 Ka n j i + A ,H & K + 漢 字 1 0 0 0 字
5427
154771
1.0%
1.2%
94.4%
R a n k i n g 1 0 0 1 - 1 1 0 0 Ka n j i + A ,H & K + 漢 字 1 1 0 0 字
4703
159474
0.8%
95.1%
96.0%
1.0%
95.3%
Ranking 1101-1200 Kanji +A,H & K +漢字1200字
4262
163736
0.7%
96.6%
0.8%
96.1%
Ranking 1201-1300 Kanji +A,H & K +漢字1300字
4222
167958
0.6%
97.2%
0.7%
96.8%
Ranking 1301-1400 Kanji +A,H & K +漢字1400字
3691
171649
0.5%
97.7%
0.5%
97.4%
R a n k i n g 1 4 0 1 - 1 5 0 0 Ka n j i + A ,H & K + 漢 字 1 5 0 0 字
3541
175190
0.4%
98.1%
0.4%
97.8%
Ranking 1501-1600 Kanji +A,H & K +漢字1600字
2909
178099
0.3%
98.4%
0.4%
98.2%
Ranking 1601-1700 Kanji +A,H & K +漢字1700字
2793
180892
0.3%
98.6%
0.3%
98.5%
Ranking 1701-1800 Kanji +A,H & K +漢字1800字
2554
183446
0.2%
98.9%
0.3%
98.7%
Ranking 1801-1900 Kanji +A,H & K +漢字1900字
2164
185610
0.2%
99.0%
0.2%
98.9%
Ranking 1901-2000 Kanji +A,H & K +漢字2000字
1993
187603
0.2%
99.2%
0.2%
99.1%
Ranking 2001-2100 Kanji +A,H & K +漢字2100字
1933
189536
0.1%
99.3%
0.1%
99.3%
Ranking 2101-2200 Kanji +A,H & K +漢字2200字
1495
191031
0.1%
99.4%
0.1%
99.4%
Ranking 2201-2300 Kanji +A,H & K +漢字2300字
1427
192458
0.1%
99.5%
0.1%
99.5%
15373
207831
0.5%
100.0%
0.5%
100.0%
Mix t u r e o f A / H / K ア ル フ ァベッ ト ・ ひ ら が な ・ カ タカ ナ混 合
Ranking 1- 100 Kanji +A,H & K +漢字100字
Ranking 101-200 Kanji +A,H & K +漢字200字
Ranking 2301-6323 Kanji +A,H & K +全部
100.0%
10.0%
95.0%
9.0%
90.0%
8.0%
85.0%
7.0%
80.0%
6.0%
75.0%
5.0%
70.0%
4.0%
65.0%
3.0%
60.0%
2.0%
55.0%
1.0%
50.0%
0.0%
Ranking 1- 100 Kanji +A,H…
Ranking 101-200 Kanji +A,H …
Ranking 201-300 Kanji +A,H …
Ranking 301-400 Kanji +A,H …
Ranking 401-500 Kanji +A,H …
Ranking 501-600 Kanji +A,H …
Ranking 601-700 Kanji +A,H …
Ranking 701-800 Kanji +A,H …
Ranking 801-900 Kanji +A,H …
Ranking 901-1000 Kanji…
Ranking 1001-1100 Kanji …
Ranking 1101-1200 Kanji …
Ranking 1201-1300 Kanji …
Ranking 1301-1400 Kanji …
Ranking 1401-1500 Kanji …
Ranking 1501-1600 Kanji …
Ranking 1601-1700 Kanji …
Ranking 1701-1800 Kanji …
Ranking 1801-1900 Kanji …
Ranking 1901-2000 Kanji …
Ranking 2001-2100 Kanji …
Ranking 2101-2200 Kanji …
Ranking 2201-2300 Kanji …
4. Results 結果
日本語の単語のテキストカバー率(漢字レベル別/累積)
Cummulative Text
Coverage by the Words
語の累積テキストカ
バー率
Increased Text
Coverage by the Words
Increased by Learning
100 More Kanji
漢字100字増加による
語のテキストカバー率
増加分
4. Results 結果 (2) - 1
RQ. 2: Do the characters which provide the text
coverage in Q.1 cover all the high frequency words?
If no, what Kanji are further required to cover the
words?
(Is there any discrepancy between the word
frequencies and character frequencies?)
i.e. Can low frequency Kanji be barrier against learning
high frequency words?
Number of Kanji at Different Frequency Levels
and the Former JLPT Levels
Former JLPT Level
4
3
2
1
Kanji Frequency
Level
1-100
39
41
20
101-300
27
53
110
301-1000
14
63
437 178
1001-2000
Others
0
0
8
0
187
Total
80
165
Subtotal
Others
Total
0
100
0
100
9
199
200
692
1
8
1
580
159
775
160
225
4163
1000
4323
755
926
1926
4397
6323
700
4. Results 結果 (2) - 2
• A narrow gap between Kanji frequency level and the
former JLPT Kanji Level
• Among the top 1000 Kanji, more than 800 Kanji are
covered by the Kanji at the former JLPT level 4, 3
and 2
• More than 96% of the word tokens (延べ語数) in
general texts will be covered by 1200 Kanji of:
• All Kanji at the former JLPT level 4, 3, and 2 (Total: 1000)
• + Top 200 Kanji at the former JLPT level 1
4. Results 結果 (2) - 3
Top 196 Kanji at the former JLPT level 1 and others 級外
保義公価基条応態郎 & 々
• Within the top 1000: 張士氏視素護離証企異評提姿
井統振吉策影紀為宮江派僕従系衛皇展案松隊施
我整及織環響修遺宗昭撃株節源養項興故裁沢端
障志激弁益嫌佐司眼密載己債訳症標健納請授挙
恵貴徳推描崎抗属盛監傷創徴街善援衆康模敵津
拠継隠称尾聖鮮厳攻妙融丈筋帝秘敷驚射壊刑壁
染功訴討幕扱脱範契弾診詳房避酸倉充典繰儀至
削博瞬仮縁憲択就聴握詩秀柄浜滅拡惑踏華闘微
雄維隣如審誘賀郷霊釈黙魔携掲遣艦剣致 & 誰頃
藤俺之岡伊阪
• Within the top 300:
4. Results 結果 (2) - 4
• 95% text coverage requires
• Top 9600 lexemes / Top 20749 orthographic forms (types 異なり語数)
• Top 1000 Kanji +Hiragana, Katakana + alphabet
• Within top 9600 lexemes, 1700 lexemes are
estimated to require Kanji beyond the top 1000
e.g. 比較、記憶、批判、距離、指摘、希望、分析、
韓国、基礎、誕生、監督、雰囲気、卒業、洗濯
• Many of them are often written in Hiragana/Katakana
e.g.即ち、駄目、奴、凄い、頑張る、挨拶、嘘、
煙草、匂い、只、是非、無駄、喧嘩、噂、伺う
5. Discussion 考察 1)
• For general texts,
learners can attain more than 70% comprehension with the
95-96% coverage
(For English, see Hu & Nation, 2000;
for Japanese, see Komori, Mikuni & Kondo, 2004)
• Learning Kanji by order of frequency is much more efficient
to gain higher text coverage (Zipf’s Law: Zipf, 1949)
• Top 300 – 500 Kanji seems much more essential
• Top 1000- 1500 Kanji might be enough for general purposes
(with occasional use of dictionary)
• It may also mean that
learning Kanji without reaching the threshold level is of little
use…
5. Discussion 考察 2)
• Also, to attain 95% coverage, 1000 Kanji are required;
however, there are some important words not covered by the
top 1000 Kanji
• In other words, some low frequency Kanji are used for high
frequency words
• Many of those Kanji has low productivity, that is, they are
rarely used for other words e.g.雰囲気、卒業、洗濯
• To cover top 9600 words (lexemes), further 200 – 500 Kanji
are estimated to be required
5. Discussion 考察 3)
• Certainly, the burden of learning Japanese
characters is heavier than most other languages
• However, the burden of learning Japanese
vocabulary may be rather lighter once the learner
knows:
• the 1000-1500 characters
• word formation/compounding rules of Kanji
• metaphors of Kanji compounds
e.g. 入門 : entering a gate  first step, to start training
• despite the fact that the text coverage is lower than
English at all word frequency levels
5. Discussion 考察 4)
In other words, it is possible that
• the number of ‘units of learning Japanese vocabulary’ is not
so many as generally perceived
• It will also be important for students/teachers to learn/teach
• association of different readings (typically On-reading and Kun-
reading) of each Kanji
 to reduce the burden of learning Japanese vocabulary
e.g. 入門 /nyuRmoN/  入る /hairu/ + 門 /moN/
入る (freq. ranking: 117) is more likely to be learned earlier
than 入門 (freq. ranking: 6369) (Matsushita, 2011)
• Without this kind of association, learners have to learn more
words separately
6. Conclusion まとめ
• 63% of BCCWJ texts are covered without Kanji (but half of them
•
•
•
•
are function words)
To attain 95% coverage, 1000 Kanji are required; however, some
important words are not covered by the top 1000 Kanji
To cover those words, further hundreds of Kanji will be required
The text coverage in Japanese are generally lower than in English,
i.e. Japanese requires more words to learn
However, many Japanese words are composed of limited number
of Kanji, therefore, the burden of learning Japanese vocabulary
may not be heavy as expected from the text coverage studies,
once the learner knows:
• the 1000-1500 characters
• form, meaning and compounding rules of Kanji
• metaphors of Kanji compounds
• association of different readings (e.g. On-reading and Kunreading) of each Kanji
These slides will be uploaded in the site shown below.
「松下言語学習ラボ」
http://www.wa.commufa.jp/~tatsum/
You can find the site by Google with the key words of
松下 (Matsushita) and 言語 (language).
References 引用文献 1)





Akimoto, M. (秋元美晴). (2002). よくわかる語彙
[Uniderstanding Vocabulary]. Tokyo: Alc(アルク).
Bauer, L. & Nation, P. (1993). Word families. International
Journal of Lexicography. 6(4), 253-279.
Carroll, J. B., Davies, P., & Richman, B. (1971). Word Frequency
Book. New York: Houghton Mifflin, Boston American Heritage.
Chikamatsu, N., Yokoyama, S., Nozaki, H., Long, E., & Fukuda,
S. (2000). A Japanese logographic character frequency list for
cognitive science research. Behavior Research Methods,
Instruments, & Computers, 32(3), 482-500.
Den, Y. (伝康晴), Yamada, A. (山田篤), Ogura, H. (小
椋秀樹), Koiso, H. (小磯花絵), & Ogiso, T. (小木曽智
信). (2009). UniDic. Version. 1.3.11.
Downloaded from http://www.tokuteicorpus.jp/dist/
References 引用文献 2)
Harlan, P. (パトリック・ハーラン). (2011). ゼロからの日
本語学習と僕の好きな日本のカルチャー (Learning
Japanese from zero, and the Japanese culture I like). Cited from
http://www.wochikochi.jp/topstory/2011/04/packun.php
 Hu, M. H. & Nation, P. (2000). Vocabulary density and reading
comprehension. Reading in a Foreign Language, 13(1), 403-430.
 Komori, K. (小森和子), Mikuni, J. (三國純子), & Kondo,
A. (近藤安月子). (2004). 文章理解を促進する語彙知識の
量的側面 ―既知語率の閾値探索の試み― (What
percentage of known words in a text facilitates reading
comprehension: a case study for exploration of the threshold of
known words coverage). 日本語教育 [Teaching Japanese as a
Foreign Language], 125, 83-92.

References 引用文献 3)

Matsushita, T. (松下達彦). (2011). 日本語を読むための
語彙データベース (The Database for Reading Japanese).
Downloaded from http://www.geocities.jp/tatsum2003/

Nation, I. S. P. (2006). How large a vocabulary is needed for
reading and listening? Canadian Modern Language Review,
63(1), 59-82.
NINJAL: The National Institute for Japanese Language (国立
国語研究所). (1962). 現代雑誌90種の用字・用語 第一分冊
総記および語彙表 (Vocabulary and Chinese characters in
ninety magazines of today: (Volume I) General description &
vocabulary frequency tables). Tokyo: Shuuei Shuppan (秀英出
版).

References 引用文献 4)



NINJAL: The National Institute for Japanese Language (国立
国語研究所). (2006). 現代雑誌200万字言語調査語彙表
(The vocabulary lists from the language survey of
contemporary magazines with two million running
characters). Downloaded from
http://www.kokken.go.jp/katsudo/seika/goityosa/index.html
Tamamura, F. (玉村文郎). (1984). 語彙の研究と教育(上).
Tokyo: The National Institute for Japanese Language (国立国
語研究所).
Zipf, G. (1949). Human Behavior and the Principle of Least
Effort: An Introduction to Human Ecology. New York: Hafner.