Features Accuracy

Gender Classification
of Japanese Authors
David Edwards & Cybelle Smith
Gendered Speech in Japanese
Gender of speaker may be overtly marked:
Gender-specific first-person pronouns
僕, boku, male; 俺, ore, male; 私, watashi, female or neutral
Question: Does gender have less-overt effects on Japanese texts as
well?
Can word choice, morphology, writing style indicate gender,
even in noisy environments like fiction writing?
Corpora
“Peace” Corpus
• 29 personal essays by middle
school students
• Topic: “Peace”
• 29 authors:
– 22 female
– 7 male
“Bookstudio” Corpus
• 485 installments of online
novels
• Genre: Fantasy
• 40 authors
– 20 female
– 20 male
• Also collected ~181
installments from authors of
unknown gender
(for future research)
Our Baseline - The “Boku” Test
Corpus
Peace
Male
Accuracy
.71
Bookstudio .91
Female
Accuracy
1.0
Overall
Accuracy
.93
.43
.67
Classifiers Used
Naïve Bayes:
Build conditional probabilities of features given gender
Calculate probability of test data given a particular gender
Select highest-probability gender
SVM:
Used the LIBSVM free classifying tool
Find dividing hyperplane in num-feature dimensional space
- Requires problem-specific parameters chosen via
cross-validation
Apply hyperplane to test data
Also attempted Logistic Regression
Chasen: Segmenter and POS-tagger
Stem
光
が
彷徨
う
よう
な
暗き
闇
Pronun
-ciation
ヒカリ
ガ
ホウコウ
ウ
ヨウ
ナ
クラキ
ヤミ
Lemma
光
が
彷徨
うい
よう
だ
暗い
闇
Part of Speech
記号-空白
名詞-一般
助詞-格助詞-一般
名詞-サ変接続
形容詞-自立 形容詞・アウオ段 ガル接続
名詞-接尾-一般
助動詞 特殊・ダ体言接続
形容詞-自立
名詞-一般
Features
Stem
Pron Lemma POS
暗き クラキ 暗い形容詞-自立
KURAki
kuraki
KURAi
adjective - independent
Features
私
わたし
ワタシ
Kanji (Chinese character)
Hiragana (phonetic)
Katakana (phonetic, like italics)
Single-feature performance on Naive-Bayes:
Feature
Indic Stem Lem Pron POS Quot WS SPDWS1 SPDWS2
Male Accuracy
.29
.67
.68
.70
.80
.23
.66 .49
.87
Female Accuracy .51
.77
.78
.74
.45
.33
.85 .81
.68
Overall Accuracy .40
.72
.73
.72
.63
.28
.76 .66
.77
Multi-feature performance on Naive-Bayes:
Trial Stem Lem Pron POS Quot WS SPD SPD Male Female Overall
WS1 WS2 Acc. Acc.
Acc.
1
X
2
X
3
X
4
X
5
X
.63
.73
.68
X
.81
.73
.77
X
.70
.76
.73
.68
.76
.72
X
.68
.78
.73
X
X
6
X
X
X
X
7
X
X
X
X
X
X
X
X
.70
.70
.70
X
X
X
.70
.73
.71
SVM Performance
• Optimizations:
– Scaling counts to avoid swamping low-frequency features
– Selecting optimal error rate and kernel parameters
Accuracy
Features
No
Scaling
Scaling Cross Validation Cross Validation
(Training Set)
(Test Set)
All features
(except
quotations)
50.6%
48.5%
79.7%
50.0%
Part of Speech 50.9%
53.0%
68.0%
47.3%
Wordshape
63.3%
75.2%
50.6%
64%
77.8%
51.8%
50.6%
Pronunciation 50.6%
Conclusion
• Without considering gendered pronouns, we
achieved similar performance
• Most-indicative feature: wordshape (use of kanji
vs. hiragana vs. katakana etc.), especially where
multiple options exist
• Point of interest: male and female Japanese
authors differ not just in the words they use, but
how they choose to write those words