Gender Classification of Japanese Authors David Edwards & Cybelle Smith Gendered Speech in Japanese Gender of speaker may be overtly marked: Gender-specific first-person pronouns 僕, boku, male; 俺, ore, male; 私, watashi, female or neutral Question: Does gender have less-overt effects on Japanese texts as well? Can word choice, morphology, writing style indicate gender, even in noisy environments like fiction writing? Corpora “Peace” Corpus • 29 personal essays by middle school students • Topic: “Peace” • 29 authors: – 22 female – 7 male “Bookstudio” Corpus • 485 installments of online novels • Genre: Fantasy • 40 authors – 20 female – 20 male • Also collected ~181 installments from authors of unknown gender (for future research) Our Baseline - The “Boku” Test Corpus Peace Male Accuracy .71 Bookstudio .91 Female Accuracy 1.0 Overall Accuracy .93 .43 .67 Classifiers Used Naïve Bayes: Build conditional probabilities of features given gender Calculate probability of test data given a particular gender Select highest-probability gender SVM: Used the LIBSVM free classifying tool Find dividing hyperplane in num-feature dimensional space - Requires problem-specific parameters chosen via cross-validation Apply hyperplane to test data Also attempted Logistic Regression Chasen: Segmenter and POS-tagger Stem 光 が 彷徨 う よう な 暗き 闇 Pronun -ciation ヒカリ ガ ホウコウ ウ ヨウ ナ クラキ ヤミ Lemma 光 が 彷徨 うい よう だ 暗い 闇 Part of Speech 記号-空白 名詞-一般 助詞-格助詞-一般 名詞-サ変接続 形容詞-自立 形容詞・アウオ段 ガル接続 名詞-接尾-一般 助動詞 特殊・ダ体言接続 形容詞-自立 名詞-一般 Features Stem Pron Lemma POS 暗き クラキ 暗い形容詞-自立 KURAki kuraki KURAi adjective - independent Features 私 わたし ワタシ Kanji (Chinese character) Hiragana (phonetic) Katakana (phonetic, like italics) Single-feature performance on Naive-Bayes: Feature Indic Stem Lem Pron POS Quot WS SPDWS1 SPDWS2 Male Accuracy .29 .67 .68 .70 .80 .23 .66 .49 .87 Female Accuracy .51 .77 .78 .74 .45 .33 .85 .81 .68 Overall Accuracy .40 .72 .73 .72 .63 .28 .76 .66 .77 Multi-feature performance on Naive-Bayes: Trial Stem Lem Pron POS Quot WS SPD SPD Male Female Overall WS1 WS2 Acc. Acc. Acc. 1 X 2 X 3 X 4 X 5 X .63 .73 .68 X .81 .73 .77 X .70 .76 .73 .68 .76 .72 X .68 .78 .73 X X 6 X X X X 7 X X X X X X X X .70 .70 .70 X X X .70 .73 .71 SVM Performance • Optimizations: – Scaling counts to avoid swamping low-frequency features – Selecting optimal error rate and kernel parameters Accuracy Features No Scaling Scaling Cross Validation Cross Validation (Training Set) (Test Set) All features (except quotations) 50.6% 48.5% 79.7% 50.0% Part of Speech 50.9% 53.0% 68.0% 47.3% Wordshape 63.3% 75.2% 50.6% 64% 77.8% 51.8% 50.6% Pronunciation 50.6% Conclusion • Without considering gendered pronouns, we achieved similar performance • Most-indicative feature: wordshape (use of kanji vs. hiragana vs. katakana etc.), especially where multiple options exist • Point of interest: male and female Japanese authors differ not just in the words they use, but how they choose to write those words
© Copyright 2024 ExpyDoc