A Text Analysis Method Using Nonparametric Bayesian Model for

An Attempt to Obtain A Similar Japanese
Historical Material Using The Variable
Order N-gram
Taizo YAMADA
National Institutes for the Humanities, JAPAN
PNC2012
Contribution
• We prototyped a search system which has
– a method of collecting similar materials using Vector
Space Model (VSM) for Japanese historical materials,
and
– a method of displaying similar materials into timeline.
• In order to use VSM, we introduced a method of a
word segmentation using nonparametric Bayesian
model.
– The word segmentation technique is unsupervised
learning.
• No dictionary.
• trying to find hidden structure in unlabeled text.
PNC2012
2
Outline
• Background, purpose
• Methodology
– Show search results sorted by similarity using
VSM
– Word segmentation using nonparametric Bayesian
model
• Conclusion, future work
PNC2012
3
Background
• Recently, in Japanese historical material, the
amount of encoded texts has been grown up.
– Not only catalogue encoding.
• Japanese historical full-text databases in Japan
– SHIPS
• Historiographical Institute, The University of Tokyo
– AzumaKagami Database
• National Institute of Japanese Literature
– Kaneaki-kyo-ki Database
• National Museum of Japanese History
–…
PNC2012
4
Example : SHIPS @ Historiographical
Institute
URL: http://wwwap.hi.u-tokyo.ac.jp/ships/shipscontroller
27DBs
Target: premodern Japanese historical materials
From: Nara period to: Edo period
PNC2012
5
Example : SHIPS @ Historiographical
Institute
The Dai Nihon Shiryo Unified Database
Full-text Databases
The Full-text Database of the Old Japanese Diaries
The Komonjo full-text database
The Nara Period Komonjo full-text database
The Heian Ibun full-text database
The Kamakura Ibun full-text database
PNC2012
6
The Komonjo full-text database(1)
keywords
PNC2012
7
The Komonjo full-text database(2)
PNC2012
8
The Komonjo full-text database(2)
The number of results, The search condition
Date
Text (substring where is hit)
PNC2012
Bibliographic information,
Link to image
9
The Komonjo full-text database(2)
Date
Sorted by date
The system can sort according to date
or bibliographic information.
PNC2012
10
The Komonjo full-text database(2)
Date
Sorted by date
The system can sort according to date
or bibliographic information.
But…sorting according to similarity
with the query is not supported
PNC2012
11
The Komonjo full-text database(2)
Date
Sorted by date
The system can sort according to date
or bibliographic information.
But…sorting according to similarity
with the query is not supported
Furthermore…
Collecting similar texts is not supported.
The system has no visual analysis method.
(e.g. displaying using timeline.)
PNC2012
12
Purpose
• In order to efficient analysis from a huge amount
of texts, method of obtaining and visualizing
similar texts is important.
– Which material is similar to a query?
– Which material is similar to a material selected by a
user?
• Our approach is as follows:
– Show search results sorted by similarity with a query.
– Show similar materials according to the selected search
result.
– Displaying the similar materials into the timeline.
PNC2012
13
Sorting according to similarity (1)
• Let me search with following conditions:
– Query: “足利直義” (ASHIKAGA, Tadayoshi)
– Target:
• Full-text databases in SHIPS.
• Temporal: “南北朝時代” Nanboku-chō period, JAPAN
(spanning from 1333 to 1392)
• 7007 materials (catalogues)
• Text: 4067 kinds of characters, 1,204,594 total
characters
PNC2012
14
Sorting according to similarity (2)
• The search results are sorted according to the
similarity to the query.
• The similarity (score) is calculated using
Vector Space Model (VSM).
PNC2012
15
Sorting according to similarity (2)
Let me select No.2
• The search results are sorted according to the
similarity to the query.
• The similarity (score) is calculated using
Vector Space Model (VSM).
PNC2012
16
Show similar materials (1)
• To select a result,
presenting the
detail of the result
(title, text,
temporal
information, …),
and the materials
which are similar
to the result.
PNC2012
17
Show similar materials (1)
• To select a result,
presenting the
detail of the result
(title, text,
temporal
information, …),
and the materials
which are similar
to the result.
Moreover, select No.2
PNC2012
18
Show similar materials (2)
• It is displayed in
the same way.
PNC2012
19
Displaying into the timeline (1)
• It is displayed in
the same way.
click
PNC2012
20
Displaying into the timeline (2)
PNC2012
21
Displaying into the timeline (2)
This timeline feature was
developed using Simile Timeline
(version 2.3.0 (with Ajax lib pre
2.3.0))
PNC2012
22
Displaying into the timeline (2)
Selected material
PNC2012
23
Displaying into the timeline (2)
Similar materials
PNC2012
24
Displaying into the timeline (2)
Similar materials
A color of an icon shows the similarity of
the material.
• : this material (font color: green)
• : top 1- 5
• : top 6 - 9
• : top 11-
PNC2012
25
Displaying into the timeline (2)
The materials have temporal information
which is normalized.
According to the temporal information,
the materials are displayed into the
timeline.
PNC2012
26
What should be solved?
• In order to sort according to the similarity, method
of similarity calculation is needed.
• Then, we use the Vector Space Model(VSM).
– In VSM, a material is represented as a vector.
– An element in a vector (the weight of the term in the
material) is calculated using tf.idf measure.
– In order to calculate a similarity between a query and a
text, the vector of the query is created.
PNC2012
27
Similarity Calculation using VSM
•
The similarity between x and y is
calculated by the following formula.
– 𝑠𝑐𝑜𝑟𝑒 𝑥, 𝑦 =
𝑖 𝑤𝑒𝑖𝑔ℎ𝑡
𝑖 𝑤𝑒𝑖𝑔ℎ𝑡
𝑥 𝑖 ∙𝑤𝑒𝑖𝑔ℎ𝑡 𝑦 𝑖
𝑥 2𝑖 ∙
𝑁
𝑑𝑓 𝑥
𝑤𝑒𝑖𝑔ℎ𝑡 𝑥
–
𝑡𝑓 𝑥 𝑖 : the frequency of term i in
document(material) x
–
𝑑𝑓(𝑥): the frequency of
document(material) x
–
𝑁: the number of documents (materials)
•
𝑖
∙ log
𝑦 2𝑖
–
𝑖
= 𝑡𝑓 𝑥
𝑖 𝑤𝑒𝑖𝑔ℎ𝑡
+1
Similarity
Currently, the numbers of terms in the
system is 119,282.
PNC2012
28
Similarity Calculation using VSM
•
The similarity between x and y is
calculated by the following formula.
– 𝑠𝑐𝑜𝑟𝑒 𝑥, 𝑦 =
𝑖 𝑤𝑒𝑖𝑔ℎ𝑡
𝑖 𝑤𝑒𝑖𝑔ℎ𝑡
–
𝑥 𝑖 ∙𝑤𝑒𝑖𝑔ℎ𝑡 𝑦 𝑖
𝑥 2𝑖 ∙
𝑖 𝑤𝑒𝑖𝑔ℎ𝑡
𝑦 2𝑖
By the way…
𝑤𝑒𝑖𝑔ℎ𝑡 The
𝑥 = 𝑡𝑓tf.idf
𝑥 ∙ log
+1
measure
needs terms from text.
𝑖
𝑖
𝑁
𝑑𝑓 𝑥
–
𝑡𝑓 𝑥 𝑖 : the frequency of term i in
document(material) x
–
𝑑𝑓(𝑥): the frequency of
document(material) x
–
𝑁: the number of documents (materials)
•
Currently, the numbers of terms in the
system is 119,282.
PNC2012
29
Similarity Calculation using VSM
•
The similarity between x and y is
calculated by the following formula.
– 𝑠𝑐𝑜𝑟𝑒 𝑥, 𝑦 =
𝑖 𝑤𝑒𝑖𝑔ℎ𝑡
𝑖 𝑤𝑒𝑖𝑔ℎ𝑡
–
–
–
–
𝑥 𝑖 ∙𝑤𝑒𝑖𝑔ℎ𝑡 𝑦 𝑖
𝑥 2𝑖 ∙
𝑖 𝑤𝑒𝑖𝑔ℎ𝑡
𝑦 2𝑖
By the way…
𝑤𝑒𝑖𝑔ℎ𝑡 The
𝑥 = 𝑡𝑓tf.idf
𝑥 ∙ log
+1
measure
needs terms from text.
𝑖
𝑖
𝑁
𝑑𝑓 𝑥
𝑡𝑓 𝑥 𝑖 : the frequency of term i in
document(material) x
Is it possible to extract terms from
𝑑𝑓(𝑥): the frequency of
Japanese
historical text?
document(material)
x
𝑁: the number of documents (materials)
•
Currently, the numbers of terms in the
system is 119,282.
PNC2012
30
Word segmentation (1)
• An example of text of Japanese historical material:
足利直義御教書案(切紙) 島津家文書_1
御教書案師直師泰誅伐事早馳参御方可致軍忠之状如件観応元年
十一月三日御判島津左京進入道殿
• The term extraction is very hard…
– There are no morphological analysis for a text of Japanese historical
material.
– Moreover, we don’t have any dictionaries (for person, location,…) to
cover the Nanboku-chō period.
• Here, we use word segmentation results of text as terms.
– In Japanese historical text, there are
• many nouns,
• some auxiliary verbs, some verbs and few others.
PNC2012
31
Word segmentation (3)
•
We introduce word segmentation technique using NPYLM (Nested Pitman-Yor
Language Model) [1].
–
NPYLM can generate probabilistic distributions by nonparametric Bayesian model, and is a
word n-gram model without any dictionaries. (unsupervised learning technique)
•
Word is represented by Character variable order n-gram in the model.
[1] Daichi Mochihashi, Takeshi Yamada, Naonori Ueda.: "Bayesian Unsupervised Word
Segmentation with Nested Pitman-Yor Language Modeling". ACL-IJCNLP 2009, pp.100-108, 2009.
•
•
The world of nonparametric: There are “many parameters” not “no parameters” .
An example: the result of the word segmentation using NPYLM is follows:
text:
御教書案師直師泰誅伐事早馳参御方可致軍忠之状如件観応元年十
一月三日御判島津左京進入道殿
result:
御教書 | 案 | 師直師 | 泰誅伐事 | 早馳 | 参御方 | 可致 | 軍忠之 | 状如件 | 観応
元年十一月三 | 日 | 御判 | 島津 | 左京進 | 入道殿
correct?:
御教書 | 案 | 師直 | 師泰 | 誅伐事 | 早馳参 | 御方 | 可致 | 軍忠之状 | 如件 | 観
応 | 元年 | 十一月 | 三日 | 御判 | 島津 | 左京進 | 入道 | 殿
It is not so bad!?
There are a number of correct segmentations
PNC2012
32
The technique of word segmentation(1)
•
Forward filtering-Backward sampling
–
–
•
This is often used in Bayesian learning of HMM (Hidden Markov Model)
2 phases : Forward filtering, Backward sampling
Forward sampling phase: computing forward variables (𝛼 𝑡 [𝑘])
–
The variable is the probability of string s(=c1, … ,ct) with final k characters being a word.
– 𝛼𝑡 𝑘 =
𝑡−𝑘
𝑡−𝑘
𝑡
𝑗=1 𝑝(𝑐𝑡−𝑘+1 |𝑐𝑡−𝑘−𝑗+1 )
∙ 𝛼[𝑡 − 𝑘][𝑗]
Aggregating marginal probabilities!
s:
1
t-k-j+1 t-k
御教書案
師直師泰
御教書案師
御教書案師直
直師泰
t-k+1 t t+1
誅伐事 早馳参…
k
𝛼 𝑡 [𝑘]
師泰
j
𝛼 𝑡 − 𝑘 [𝑗]
𝑡−𝑘
𝑡
Here, 𝑝(𝑐𝑡−𝑘+1
|𝑐𝑡−𝑘−𝑗+1
) : probability of bigram
How is the probability computed ?  using NPYLM
PNC2012
33
The technique of word segmentation(2)
• Backward sampling phase: Determination the value of k in 𝛼 𝑡 [𝑘]
– In proportion of the forward variable 𝛼 𝑡 [𝑘], value of k is drawn.
𝑡
– Thus, 𝑘 ∝ 𝑝 𝑤𝑖 𝑐𝑡−𝑘+1
,Θ ∙ 𝛼 𝑡 𝑘
𝑁
• Start status : 𝑝 $ 𝑐𝑁−𝑘
– N: the length of string s (thus, s = c1, …, cN)
– $: a sentence boundary. $ is at the end of the string.
• Terminate condition : t = 0
• 𝑘 : string length
– If k is defined, word is defined too.
s:
1
御教書案
t-k+1
t
師直師泰
御教書案師
御教書案師直
t+1…
誅伐事 早馳参…
直師泰
師泰
k
𝑤𝑖
𝑡
𝑃(𝑤𝑖 |𝑐𝑡−𝑘+1
)
𝑡
𝑐𝑡−𝑘+1
𝑡
𝑝(𝑤𝑖 |𝑐𝑡−𝑘+1
)
Here,
: probability of bigram
How is the probability calculated ?  using NPYLM
PNC2012
34
The technique of word segmentation(3)
•
NPYLM is an extension of HPYLM [1](Hierarchal Pitman-Yor Language Model).
–
–
[1]Y.W.Teh.: A Hierarchical Bayesian Language Model Based On Pitman-Yor Processes, ACL2006.
In HPYLM, context h, word w, the probability that w follows h is p(w|h), the probability is
computed by
• 𝑃 𝑤ℎ =
–
–
–
–
•
•
𝑐
𝑤ℎ
−𝑑⋅𝑡ℎ𝑤
𝜃+𝑐(ℎ)
+
𝜃+𝑑⋅𝑡ℎ⋅
𝜃+𝑐(ℎ)
⋅ 𝑝(𝑤|ℎ′ )
ℎ′ : the suffix of h consisting of all but the first word
𝑡ℎ⋅ = 𝑤 𝑡ℎ𝑤
𝑐 ℎ = 𝑤 𝑐(𝑤|ℎ)
𝑡ℎ𝑤 , 𝑑, 𝜃 : parameter of HPYLM … many parameters, So “nonparametric”
Discounting frequency of n-degree and interpolated by frequency of (n-1)-degree
NPYLM has Character n-gram model and Word n-gram model.
–
Character n-gram is embedded in the base measure of Word n-gram.
•
–
Thus, 0-gram in Word n-gram is Character n-gram.
HPYLM is represented by Character n-gram and Word n-gram.
•
Detail written in Daichi Mochihashi, Takeshi Yamada, Naonori Ueda.: "Bayesian Unsupervised Word
Segmentation with Nested Pitman-Yor Language Modeling". ACL-IJCNLP 2009, pp.100-108, 2009.
PNC2012
35
The technique of word segmentation(4)
•
text:
Finally, segmented string 𝑤 = 𝑤𝑖 , 𝑤𝑖−1 , … , 𝑤1 can be obtained.
御教書案師直師泰誅伐事早馳参御方可致軍忠之状如件観応元年十
一月三日御判島津左京進入道殿
御教書 | 案 | 師直師 | 泰誅伐事 | 早馳 | 参御方 | 可致 | 軍忠之 | 状如件 | 観応
Segment:
元年十一月三 | 日 | 御判 | 島津 | 左京進 | 入道殿
•
Note: The procedure is accomplished by MCMC(Markov chain Monte Carlo)
method.
– Using all training data, various parameters should be sampled. We use the Gibbs sampling.
•
•
•
•
Parameters are fixed except sampling target parameter.
And the parameter is sampled with training data.
Next sampling target is decided.
The above-mentioned is applied until all the parameter is sampled.
– The procedure should be repeated many times.
• Until “you are satisfied“ (e.g., the change of parameters is quite small),
or parameters are converged.
PNC2012
36
Examples: Word segmentation
13750880550 赤堀文書 大日本史料6_46
text:
譲與所領事右美作国塩湯郷壱分地頭職并公文職内半分者兄義季観応二年二月廿一日
戴御判今半分事康季同日戴御判者也彼御判同御施行遵行軍忠之御判等相副之甥為帯
刀左衛門尉季治於猶子譲與者也聊一族親類中不可有違乱妨儀也仍為後日譲状如件永
和元年八月十日藤原康季
譲與 | 所領 | 事右 | 美作国 | 塩湯郷壱分 | 地頭 | 職并 | 公文 | 職内 | 半分者 | 兄義季 | 観
応二 | 年二月 | 廿一日 | 戴御判 | 今半分事 | 康季同日 | 戴御 | 判者也 | 彼 | 御判 | 同 | 御
Segment: 施行 | 遵行 | 軍忠之 | 御判 | 等相副之 | 甥為 | 帯刀 | 左衛門 | 尉季 | 治於 | 猶子 | 譲與 |
者也 | 聊一 | 族親 | 類中 | 不可有 | 違乱 | 妨儀也 | 仍為後日 | 譲状 | 如件 | 永和元 | 年八
月 | 十日 | 藤原 | 康季
13900080170 大内義弘挙状案 吉川家文書_2
text:
吉河讃岐入道玄龍申安芸国山県郡内志路原庄事依軍忠預置之候御下文事申御沙汰候
者可然候恐惶謹言明徳元年八月十七日左京権大夫義弘在判進上御奉行所
Segment:
吉河 | 讃岐 | 入道 | 玄龍申 | 安芸国 | 山県郡内 | 志路原庄事 | 依 | 軍忠 | 預置 | 之候 | 御
下文 | 事申 | 御沙汰 | 候者 | 可然候 | 恐惶 | 謹言 | 明徳元 | 年八月 | 十七日 | 左京 | 権大
夫 | 義弘 | 在判 | 進上御 | 奉行所
Segmentation may be fine more if some dictionaries can be introduced.
 We should challenge in semi-supervised learning!?
PNC2012
37
Conclusion, Future Work
• In order to support to search and show from a huge amount of the
texts of the Japanese historical materials, we introduce the search
system using VSM and Timeline.
• In order to VSM, we introduce word segmentation using NPYLM.
• All the methods which we introduced is the stage of the “sketch”.
– The search system is prototype.
– We introduce and prototype, but do not evaluate.
• The evaluation is very hard!?
– We'll consider to introducing a semi-supervised learning
• For more fine word segmentation…
– And investigating method of text analysis for Japanese historical
material.
• For example:
– topic extraction,
– With temporal data: topic detection / topic tracking,…
• For discovering similar materials “semantically”…
PNC2012
38
Thank you for listening to my presentation.
– E-mail: [email protected]
PNC2012
39