Text Analysis Method Using Latent Topics for Field Notes in Area

Text Analysis Method Using Latent Topics
for Field Notes in Area Studies
Taizo Yamada
Historiographical Institute,
The University of Tokyo
2013/12/13
PNC2013
1
Contribution
Text analysis for Area Studies
– applying topic model to a field note for Area
studies
• We use LDA (Latent Dirichlet Allocation) as a topic
model.
• Similar fragments or scenes in field note can be
obtained.
– Visualization of the relationship between place
names
• The place information does not have Latitude and
longitude.
• We don’t have any dictionaries of place name.
2013/12/13
PNC2013
2
Outline
Background, purpose
Methodology of text analysis
– Text structuring,
– Term extraction
– Characterization of term
– Method of obtaining similar text fragments
– Visualization and System
Conclusion
2013/12/13
PNC2013
3
Background
 Recently, Area Studies has made remarkable progress.
– Researchers in Area Studies can search and analyze large
volumes of data easily and quickly.
– using information technology such as web technology, data
analysis, data engineering,…
– In order to promote the analysis, the researchers have
published databases.
• catalogues, images, statistical data, spatial data and temporal data.
 For more the progress of the study,
– we believe that text analysis is one of the essential elements.
– a text such as a field note has a description of sights, scenes and
customs,
– but latent topics or subjects can be key elements characterizing
the area.
2013/12/13
PNC2013
4
Purpose
 Text analysis method for a field note in Area
Studies.
– We prepare a field note database in which the data
unit is a description of a sight or a scene.
– In order to detect latent topics, we use latent Dirichlet
allocation (LDA).
• LDA is one of a topic model.
• in LDA each text can be viewed as a mixture of various latent
topics and each topic can be viewed as a mixture of various
words.
– In order to detect the gait of investigator in a field
note
• Visualization of the gait shows presentation of relations
between place names.
2013/12/13
PNC2013
5
Text(1)
Target: Koichi Takaya, “The
Field note collection2
Sumatra” (in Japanese)
– 1984. 10. 19 ― 1985. 1. 18
– Overall Sumatra Island
2013/12/13
PNC2013
6
Text structuring (1)
2013/12/13
PNC2013
7
Text structuring (1)
2013/12/13
PNC2013
8
Text structuring (2)
2013/12/13
PNC2013
9
Term extraction(1)
Result of morphological analysis
マングローブ
。
前面
の
海
に
は
バガン
。
魚
取り
用
の
櫓
。
いくつ
も
ある
。
EOS
Text (a scene)
マングローブ。前面の海には
バガン( 魚取り用の櫓) いくつ
もある。
名詞,一般,*,*,*,*,マングローブ,マングローブ,マングローブ
記号,句点,*,*,*,*,。,。,。
名詞,一般,*,*,*,*,前面,ゼンメン,ゼンメン
助詞,連体化,*,*,*,*,の,ノ,ノ
名詞,一般,*,*,*,*,海,ウミ,ウミ
助詞,格助詞,一般,*,*,*,に,ニ,ニ
助詞,係助詞,*,*,*,*,は,ハ,ワ
名詞,一般,*,*,*,*,*
記号,句点,*,*,*,*,。,。,。
名詞,一般,*,*,*,*,魚,サカナ,サカナ
名詞,接尾,一般,*,*,*,取り,トリ,トリ
名詞,接尾,一般,*,*,*,用,ヨウ,ヨー
助詞,連体化,*,*,*,*,の,ノ,ノ
名詞,一般,*,*,*,*,櫓,ロ,ロ
記号,句点,*,*,*,*,。,。,。
名詞,代名詞,一般,*,*,*,いくつ,イクツ,イクツ
助詞,係助詞,*,*,*,*,も,モ,モ
動詞,自立,*,*,五段・ラ行,基本形,ある,アル,アル
記号,句点,*,*,*,*,。,。,。
“名詞”: Noun, “助詞”: postpositional particle, “記号”: Symbol, “動詞”: Verb
 morphological analysis
– mecab+ipadic (morphological analyzer; dictionary)
2013/12/13
PNC2013
10
Term extraction(2)
Bag-of-Words
Result of morphological analysis
マングローブ
。
前面
の
海
に
は
バガン
。
魚
取り
用
の
櫓
。
いくつ
も
ある
。
EOS
名詞,一般,*,*,*,*,マングローブ,マングローブ,マングローブ
記号,句点,*,*,*,*,。,。,。
名詞,一般,*,*,*,*,前面,ゼンメン,ゼンメン
助詞,連体化,*,*,*,*,の,ノ,ノ
名詞,一般,*,*,*,*,海,ウミ,ウミ
助詞,格助詞,一般,*,*,*,に,ニ,ニ
助詞,係助詞,*,*,*,*,は,ハ,ワ
名詞,一般,*,*,*,*,*
記号,句点,*,*,*,*,。,。,。
名詞,一般,*,*,*,*,魚,サカナ,サカナ
名詞,接尾,一般,*,*,*,取り,トリ,トリ
名詞,接尾,一般,*,*,*,用,ヨウ,ヨー
助詞,連体化,*,*,*,*,の,ノ,ノ
名詞,一般,*,*,*,*,櫓,ロ,ロ
記号,句点,*,*,*,*,。,。,。
名詞,代名詞,一般,*,*,*,いくつ,イクツ,イクツ
助詞,係助詞,*,*,*,*,も,モ,モ
動詞,自立,*,*,五段・ラ行,基本形,ある,アル,アル
記号,句点,*,*,*,*,。,。,。
 Extraction target: only noun
 But following types are not extracted:
Bakauhumi:1
マングローブ:1
前面:1
海:1
バガン:1
魚取り用:1
櫓:1
ココヤシ:1
下:1
家:1
チョウジ:1
斜面:1
 The number of the kinds of term is
5,666.
– pronoun, number,
2013/12/13
PNC2013
11
Term extraction(3)
720km: Jakarta 出発
830km: Bakauhumi (*1)
① マングローブ。前面の海にはバガン( 魚取り用の櫓) いくつ
もある。
② ココヤシ多い。この下に少し家ある。
③ チョウジの多い斜面。
853km: 稲。今若実り。
54km: このあたりよりチョウジ多くなる。その下を時に耕している。
トウモロコシを植えるらしい。
70km: 水田をよく見る。東に海見える。
77-79km: ココヤシが多い。時に水田あり、それ実っている。
85km: ココヤシ園広い。時にチョウジがある。
90km: 西海岸に来る。マングローブあるが、その背後にはココヤ
シ多い。
97km: チョウジが多い。この辺りは殆どがジャワ人だという。
01km: Sidomulyo。周り、シラス台地。
11km: 5 ~ 10 年生のココヤシ多い。他に、チョウジ、バナナ、ラン
ブータン、ドリアン。
18km; 左の海にはバガンが100 基ほど見える。
22km: 海岸は広くココヤシ。これ60 年生。高みはチョウジ多い。
 Markup the extracted terms
– The terms may characterize the
scene in the text.
– Extracted terms for each scene
are different.
 By the way, What features do the
terms have?
– We should prepare a method of
a detection of the features.
– But we don’t have any thesaurus
or dictionaries.
 Then, in order to detect, we
introduce topic model.
– Using topic model, we can detect
latent topics as the features.
2013/12/13
PNC2013
12
Using topic model(1)
 We use LDA(Latent Dirichlet Allocation) as topic
model.
– Topic model
• Modeling of co-occurrence of terms.
• The results show term classification.
– The kind of topic model
• LSI(Latent Semantic Indexing): the model of introducing
latent topic to VSM(Vector Space Model).
• PLSI(Probabilistic Latent Semantic Indexing): The redefinition as a probabilistic model of LSI.
• LDA: improved PLSI based on Bayesian learning
2013/12/13
PNC2013
13
Using topic model(2)
 LDA :D.M. Blei, et al. “Latent Dirichlet Allocation”, 2003.
– document generation model where generating probability of
latent topic follows Dirichlet distribution.
– Latent topics can be determined if parameters of LDA can be
tuned.
• 𝑝 𝑑 𝛼, 𝛽 =
𝐷𝑖𝑟 𝜃 𝛼
𝑑
𝑛=1
𝐶
𝑘=1 𝑝
𝑤𝑛 𝑧𝑘 , 𝛽 𝑝 𝑧𝑘 𝜃
𝑑𝜃
–
–
–
–
𝛼, 𝛽: parameter of LDA
𝑧 = 𝑧1 , 𝑧2 , … , 𝑧𝐶 :latent topic
𝜃 = 𝜃1 , 𝜃2 , … , 𝜃𝐶 : generating probability
d = (w1 , w2 , … , w|𝑑| ) : document.𝑤𝑛 : term.|d|: the total number of term in
d
– Dir: Dirichlet distribution
2013/12/13
PNC2013
14
Using topic model(2)
 LDA :D.M. Blei, et al. “Latent Dirichlet Allocation”, 2003.
– document generation model where generating probability of
latent topic follows Dirichlet distribution.
– Latent topics can be determined if parameters of LDA can be
tuned.
Topic can be
generated
according to θ.
Document can be generated
according to terms
• 𝑝 𝑑 𝛼, 𝛽 =
𝐷𝑖𝑟 𝜃 𝛼
𝑑
𝑛=1
θ can be generated by α
𝐶
𝑘=1 𝑝
𝑤𝑛 𝑧𝑘 , 𝛽 𝑝 𝑧𝑘 𝜃
𝑑𝜃
The term can be generated
according to topic z_k and β.
–
–
–
–
𝛼, 𝛽: parameter of LDA
𝑧 = 𝑧1 , 𝑧2 , … , 𝑧𝐶 :latent topic
𝜃 = 𝜃1 , 𝜃2 , … , 𝜃𝐶 : generating probability
d = (w1 , w2 , … , w|𝑑| ) : document.𝑤𝑛 : term.|d|: the total number of term in
d
– Dir: Dirichlet distribution
2013/12/13
PNC2013
15
Detection of latent topic
 Feature of LDA
– text
• A set of terms
• Having multiple topics
– term
• Belong to multiple topics
• Not only specific topic
 Spatial changing(scene changing)
– Because of the visualization of
detection results, we can understand
the changing .
– Latent topics are changed according
to the spatial changing.
 By the way, which is similar?
2013/12/13
PNC2013
16
Similarity between texts (1)
 We introduce VSM (Vector Space Model).
– Feature vectors are needed by VSM.
– The vector has an element which is total number of terms
per topic.
– Similarity between vectors is calculated by cosine similarity.
• sim x, y =
𝑘 𝑤𝑒𝑖𝑔ℎ𝑡
𝑘 𝑤𝑒𝑖𝑔ℎ𝑡
𝑥𝑘 ⋅𝑤𝑒𝑖𝑔ℎ𝑡(𝑦𝑘 )
2
𝑧𝑘 ,𝑥 ⋅
𝑘 𝑤𝑒𝑖𝑔ℎ𝑡
𝑧𝑘 ,𝑦
2
– x,y: text(scene)
– weight zk , x : The weight of topic zk in text x.
– weight zk , x = tf xk ⋅ log
𝑁
𝑑𝑓 𝑘 +1
: tf.idf weighting
– xk : the frequency of zk in text x.
– 𝑑𝑓 𝑘 : the number of text which has topic zk .
– N: the number of text
2013/12/13
PNC2013
17
Similarity between texts (2)
2013/12/13
PNC2013
18
Track of investigation (1)
 Beginning of text
– Date: Oct. 19. ‘84
– “Jakarta よりKotabumi へ行
く。”
– The text means the
movement from ”Jakarta”
to ”Kotabumi”.
 Tracking the movement
– Extracting place name.
– Rule:
• from: ○○[から|より|出発
|…]
• to: ○○[へ|まで|に|泊|…]
– Unfortunately, we don’t
have any dictionaries or
gazetteers.
– I connect extracted place
names for the time being.
2013/12/13
PNC2013
19
Track of investigation (2)
Force-Directed Graph
Using D3.js
http://d3js.org/
Jakarta
Nov. ‘84
Pekanbaru
Tembilahan
Oct. ‘84
Solok
Dec. ‘84
Jan. ‘85
Singapore
2013/12/13
PNC2013
20
Conclusion, Future works
 We introduce text analysis for field note in Area
Studies.
– Using topic model LDA
– Tracking of the investigator.
 Future work
– Improvement of text analysis for Area Studies.
• What is the system that the researcher for Area Studies
wants?
• We consider about the answer, and develop system
according to the answer.
2013/12/13
PNC2013
21
Thank you for listening to my presentation.
– E-mail: [email protected]
2013/12/13
PNC2013
22