Text Analysis Method Using Latent Topics for Field Notes in Area Studies Taizo Yamada Historiographical Institute, The University of Tokyo 2013/12/13 PNC2013 1 Contribution Text analysis for Area Studies – applying topic model to a field note for Area studies • We use LDA (Latent Dirichlet Allocation) as a topic model. • Similar fragments or scenes in field note can be obtained. – Visualization of the relationship between place names • The place information does not have Latitude and longitude. • We don’t have any dictionaries of place name. 2013/12/13 PNC2013 2 Outline Background, purpose Methodology of text analysis – Text structuring, – Term extraction – Characterization of term – Method of obtaining similar text fragments – Visualization and System Conclusion 2013/12/13 PNC2013 3 Background Recently, Area Studies has made remarkable progress. – Researchers in Area Studies can search and analyze large volumes of data easily and quickly. – using information technology such as web technology, data analysis, data engineering,… – In order to promote the analysis, the researchers have published databases. • catalogues, images, statistical data, spatial data and temporal data. For more the progress of the study, – we believe that text analysis is one of the essential elements. – a text such as a field note has a description of sights, scenes and customs, – but latent topics or subjects can be key elements characterizing the area. 2013/12/13 PNC2013 4 Purpose Text analysis method for a field note in Area Studies. – We prepare a field note database in which the data unit is a description of a sight or a scene. – In order to detect latent topics, we use latent Dirichlet allocation (LDA). • LDA is one of a topic model. • in LDA each text can be viewed as a mixture of various latent topics and each topic can be viewed as a mixture of various words. – In order to detect the gait of investigator in a field note • Visualization of the gait shows presentation of relations between place names. 2013/12/13 PNC2013 5 Text(1) Target: Koichi Takaya, “The Field note collection2 Sumatra” (in Japanese) – 1984. 10. 19 ― 1985. 1. 18 – Overall Sumatra Island 2013/12/13 PNC2013 6 Text structuring (1) 2013/12/13 PNC2013 7 Text structuring (1) 2013/12/13 PNC2013 8 Text structuring (2) 2013/12/13 PNC2013 9 Term extraction(1) Result of morphological analysis マングローブ 。 前面 の 海 に は バガン 。 魚 取り 用 の 櫓 。 いくつ も ある 。 EOS Text (a scene) マングローブ。前面の海には バガン( 魚取り用の櫓) いくつ もある。 名詞,一般,*,*,*,*,マングローブ,マングローブ,マングローブ 記号,句点,*,*,*,*,。,。,。 名詞,一般,*,*,*,*,前面,ゼンメン,ゼンメン 助詞,連体化,*,*,*,*,の,ノ,ノ 名詞,一般,*,*,*,*,海,ウミ,ウミ 助詞,格助詞,一般,*,*,*,に,ニ,ニ 助詞,係助詞,*,*,*,*,は,ハ,ワ 名詞,一般,*,*,*,*,* 記号,句点,*,*,*,*,。,。,。 名詞,一般,*,*,*,*,魚,サカナ,サカナ 名詞,接尾,一般,*,*,*,取り,トリ,トリ 名詞,接尾,一般,*,*,*,用,ヨウ,ヨー 助詞,連体化,*,*,*,*,の,ノ,ノ 名詞,一般,*,*,*,*,櫓,ロ,ロ 記号,句点,*,*,*,*,。,。,。 名詞,代名詞,一般,*,*,*,いくつ,イクツ,イクツ 助詞,係助詞,*,*,*,*,も,モ,モ 動詞,自立,*,*,五段・ラ行,基本形,ある,アル,アル 記号,句点,*,*,*,*,。,。,。 “名詞”: Noun, “助詞”: postpositional particle, “記号”: Symbol, “動詞”: Verb morphological analysis – mecab+ipadic (morphological analyzer; dictionary) 2013/12/13 PNC2013 10 Term extraction(2) Bag-of-Words Result of morphological analysis マングローブ 。 前面 の 海 に は バガン 。 魚 取り 用 の 櫓 。 いくつ も ある 。 EOS 名詞,一般,*,*,*,*,マングローブ,マングローブ,マングローブ 記号,句点,*,*,*,*,。,。,。 名詞,一般,*,*,*,*,前面,ゼンメン,ゼンメン 助詞,連体化,*,*,*,*,の,ノ,ノ 名詞,一般,*,*,*,*,海,ウミ,ウミ 助詞,格助詞,一般,*,*,*,に,ニ,ニ 助詞,係助詞,*,*,*,*,は,ハ,ワ 名詞,一般,*,*,*,*,* 記号,句点,*,*,*,*,。,。,。 名詞,一般,*,*,*,*,魚,サカナ,サカナ 名詞,接尾,一般,*,*,*,取り,トリ,トリ 名詞,接尾,一般,*,*,*,用,ヨウ,ヨー 助詞,連体化,*,*,*,*,の,ノ,ノ 名詞,一般,*,*,*,*,櫓,ロ,ロ 記号,句点,*,*,*,*,。,。,。 名詞,代名詞,一般,*,*,*,いくつ,イクツ,イクツ 助詞,係助詞,*,*,*,*,も,モ,モ 動詞,自立,*,*,五段・ラ行,基本形,ある,アル,アル 記号,句点,*,*,*,*,。,。,。 Extraction target: only noun But following types are not extracted: Bakauhumi:1 マングローブ:1 前面:1 海:1 バガン:1 魚取り用:1 櫓:1 ココヤシ:1 下:1 家:1 チョウジ:1 斜面:1 The number of the kinds of term is 5,666. – pronoun, number, 2013/12/13 PNC2013 11 Term extraction(3) 720km: Jakarta 出発 830km: Bakauhumi (*1) ① マングローブ。前面の海にはバガン( 魚取り用の櫓) いくつ もある。 ② ココヤシ多い。この下に少し家ある。 ③ チョウジの多い斜面。 853km: 稲。今若実り。 54km: このあたりよりチョウジ多くなる。その下を時に耕している。 トウモロコシを植えるらしい。 70km: 水田をよく見る。東に海見える。 77-79km: ココヤシが多い。時に水田あり、それ実っている。 85km: ココヤシ園広い。時にチョウジがある。 90km: 西海岸に来る。マングローブあるが、その背後にはココヤ シ多い。 97km: チョウジが多い。この辺りは殆どがジャワ人だという。 01km: Sidomulyo。周り、シラス台地。 11km: 5 ~ 10 年生のココヤシ多い。他に、チョウジ、バナナ、ラン ブータン、ドリアン。 18km; 左の海にはバガンが100 基ほど見える。 22km: 海岸は広くココヤシ。これ60 年生。高みはチョウジ多い。 Markup the extracted terms – The terms may characterize the scene in the text. – Extracted terms for each scene are different. By the way, What features do the terms have? – We should prepare a method of a detection of the features. – But we don’t have any thesaurus or dictionaries. Then, in order to detect, we introduce topic model. – Using topic model, we can detect latent topics as the features. 2013/12/13 PNC2013 12 Using topic model(1) We use LDA(Latent Dirichlet Allocation) as topic model. – Topic model • Modeling of co-occurrence of terms. • The results show term classification. – The kind of topic model • LSI(Latent Semantic Indexing): the model of introducing latent topic to VSM(Vector Space Model). • PLSI(Probabilistic Latent Semantic Indexing): The redefinition as a probabilistic model of LSI. • LDA: improved PLSI based on Bayesian learning 2013/12/13 PNC2013 13 Using topic model(2) LDA :D.M. Blei, et al. “Latent Dirichlet Allocation”, 2003. – document generation model where generating probability of latent topic follows Dirichlet distribution. – Latent topics can be determined if parameters of LDA can be tuned. • 𝑝 𝑑 𝛼, 𝛽 = 𝐷𝑖𝑟 𝜃 𝛼 𝑑 𝑛=1 𝐶 𝑘=1 𝑝 𝑤𝑛 𝑧𝑘 , 𝛽 𝑝 𝑧𝑘 𝜃 𝑑𝜃 – – – – 𝛼, 𝛽: parameter of LDA 𝑧 = 𝑧1 , 𝑧2 , … , 𝑧𝐶 :latent topic 𝜃 = 𝜃1 , 𝜃2 , … , 𝜃𝐶 : generating probability d = (w1 , w2 , … , w|𝑑| ) : document.𝑤𝑛 : term.|d|: the total number of term in d – Dir: Dirichlet distribution 2013/12/13 PNC2013 14 Using topic model(2) LDA :D.M. Blei, et al. “Latent Dirichlet Allocation”, 2003. – document generation model where generating probability of latent topic follows Dirichlet distribution. – Latent topics can be determined if parameters of LDA can be tuned. Topic can be generated according to θ. Document can be generated according to terms • 𝑝 𝑑 𝛼, 𝛽 = 𝐷𝑖𝑟 𝜃 𝛼 𝑑 𝑛=1 θ can be generated by α 𝐶 𝑘=1 𝑝 𝑤𝑛 𝑧𝑘 , 𝛽 𝑝 𝑧𝑘 𝜃 𝑑𝜃 The term can be generated according to topic z_k and β. – – – – 𝛼, 𝛽: parameter of LDA 𝑧 = 𝑧1 , 𝑧2 , … , 𝑧𝐶 :latent topic 𝜃 = 𝜃1 , 𝜃2 , … , 𝜃𝐶 : generating probability d = (w1 , w2 , … , w|𝑑| ) : document.𝑤𝑛 : term.|d|: the total number of term in d – Dir: Dirichlet distribution 2013/12/13 PNC2013 15 Detection of latent topic Feature of LDA – text • A set of terms • Having multiple topics – term • Belong to multiple topics • Not only specific topic Spatial changing(scene changing) – Because of the visualization of detection results, we can understand the changing . – Latent topics are changed according to the spatial changing. By the way, which is similar? 2013/12/13 PNC2013 16 Similarity between texts (1) We introduce VSM (Vector Space Model). – Feature vectors are needed by VSM. – The vector has an element which is total number of terms per topic. – Similarity between vectors is calculated by cosine similarity. • sim x, y = 𝑘 𝑤𝑒𝑖𝑔ℎ𝑡 𝑘 𝑤𝑒𝑖𝑔ℎ𝑡 𝑥𝑘 ⋅𝑤𝑒𝑖𝑔ℎ𝑡(𝑦𝑘 ) 2 𝑧𝑘 ,𝑥 ⋅ 𝑘 𝑤𝑒𝑖𝑔ℎ𝑡 𝑧𝑘 ,𝑦 2 – x,y: text(scene) – weight zk , x : The weight of topic zk in text x. – weight zk , x = tf xk ⋅ log 𝑁 𝑑𝑓 𝑘 +1 : tf.idf weighting – xk : the frequency of zk in text x. – 𝑑𝑓 𝑘 : the number of text which has topic zk . – N: the number of text 2013/12/13 PNC2013 17 Similarity between texts (2) 2013/12/13 PNC2013 18 Track of investigation (1) Beginning of text – Date: Oct. 19. ‘84 – “Jakarta よりKotabumi へ行 く。” – The text means the movement from ”Jakarta” to ”Kotabumi”. Tracking the movement – Extracting place name. – Rule: • from: ○○[から|より|出発 |…] • to: ○○[へ|まで|に|泊|…] – Unfortunately, we don’t have any dictionaries or gazetteers. – I connect extracted place names for the time being. 2013/12/13 PNC2013 19 Track of investigation (2) Force-Directed Graph Using D3.js http://d3js.org/ Jakarta Nov. ‘84 Pekanbaru Tembilahan Oct. ‘84 Solok Dec. ‘84 Jan. ‘85 Singapore 2013/12/13 PNC2013 20 Conclusion, Future works We introduce text analysis for field note in Area Studies. – Using topic model LDA – Tracking of the investigator. Future work – Improvement of text analysis for Area Studies. • What is the system that the researcher for Area Studies wants? • We consider about the answer, and develop system according to the answer. 2013/12/13 PNC2013 21 Thank you for listening to my presentation. – E-mail: [email protected] 2013/12/13 PNC2013 22
© Copyright 2024 ExpyDoc