Document Classification Method with Small Training Data Yasunari MAEDA (Kitami Institute of Technology) Hideki YOSHIDA (Kitami Institute of Technology) Toshiyasu Matsushima (Waseda University) 1 Topics Overview of document classification Document classification using distance Document classification using probabilistic model Preparations The first our previous research The second our previous research Experiments of previous methods Proposed method Experiments Conclusion 2 Overview of document classification Key words in documents are used. ex. classification of newspaper articles art1 art2 art3 art4 art5 art6 class economy stocks, company, science articles sport article key word amino acid, computer, class article economy art1, art3 science art5, art6 sport art2, art4 result of classification baseball, swimming, A new characteristic of amino acid was found. key word KEY stocks , amino acid , baseball , . class of articles C economy, science , sport , . In many cases, each key word belongs to more than one classes. 3 Document classification using distance A new article is classified into a class whose distance is minimum. class B class A art A1 art A3 art A5 art A4 art A2 dB new article dA The distance between class A and the new article is minimum. The new article is classified into the class A. art B2 art B1 dC art B3 art B4 class C art C1 art C2 art C3 art C4 art C5 It is very easy to use in real case. There is no theoretical guarantee on accuracy. Vector space model is well known. 4 Document classification using probabilistic model Key words occur depending on probabilistic distributions. A new article is classified into a class whose error rate is minimum. article A new characteristic of amino acid was found. occurrence of article class “science” occurs parameters which dominate probability distributions p science p amino acid science, key word “amino acid” occurs under the condition that class “science” occurs. article classification = estimation for class p our previous research p aminoacid , minimizing the error rate with respect to the Bayes criterion accuracy is low with small training data We want to improve accuracy with small training data. 5 Preparations(1/3) ci : class of documents C : set of classes , C c1, c2 ,, c C . key i : key word KEY : set of key words, KEY key1, key2 ,, key KEY . pci : a probability of an event that class ci occurs : a parameter which dominates pci , . * : a true parameter which is unknown, * . : a probability of an event that key word keyj occurs in a pkey j ci , document in class ci . : a parameter which dominates p key j ci , , . * * : a true parameter which is unknown, . doc : a new document, doc x, yn , x : a class of new document doc , x C . ( x is unknown) yn: a string of key words in new document doc , yn y1 y2 yn . n ( y is known) n : the number of key words in new document doc 6 Preparations(2/3) docL : training data x , y x , y x , y , doc x, y L n L 1 n1 2 n2 L nL L : the number of documents in the training data xi : the class of the i th document in the training data, xi C. n i : the number of key words in the i th document in the training data y ni : a string of key words in the i th document in the training data y ni yi,1 yi,2 yi, ni . yi , j : the j th key word in the i th document in the training data doc occurs n n p doc , p x p y x, p x p yi x, .(1) a probability of an event that the new document i 1 L doc occurs ni L L p doc , pxi p yi , j xi , i 1 j 1 . a probability of an event that the training data (2) 7 Preparations(3/3) document classification problem estimating the unknown class x of the new document under the condition that the string of key words yn in and the training data docL are given. doc doc 8 The first our previous research(1/2) The class of the new document is estimated by n d B y , doc arg max p L xˆ C x L pxˆ d p x, y , y n n L i 1 i 1 p y xˆ, d , i (3) F xˆ x L xˆ x L p xˆ d (4) p F x x L x , xC xˆ, y x, y n L F y yi 1 y xˆ F i i i n L i 1 ˆ p x, y , y p yi x , d n L ˆ F x , y x , y F y yi 1 y xˆ y KEY , xˆ , yi xˆ : parameters of Dirichlet distribution for p xˆ , p yi xˆ , . where, (5) (prior distributions for the unknown parameters) F xˆ x L : the number of documents in the class xˆ in the training data : the number of key word y i in the documents in the class xˆ F xˆ, yi x, y n F yi yi 1 L in the training data : the number of key word y i in the string y i 1 9 The first our previous research(2/2) the first our previous method optimal method which minimizes the error rate with respect to Bayes criterion But, the accuracy is low with small training data. 0.5 is used as parameter of prior distributions in order to represent no information. the second our previous research improve accuracy with small training data Accuracy depends on prior distributions with small training data. 10 The second our previous research(1/2) We estimate prior distribution using estimating data. estimating data for prior distributions The new documents and the training data occur from the same source. The estimating data occurs from another source. v, w v , w v , w x m G 1 m1 2 m2 mG , w , G (6) G : the number of documents in the estimating data vi : the class of the i th document m i: the number of key words in the i th document w mi : the string of key words in the i th document wmi wi,1wi,2 wi, mi . wi , j : the j th key word in the i th document 11 The second our previous research(2/2) Parameters in eq(4) and eq(5) are estimated using estimating data. F xˆ v G xˆ xˆ F x v G x , (7) xC m G ˆ F x , yi v, w yi xˆ yi xˆ m G ˆ ˆ y x F x , y v, w yKEY , (8) where, xˆ , yi xˆ : parameters of Dirichlet distribution for p xˆ , p yi xˆ , . 12 Experiments of previous methods(1/2) comparison between the first our previous method and the second first our previous method(prev1) 0.5 is used as each parameter for prior distributions. second our previous method(prev2) Prior distributions are estimated using estimating data. new documents : 10000 (Japanese Mainichi News Paper 2007) training data : Japanese Mainichi News Paper 2007 estimating data : 50000 (Japanese Mainichi News Paper 1994) 13 Experiments of previous methods(2/2) 0.8 0.7 accuracy rate 0.6 0.5 0.4 0.3 0.2 prev1 0.1 prev2 0 10 20 30 40 50 100 500 1000 5000 10000 50000 number of training data prev2 is higher than prev1 with small training data. But prev2 is lower than prev1 with large training data. 14 Proposed method Parameters in eq(4) and eq(5) are estimated as follows: F xˆ v G xˆ xˆ G F x v x xC log A1 L 2 xˆ A2 m G ˆ ˆ F x , yi v , w y i x yi xˆ F xˆ, y v, wm G y xˆ y KEY A3 L (9) , log A1 L 2 yi xˆ A2 A3 L , (10) where, 1 A1 , 1 A2 , 0 A3 1. 15 Experiments (1/2) comparison between the first our previous method and the new proposed method first our previous method(prev1) 0.5 is used as each parameter for prior distributions. new our proposed method(pro) A1 10 , A2 1000000, A3 0.995. new documents : 10000 (Japanese Mainichi News Paper 2007) training data : Japanese Mainichi News Paper 2007 estimating data : 50000 (Japanese Mainichi News Paper 1994) 16 Experiments (2/2) 0.8 0.7 accuracy rate 0.6 0.5 0.4 0.3 0.2 prev1 0.1 pro 0 10 20 30 40 50 100 500 1000 5000 10000 50000 number of training data pro is higher than prev1 in all points. 17 Conclusion Accuracy of our new proposed method is higher than the first our previous method with small training data. And the accuracy is equal when the size of training data is big. with small training data use estimating data mainly with large training data use training data mainly further works We want to study a method to choose the parameters A1, A2 and A3. 18
© Copyright 2024 ExpyDoc