統計的決定理論に基づく 複数のクラスに属する文書の

Document Classification Method with
Small Training Data
Yasunari MAEDA (Kitami Institute of Technology)
Hideki YOSHIDA (Kitami Institute of Technology)
Toshiyasu Matsushima (Waseda University)
1
Topics
Overview of document classification
Document classification using distance
Document classification using probabilistic model
Preparations
The first our previous research
The second our previous research
Experiments of previous methods
Proposed method
Experiments
Conclusion
2
Overview of document classification
Key words in documents are used.
ex. classification of newspaper articles
art1
art2
art3
art4
art5
art6
class
economy stocks, company,
science
articles
sport
article
key word
amino acid, computer,
class
article
economy
art1, art3
science
art5, art6
sport
art2, art4
result of classification
baseball, swimming,
A new characteristic of amino acid was found.
key word KEY   stocks , amino acid , baseball ,  .
class of articles C   economy, science , sport , .
In many cases, each key word belongs to more than one classes.
3
Document classification using distance
A new article is classified into a class whose distance is minimum.
class B
class A
art A1
art A3
art A5 art A4
art A2
dB
new article
dA
The distance between class
A and the new article is
minimum.
The new article is classified
into the class A.
art B2
art B1
dC
art B3
art B4
class C
art C1
art C2 art C3
art C4
art C5
It is very easy to use in real case.
There is no theoretical guarantee on accuracy.
Vector space model is well known.
4
Document classification using probabilistic model
Key words occur depending on probabilistic distributions.
A new article is classified into a class whose error rate is minimum.
article
A new characteristic of amino acid was found.
occurrence of article class “science” occurs
parameters which dominate
probability distributions
p science   p amino acid science, 
key word “amino acid” occurs under the condition that class “science” occurs.
article classification = estimation for class
p
our previous
research
  p aminoacid
,

minimizing the error rate with respect to the Bayes criterion
accuracy is low with small training data
We want to improve accuracy with small training data.
5
Preparations(1/3)
ci
: class of documents
C : set of classes , C  c1, c2 ,, c C .
key i : key word
KEY : set of key words, KEY  key1, key2 ,, key KEY .



pci   : a probability of an event that class ci occurs

 : a parameter which dominates pci   ,    .
 * : a true parameter which is unknown,  *  .
: a probability of an event that key word keyj occurs in a
pkey j ci ,  document in class ci .
 : a parameter which dominates p key j ci ,  ,   .
*
 * : a true parameter which is unknown,   .

doc : a new document,



doc  x, yn ,
x : a class of new document doc , x  C . ( x is unknown)

yn: a string of key words in new document doc , yn  y1 y2  yn  .
n
( y  is known)
n : the number of key words in new document doc
6
Preparations(2/3)
docL : training data
   x , y x , y x , y ,
doc  x, y
L
n L
1
n1
2
n2
L
nL
L : the number of documents in the training data
xi : the class of the i th document in the training data, xi  C.
n i : the number of key words in the i th document in the training data
y ni : a string of key words in the i th document in the training data
y ni  yi,1 yi,2  yi, ni .
yi , j : the j th key word in the i th document in the training data
doc occurs
n
n
p doc  ,   p x   p  y x,   p x   p  yi x,  .(1)
a probability of an event that the new document
i 1
L
doc occurs
ni
L 

L
p doc  ,    pxi   p yi , j xi , 
i 1 
j 1
.
a probability of an event that the training data




(2)
7
Preparations(3/3)
document classification problem
estimating the unknown class x of the new document
under the condition that the string of key words yn in
and the training data docL are given.
doc
doc
8
The first our previous research(1/2)
The class of the new document is estimated by

n
d B y , doc
  arg max  p
L
xˆ C 
x
L
 pxˆ d   p x, y  , y
n
n L
i 1
i 1 
 p y xˆ, d ,
i

(3)
F  xˆ  x L    xˆ 


 x L  p xˆ   d 
(4)
p
 




  F  x  x L    x  ,


xC  
 xˆ, y  x, y n L   F y yi 1    y xˆ
F
i
i
i


n L
i 1 




ˆ
 p x, y  , y  p yi x , d 

n L

 




ˆ




F
x
,
y
x
,
y
  F y yi 1    y xˆ
  

y KEY  
,
 xˆ  ,  yi xˆ  : parameters of Dirichlet distribution for p xˆ   , p yi xˆ ,  .
where,
 




  
 
(5)

(prior distributions for the unknown parameters)
F  xˆ  x L  : the number of documents in the class


xˆ
in the training data
   : the number of key word y i in the documents in the class xˆ

F  xˆ, yi  x, y n

F  yi yi 1 


L

in the training data
: the number of key word y i in the string
y i 1
9
The first our previous research(2/2)
the first our previous method
optimal method which minimizes the error rate with respect
to Bayes criterion
But, the accuracy is low with small training data.
0.5 is used as parameter of prior distributions in order to
represent no information.
the second our previous research
improve accuracy with small training data
Accuracy depends on prior distributions with small training data.
10
The second our previous research(1/2)
We estimate prior distribution using estimating data.
estimating data for prior distributions
The new documents and the training data occur from the same source.
The estimating data occurs from another source.
v, w   v , w v , w x
m G
1
m1
2
m2

mG
,
w
,
G
(6)
G : the number of documents in the estimating data
vi : the class of the
i th document
m i: the number of key words in the i th document
w mi : the string of key words in the i th document
wmi  wi,1wi,2 wi, mi .
wi , j : the j th key word in the
i th document
11
The second our previous research(2/2)
Parameters in eq(4) and eq(5) are estimated using estimating data.
F  xˆ  v G    xˆ 


 xˆ  


  F  x  v G    x  ,
(7)


xC  





m G
ˆ
F  x , yi  v, w
    yi xˆ 


 yi xˆ  
 

m G

ˆ
ˆ
    y  x  
  F  x , y  v, w
yKEY  

 ,






(8)
where,
 
  

 xˆ  ,   yi xˆ  : parameters of Dirichlet distribution for p xˆ   , p yi xˆ ,  .
12
Experiments of previous methods(1/2)
comparison between the first our previous method and the second
first our previous method(prev1)
0.5 is used as each parameter for prior distributions.
second our previous method(prev2)
Prior distributions are estimated using estimating data.
new documents : 10000 (Japanese Mainichi News Paper 2007)
training data : Japanese Mainichi News Paper 2007
estimating data : 50000 (Japanese Mainichi News Paper 1994)
13
Experiments of previous methods(2/2)
0.8
0.7
accuracy rate
0.6
0.5
0.4
0.3
0.2
prev1
0.1
prev2
0
10
20
30
40
50
100
500
1000
5000
10000 50000
number of training data
prev2 is higher than prev1 with small training data.
But prev2 is lower than prev1 with large training data.
14
Proposed method
Parameters in eq(4) and eq(5) are estimated as follows:


 F xˆ v G   xˆ

 xˆ  
G


F
x
v
  x


 xC








log A1 L  2

 xˆ
A2


m G






ˆ
ˆ




F  x , yi v , w    y i x





  yi xˆ  
   F  xˆ, y v, wm G     y xˆ 


 



 y KEY 




A3 L
(9)
,
log A1 L  2

  yi xˆ
A2
A3 L
, (10)
where,
1  A1   , 1  A2   , 0  A3  1.
15
Experiments (1/2)
comparison between the first our previous method and
the new proposed method
first our previous method(prev1)
0.5 is used as each parameter for prior distributions.
new our proposed method(pro)
A1  10 , A2  1000000, A3  0.995.
new documents : 10000 (Japanese Mainichi News Paper 2007)
training data : Japanese Mainichi News Paper 2007
estimating data : 50000 (Japanese Mainichi News Paper 1994)
16
Experiments (2/2)
0.8
0.7
accuracy rate
0.6
0.5
0.4
0.3
0.2
prev1
0.1
pro
0
10
20
30
40
50
100
500
1000
5000
10000 50000
number of training data
pro is higher than prev1 in all points.
17
Conclusion
Accuracy of our new proposed method is higher than the first our
previous method with small training data. And the accuracy is equal
when the size of training data is big.
with small training data
use estimating data mainly
with large training data
use training data mainly
further works
We want to study a method to choose the parameters A1, A2 and A3.
18