Adaptive Learning that Utilizing Parameters of

Adaptive Learning
Utilizing Parameters of
Existing Evaluation Function
Chikayama & Taura lab
37-096511
M1 Yuki Yano
1
Agenda
•
•
•
•
•
•
Introduction
Related studies
Proposed method
Evaluation setting
Experiments
Conclusion
2
Agenda
•
•
•
•
•
•
Introduction
Related studies
Proposed method
Evaluation setting
Experiments
Conclusion
Important factors of
Computer Game Players
• To create strong computer game players, we
need to improve
–search efficiency
–evaluation accuracy
We focus on this factor
3
4
What is “evaluation”
• Computer game players evaluate advantage of
each game state by evaluation function.
if evaluation is accuracy
if evaluation is not accuracy
〇 ×
〇 ×
〇
〇
×
〇 ×
〇 ×
〇 〇
〇
×
-1
×
〇 × 〇
〇 ×
〇 ×
〇
〇 〇
〇
〇 ×
2
×
5
Chose incorrect move…
×
10
〇 × 〇
〇
〇 ×
-2
×
3
Chose correct move!
5
Evaluation Function
• Evaluation function is constructed with
weights(parameters) and features.
weight vector
eval( p)  w x( p)
T
game stage
feature vector of stage p
• There are two ways to increase evaluation
accuracy.
– tuning parameters
– feature engineering
Feature Engineering and
Tuning Parameters
• When add new features to evaluation functions,
parameters are usually tuned from scratch.
This means to discard the knowledge
which existing evaluation function has.
previous
eval func
parameters
features
discard…
add new features
new
eval func
new parameters
features
new
This is
wasteful!
tune form
scratch
6
7
Our Approach
• We utilize existing parameters for the new
learning by adapting the idea of domain
adaptation.
– Evaluated our method with Shogi.
• By this approach, we achieved the following.
– Created new framework to utilize existing
parameters of evaluation function.
– Improved learning efficiency.
8
Agenda
•
•
•
•
•
•
Introduction
Related studies
Proposed method
Evaluation setting
Experiments
Conclusion
9
Tuning Parameters
• Today, machine learning is used to tune
parameters of evaluation functions.
– Evaluation functions are getting complex, and it is
hard to tune by hand.
• Two approaches
– Supervised Learning
– Unsupervised Learning
We use this approach
10
Supervised Learning in Games
• Game records are used as the teacher.
• How to use these “teachers”?
– Basic way : use as the exact value
min
w
 r
nrecords
n
 eval( pn )
2
label of record
– To use this approach, it is necessary to label exact
value to each records.
• e.g. final score is used as label in Othello
– Such labeling is hard for some games like Shogi.
11
Comparison Training
[Tesauro. G, 1989]
• The method which focuses on comparative
relationship of each game stage.
– Tune parameters to “imitate” moves in game
records.
• Known as “Bonanza Method” in Shogi.
label approach
label
comparison approach
p
compare
value
calculate
value
evaluation
function
p2
evaluation
function
p3
compare
two state
p1
record’s move
12
Bonanza Method[Hoki, 2006]
record’s
move
root
other
move
p_1
p_0
search
p’_0
eval( p0 )
search
p_2
・・・
search
p’_1
p’_2
eval( p1 )
eval( p2 )
Tune parameters to satisfy following condition
k (k  0), eval( p0 )  eval( pk )
13
Bonanza Method[Hoki, 2006]
• Tuning parameters to minimize following loss
function due to satisfying previous condition.


min   T eval( p0 ,n )  eval( pk ,n )
w
nrecords  k  0

• T(z) converts the difference of values to the
degree of coincidence with record’s move.
– sigmoid function [Hoki, 2006]
– logistic regression [Kaneko et al., 2008]
14
Domain Adaptation
• “Domain Adaptation” is the method which
transfers the knowledge of related domains.
– e.g. “general text”
source domain
Knowledge
“medical text”
target domain
Learning
System
15
Domain Adaptation
• In domain adaptation, our goals are
– Tuning parameters to adapt target domain.
– Keeping properties of source domain’s knowledge.
• To achieve second goal, we have to quantify
the difference of current parameters from
source domain’s one.
Bayesian Divergence Prior
[X.Li et al., 2007]
16
• The index which uses the KL-divergence of
likelihood for current and previous parameters.
– Keep the classification results of previous
parameters.
train data
class label(-1 or +1)
min L(w, S target )  C  D( p( y | x, w source ) || p( y | x, w ))
w
loss function
difference of knowledge
 q ( x) 
dx
where D( p( x) || q( x))   p( x) ln
 p ( x) 
large if two distributions are different
Bayesian Divergence Prior
[X.Li et al., 2007]
17
• Assume likelihood is logistic sigmoid function
which is generally used as classification model.
1
p( y | x, w) 
T
1  exp( yw x)
• In this assumption, bayesian divergence prior
becomes very simple.
min L(w, S source )  C w  w target
w
18
Agenda
•
•
•
•
•
•
Introduction
Related studies
Proposed method
Evaluation setting
Experiments
Conclusion
19
Motivation
• After adding new features, evaluation
functions are usually tuned from scratch.
– Existing parameters are discarded
• We utilized these parameters to new learning
by using bayesian divergence prior.
previous method
proposed
method
utilize discard
existing
eval func
add
features
tuning
new
eval func
records
utilize
20
Proposed Method
• An adaptive learning scheme which uses
existing parameters as the restricting term.
restricting parameter
 w pre 
C

min L(w, S )  w  
w
0 
2

loss function
2 Fill parameters
restricting term
Tuning parameters
to adapt to
the training data
Restrict parameters to
keep existing property
of new features
with zero
Advantages of
Utilizing Existing Parameters
21
• Existing parameters are tuning by some
knowledge (game records, human heuristics).
– Achieve more information by utilizing existing
parameters.
Stronger computer game players
• Existing parameters have high evaluation
accuracy.
Quicker convergence
22
Agenda
•
•
•
•
•
•
Introduction
Related studies
Proposed method
Evaluation setting
Experiments
Conclusion
23
Evaluation Environment
• Use “Gekisashi” to evaluate proposed method.
– “Gekisashi” is a Shogi program made in Chikayama
& Taura lab.
– Won world computer Shogi tournament 2008.
• Use “InTrigger” for computing.
– “InTrigger” is a distributed platform constructed
by a cluster of clusters.
– Learning took 20 hours with 80 parallel of Intel
Xeon E5410(2.33GHz).
24
Learning Method
• Tuned parameters by
repeating these operations
for 10 steps.
Create train data by
searching all available
moves in 5000 records.
Update parameters by
train data created in
previous step.
Did update
50 times?
yes
no
• Use “logistic regression”
for loss function
• Use “stochastic meta
descent” for update.
25
Loss Function with Logistic Regression
L(w, S ) 
T
log(
1

exp(

y
w
xdiff ))

( x, y )S
loss
loss
(record’s eval) – (other eval)
y : label(-1 or +1)
x_diff : x(p_record) - x(p_other)
S : train data
Tuning parameters w
to this direction
(record’s eval) – (other’s eval)
Stochastic Meta Descent
[N. N. Schraudolph, 2002]
26
• Set independent step sizes for each element
and update step sizes by past update
information.
– Faster convergence than normal gradient method.
Update parameters
wt 1  wt  diag( ηt )t
Update step sizes
ηt 1  diag( ηt ) Max0.5 (1   diag(t ) vt )
vt 1  vt  diag( ηt )(t  Ht vt )
 : gradient
H : hessian
 : meta-step size
 : decay factor
Max0.5 : replace elements
lower than 0.5 with 0.5
27
Learning Setting
• Add new features to evaluation function of
“Gekisashi”.
feature name
feature num
original feature
410
relative positions of king and others
absolute positions of king and others
relative positions of two pieces
9,248
209,952
147,968
absolute positions of adjacent two pieces
total
331,776
699,354
※Changed feature qualities related to king by the degree of progress.
28
Assessment Procedure
• Used following three assessment criteria
criterion
detail
agreement rate
The ratio of conforming to record’s move.
Measured by 250 game records not used for learning.
incorrect degree
The degree of false evaluation.
Measured by 250 game records no used for learning.
winning rate
The winning rate for original parameters.
First 30 steps are fixed, and search each move 10 sec.
※Used original parameters which filled new feature’s weights with zero , this means we
didn’t compare with “true Gekisash”.
※incorrect degree was calculated by sigmoid function(see our resume).
29
Agenda
•
•
•
•
•
•
Introduction
Related studies
Proposed method
Evaluation setting
Experiments
Conclusion
30
Experiments
1. (preliminary experiment) Tuning restricting
parameter.
2. Comparing with various methods.
31
Tuning the Restricting Parameter
• Objective
– Tuning the balance between loss function and
restricting term.
• How to
– Tuning parameters by changing restricting
parameter from 10^-5 to 10^5.
• Use Gekisashi’s original parameters as initial values.
• Tuning parameters of stochastic meta descent by 1 step
learning.
32
Agreement Rate of Each Restricing Parameter
agreement rate(%)
10^-5~10^2
10^3~10^5
step
33
Incorrect degree
Incorrect Degree of Each Restricting Parameter
10^3~10^5
step
10^-5~10^2
34
Winning Rate of Each Restricting Parameter (100 plays)
C=100
winning rate(%)
63
60
59
58
57
56
54
53
52
significance level
5%(61)
53
even(50)
44
restricting parameter
35
Comparing Various Methods
• Objective
– Evaluate availability of
• using existing parameters for initial values.
• utilizing existing parameters as restricting term.
• How to
– Compare following five methods.
name
Initial value
restricting term
proposed
existing parameters
C=100
no bind
existing parameters
none
const
existing parameters
fixed existing parameters
koma
piece values
none
bind koma
piece values
C=1
36
agreement rate(%)
Agreement Rate of Each Method
step
37
Incorrect degree
Incorrect Degree of Each Method
step
38
winning rate(%)
Winning Rate of Each Method (100 plays)
61
significance level
5%(61)
56
55
52
48
method
even(50)
39
The Results of Mutually Battles (200 plays)
Using
existingnoparameters
proposed
bind
const
proposed
53.5
koma
bind koma
60
61.5
64.5
61.5
55
60.5
59
54.5
no bind
46.5
const
40
38.5
koma
38.5
45
41
bind koma
35.5
39.5
45.5
blue : over 50%
red
53.5
46.5
Not using existing parameters
: wins beyond significance level 5% (more than 115 wins)
green : loses beyond significance level 5% (less than 85 wins)
40
Learning Speed
agreement rate
Incorrect degree
6 step
10 step
6 step
proposed(6step) v.s. koma(10step)
10 step
52.5%(200 plays)
Acquired nearly equal strength by fewer step
41
Agenda
•
•
•
•
•
•
Introduction
Related studies
Proposed method
Evaluation setting
Experiments
Conclusion
42
Conclusion and Future Tasks
• Adaptive learning utilizing existing parameters
– Obtained useful evaluation functions with
improved learning speed.
– Tuning the weight of the restricting term.
• Added features are very simple.
– More sophisticated features may show the merits
more clearly.
Thank you for your attention.