Adaptive Learning Utilizing Parameters of Existing Evaluation Function Chikayama & Taura lab 37-096511 M1 Yuki Yano 1 Agenda • • • • • • Introduction Related studies Proposed method Evaluation setting Experiments Conclusion 2 Agenda • • • • • • Introduction Related studies Proposed method Evaluation setting Experiments Conclusion Important factors of Computer Game Players • To create strong computer game players, we need to improve –search efficiency –evaluation accuracy We focus on this factor 3 4 What is “evaluation” • Computer game players evaluate advantage of each game state by evaluation function. if evaluation is accuracy if evaluation is not accuracy 〇 × 〇 × 〇 〇 × 〇 × 〇 × 〇 〇 〇 × -1 × 〇 × 〇 〇 × 〇 × 〇 〇 〇 〇 〇 × 2 × 5 Chose incorrect move… × 10 〇 × 〇 〇 〇 × -2 × 3 Chose correct move! 5 Evaluation Function • Evaluation function is constructed with weights(parameters) and features. weight vector eval( p) w x( p) T game stage feature vector of stage p • There are two ways to increase evaluation accuracy. – tuning parameters – feature engineering Feature Engineering and Tuning Parameters • When add new features to evaluation functions, parameters are usually tuned from scratch. This means to discard the knowledge which existing evaluation function has. previous eval func parameters features discard… add new features new eval func new parameters features new This is wasteful! tune form scratch 6 7 Our Approach • We utilize existing parameters for the new learning by adapting the idea of domain adaptation. – Evaluated our method with Shogi. • By this approach, we achieved the following. – Created new framework to utilize existing parameters of evaluation function. – Improved learning efficiency. 8 Agenda • • • • • • Introduction Related studies Proposed method Evaluation setting Experiments Conclusion 9 Tuning Parameters • Today, machine learning is used to tune parameters of evaluation functions. – Evaluation functions are getting complex, and it is hard to tune by hand. • Two approaches – Supervised Learning – Unsupervised Learning We use this approach 10 Supervised Learning in Games • Game records are used as the teacher. • How to use these “teachers”? – Basic way : use as the exact value min w r nrecords n eval( pn ) 2 label of record – To use this approach, it is necessary to label exact value to each records. • e.g. final score is used as label in Othello – Such labeling is hard for some games like Shogi. 11 Comparison Training [Tesauro. G, 1989] • The method which focuses on comparative relationship of each game stage. – Tune parameters to “imitate” moves in game records. • Known as “Bonanza Method” in Shogi. label approach label comparison approach p compare value calculate value evaluation function p2 evaluation function p3 compare two state p1 record’s move 12 Bonanza Method[Hoki, 2006] record’s move root other move p_1 p_0 search p’_0 eval( p0 ) search p_2 ・・・ search p’_1 p’_2 eval( p1 ) eval( p2 ) Tune parameters to satisfy following condition k (k 0), eval( p0 ) eval( pk ) 13 Bonanza Method[Hoki, 2006] • Tuning parameters to minimize following loss function due to satisfying previous condition. min T eval( p0 ,n ) eval( pk ,n ) w nrecords k 0 • T(z) converts the difference of values to the degree of coincidence with record’s move. – sigmoid function [Hoki, 2006] – logistic regression [Kaneko et al., 2008] 14 Domain Adaptation • “Domain Adaptation” is the method which transfers the knowledge of related domains. – e.g. “general text” source domain Knowledge “medical text” target domain Learning System 15 Domain Adaptation • In domain adaptation, our goals are – Tuning parameters to adapt target domain. – Keeping properties of source domain’s knowledge. • To achieve second goal, we have to quantify the difference of current parameters from source domain’s one. Bayesian Divergence Prior [X.Li et al., 2007] 16 • The index which uses the KL-divergence of likelihood for current and previous parameters. – Keep the classification results of previous parameters. train data class label(-1 or +1) min L(w, S target ) C D( p( y | x, w source ) || p( y | x, w )) w loss function difference of knowledge q ( x) dx where D( p( x) || q( x)) p( x) ln p ( x) large if two distributions are different Bayesian Divergence Prior [X.Li et al., 2007] 17 • Assume likelihood is logistic sigmoid function which is generally used as classification model. 1 p( y | x, w) T 1 exp( yw x) • In this assumption, bayesian divergence prior becomes very simple. min L(w, S source ) C w w target w 18 Agenda • • • • • • Introduction Related studies Proposed method Evaluation setting Experiments Conclusion 19 Motivation • After adding new features, evaluation functions are usually tuned from scratch. – Existing parameters are discarded • We utilized these parameters to new learning by using bayesian divergence prior. previous method proposed method utilize discard existing eval func add features tuning new eval func records utilize 20 Proposed Method • An adaptive learning scheme which uses existing parameters as the restricting term. restricting parameter w pre C min L(w, S ) w w 0 2 loss function 2 Fill parameters restricting term Tuning parameters to adapt to the training data Restrict parameters to keep existing property of new features with zero Advantages of Utilizing Existing Parameters 21 • Existing parameters are tuning by some knowledge (game records, human heuristics). – Achieve more information by utilizing existing parameters. Stronger computer game players • Existing parameters have high evaluation accuracy. Quicker convergence 22 Agenda • • • • • • Introduction Related studies Proposed method Evaluation setting Experiments Conclusion 23 Evaluation Environment • Use “Gekisashi” to evaluate proposed method. – “Gekisashi” is a Shogi program made in Chikayama & Taura lab. – Won world computer Shogi tournament 2008. • Use “InTrigger” for computing. – “InTrigger” is a distributed platform constructed by a cluster of clusters. – Learning took 20 hours with 80 parallel of Intel Xeon E5410(2.33GHz). 24 Learning Method • Tuned parameters by repeating these operations for 10 steps. Create train data by searching all available moves in 5000 records. Update parameters by train data created in previous step. Did update 50 times? yes no • Use “logistic regression” for loss function • Use “stochastic meta descent” for update. 25 Loss Function with Logistic Regression L(w, S ) T log( 1 exp( y w xdiff )) ( x, y )S loss loss (record’s eval) – (other eval) y : label(-1 or +1) x_diff : x(p_record) - x(p_other) S : train data Tuning parameters w to this direction (record’s eval) – (other’s eval) Stochastic Meta Descent [N. N. Schraudolph, 2002] 26 • Set independent step sizes for each element and update step sizes by past update information. – Faster convergence than normal gradient method. Update parameters wt 1 wt diag( ηt )t Update step sizes ηt 1 diag( ηt ) Max0.5 (1 diag(t ) vt ) vt 1 vt diag( ηt )(t Ht vt ) : gradient H : hessian : meta-step size : decay factor Max0.5 : replace elements lower than 0.5 with 0.5 27 Learning Setting • Add new features to evaluation function of “Gekisashi”. feature name feature num original feature 410 relative positions of king and others absolute positions of king and others relative positions of two pieces 9,248 209,952 147,968 absolute positions of adjacent two pieces total 331,776 699,354 ※Changed feature qualities related to king by the degree of progress. 28 Assessment Procedure • Used following three assessment criteria criterion detail agreement rate The ratio of conforming to record’s move. Measured by 250 game records not used for learning. incorrect degree The degree of false evaluation. Measured by 250 game records no used for learning. winning rate The winning rate for original parameters. First 30 steps are fixed, and search each move 10 sec. ※Used original parameters which filled new feature’s weights with zero , this means we didn’t compare with “true Gekisash”. ※incorrect degree was calculated by sigmoid function(see our resume). 29 Agenda • • • • • • Introduction Related studies Proposed method Evaluation setting Experiments Conclusion 30 Experiments 1. (preliminary experiment) Tuning restricting parameter. 2. Comparing with various methods. 31 Tuning the Restricting Parameter • Objective – Tuning the balance between loss function and restricting term. • How to – Tuning parameters by changing restricting parameter from 10^-5 to 10^5. • Use Gekisashi’s original parameters as initial values. • Tuning parameters of stochastic meta descent by 1 step learning. 32 Agreement Rate of Each Restricing Parameter agreement rate(%) 10^-5~10^2 10^3~10^5 step 33 Incorrect degree Incorrect Degree of Each Restricting Parameter 10^3~10^5 step 10^-5~10^2 34 Winning Rate of Each Restricting Parameter (100 plays) C=100 winning rate(%) 63 60 59 58 57 56 54 53 52 significance level 5%(61) 53 even(50) 44 restricting parameter 35 Comparing Various Methods • Objective – Evaluate availability of • using existing parameters for initial values. • utilizing existing parameters as restricting term. • How to – Compare following five methods. name Initial value restricting term proposed existing parameters C=100 no bind existing parameters none const existing parameters fixed existing parameters koma piece values none bind koma piece values C=1 36 agreement rate(%) Agreement Rate of Each Method step 37 Incorrect degree Incorrect Degree of Each Method step 38 winning rate(%) Winning Rate of Each Method (100 plays) 61 significance level 5%(61) 56 55 52 48 method even(50) 39 The Results of Mutually Battles (200 plays) Using existingnoparameters proposed bind const proposed 53.5 koma bind koma 60 61.5 64.5 61.5 55 60.5 59 54.5 no bind 46.5 const 40 38.5 koma 38.5 45 41 bind koma 35.5 39.5 45.5 blue : over 50% red 53.5 46.5 Not using existing parameters : wins beyond significance level 5% (more than 115 wins) green : loses beyond significance level 5% (less than 85 wins) 40 Learning Speed agreement rate Incorrect degree 6 step 10 step 6 step proposed(6step) v.s. koma(10step) 10 step 52.5%(200 plays) Acquired nearly equal strength by fewer step 41 Agenda • • • • • • Introduction Related studies Proposed method Evaluation setting Experiments Conclusion 42 Conclusion and Future Tasks • Adaptive learning utilizing existing parameters – Obtained useful evaluation functions with improved learning speed. – Tuning the weight of the restricting term. • Added features are very simple. – More sophisticated features may show the merits more clearly. Thank you for your attention.
© Copyright 2024 ExpyDoc