2004 Open Lecture at ISM Recent topics in machine learning: Boosting 2004年11月24日(水)~ 26日(金) 公開講座 統計数理要論「機械学習の最近の話題」 ブースト学習 江口 真透 (統計数理研究所, 総合研究大学院統計科学) 1 講座内容 ブースト学習: 統計的パタン認識の手法であるアダブーストを 概説し、その長所と欠点について考察します. 遺伝子発現、リモートセンシング・データ などの適用例の紹介をします。 2 Boost Leaning (I) 10:00-12:30 November 25 Thu Boost learning algorithm AdaBoost AsymAdaBoost AsymLearning EtaBoost Robust learning GroupBoost Group learning 3 Boost Leaning (II) 13:30-16:00 November 26 Fri BridgeBoost Meta learning LocalBoost Local learning Statistical discussion Probablistic framework Bayes Rule, Fisher’s LDF, Logistic regression Optimal classifier by AdaBoost 4 謝辞 この講座で紹介された内容の多くは,以下の 共同研究者の方々との成果を含む.ここに感謝する. 村田 昇氏(早稲田大学理工学) 西井龍映氏(九州大学数理学) 金森敬文氏(東京工大,情報数理) 竹之内高志氏(統計数理研究所) 川喜田正則君(総研大,統計科学) John B. Copas (Dept Stats, Univ of Warwick) 5 The strength of the weak learnability. Schapire, R. (1990) Strong learnability If, given access to a source of examples of the unknown concept, the learner with high probability is able to output an hypothesis that is correct on all but an arbitrarily small fraction of the instances. Weak learnability The concept class is weakly learnable if the learner can produce an hypothesis that performs only slightly better than random guessing. 6 Web-page on Boost Boosting Research Site: http: //www.boosting.org/ Robert Schapire’s home page: http://www.cs.princeton.edu/~schapire/ Yoav Freund's home page : http://www1.cs.columbia.edu/~freund/ John Lafferty http://www-2.cs.cmu.edu/~lafferty/ 7 Statistical pattern recognition Recognition for Character, Image, Speaker, Signal, Face, Language,… Prediction for Weather, earthquake, disaster, finance, interest rates, company bankruptcy, credit, default, infection, disease, adverse effect Classification for Species, parentage, genomic type, gene expression, protein expression, system failure, machine trouble 8 Multi-class classification Feature vector Class label Discriminant function Classification rule 9 Binary classification label 0-normalization Classification rule Learn a training dataset Make a classification 10 Statistical learning theory Boost learning Boost by filter (Schapire, 1990) Bagging, Arching (bootstrap) (Breiman, Friedman, Hasite) AdaBoost (Schapire, Freund, Batrlett, Lee) Support vector Maximize margin Kernel space (Vapnik, Sholkopf) 11 Class of weak machines Stamp class Linear class ANN class SVM class kNN class Point: colorful character rather than universal character 12 AdaBoost 1 1. Initial settings : w1 (i ) (i 1 n ), F0 ( x ) 0 n 2. For t 1,, T (a ) (b) (c) wt (i ) t ( f ) I( yi f ( xi )) , wt (i ' ) t ( f (t ) ) min t ( f ) fF t 1 t ( f( t ) ) 1 2 log t ( f( t ) ) wt 1(i ) wt (i ) exp( t f( t ) ( xi ) yi ) T 3. sign( FT ( x )), where FT ( x ) t f ( t ) ( x ) t 1 13 Learning algorithm w1 (1), , w1 (n) f (1) ( x ) 1 w2 (1),, w2 (n) f( 2) ( x) 2 Data 1 2 T t 1 t f(t ) ( x) T wT (1),, wT (n) f (T ) ( x ) T Final machine sign( FT ( x ) ), where FT ( x ) t f ( t ) ( x ) t 1 14 Simulation (complete separation) 1 Feature space [-1,1]×[-1,1] 0.5 Decision boundary -1 -0.5 0.5 1 -0.5 -1 15 Set of weak machines Linear classification machines 1 Random generation 0.5 -1 -0.5 0.5 1 -0.5 -1 16 Learning process (I) 1 1 1 0.5 0.5 0.5 0 0 0 -0.5 -0.5 -1 -0.5 -1 -1 -0.5 0 0.5 1 -1 Iter = 1, train err = 0.21 -0.5 0 0.5 1 -1 -1 -0.5 Iter = 13, train err = 0.18 1 0.5 0.5 0.5 0 0 0 -0.5 -0.5 -1 -0.5 0 0.5 Iter = 23, train err = 0.10 1 1 1 -1 -1 -1 0.5 Iter = 17, train err = 0.10 1 -0.5 0 -1 -0.5 0 0.5 Iter = 31, train err = 0.095 1 -1 -0.5 0 0.5 Iter = 47, train err = 0.08 1 17 Learning process (II) 1 1 0.5 0.5 0 0 0 -0.5 -0.5 -0.5 -1 -1 -0.5 0 0.5 Iter = 55, train err = 0.061 1 1 0.5 -1 -1 -1 -0.5 0 0.5 Iter = 99, train err = 0.032 1 -1 -0.5 0 0.5 1 Iter = 155, train err = 0.016 18 Final stage 1 1 0.5 0.5 0 0 -0.5 -0.5 -1 -1 -1 -0.5 0 0.5 Contour of F(x) 1 -1 -0.5 0 Sign(F(x)) 0.5 1 19 Learning curve 0.2 Training error 0.15 0.1 0.05 50 100 150 200 250 Iter = 1,…..,277 20 Characteristics Update wt (i ) wt 1 (i ) f ( t ) ( xi ) yi multiplifact or et f ( t ) ( xi ) yi multiplifact or e t Weighted error rates t ( f( t ) ) t 1 ( f( t ) ) t 1 ( f( t 1) ) 1 t 1 ( f ( t ) ) 2 ( least favorable ) 21 t 1 ( f ( t ) ) n t 1 ( f t ) I ( f ( t ) ( xi ) yi ) i 1 n I( f i 1 (t ) 1 2 wt 1 (i ) wt 1(i' ) ( xi ) yi ) exp{ t yi f ( t ) ( xi )}wt (i ) n exp{ i 1 t yi f ( t ) ( xi )}wt (i ) n exp{ t } I ( f ( t ) ( xi ) yi ) wt (i ) i 1 n n i 1 i 1 exp{ t } I ( f ( t ) ( xi ) yi ) wt (i ) exp{ t } I ( f ( t ) ( xi ) yi ) wt (i ) 1 t ( f(t ) ) (f ) t ( f(t ) ) t (t ) 1 t ( f(t ) ) t ( f(t ) ) t ( f(t ) ) {1 t ( f ( t ) )} t ( f(t ) ) 1 t ( f(t ) ) 1 2 22 Exponential loss Exponential loss : 1 n Lexp ( F ) exp{ yi F (xi )} n i 1 Update by F ( x) F ( x) f ( x) 1 n Lexp ( F f ) exp{ yi F ( xi )}exp{ yi f ( xi )} n i 1 1 n exp{ yi F ( xi )} e I( f ( xi ) yi ) e I( f ( xi ) yi ) n i 1 Lexp (F ){ e ( f ) e (1 ( f ))} 23 Sequential minimization Lexp (F f ) Lexp (F ){ ( f ) e (1 ( f )) e } n where ( f ) I ( f (x ) y ) exp{ y F (x )} i 1 j i i i Lexp ( F ) ( f ) e (1 ( f )) e 2 1 ( f ) ( f )e 2 ( f ){1 ( f )} e 2 ( f ){1 ( f )} Equality holds iff opt 1 1 ( f ) log 2 ( f ) 24 AdaBoost = minimum exp-loss min Lexp ( Ft 1 f ( t ) ) Lexp ( Ft 1 ) ( f ( t ) ){ 1 ( f ( t ) )} R opt (a ) (b) (c) 1 ( f(t ) ) 1 log 2 ( f(t ) ) f( t ) arg min t ( f ) f F 1 t ( f(t ) ) 1 t log 2 t ( f(t ) ) wt 1(i) wt (i) exp{t yi ft ( xi )} * 25 Simulation (complete random) 1 0.5 -1 -0.5 0.5 1 -0.5 -1 26 Overlearning of AdaBoost 1 1 1 0.5 0.5 0.5 0 0 0 -0.5 -0.5 -0.5 -1 -1 -0.5 0 0.5 Iter = 51, train err = 0.21 1 -1 -1 -1 -0.5 0 0.5 Iter = 151, train err = 0.06 1 -1 -0.5 0 0.5 1 Iter =301, train err = 0.0 27 Drawbacks of AdaBoost 1. Unbalancing learning AsymAdaBoost Balancing the false n/ps’ 2. Over-learning even for noisy dataset EtaBoost Robustfy mislabelled examples GroupBoost Relax the p >> n problems LocalBoost Extract spatial information BridgeBoost Combine different datasets 28 AsymBoost The small modification of AdaBoost into Lasym ( f t Ft 1 ) 2 (b)’ arg min R * t * n The selection of k The default choice is k I ( y 1) i 1 n i I ( y 1) i 1 i 29 Weighted errors by k 30 31 32 33 Result of AsymBoost 34 Eta-loss function regularized 35 EtaBoost 1. Initial settings : w1 (i ) n m ( f ) I( yi f ( xi )) wm (i ) , 2. For m 1,, T (a ) 1 (i 1 n ), F0 ( x ) 0 n m ( f ( m ) ) min m ( f ) i 1 f (b) (c) wm1(i) wm (i) exp(* m f( m) ( xi ) yi ) T 3. sign( FT ( x )), where FT ( x ) t f ( t ) ( x ) t 1 36 A toy example 37 AdaBoost vs Eta-Boost 38 Overlearning of AdaBoost 1 Iter = 51, train err = 0.21 1 0.5 0.5 0 -0.5 -1 -0.5 0.5 1 -1 -0.5 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 1 Iter =301, train err = 0.0 0.5 -1 0 Simulation (complete random) -0.5 -1 39 EtaBoost 1 1 0.5 0.5 0 0 0 -0.5 -0.5 -0.5 1 0.5 -1 -1 -1 -1 -0.5 0 0.5 Iter = 51, train err = 0.25 1 -1 -0.5 0 0.5 Iter = 51, train err = 0.15 1 -1 -0.5 0 0.5 1 Iter =351, train err = 0.18 40 Mis-labeled examples Mis-labeled 41 Comparison AdaBoost EtaBoost 42 GroupBoost Relax over-learning of AdaBoost by group learning Idea: In AdaBoost 2 (a) The best machine is singly selected Other better machines are cast off. Is there any wise way of grouping G best macines? 43 Grouping machines Let f t j ( ; b j ) arg min t ( f ), where f Fj Fj a sgn(x j b) : b IR, a{1,1} Select a set f t (1) ( ; b(1) ),, ft ( g ) ( ; b( g ) ) such that t ( ft (1) ( ; b(1) )) t ( ft ( g ) ( ; b( g ) )) 1 G f t (x) t ,( g ) f t ( x( g ) ; b( g ) ) G g 1 44 GroupBoost N (a ) t ( f ) I( f ( xi ) yi ) wt (i ) f (x t (b) t ,(g ) (1) i 1 ; b(1) ),, ft ( x(G ) ; b(G ) ) 1 t ( f t ( ; b( g ) )) 1 log 2 t ( ft ( ; b( g ) )) ( g 1,, G). 1 G f ( x ) t ,(g ) f t ( x( g ) ; b( g ) ). G g 1 (c) wt (i ) exp( f t ( x ) yi ) wt 1 (i ) Zt 1 45 Grouping jumps for the next 2 f2 f2 1 f1 f1 f3 F0 3 f3 f4 4 f4 46 Learning archtecture f (1,1) ( x ), , f (1,G ) ( x ) w1 (1), , w1 (n) w2 (1),, w2 (n) Data (1,1) , , (1,G ) 1 f (1) ( x) f ( 2,1) ( x ), , f ( 2,G ) ( x ) ( 2,1) , , ( 2,G ) 2 f( 2) ( x) T f t 1 Grouping G machines (x ) T f (T ,1) ( x ), , f (T ,G ) ( x ) wT (1),, wT (n) (t ) (T ,1) , , (T ,G ) f (T ) ( x ) f( t ) 1 G ((t ,1) f(t ,1) (t ,G) f(t ,G) ) 47 AdaBoost and GroupBoost Update the weights AdaBoost: exp(t ) G GroupAdaBoost : exp( t , g ) g 1 ・ AdaBoost Lexp (Ft t 1 ft 1 ) Lexp (Ft ) ・ GroupAdaBoost Lexp (Ft t 1 ft 1 ) Lexp (Ft ft 1 ) Lexp (Ft ) 48 From microarray Contest program from bioinformatics (BIP2003) http://contest.genome.ad.jp/ Microarray data Number of genes p = 1000~100000 Size of individuals n = 10~100 49 Output http://genome-www.stanford.edu/cellcycle/ 50 Goal Disease and gene expressions y is a label for clinical infomation x = (x 1 , …, x p ) = feature vector The genen expression for i-th individual is observed x i = log (BIP2003) Problem2: n = 27+11 = 38 << p = 7109 51 Microarray data cDNA microarry 52 Prediction from gene expressions Feature vector dimension = number of genes p components = quantities of gene expression x ( x1,x p ) Class label disease, adverse effect y { 1, 1} f : x y Classification machine based on training dataset {( xi , yi ) : 1 i n } 53 Leukemic diseases, Golub et al http://www.broad.mit.edu/cgi-bin/cancer/publications/ 54 AdaBoost 1 1. Initial settings : w1 (i ) (i 1 n ), F0 ( x ) 0 n 2. For t 1,, T n t ( f ) I( yi f ( xi )) wt (i ) , i 1 (a ) t ( f ( t ) ) min t ( f ) f 1 t ( f( t ) ) 1 2 log t ( f( t ) ) (b) t (c) wt 1(i) wt (i) exp( t f(t ) ( xi ) yi ) T 3. sign( FT ( x )), where FT ( x ) t f ( t ) ( x ) t 1 55 One-gene classifier 1 if x j b j f j (x ) 1 if x j b j one-gene classifier Let x j1, ... , x j n be expressions of the j-th gene bj xj 5 Error number 6 5 6 5 6 5 4 5 65 b j arg min { I ( yi sgn ( x j i b ) } b i 56 The second training 16 nb. of correctans. 0 . 5 log log2 4 nb. of false ans. Update the weight: 1 0.5 log Weight up to 2 Weight down to 0.5 bj bj xj Errror number 4.5 4 5.5 6 7.5 7 97 9 8.5 8 b j arg min { w(i ) I ( yi sgn ( x j i b )} b i 57 Web microarray data p n y = +1 y = -1 ALLAML 7129 72 37 35 Colon 2000 62 40 22 Estrogen 7129 49 25 24 http://microarray.princeton.edu/oncology/ http://mgm.duke.edu/genome/dna micro/work/ p >> n 58 10-fold validation D ( xi , yi ) N i 1 Validation Validation Validation 1 2 3 Training Training 4 9 10 (1) (2) (3) Averaing by 1 10 (k ) 10 k 1 59 ALLAML Training error 10-fold CV error 60 Colon Training error 10-fold CV error 61 Estrogen Training error 10-fold CV error 62 Gene score T T G t 1 t 1 g 1 ・ FT (x) f (x) t ,( g ) f t ( x( g ) ; b( g ) ) Confidencefactor for j - th gene 1 T G S ( j ) I (( g ) j )t ,(g ) T t 1 g 1 S(j) suggests the degree of association with the class-label 63 Genes associated with disease Compare our result with the set of genes suggested by Golub et al.,West et al. AdaBoost can detect only 15 genes in the case of ALLAML,. GroupAdaBoostde detect 30 genes compatible with Their results 64 Boost Leaning (II) 13:30-16:00 November 26 Fri BridgeBoost p >> n problem LocalBoost contextual information Statistical discussion Probablistic framework Bayes Rule, Fisher’s LDF, Logistic regression Optimal classifier by AdaBoost 65 Problem: p >> n Fundamental issue on Bioinformatics p is the dimension of biomarker (SNPs, proteome, microarray, …) n is the number of individuals (informed consent, institutional protocol, … bioethics) 66 An approach by combining Let B be a biomarker space Let I 1 , , I K be K experimental facilities Dk { ( zi ) : i 1, , nk } p n k (k 1,...,K ) Rapid expansion of genomic data K larger K n k 1 k p 67 BridgeBoost CAMDA (Critical Assessment of Microarray Data Analysis ) DDBJ (DNA Data Bank Japan, NIG) D1 f ( D1 ) f ( D1 | D1 ) D2 f ( D2 ) f ( D2 | D2 ) …. f ( DK ) …. …. DK result f ( DK | D K ) 68 CAMDA 2003 4 datasets for Lung Cancer Harvard PNAS, 2001 Affymetrix Michigan Nature Med, 2002 PNAS, 2001 Affymetrix Cancer Res 2001 cDNA Stanford Ontario cDNA http://www.camda.duke.edu/camda03/datasets/ 69 Some problems 1. Heterogeneity in feature space cDNA, Affymetrix 2. Heterogeneous class-labeling Differences in covariates,medical diagnosis 3. Heterogeneous generalization powers Uncertainty for microarray experiments 4. Publication bias A vast of unpublished studies 70 Machine learning Leanability: boosting weak learners? AdaBoost : Freund & Schapire (1997) weak classifiers A strong classifier { f1 ( x), .... , f p ( x) } f ( x) stagewise 1 f(1) ( x) 1 f(1) ( x) t f( t ) ( x) 71 Learning algorithm w1 (1), , w1 (n) f (1) ( x ) 1 w2 (1),, w2 (n) f( 2) ( x) 2 D 1 2 T t 1 t f(t ) ( x) T wT (1),, wT (n) f (T ) ( x ) T Final machine sign( FT ( x ) ), where FT ( x ) t f ( t ) ( x ) t 1 72 Different datasets D K where Dk { ( x(ik ) , y(ik ) ) : 1 i nk } Dk k 1 x (ik ) IR p y (i k ) expression vector of the same genes label of the same clinical item Normalization: IR p ∋ x(i k ) x (i k ) [0,1]p x (k ) ij x (i kj) min x (i kj) i max x i (k ) i j min x i (k ) i j ( j 1,..., p) 73 Weighted Errors The k-th weighted error nk t(k ) ( f ) i 1 wt nk (k ) wt (i ) (k ) i 1 I( yi (k ) f ( xi )) (k ) (i ) The combined weighted error nk K t ( f ) ( k ) t ( k ) ( f ) k 1 where ( k ) wt (k ) i 1 K nh h 1 i 1 wt (i ) (h) (i ) 74 BridgeBoost {wt (i ) :1 i nk }1 k K (k ) (a) ft (k ) arg min t ( f ) (k ) f (k ) 1 t ( ft ) 1 (k ) 2 log t ( ft ) (b) t ( c) 1 K (k ) (k ) f t ( x ) t f t ( x ) K k 1 (d ) wt 1 (i ) wt (i ) exp( t (k ) (k ) (k ) (k ) (k ) (k ) f t ( xi ) yi ) 75 Learning flow D1 D2 ft (1) {wt (i )} ft ( 2) {wt (i )} (1) ( 2) ( x) t (1) ( x) t ( 2) (1) {wt1 (i )} ( 2) {wt1 (i )} D f (t ) ( x) DK Stage t : { wt (K) (i) } f(t ) ft 1 K ( t (K ) (1) ( x) ft (1) t(K ) (K ) t (K) { wt1 (i)} ft (K ) ) Stage t+1 : 76 Mean exponential loss 1 Lk ( F ) nk Exponential loss Mean exponential loss nk (k ) (k ) exp{ y F ( x i i )} i 1 K L ( F ) Lk ( F ) k 1 (k ) t 1 2 log 1 t ( ft t ( ft (k ) (k ) ) ) t( k ) arg min L ( ft ( k ) Ft 1 ) 1 K L ( Ft 1 f t ) Lk ( Ft 1 (t k ) f t( k ) ) L ( Ft 1 ) K k 1 Note: convexity of Expo-Loss 77 Meta-leaning h k Lh ( ft ( k ) Ft 1 ) is cross- validatory Lh () is on Dh ; f t( k ) is on Dk K (k ) (k ) (k ) L ( f F ) L ( f F ) L ( f h t h t Ft 1 ) t 1 k t t 1 h 1 h k Separate learning Meta-learning 78 Simulation 3 datasets { D1 , D 2 , D3 } p 100, n1 50, n2 50, n3 50 data 1, data2 D1 , D2 Test error 0 (ideal) data3 D3 Traning error Test error Test error 0.5(ideal) Collapsed dataset 79 Comparison Training error Test error Separate AdaBoost Training error BridgeBoost Test error 80 Test errors Collapsed AdaBoost Min =15% Separate AdaBoost Min =43% Min = 3% Min = 4% BridgeBoost Min =4% 81 Conclusion f ( D1 ) f ( D1 | D1 ) D2 f ( D2 ) f ( D2 | D2 ) …. …. DK f ( DK ) Separate Leaning …. D1 result f ( DK | D K ) Meta-leaning 82 Unsolved problems in BridgeBoost 1. Which dataset should be joined or deleted in BridgeBoost ? 2. Prediction for class-label for a given new x ? 3. On the information on the unmatched genes in combining datasets 4. Heterogeneity is OK, but publication bias? 83 Markov random field Markov random field (MRF) ICM algorithm (Besag, 1986) 84 Neighbor sites 85 Space AdaBoost 1. Estimate the posterior p(k | x) using only non-contextual information. 2. Extract the contextual information based on the estimated posterior p(k | x) 3. Make a hierarchical set of weak machines, and start the algorithm from the noncontextual result. Fr { pr (k | i ) } 86 Analysis comparison True Image r r r True 2 labels 2 2 2 0 r 2 1 0 r 2 4 r 2 17 87 LocalBoost Let S be the sets of possible sites and we obtain training dataset over S : Local exponential loss : 1 n L local( F , h) Kh ( s, s(i )) exp{ ys ( i ) F ( xs ( i ) )} n i 1 88 Algorithm st is unformlysampledfrom{ s(i ) : i 1,...,n} n t ( f , st ) Kh ( s,st )I( yi f ( xi )) wt (i ) , i 1 (a ) ( b) f ( t ) st arg min t ( f , st ) f F t 1 t ( f( t ) s , st ) 1 t log 2 t ( f( t ) s , st ) t (c) wt 1(i) wt (i) exp( t f(t ) st ( xi ) yi ) Note: the selected machine f ( t ) st has local information around st 89 Locally weited combine Predicion rule h( x, s) sgn(F ( x, s)) F ( x, s) T K ( s, s ) f t 1 h t t ( t ) st ( x) We use the second weighting for strongly combining only classification machines f ( t ) st with st near s. 90 LocalBoost vs. AdaBoost 91 Statistical discussion Bayes rule ROC curve Neyman-Pearson Lemma Model-misspecification 92 Error rate Two types of errors (misclassification probabilities) True Positive False Positive False Negative True Negative False Negative False Positive u is a cut-off point. 93 Neyman-Pearson Lemma (1) Null hypothesis Alternative Log-likelihood raio Neyman-Pearson lemma 94 Neyman-Pearson Lemma (2) Bayes rule 95 Loss functions by NP lemma where 96 A class of loss functions Exponential Log loss 97 Bayes rule equivalence Theorem 1 Equality holds if and only if 98 Error rate Error rate is of the class where H(s) は Heaviside function, In general 99 Important examples Credit scoring Medical screening 100 Minimum Exponential Loss Empirical exp loss Expected exp loss Theorem. 101 Variational discussion 102 ROC curve A Area Under the Curve Gini index 103 ROC analysis TP FP 104 Logistic type discriminant TFor a given traning dataset For a given function the empirical loss is suggests the decision rule 105 Log loss Conditionsal expected likelihood where 106 Estimation equation expected empirical where logistic IRLS (Iteratively reweighted least squares) glm(formula, family = binomial, data, weights =W,・・・ ) 107 Fisher consistency Theorem 3. If the distribution of (x, y) is is asymptotically consistent for 108 Asymptotic efficiency Asymptotic variance 1 1 ˆ var (β) J (β)V (β) J 1 (β) n A Cramer-Rao type innequality gives 1 ˆ var (β) E{ p (1 p}f (x) f (x)T } n A 1 Equality holds if and only if Or equivalently the logistic regression. 109 Expected loss under parametrics Expected D-loss Risk( βˆ ,U ) I{LU ( βˆ )} Under the parametric assumption 1 ˆ ˆ Risk (βU * , U ) Risk (β log , U ) o( ) ( U , U *) n 110 Expected loss under misspecified model Under a near parametric settings ( x) β T 1 2 f ( x ) O(n ) Then for Risk( βˆ U * ,U ) D( βU * ) 12 tr{varA ( βˆU * ) Hesse(LU )} 1 o(n ) where β U * argmin LU * ( β ' ) β' 111 Sampling scheme Conditinal samling (Cohort study, prospective study) Mixture sampling Separatesampling (case-control study, retrospective study 112 Adams, N.M. nd Hand, D.J. (1999). Comparing classifiers when the misclassification costs are uncertain. Pattern Recognition 32, 1139-1147. Adams, N.M. and Hand, D.J. (2000). Improving the practice of classifier performance assessment. Neural Computation 12, 305-311. Begg, C. B., Satogopan, J. M. and Berwick, M. (1998). A new strategy for evaluating the impact of epidemiologic risk factors for cancer with applications to melanoma. J. Amer. Statist. Assoc. 93, 415-426. Berwick, M, Begg, C. B., Fine, J. A., Roush, G. C. and Barnhill, R. L. (1996). Screening for cutaneous melanoma by self skin examination. J. National Cancer Inst., 88, 17-23. Bishop, C. (1995),. Neural Networks for Pattern Recognition, Clarendon Press, Oxford. 113 Domingo, C and Watanabe, O. (2000). MadaBoost: A modification of AdaBoost.In Proc. of the 13th Conference on Computational Learning Theory. Efron, B. (1975), The efficiency of logistic regression compared to normal discriminant analysis. J. Amer. Statist. Asoc.70, 892-898. Friedman, J., Hastie, T. and Tibishirani, R. (2000). Additive logistic regression: A statitistical view of boosting. Ann. Statist. 28, 337-407. Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, 179-188. Hand, D. J. and Henley, W. E. (1997). Statistical classification methods in consumer credit scoring: a review. J. Roy. Statist. Soc., A, 160, 523-541. Ryuei Nishii and Shinto Eguchi (2004). Supervised Image Classification by Contextual AdaBoost Based on Posteriors in Neighborhoods To be submitted. Hastie, T. Tibishirani, R. and Friedman J. (2001). The elements of statistical learning. Springer, New York. 114 Lebanon, G. and Lafftry, J. (2001). Boosting and maximum likelihood for exponential models. NIPS, 14, 2001. McLachlan, G. J. (1992). Discriminant analysis and statistical pattern recognition. Wiley, New York. Pepe. M.S. and Thampson, M.L. (2000). Combing diagnostic test results to increase accuracy. Biostatistics 1, 123-140. Ratsch, G., Onoda, T. and Muller K.-R. (2001) Soft Margins for AdaBoost. Machine Learning. 42(3)}, 287-320. Schapire, R. (1990). The strength of the weak learnability. Machine Learning 5, 197-227. Schapire, R. Freund, Y, Bartlett, P. and Lee, W. (1998). Boosting the margin: a new explanation for effectiveness of voting methods. Ann. Statist., 26, 1651-1686. Vapnik, V. N. (1999). The Nature of Statistical Learning Theory. Springer: New York. 115 Eguchi, S. and Copas, J. (2001). Recent developments in discriminant analysis from an information geometric point of view. J. Korean Statist. Soc. 30, 247-264 (2001). (The special issue of the 30th aniversary of the Korean Statist. Soc) Eguchi, S. and J. Copas . A class of logistic-type discriminant functions. Biometrika 89, 1-22 (2002). Eguchi, S. (2002) U-boosting Method for Classification and Information Geometry. Invited talk on International Statistical Workshop, Statistical Research Center for Complex System, Seoul Natinal University Kanamori, T. Takenouchi, S. Eguchi and N. Murata (2004). The most robust loss function for boosting. Lecture Notes in Computer Science 3316, 496501, Springer. Murata, N., T. Takenouchi, T. Kanamori and S. Eguchi , Information geometry of U-Boost and Bregman divergence. Neural Computation 16, 1437-1481 (2004). 116 Takenouchi, T. and S. Eguchi. Robustifying AdaBoost by adding the naive error rate. Neural Computation 16, 767-787 (2004). N. Murata, T. Takenouchi, T. Kanamori, S. Eguchi. Geometry of U-Boost algorithms. 林原フォーラム「偶然と必然-数理、情報、経済」分科会「情報の物理学」 江口真透 (2004). 情報幾何と統計的パタン認識, 数学論説 江口真透 (2004). 統計的パタン識別の情報幾何 -U ブースト学習アルゴリズム- 数理科学 No. 489, 53-59 江口真透 (2002). 統計的識別の方法について - ロジスティック判別からアダブース トまで -. 応用統計学会 第24回 シンポジウムプログラム-多変量解析の新展 開- 特別講演 117 Future paradigm The concept of learning will offer highly productive ideas for computational algorithms also in future. Is it an imitation of biological brain? Meta learning? Human life? And Statistics? And Statistics? 118
© Copyright 2024 ExpyDoc