2015年6月19日 統計数理研究所 オープンハウス 近傍法における距離・類似度尺度のデータ中心化 ーハブネスの軽減ー 鈴木郁美 統計的機械学習センター 特任研究員 Research Highlight ❧ Similarity measures based on inner products are popular measures in NLP and other machine learning tasks. ❧ kNN does not work well for high-dimensional data. ❧ [Radovanovic et al. 2010] pointed out that hubs emerge in high-dimensional space. A Hub is a sample which is similar to many other samples in a dataset. The presence of hubs can deteriorate the accuracy of kNN-based classification. ❧ [Radovanovic et al. 2010] showed that samples close to / similar to the data centroid tend to become hubs. ❧ We show that simple Data Centering technique can reduce hubs and improve kNN based classification performance. What is a hub? Classification based on kNN A B C D E J 1st X P V X X X 2nd M X Y R S W S T X U Q O Test data Rank k-nearest neighbor(k=5) test 3rd X is a hub sample is a hub sample which appears in many other samples (queries) kNN X Emergence of Hubs A sample which is similar to the data centroid tends to become a hub A label of a test sample is predicted by labels of k training samples which are most similar to the test sample. Data Centering Synthetic dataset 500 samples with 10 and 50 dimensions Cosine similarity is used to measure similarity between samples Evaluate N10 value for each sample in a dataset Nk is the number of times a sample appears in other samples’ kNN Nk Value is large for hub samples x cent = x − x 10 dimensions d=10 50 Frequency x icent 40 30 20 10 0 0 Similarity to centroid 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0 Similarity between the i and j-th samples is measured by inner product of their feature vectors. Before Centering: x (i), x ( j ) After Centering: x cent (i ), x cent ( j ) = x (i ) − x, x ( j ) − x 5 10 15 N10 20 50 dimensions d=50 Histogram of N10 values for 500 samples 60 40 Many samples show small N10 value A few sample(Hubs) shows large N10 value 20 0 25 0 20 0.6 Similarity to centroid xi Frequency x cent j xj 80 40 N10 60 70 Samples with large N10 value (hubs) show more similar to the data centroid 0.5 0.4 0.3 5 10 15 N10 20 25 0.2 0 20 40 N10 60 70 N10: Number of times a sample appears in other samples’ 10NN Experiments Multi-cluster Analysis with Reuters Transcribed Why Data Centering Reduces Hubs? After Centering Probability Before Centering Similarity (Low ↔ High) ①Select two points Similarity (Low ↔ High) h and , such that h, µ > , µ Then, compare similarity (inner product) for the selected samples (h and ℓ ) with other samples x. ②Before Centering : How different is the mean of the distribution h, x and , x ? Ε "# h, x $% − E "# , x $% = h, Ε [ x ] − , Ε [ x ] = h, µ − , µ > 0 ∴ The mean of two distributions are different : Ε #$ h, x %& > Ε #$ , x %& ③ After Centering : What become of the mean difference of the distribution hcent , x cent = h − x, x − x = h, x − h, x − x, x + x 2 h − x, x − x and ℓ − x, x − x ? ℓ cent , x cent = ℓ − x, x − x = ℓ, x − ℓ, x − x, x + x Reuters Transcribed data. Ε[ h − x, x − x ]− Ε[ ℓ − x, x − x ] = Ε[ h, x ]− Ε[ h, x ]− Ε[ ℓ, x ]+ Ε[ ℓ, x ] = 0 ∴ The mean of two distributions are not different : Ε[ h - x, x - x ] = Ε[ ℓ - x, x - x 2 ] Centroid vector: N x = ∑i =1 x i
© Copyright 2024 ExpyDoc