近傍法における距離・類似度尺度のデータ中心化 ーハブネスの軽減ー

2015年6月19日 統計数理研究所 オープンハウス
近傍法における距離・類似度尺度のデータ中心化
ーハブネスの軽減ー
鈴木郁美 統計的機械学習センター 特任研究員 Research Highlight
❧  Similarity measures based on inner products are popular measures in NLP and other machine learning tasks.
❧  kNN does not work well for high-dimensional data.
❧  [Radovanovic et al. 2010] pointed out that hubs emerge in high-dimensional space.
A Hub is a sample which is similar to many other samples in a dataset.
The presence of hubs can deteriorate the accuracy of kNN-based classification.
❧  [Radovanovic et al. 2010] showed that samples close to / similar to the data centroid tend to become hubs.
❧  We show that simple Data Centering technique can reduce hubs and improve kNN based classification performance.
What is a hub?
Classification based on kNN
A
B
C
D
E
J
1st
X
P
V
X
X
X
2nd
M
X
Y
R
S
W
S
T
X
U
Q
O
Test data
Rank
k-nearest neighbor(k=5)
test
3rd
X
is a hub sample
is a hub sample which appears in many other samples (queries) kNN
X
Emergence of Hubs A sample which is similar to the data centroid tends to become a hub
A label of a test sample is predicted by labels of k training samples which are most
similar to the test sample.
Data Centering
Synthetic dataset 500 samples with 10 and 50 dimensions
Cosine similarity is used to measure similarity between samples Evaluate N10 value for each sample in a dataset
Nk is the number of times a sample appears in other samples’ kNN
Nk Value is large for hub samples
x cent = x − x
10 dimensions
d=10
50
Frequency
x icent
40
30
20
10
0
0
Similarity to centroid
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0
Similarity between the i and j-th samples is measured by inner product of their feature
vectors.
Before Centering: x (i), x ( j )
After Centering: x cent (i ), x cent ( j ) = x (i ) − x, x ( j ) − x
5
10 15
N10
20
50 dimensions
d=50
Histogram of N10 values for 500 samples
60
40
Many samples show small N10 value
A few sample(Hubs) shows large N10
value 20
0
25
0
20
0.6
Similarity to centroid
xi
Frequency
x cent
j
xj
80
40
N10
60 70
Samples with large N10 value
(hubs) show more similar to the
data centroid
0.5
0.4
0.3
5
10 15
N10
20 25
0.2
0
20
40
N10
60 70
N10: Number of times a sample
appears in other samples’ 10NN
Experiments
Multi-cluster Analysis with Reuters Transcribed
Why Data Centering Reduces Hubs?
After Centering
Probability
Before Centering
Similarity (Low ↔ High)
①Select two points
Similarity (Low ↔ High)
h and , such that h, µ > , µ
Then, compare similarity (inner product) for the selected samples (h and ℓ ) with other samples x.
②Before Centering : How different is the mean of the distribution
h, x and , x ?
Ε "# h, x $% − E "# , x $% = h, Ε [ x ] − , Ε [ x ] = h, µ − , µ > 0
∴ The mean of two distributions are different : Ε #$ h, x %& > Ε #$ , x %&
③ After Centering : What become of the mean difference of the distribution
hcent , x cent = h − x, x − x = h, x − h, x − x, x + x
2
h − x, x − x and ℓ − x, x − x ?
ℓ cent , x cent = ℓ − x, x − x = ℓ, x − ℓ, x − x, x + x
Reuters Transcribed data.
Ε[ h − x, x − x ]− Ε[ ℓ − x, x − x ] = Ε[ h, x ]− Ε[ h, x ]− Ε[ ℓ, x ]+ Ε[ ℓ, x ] = 0
∴ The mean of two distributions are not different : Ε[ h - x, x - x ] = Ε[ ℓ - x, x - x
2
]
Centroid vector:
N
x = ∑i =1 x i