Stable Feature Selection via Dense Feature Groups

Stable Feature Selection
via Dense Feature
Groups
Lei Yu, Chris Ding, Steven Loscalzo
Knowledge Discovery and Data mining 2008
Coffeetalk, 20 Mar 2014
donderdag 20 maart 2014
Stable feature selection
features
10
20
30
40
50
samples
60
70
80
90
100
200
300
400
500
• Often: MANY features, but correlated
• Needs: interpretation. Which GROUPS of features are
important?
2
donderdag 20 maart 2014
Stable feature selection
features
10
20
30
40
50
samples
60
70
80
90
100
200
300
400
500
• Often: MANY features, but correlated
• Needs: interpretation. Which GROUPS of features are
important?
2
donderdag 20 maart 2014
Representation of features
• Each feature is an object
• Cluster the points
(= objects = features)
feature 17
sample 2
sample 1
3
donderdag 20 maart 2014
defined by the original samples. Algorithm 1, DG
algorithms are
Group Finder) was introduced in [26] as a mean
as SVM-RFE.
Mean
Shift
dense feature groups from data. DGF works by
of the selected
kernel density estimation [24] and the iterative
up-based SVMy1 = x17[4] on each of the features in the sam
• Start
The
resultsatofa pointprocedure
When the
mean shift process converges, nearby fe
he same
as group• Weigh
your neighbors
with
gathered
into feature groups and returned by the
in the
figures.
a (Gaussian)
kernel,
with
The main
part of DGF is the iterative mean s
cy on width
the train�
�
h
ydure
for
j −x
i all features, which locates a density peak b
the truly relexi K
theh mean at a given feature xi and using other
elected features
the local neighborhood (determined by a kernel
uces from 1000
h) to shift the mean to a denser location. Specifi
rop •sharply.
In
Update your position, until
uch less depenyj −xi
n
convergence
x
K(
)
i
i=1
h
mple size is over
yj+1 =
j
=
1,
2,
...
y
−x
n
i
j
• Storeoutcluster center
onsistently
)
i=1 K(
h
est •that
intrinRepeat
for all objects
is used to determine the sequence of successive
mple variations
of the kernel K. The algorithm has a time com
ng relevant fea2
O(λn m), where λ is the number of iterations for
s more effective
4
shift procedure to converge. Details of the algorith
choice of kernel function K and bandwidth are giv
donderdag 20 maart 2014
defined by the original samples. Algorithm 1, DG
algorithms are
Group Finder) was introduced in [26] as a mean
as SVM-RFE.
Mean
Shift
dense feature groups from data. DGF works by
of the selected
kernel density estimation [24] and the iterative
up-based SVMy1 = x17[4] on each of the features in the sam
• Start
The
resultsatofa pointprocedure
When the
mean shift process converges, nearby fe
he same
as group• Weigh
your neighbors
with
gathered
into feature groups and returned by the
in the
figures.
a (Gaussian)
kernel,
with
The main
part of DGF is the iterative mean s
cy on width
the train�
�
h
ydure
for
j −x
i all features, which locates a density peak b
the truly relexi K
theh mean at a given feature xi and using other
elected features
the local neighborhood (determined by a kernel
uces from 1000
h) to shift the mean to a denser location. Specifi
rop •sharply.
In
Update your position, until
uch less depenyj −xi
n
convergence
x
K(
)
i
i=1
h
mple size is over
yj+1 =
j
=
1,
2,
...
y
−x
n
i
j
• Storeoutcluster center
onsistently
)
i=1 K(
h
est •that
intrinRepeat
for all objects
is used to determine the sequence of successive
mple variations
of the kernel K. The algorithm has a time com
ng relevant fea2
O(λn m), where λ is the number of iterations for
s more effective
4
shift procedure to converge. Details of the algorith
choice of kernel function K and bandwidth are giv
donderdag 20 maart 2014
defined by the original samples. Algorithm 1, DG
algorithms are
Group Finder) was introduced in [26] as a mean
as SVM-RFE.
Mean
Shift
dense feature groups from data. DGF works by
of the selected
kernel density estimation [24] and the iterative
up-based SVMy1 = x17[4] on each of the features in the sam
• Start
The
resultsatofa pointprocedure
When the
mean shift process converges, nearby fe
he same
as group• Weigh
your neighbors
with
gathered
into feature groups
G2 and returned
in the
figures.
G1 by the
a (Gaussian)
kernel,
with
The main
part of DGF is the iterative mean s
cy on width
the train�
�
h
ydure
for
j −x
i all features, which locates a density peak b
the truly relexi K
G
3
the
mean
at
a
given
feature
x
and
using
other
i
h
elected features
the local neighborhood (determined by a kernel
uces from 1000
h) to shift the mean to a denser location. Specifi
rop •sharply.
In
Update your position, until
uch less depenyj −xi
n
convergence
x
K(
)
i
i=1
h
mple size is over
yj+1 =
j
=
1,
2,
...
y
−x
n
i
j
• Storeoutcluster center
onsistently
)
i=1 K(
h
est •that
intrinRepeat
for all objects
is used to determine the sequence of successive
mple variations
of the kernel K. The algorithm has a time com
ng relevant fea2
O(λn m), where λ is the number of iterations for
s more effective
4
shift procedure to converge. Details of the algorith
choice of kernel function K and bandwidth are giv
donderdag 20 maart 2014
Dense Group Finder
samples
10
20
30
40
50
60
70
80
90
100
• Consider the features as objects xi
• Choose a characteristic distance h
• Find the modes of the density
features
(mean shift procedure) yj
• Group the features belonging to
one mode Gk
200
• Optionally: remove all features
�xi − yj � > h
300
5
4
donderdag 20 maart 2014
Dense group finder
• After mean shift: found feature
groups Gi
• Do group feature selection:
Dense Relevant Attribute Group
Selector
(only find the groups, based on
stability; no feature combiner)
• Do classification: only use ONE
feature from each group
samples
G1
features
G3
G2
6
donderdag 20 maart 2014
Results
Table 2: Average Accuracies (%, with Standard Deviation ±) Produced by DRAGS, Kmeans, and RFE
k
Data
Colon
SVM
1NN
Leuk.
SVM
1NN
Lung
SVM
1NN
Pro.
SVM
1NN
Lym.
SVM
1NN
SRB.
SVM
donderdag 20 maart 2014
Method
DRAGS
RFE
Kmeans
DRAGS
RFE
Kmeans
DRAGS
RFE
Kmeans
DRAGS
RFE
Kmeans
DRAGS
RFE
Kmeans
DRAGS
RFE
Kmeans
DRAGS
RFE
Kmeans
DRAGS
RFE
Kmeans
DRAGS
RFE
Kmeans
DRAGS
RFE
Kmeans
DRAGS
RFE
Kmeans
4
80.5±9.6
64.8±3.3
77.6±10.7
74.1±10.7
59.9±6.2
76.1±9.0
92.8±3.8
78.2±5
89.4±5.8
93.5±4
77.7±4.5
90.2±5.4
98.5±1.9
90.3±1.6
94.9±3.2
98.4±1.4
91.9±3.1
95.4±3.9
86.6±4.6
71.1±3.6
83.1±6.8
84.2±6.6
64.3±4.8
76.8±6.9
82.4±8.5
87.5±3.7
87.9±9.5
91.3±6
95.2±3.8
93.3±7.2
86.3±10.2
56.8±10.5
61.4±15.3
6
82.4±9.6
65.5±4.1
79.3±8.7
77±8.9
64±5.8
76.1±7.1
92.2±3.6
85.5±3.7
91.8±4.3
92.1±4.1
83.1±4.5
90.6±6.2
99±1.3
93.2±1.6
96.0±2.8
99±1.3
93.6±2.2
95.6±3.0
88.6±4.9
74.7±5
85.1±7.3
86.2±4.7
70.7±4.6
76.8±7.6
86±11.3
95.8±4
91.9±9.4
96.3±4.5
97.7±2.9
95.0±5.0
94.1±5.3
70.8±9.9
72.7±16.1
8
83.5±9.5
68.9±5.1
81.7±7.7
75.1±8.1
67.2±3.4
75.4±7.9
92.9±4
87.3±5.5
92.9±4.1
90.6±4.8
86.2±5
91.3±5.7
98.9±1.3
95.7±1.3
97.1±2.5
98.7±1.2
95±1.7
96.3±2.6
90.5±5.5
78.7±3.8
85.3±7.2
87.2±5.6
74.7±3.2
78.3±7.0
94.3±10.7
97.3±3.1
94.4±7.6
99±1.9
97.3±3.7
95.0±6.6
96.5±3.9
81.3±8.8
79.0±15.6
10
83.8±9.2
68.3±5.6
82.7±7.0
74.4±9.3
66.9±5.9
76.3±6.6
93.2±4.2
89.9±4.9
93.6±4.3
90.8±4.3
86.6±5.1
93.3±4.9
98.7±1.8
96.4±1.7
97.4±2.5
98.9±1.3
95.2±2
96.8±2.5
89.8±5.9
79.7±3.1
85.7±7.0
86.5±6.4
77.4±4.2
80±6.8
94.8±10.3
98±3.1
94.4±7.7
98.9±2.4
98±3.1
95.5±4.6
97±3.6
89.5±6.5
85.4±11.3
20
84.9±6.9
76.1±5.3
84.2±6.3
74.6±7.4
73±4.6
79.6±9.0
95±3.5
95.3±2.9
95.6±4.0
92.4±3.6
91.6±3
93.8±4.3
98.8±1.6
98.8±0.7
97.6±2.6
98.6±1.7
97.7±1.1
97.1±2.2
89.1±6.4
85.4±4.3
88.7±6.0
85.1±6.1
79.8±4.3
83.1±6.0
99.2±1.8
99±2.4
97.3±3.6
99.4±1.6
98.8±2.5
97.3±3.6
99.5±1.9
95.1±3.6
90.7±6.3
30
86.3±6.8
78.6±5.1
84.2±5.8
75.2±7.8
74.8±5.8
78.2±9.1
96.2±3
95.5±2.7
95.4±3.5
92.4±5
93.2±3.5
94.1±4.1
98.9±1.4
99.4±0.6
97.8±2.1
98.7±1.7
97.6±0.9
97.6±2.1
89.9±5.4
87.4±2.7
88.0±5.9
82.5±6.6
83.2±5.1
81.4±6.2
98.3±3.9
99.3±1.7
97.6±3.7
99±2.9
98.8±2.2
97.3±3.8
99.2±1.8
97±3.4
93.0±5.5
40
87.3±4.9
80.2±6.7
81.9±8.2
77.3±7.5
75.9±5.1
78.1±11.7
96.7±3.4
96.9±0.9
95.5±3.6
94.3±3.5
94.9±1.9
93.7±3.7
99±1.3
99.5±0.3
98.3±1.8
98.6±1.9
98.4±0.5
97.6±2.0
91.3±4.6
89.4±2.5
87.4±5.4
82.7±6.7
83.9±3.6
83.6±5.3
97.5±4.8
99.5±1.5
98.2±2.9
97.9±4.1
99.2±1.9
98.1±3.2
99.2±1.8
97.3±3.7
93.9±5.9
50
87.1±5.6
81.9±7.6
82.2±7.8
77.9±7.3
80.3±4.9
75.5±8.7
97.1±3.5
97.5±0.9
97.3±3.3
95.6±4.5
94.9±2.9
95.1±4.2
99.1±1.1
99.4±0.5
98.4±2.0
98.7±2
98.4±0.8
97.9±1.9
91.7±3.9
90.2±2.4
88.1±5.2
83.8±5.6
84.6±2.8
82.9±6.5
98.6±4.5
99.5±1.5
98.4±2.8
98.6±3.8
99±2
98.5±2.5
99.4±1.6
7
97.9±3.2
93.1±6.8
Conclusions
• Very simple idea (but is it new??)
• Focus on the stability of the clustering
• Cool thing: the width parameter h is set to the average
k-nearest neighbor distance!
• Strange thing: why only take ONE feature from each
group? (no averaging/PCA/...?)
8
donderdag 20 maart 2014