Time Series Clustering: Analysis of Dynamic Systems

Time Series Clustering: Analysis of Dynamic Systems
SAS Talks
January 23, 2014
TIME SERIES CLUSTERING: ANALYSIS OF DYNAMIC
SYSTEMS
JANUARY 23, 2014
SPEAKERS
 David
•
Manager, Emerging Technologies
Magnify Analytic Solutions
 Stacy
•
J. Corliss, PhD
Hobson
Director, Customer Loyalty
SAS Institute
Overview and History
A Basic Example
Plotting the Results
Standardization
Non-Periodic Events
Summary
A Typical Cluster Analysis in SAS
Cluster Analysis
 First developed in the 1930’s by several
statisticians including H. Driver and A. Kroeber,
J. Zubin, and R. Tryon
 Uses various distance measures to group
observations into subsets with similar properties
 Groupings can be based on distance from a
centroid or the distribution or density of the
observations
 An example of unsupervised learning; often nonparametric
An Example of PROC FASTCLUS
title2 'Preliminary Analysis by FASTCLUS';
proc fastclus data=iris summary maxc=10 maxiter=99 converge=0
mean=mean out=prelim cluster=preclus;
var petal sepal;
run;
An Example of PROC FASTCLUS
title2 'Preliminary Analysis by FASTCLUS';
proc fastclus data=iris summary maxc=10 maxiter=99 converge=0
mean=mean out=prelim cluster=preclus;
var petal sepal;
run;
An Example of PROC FASTCLUS
title2 'Preliminary Analysis by FASTCLUS';
proc fastclus data=iris summary maxc=10 maxiter=99 converge=0
mean=mean out=prelim cluster=preclus;
var petal sepal;
run;
An Example of PROC FASTCLUS
title2 'Preliminary Analysis by FASTCLUS';
proc fastclus data=iris summary maxc=10 maxiter=99 converge=0
mean=mean out=prelim cluster=preclus;
var petal sepal;
run;
An Example of PROC FASTCLUS
title2 'Preliminary Analysis by FASTCLUS';
proc fastclus data=iris summary maxc=10 maxiter=99 converge=0
mean=mean out=prelim cluster=preclus;
var petal sepal;
run;
An Example of PROC FASTCLUS
title2 'Preliminary Analysis by FASTCLUS';
proc fastclus data=iris summary maxc=10 maxiter=99 converge=0
mean=mean out=prelim cluster=preclus;
var petal sepal;
run;
An Example of PROC FASTCLUS
Cluster Analysis Applied
to Time Series Data
**** NOAA Precipitation Data ****;
data work.noaa;
infile "/home/sas/NESUG/noaa_mi_1950_2009_tab.txt"
dsd dlm='09'x lrecl=1500 truncover firstobs=2;
input
state_code :3.0
division
:3.0
year_month :$6.
pcp
:6.2
;
length year 8.0 month 8.0;
year = left(year_month,1,4);
month = right(year_month,5,2);
run;
Plotting the Results
goptions device=png;
symbol1 font=marker value=u height=0.6 c=blue;
symbol2 font=marker value=u height=0.6 c=red;
symbol3 font=marker value=u height=0.6 c=yellow;
symbol4 font=marker value=u height=0.6 c=green;
legend1 frame cframe=ligr label=none cborder=black
position=center value=(justify=center);
axis1 label=(angle=90 rotate=0) minor=none;
axis2 minor=none;
proc gplot data=work.cluster3;
plot year * month = cluster /frame cframe=ligr
legend=legend1 vaxis=axis1 haxis=axis2;
run;
Plotting the Results
goptions device=png;
symbol1 font=marker value=u height=0.6 c=blue;
symbol2 font=marker value=u height=0.6 c=red;
symbol3 font=marker value=u height=0.6 c=yellow;
symbol4 font=marker value=u height=0.6 c=green;
legend1 frame cframe=ligr label=none cborder=black
position=center value=(justify=center);
axis1 label=(angle=90 rotate=0) minor=none;
axis2 minor=none;
proc gplot data=work.cluster3;
plot year * month = cluster /frame cframe=ligr
legend=legend1 vaxis=axis1 haxis=axis2;
run;
Plotting the Results
goptions device=png;
symbol1 font=marker value=u height=0.6 c=blue;
symbol2 font=marker value=u height=0.6 c=red;
symbol3 font=marker value=u height=0.6 c=yellow;
symbol4 font=marker value=u height=0.6 c=green;
legend1 frame cframe=ligr label=none cborder=black
position=center value=(justify=center);
axis1 label=(angle=90 rotate=0) minor=none;
axis2 minor=none;
proc gplot data=work.cluster3;
plot year * month = cluster /frame cframe=ligr
legend=legend1 vaxis=axis1 haxis=axis2;
run;
Plotting the Results
goptions device=png;
symbol1 font=marker value=u height=0.6 c=blue;
symbol2 font=marker value=u height=0.6 c=red;
symbol3 font=marker value=u height=0.6 c=yellow;
symbol4 font=marker value=u height=0.6 c=green;
legend1 frame cframe=ligr label=none cborder=black
position=center value=(justify=center);
axis1 label=(angle=90 rotate=0) minor=none;
axis2 minor=none;
proc gplot data=work.cluster3;
plot year * month = cluster /frame cframe=ligr
legend=legend1 vaxis=axis1 haxis=axis2;
run;
Cluster Analysis Applied
to Time Series Data
Cluster Analysis Applied
to Complex Periodic Data
mV
Beat Number
Rate of Change Variables
**** interval percent change ****;
proc sort data=work.doe;
by year week;
run;
data work.doe;
set work.doe;
by year week;
retain pw_price_index; output;
if first.week then pw_price_index = annualized_price_index;
run;
data work.doe;
set work.doe;
by year week;
weekly_pct_change = (annualized_price_index
- pw_price_index) * 100;
run;
Rate of Change Variables
**** interval percent change ****;
proc sort data=work.doe;
by year week;
run;
data work.doe;
set work.doe;
by year week;
retain pw_price_index; output;
if first.week then pw_price_index = annualized_price_index;
run;
data work.doe;
set work.doe;
by year week;
weekly_pct_change = (annualized_price_index
- pw_price_index) * 100;
run;
Rate of Change Variables
**** interval percent change ****;
proc sort data=work.doe;
by year week;
run;
data work.doe;
set work.doe;
by year week;
retain pw_price_index; output;
if first.week then pw_price_index = annualized_price_index;
run;
data work.doe;
set work.doe;
by year week;
weekly_pct_change = (annualized_price_index
- pw_price_index) * 100;
run;
Rate of Change Variables
**** interval percent change ****;
proc sort data=work.doe;
by year week;
run;
data work.doe;
set work.doe;
by year week;
retain pw_price_index; output;
if first.week then pw_price_index = annualized_price_index;
run;
data work.doe;
set work.doe;
by year week;
weekly_pct_change = (annualized_price_index
- pw_price_index) * 100;
run;
Standardization of Variables
proc standard data=work.doe mean=0 std=1
out=work.doe_stan;
var week annualized_price_index weekly_pct_change
supply supply_pct_change;
run;
proc fastclus data=work.doe_stan maxc=6 maxiter=20
out=work.cluster1;
var week annualized_price_index weekly_pct_change
supply supply_pct_change;
run;
Use of PROC STANDARD
proc standard data=work.doe mean=0 std=1
out=work.doe_stan;
var week annualized_price_index weekly_pct_change
supply supply_pct_change;
run;
proc fastclus data=work.doe_stan maxc=6 maxiter=20
out=work.cluster1;
var week annualized_price_index weekly_pct_change
supply supply_pct_change;
run;
Use of PROC STANDARD
proc standard data=work.doe mean=0 std=1
out=work.doe_stan;
var week annualized_price_index weekly_pct_change
supply supply_pct_change;
run;
proc fastclus data=work.doe_stan maxc=6 maxiter=20
out=work.cluster1;
var week annualized_price_index weekly_pct_change
supply supply_pct_change;
run;
Standardization of Volatile Data
Identification of Seasons
2
12
22
32
42
Blue: Post-holiday lull
Purple: Winter Season
Red: Spring Run-up
Amber: Summer Driving Season
Yellow: Holiday Spikes
Green: Fall Season
52
One-Time Events
of Variable Duration
One-Time Events
of Variable Duration
**** Time series of high velocity absorption events ****;
data work.first_date;
set work.hva;
by event_id jd_minus_24e5;
if first.event_id;
first_date = jd_minus_24e5;
keep event_id first_date;
run;
(and similarly for the last date of each event,
which are merged with the first date)
data work.hva;
merge work.hva work.first_last;
by event_id;
percent_duration = (day_of_event / duration) * 100;
run;
One-Time Events
of Variable Duration
**** Time series of high velocity absorption events ****;
data work.first_date;
set work.hva;
by event_id jd_minus_24e5;
if first.event_id;
first_date = jd_minus_24e5;
keep event_id first_date;
run;
(and similarly for the last date of each event,
which are merged with the first date)
data work.hva;
merge work.hva work.first_last;
by event_id;
percent_duration = (day_of_event / duration) * 100;
run;
One-Time Events
of Variable Duration
**** Time series of high velocity absorption events ****;
data work.first_date;
set work.hva;
by event_id jd_minus_24e5;
if first.event_id;
first_date = jd_minus_24e5;
keep event_id first_date;
run;
(and similarly for the last date of each event,
which are merged with the first date)
data work.hva;
merge work.hva work.first_last;
by event_id;
percent_duration = (day_of_event / duration) * 100;
run;
One-Time Events
of Variable Duration
**** Time series of high velocity absorption events ****;
data work.first_date;
set work.hva;
by event_id jd_minus_24e5;
if first.event_id;
first_date = jd_minus_24e5;
keep event_id first_date;
run;
(and similarly for the last date of each event,
which are merged with the first date)
data work.hva;
merge work.hva work.first_last;
by event_id;
percent_duration = (day_of_event / duration) * 100;
run;
Absolute versus
Relative Measures
Time Series Clustering
of Non-Periodic Events
Time series cluster analysis can be applied to events that
change over time to identify distinct, successive stages in
dynamically evolving systems
Both cyclical and non-repeating events may be analyzed
Both point in time and rate of changes variables should
be considered
Standardization of model variables may be necessary
prior to cluster analysis to obtain the best distinction
between clusters
Events of variable duration may be analyzed by rescaling the time to % of total duration
Fisher, R.A., 1936, Annals of Eugenics, 7, 2, 179
Tryon, R. C., 1939, Cluster analysis. Ann Arbor: Edwards Brothers
The SAS Institute, Cary, N.C. www.support.sas.com
QUESTIONS?
PLEASE USE THE Q&A PANEL
SAS Global Forum 2014
Washington, D.C.
March 23-26, 2014
www.sasglobalforum.org
Thank You
David J. Corliss
[email protected]
[email protected]
sas.com