I w - The R Project for Statistical Computing

Use of R in the UK Office for National
Statistics
Duncan Elliott
Time Series Analysis
y Branch
Survey Methodology and Statistical Computing Division
Outline
• Brief
B i f hi
history
t
off R iin ONS
• Examples of analysis
dlm package for modelling unemployment
spatstat package for crime statistics
• Use in producing National Statistics
Current: MortalitySmooth
y
package
p
g
Future: ?
Brief history of R in ONS
Year
Version
ONS Users
2004
2.0.1
1
2006
231
2.3.1
5
2011
2.12.1
30
2014
3.0.2 (32-bit)
60
• Used
U d as a research
h ttooll ffor
Spatial analysis, small area estimation, time series
analysis sample design & estimation
analysis,
Brief history of R in ONS
• R used to call X
X-12-ARIMA
12 ARIMA to analyse multiple
seasonal time series efficiently (2006)
• Development of informal training (2007)
• Smoothing of mortality rates (2010)
• Establishment of R Testing Group (2011)
Methodologists & IT specialists
Pilots for disclosure tool, admin data processing, production
standard graphics, experience of other NSI’s
Conclusion: useful research tool not yet for general production
systems
• Establishment of R Development Group (2012)
Aim: testing for production environments
Development of formal training (2013)
How is R is currently used at ONS
• R for Windows on individual workstations
• Restricted use of R
Delay in updating versions
No direct access to packages on CRAN or other repositories
Cannot be used for regular
g
p
production of a National Statistic
(one exception)
• Mostly users access via R Gui (some use of R studio)
• Used for analysis and research in a growing number
of areas at ONS
Analysis with the dlm package
• Giovanni Petris (2010).
(2010) An R Package for Dynamic
Linear Models. Journal of Statistical Software, 36(12),
1-16.
1
16. URL http://www.jstatsoft.org/v36/i12/
• Unemployment statistics currently published as
rolling quarterly data
p
model for
• Aim: state space
Modelling potential discontinuities
Account for Survey Error Autocorrelation (SEA) and Rotation
G
Group
Bias (RGB)
( G )
Extracting monthly signal
Removal of some unobserved components
LFS Survey Design & Estimation
• Quarterly survey with rotating panel design
• 40,000 households per quarter
• Respondents
R
d t iinterviewed
t i
d ffor 5 successive
i waves
at three monthly intervals
• Typically Wave 1 CAPI,
CAPI waves 2
2-5
5 CATI
• Rolling quarterly estimates use calibration
weighting
• Imputation for non-response = roll forward for one
period else zero weight
Cohort and Wave structure
Cohort
Period
1 2 3 4 5 6 7 8 9 10 11 12
Jan‐Mar
Jan
Mar 2012
2012 W5 W4 W3 W2 W1
Apr‐Jun 2012
W5 W4 W3 W2 W1
Jul‐Sep 2012
W5 W4 W3 W2 W1
Oct‐Dec 2012
W5 W4 W3 W2 W1
Jan‐Mar 2013
W5 W4 W3 W2 W1
A J 2013
Apr‐Jun 2013
W5 W4 W3 W2 W1
Jul‐Sep 2013
W5 W4 W3 W2 W1
Oct Dec 2013
Oct‐Dec 2013
W5 W4 W3 W2 W1
Multivariate model
• Ob
Observations
ti
are monthly
thl wave specific
ifi
estimates for waves j = 1,2,..5
ytj  Yt  a j  etj
Yt  Lt  St  I t
Lt  Lt 1  Rt 1  wtL
Rt  Rt 1  wtR
10
St   St i  wtS
i 1
It  w
I
t
wtL ~ N (0,  L2 )
wtR ~ N (0,
(0  R2 )
wtS ~ N (0,  S2 )
wtI ~ N (0,  I2 )
Multivariate model
• Model
M d l ffor wave specific
ifi errors
et   e  w
j
j
j ,t
t 3
e
t
• For example
e  e  w  e  w
3
t
3 3,t
t 3
e
t
3 2
t 3
w1
w2
w3
w4
w5
e1t-3
e2t-3
e3t-3
e4t-3
e5t-3
e1t-2
e2t-2
e3t-2
e4t-2
e5t-2
e1t-1
e2t-1
e3t-1
e4t-1
e5t-1
e1t
e2t
e3t
e4t
e5t
e
t
Multivariate model
• State
St t Space
S
Model
M d l
ytj  Yt  a j  etj  Lt  St  I t  a j  etj
y t  F t
 t  G t 1  w t
• State vector
 t  ( tY ,  te )
 tY  ( Lt , Rt , St , St 1 ,...St 10 , I t , a 2 , a 3 ,..., a5 )
 te  (et1 , et2 ,..., et5 , et11 , et21 ,..., et51 , et12 , et22 ,..., et52 )
Pseudo Survey Error Autocorrelation
• E
Estimates
ti t
based on
Pf ff
Pfeffermann
et al (1998)
Monthly unemployment UK aged 16
16+
spatstat: visualising crime data
• Adrian Baddeley,
Baddeley Rolf Turner (2005)
(2005). spatstat: An R
Package for Analyzing Spatial Point Patterns. Journal
of Statistical Software 12(6), 1-42.
1 42. URL
http://www.jstatsoft.org/v12/i06/
• Crime data is currently released by police.uk at
postcode level
• Rich data source, but current presentation could be
improved to enable better understanding of trends
and complex patterns
• Kernel smoothing done in R, using the spatstat
package
k
Vehicle crime in greater London
Shepherd's
Bush
Ilford
15
New methods for small area estimation
• Early research ongoing into new methods of
estimating income at small areas (MSOA level)
• Method proposed by Molina and Rao (2010) has
been implemented in R as part of this research
Difficult to implement
p
in our standard software
Preferred tool for academics involved
• Potentially allows production of quantities of income,
such as the median, at MSOA level which were not
previously available
R in production of a National Statistic
• 2010 review
i
off mortality
t lit rates
t estimation
ti ti
Recommended use of 2-dimensional p-spline
Method not available in standard software
Carlo G. Camarda (2012). MortalitySmooth: An R
P k
Package
for
f Smoothing
S
thi Poisson
P i
Counts
C
t with
ith PP
Splines. Journal of Statistical Software, 50(1), 124 URL http://www.jstatsoft.org/v50/i01/
24.
http://www jstatsoft org/v50/i01/
Unsmoothed mortality improvement
rates
t ffor females
f
l in
i the
th UK
100
75
Age
Improvement
Rate
25
0
-25
-50
50
25
0
1960
1970
1980
1990
Year
2000
2010
Smoothed mortality improvement rates
ages for
f ffemales
l in
i the
th UK
Testing R for production
• E
Evaluation
l ti off survey (Lumley,
(L l
2004 and
d
2010), and ReGenesees (ISTAT, 2014) eg
zero hours
h
contracts,
t t business
b i
surveys
• Statistical functions in CORD
Benchmarking function (Dagum & Cholette, 2006)
Forecasting?
Splining?
Seasonal adjustment?
• … other areas of the generic statistical
business p
process model
What we would like to learn?
• Wh
Whatt barriers
b i
tto using
i Rh
have th
there b
been iin
your organisations and how have you
overcome them?
th ?
• Experience of organisations where R has
been used for systems development (eg
hosting/calling from servers/integrating with
other software)
• Functions relevant for National Accounts
• R and big data
References
•
•
•
•
•
•
•
•
•
Adrian Baddeley, Rolf Turner (2005). spatstat: An R Package for Analyzing Spatial
Point Patterns. Journal of Statistical Software 12(6), 1-42. URL
http://www.jstatsoft.org/v12/i06/
Carlo G. Camarda (2012). MortalitySmooth: An R Package for Smoothing Poisson
Counts with P-Splines. Journal of Statistical Software, 50(1), 1-24. URL
http://www jstatsoft org/v50/i01/
http://www.jstatsoft.org/v50/i01/
Estela Bee Dagum & Pierre A. Cholette (2006) Benchmarking, Temporal Distribution,
and Reconciliation Methods for Time Series: Lecture Notes in Statistics 186, Springer
ISTAT (2014) http://www.istat.it/it/strumenti/metodi-e-software/software/regenesees
http://www istat it/it/strumenti/metodi e software/software/regenesees
T. Lumley (2012) "survey: analysis of complex survey samples". R package version
3.28-2.
T Lumley (2004) Analysis of complex survey samples
T.
samples. Journal of Statistical Software
9(1): 1-19
Molina, I. and Rao, J.N.K. (2010) ‘Small area estimation of poverty indicators’ Canadian
Journal of Statistics, Vol.38 No.3 pp369-385
pp
Giovanni Petris (2010). An R Package for Dynamic Linear Models. Journal of Statistical
Software, 36(12), 1-16. URL http://www.jstatsoft.org/v36/i12/
Pfeffermann, D., Feder, M. And Signorelli, D (1998). ‘Estimation of autocorrelations of
survey errors with application to trend estimation in small samples’ Journal of Business
and Economic Statistics, Vol. 16 pp339-348
Thank you
Contact:
C
t t
[email protected]
+44 16 33 45 56 20
Acknowledgements:
Ki
Kieran
M
Martin,
ti Ria
Ri Sanderson,
S d
Daniel
D i l
Ayoubkhani & Gary Brown