GPU-Accelerated Bayesian Learning and Forecasting in

GPU-Accelerated Bayesian Learning and Forecasting in
Simultaneous Graphical Dynamic Linear Models
Lutz F. Gruber∗
Mike West†
Department of Mathematics
Chair of Mathematical Statistics
Technical University of Munich
Munich, Germany
[email protected]
Department of Statistical Science
Duke University
Durham, North Carolina, USA
[email protected]
July 2, 2014
Abstract
We discuss modeling and GPU-based computation in a new class of multivariate dynamic
models customized to learning and prediction with increasingly high-dimensional time series. This defines an approach to decoupling analysis into a parallel set of univariate time
series dynamic models, while flexibly modeling cross-series relationships in a novel, induced
class of time-varying graphical models of multivariate volatility matrices. A novel decoupling/recoupling computational strategy allows for exact analysis and simulation of univariate time series models that are then coherently linked to represent the full multivariate
model. The computational innovations use importance sampling and variational Bayes ideas,
and the overall strategy is ideally suited to GPU implementation. The analysis is enormously
scalable with time series dimension, as we demonstrate in an analysis of a 400-dimensional
financial time series.
Keywords: Decoupling models; High-dimensional time series; Importance sampling; Parallel
computing; Recoupling models; Variational Bayes.
∗
Research developed as a Fulbright Visiting Scholar in the Department of Statistical Science at Duke University,
partly supported under the Fulbright Program for Foreign Students sponsored by U.S. Department of State.
†
Research partially supported by the US National Science Foundation [grant DMS-1106516].
Any opinions, findings and conclusions or recommendations expressed in this work are those of the authors and
do not necessarily reflect the views of the Fulbright Foundation or the NSF.
1.
Introduction
Challenges of scaling on-line Bayesian analyses of multivariate dynamic models to increasingly high-dimensions define a need for new modeling and computational strategies. Sequential
analysis that is both computationally accessible and sensitive in characterizing the dynamic,
multivariate structure of increasingly many series is of interest in many application areas, such
as macroeconomics (Koop 2012; Nakajima and West 2013a), financial portfolio studies (Aguilar
and West 2000; Zhou et al. 2014), commercial demand, sales and consumer interest forecasting (Agarwal et al. 2010), large scale-networks in energy demand forecasting (Hosking et al.
2013), and biomedical areas including the neurosciences (Trejo et al. 2007; Prado 2010).
We address these challenges in a novel class of Simultaneous Graphical Dynamic Linear Models
(SGDLMs) and utilizing GPU-based distributed implementations. This:
• Uses fast, efficient sequential learning in a set of decoupled univariate dynamic linear models that capture cross-series contemporaneous dependencies via a sparse and dynamic
simultaneous equations formulation;
• Recouples these parallel univariate analyses using importance reweighting of sets of direct
simulations from the univariate models, to define coherent inference and forecasting of
the multivariate series; and
• Moves ahead in time using decoupled models based on a variational Bayes strategy.
On modeling: SGDLM allows for flexible time-varying parameter, state-space models of each
individual series. If analyzed separately, each of these is amenable to standard, closed form
sequential filtering and forecasting (Prado and West 2010; West and Harrison 1997), so analysis
is most efficient. SGDLMs further integrate a novel specification of multivariate dependencies
across series at each time point, based on the simultaneous equations representation. This contributes a new model class to the growing field of sparse modeling of volatility matrices. Unlike
Cholesky-style and factor models (e.g. Aguilar and West 2000; Nakajima and West 2013b),
SGDLMs do not require an ordering of the univariate series, which can be an obstacle to model
specification and is a major question when dealing with many time series. Compared to dynamic graphical models of precision matrices (e.g. Quintana and West 1987; Carvalho and West
2007; Wang and West 2009), SGDLMs are immensely more scalable computationally, and substantially more flexible in allowing for patterns of change in multivariate volatility matrices that
are specified implicitly via state-space models for sets of simultaneous regression parameters.
On computation: The recoupling analysis uses exact, within-series simulation of states and parameters for each time step, combined with importance sampling (IS) reweighting for inference
and prediction of the multivariate series. We then exploit an efficient and accurate variational
1
Bayes (VB) strategy (e.g. Jaakkola and Jordan 2000, and West and Harrison 1997, sect. 12.3.4)
to decouple and proceed to the next time point. Importantly, we monitor and adjust for the VB
approximation using information from the IS step.
A main interest is in scaling-up the number of time series, and here we exploit (C++/CUDA)
GPU computation (Suchard et al. 2010; Lee et al. 2010). GPU computation is simply ideally
suited to the analysis. Univariate time series model updates and simulation are performed in
parallel across the series at each time point, and then recoupled for coherent inference and forecasting before decoupling again to move to the next time point and re-parallelization. The overall modeling and computational strategy is enormously scalable with time series dimension as
a result. We discuss computational demands in general, illustrate the GPU implementation, and
demonstrate that this allows for an involved, real-time Bayesian analysis of a 400-dimensional
daily stock return time series.
2.
2.1
SGDLMs
Model Structure
The m−dimensional time series yt = (y1t , . . . , ymt )0 , t = 1, 2, . . . , is modeled as follows. At
each time t, the real-valued, scalar components yjt follow univariate DLMs (West and Harrison
1997; Prado and West 2010)
0
yjt = x0jt φjt + ysp(j),t
γ jt + νjt = F0jt θ jt + νjt ,
j = 1 : m,
(1)
where:
• xjt is a known column vector of predictors or constants, with corresponding dynamic regression coefficients in the column state vector φjt , each of dimension pjφ .
• The index set sp(j) ⊆ {1 : m}\{j} identifies the pjγ = |sp(j)| simultaneous parents of series
j in the column vector ysp(j),t , with corresponding state vector γ jt .
• The state and regression vectors are catenated as
θ jt =
φjt
γ jt
!
and
Fjt =
xjt
ysp(j),t
!
,
each of dimension pj = pjφ + pjγ ; at each t, we denote them collectively by Θt = θ 1:m,t .
• The observation noise terms νjt ∼ N (0, λ−1
jt ) have precisions λjt and are independent across
series and over time.
2
Each of the state vectors θ jt will follow some specified state evolutions models, such as vector
random walks, TV-VAR models, etc., depending on context.
The model defines a set of coupled, dynamic simultaneous equations that cohere based on
the independence of the νjt . The parental sets sp(j) allow for sparse modeling of contemporaneous structure; with larger m, we will select small parental sets that inherently define sparse
graphical model structures. This draws on the use of simultaneous autoregressive (SAR) models
in spatial statistics (e.g. Mukherjee et al. 2014). The result is seen in the immediate deduction
that
yt ∼ N (At µt , Σt )
(2)
where
At = (I − Γt )−1
and
0
Σ−1
t ≡ Ωt = (I − Γt )Λt (I − Γt )
(3)
and: (i) µt = (µ1t , . . . , µmt )0 where µjt = x0jt φjt for each j = 1 : m; (ii) Λt = diag(λ1t , . . . , λmt );
and (ii) the dynamic simultaneous coefficient matrix is defined as

0
γ1,2,t
γ1,3,t
···
γ1,m,t



 γ2,1,t

0
γ
·
·
·
γ
2,3,t
2,m,t


 ..

..
.
.
..
..
..
Γt =  .

.
.


γ

.
.
.
γ
0
γ
m−1,1,t
m−1,m−2,t
m−1,m,t


γm,1,t γm,2,t
...
γm,m−1,t
0
with entries γj,h,t = 0 for each h 6∈ sp(j), and implicitly indexing the elements of each γ jt against
their partner parents in sp(j) to define the time-varying, non-zero entries. A sparse model has a
sparse simultaneous coefficient matrix Γt , with a chosen and fixed set of zeros and with dynamic
elements in other entries.
The cross-series dependence characteristics of the covariance matrix Σt derive from the precision matrix Ωt , that being implied by the simultaneous parental structure in Γt . When parental
sets are very sparse relative to m, the zeros in Γt generally result in a sparse implied precision
matrix Ωt , with non-zero and time-varying entries elsewhere, i.e., a sparse, dynamic graphical
model– hence we refer to the multivariate coupled DLMs as an SGDLM. Practical relevance will
typically mean that sp(j) ∼ k for some (relatively small) fixed k, so models with moderate and
large m will be very sparse indeed; the effective number of non-zero parameters underlying the
m(m + 1)/2 elements of the volatility matrix Ωt is then just m(k + 1).
2.2
Challenges to Efficient Sequential Learning
The univariate DLM for yjt is a linear, normal state-space model for θ jt that can be coupled
with the traditional variance discount model for stochastic volatilities λjt . This yields the stan3
dard forward filtering and forecasting analysis in closed form (West and Harrison 1997). At
the heart of this are sequentially updated normal/gamma priors p(θ jt , λjt |Dt−1 ), and posteriors
p(θ jt , λjt |Dt ), where Dt = {yt , It , Dt−1 } represents all information available at time t; from
time t − 1 to t, this is updated with the most recent observation yt and any additional subjective
information It that may be incorporated (e.g. West and Harrison 1989, 1997, chapt. 11).
The full multivariate model raises computational challenges evident in the non-linearities in
state vectors in eqns. (2,3). However, there is some simplification and structure, again paralleling forms arising in spatial SAR models. It is easily shown that the time t likelihood function for
Θt , Λt , directly derived from the m−variate normal density of eqn. (2), can be reduced to
p(yt |Θt , Λt ) ∝ |I − Γt |
Y
p(yjt |θ jt , λjt )
(4)
j=1:m
where the product is of normal density terms from individual models yjt ∼ N (F0jt θ jt , 1/λjt ).
The determinant factor here induces the computational challenges. Apart from the special
cases of compositional models (Zhao and West 2014) in which Γt is triangular with a diagonal
of zeros and so |I − Γt | = 1, this term contributes to the likelihood for the γ jt vectors. In very
sparse models, it can tend to be quite diffuse compared to the product of individual likelihood
terms, but generally it matters; it arises theoretically and ensures coherence of the multivariate
analysis– reflecting the parametric constraints on Γt that are inherently required to ensure positive definiteness and symmetry of Ωt . Below in our financial time series example we demonstrate
some of the negative practical consequences of simply ignoring this.
That said, the form of eqn. (4) suggests opportunity to exploit separate, parallel analyses of
each series in order to define an effective sequential computational strategy for the full multivariate model. This observation underlies our decoupling/recoupling computational approach.
The naive approximation of ignoring the determinant correcting factor is akin to running a
univariate DLM on each series individually. Our strategy improves on this while retaining the
analytical tractability of the time evolution and update steps.
3.
3.1
Model Decoupling/Recoupling for Bayesian Computation
Model Emulation: Decoupling for Forward Filtering
The above, intuition-building discussion suggests that a bank of m decoupled, univariate
models updated in parallel at time t will define a useful emulator of the full multivariate model
at that time point. This underlies our decoupling/recoupling strategy.
A. Time t − 1 Posterior: Suppose the time t − 1 posterior for the full set of m states and
volatilities Θt−1 , Λt−1 is approximated– or, better, emulated– by a product of m multivariate
4
normal-gamma distributions. That is,
p(Θt−1 , Λt−1 |Dt−1 ) =
Y
pj,t−1 (θ j,t−1 , λj,t−1 |Dt−1 )
(5)
j=1:m
with components
(θ j,t−1 , λj,t−1 |Dt−1 ) ∼ N G(mj,t−1 , Cj,t−1 /sj,t−1 , nj,t−1 /2, nj,t−1 sj,t−1 /2)
(6)
where the notation (Prado and West 2010) represents the normal-gamma distribution
(θ j,t−1 |λj,t−1 , Dt−1 ) ∼ N (mj,t−1 , Cj,t−1 /(sj,t−1 λj,t−1 )),
(λj,t−1 |Dt−1 ) ∼ G(nj,t−1 /2, nj,t−1 sj,t−1 /2)
in which G(n, ns) is gamma with shape n, scale ns, and mean n/(ns) = 1/s. The implied margin
for the state vector is p−variate T with nj,t−1 degrees of freedom and scale matrix Cj,t−1 .
B. State-Space Evolution to Time t: The individual states (θ jt , λjt ) are modeled as evolving according to standard linear, normal state-space models with stochastic volatility discounting (West and Harrison 1997; Prado and West 2010), independently across j = 1 : m. With It
representing the information defining the evolution model, this defines the conditional distributions
(θ jt , λjt |θ j,t−1 , λj,t−1 , It , Dt−1 )
with standard details summarized in Appendix A.1.
For each j = 1 : m, we have known state evolution matrices Gjt , state innovations variance
matrices Wjt , and volatility discount factors βj , that define these evolution models. One key
result to note here is that the resulting time t prior is the “evolved” version of eqn. (5) given by
p(Θt , Λt |It , Dt−1 ) =
Y
p(θ jt , λjt |It , Dt−1 )
(7)
j=1:m
with components
(θ jt , λjt |It , Dt−1 ) ∼ N G(ajt , Rjt /sj,t−1 , rjt /2, rjt sj,t−1 /2)
(8)
where ajt = Gjt mj,t−1 , Rjt = Gjt Cj,t−1 Gjt + Wjt and rjt = βj nj,t−1 .
C. Update to Time t Posterior: The time t − 1 → t filtering is completed by updating the prior
in eqn. (7) via the likelihood function in eqn. (4). We now see that, apart from the determinant
term, the posterior is dominated by the product of independent posteriors across the series, with
5
the latter being defined by standard updates (West and Harrison 1997). That is, the posterior is
p(Θt , Λt |Dt ) ∝ |I − Γt |
Y
(9)
pejt (θ jt , λjt )
j=1:m
where each pejt (·) is the density of a conjugate normal-gamma
e jt /e
e jt , C
(θ jt , λjt ) ∼ N G(m
sjt , n
ejt /2, n
ejt sejt /2),
(10)
e jt , n
e jt , C
with m
ejt , sejt given by the standard updating equations; see full details in Appendix A.2.
3.2
Model Emulation: Direct and Importance Sampling
Expecting the determinant term of eqn. (9) to be generally dominated by the product of
density functions of eqn. (10) over their range, it is clear that this product will be a close approximation of the exact posterior. This suggests using the set of m decoupled conjugate posteriors
of eqn. (10) to emulate the exact posterior, and we utilize it as an importance sampling proposal
distribution. We can trivially sample directly from the independent, conjugate normal-gammas–
in parallel and independently– and then combine the samples across series; this yields importance samples with importance weights, from eqn. (9), proportional to |I − Γt |. Explicitly:
1. For each series j = 1 : m in parallel and independently, draw random samples of size
(i)
(i)
(θ jt , λjt ), i = 1 : N, from pejt (·).
(i)
(i)
2. Combine these marginal samples to define multivariate importance samples (Θt , Λt ).
P
(i)
3. Compute the importance weight αit ∝ |I − Γt | and normalize so that i=1:N αit = 1.
4. Assess the Monte Carlo accuracy of the importance sample using standard methods, inP
cluding effective sample size SN = N/ i=1:N αit2 .
5. Use the importance sample for Monte Carlo inference on current states, volatilities, and
for forecasting ahead from time t. We comment further on forecasting below.
Note that we can massively parallelize when m is large. Also, the importance weights do not
depend on the φjt , λjt , only on the simultaneous coefficients γ jt . This reflects the view that the
parallel conjugate models will effectively emulate the full multivariate model in inferring seriesspecific states and volatilities, while some “corrections” will be needed for inferences on the
parental states that explicitly define cross-series structure. Importance sampling is the natural
and elegant approach to making these corrections. The series/models are decoupled for updates
and direct simulation, and then recoupled for importance sampling targeting the exact posterior.
6
3.3
Variational Bayes Posterior Decoupling
3.3.1.
Mean Field Posterior
Completing the evolution-update steps from time t − 1 to time t, we now see that moving this
analysis along to time t + 1 and the future will require morphing the importance sample-based
posterior for (Θt , Λt |Dt ) to a set of m, decoupled conjugate normal-gammas as in eqns. (5,6)
but with t − 1 updated to t. This will then complete the cycle and define an overall sequential
analysis strategy. We do this via a variational Bayes (VB) step, in fact a mean-field approximation (Jaakkola and Jordan 2000). On an historical note, both the conceptual basis and
technical aspects of this kind of mapping to sets of conjugate forms has a long history in the
Bayesian dynamic modeling and forecasting literature (e.g. Harrison and Stevens 1971; Alspach
and Sorenson 1972; Harrison and Stevens 1976; Smith and West 1983; West and Harrison 1997,
sect. 12.3.4).
The VB decoupling replaces the importance sample representation of the posterior p(Θt , Λt |Dt )
with the Kullback-Leibler (KL) “nearest” product of normal-gammas across the m series. Again
the intuition is that the product of the m independent terms will generally dominate the posterior; now, for moving ahead to time t + 1 and in view of the interests in computational efficacy
enabled by analytic tractability, we cut back to the latter form, using KL divergence-based optimization of the product form. That is, we now choose the m sets of parameters {mjt , Cjt , njt , sjt }
to define the time t product form posterior as in eqns. (5,6) with time index t − 1 updated to t.
It is easily shown that the KL divergence of such a product of conjugate forms from the exact
posterior p(Θt , Λt |Dt ) is minimized via:
• mjt = E[λjt θ jt ]/E[λjt ];
• Vjt = E[λjt (θ jt − mjt )(θ jt − mjt )0 ];
−1
• djt = E[λjt (θ jt − mjt )0 Vjt
(θ jt − mjt )];
• njt such that log(njt + pj − djt ) − ψ(njt /2) − (pj − djt )/njt − log(2E[λjt ]) + E[log λjt ] = 0;
• sjt = (njt + pj − djt )/(njt E[λjt ]);
• Cjt = sjt Vjt .
Here the expectations E[·] are with respect to the importance sampling-based posterior, so that
these parameters can be easily computed. The noteworthy cases are the njt , whose values are
trivially computed via an iterative numerical approach; see Appendix B.2.
3.3.2.
Comments on KL Divergence and Simulation-VB Strategy
Recall that, for any random quantity z, the KL divergence of a distribution with density g(x)
from one with density p(x) is KLg|p = −Ep log{p(x)/g(x)} . Here the densities are continuous,
7
discrete or mixed and have common support, and Ep [·] refers to expectation under p(·). We make
the following observations connected to the above use of the KL minimizing VB strategy.
1. Denote by pe(Θt , Λt |Dt ) the product of the initial set of m independent normal-gamma posteriors in eqn. (10). The importance sampling weights α1:N,t provide a direct assessment
of the divergence KLpe|p of pe(·) from the exact posterior p(·) of eqn. (9), as follows. Write
P
HN = i=1:N αit log(N αit ), the entropy of the importance sampling weights αit relative to
a set of N uniform weights– one measure of efficacy of importance samplers (e.g. West
1993). It can be easily shown that, as N → ∞, HN → KLpe|p , so the relative entropy gives
a direct estimate of the KL divergence.
P
It can also be shown that HN ≤ N i=1:N αit2 − 1 = N/SN − 1 where SN is effective sample
size; for large N , the limiting value of SN /N is bounded from above by 1/(1 + KLpe|p ).
2. The VB-based posterior defined in Section 3.3.1 represents an improvement over the initial
pe(·) as it minimizes KLg|p over all g(Θt , Λt ) that are products of m normal-gamma forms
for the (θ jt , λjt ), j = 1 : m. Hence HN gives an estimate of the upper bound of the minimized divergence; if HN is already small– based on calibrating to effective sample size as
above– then we are assured of closeness of the revised VB-based posterior approximation.
3. The KL divergence of any subset of parameters (Θt , Λt ) cannot exceed the divergence on
the full set. Hence the latter is an operational upper bound on divergences of the approximating marginal posteriors in any of the m individual models. If the overall approximation
is good, we do not have to monitor the margins.
4. In evolution from time t to t + 1, the KL-optimized product of normal-gamma posteriors for
(Θt , Λt |Dt ) evolves to a similar analytic form for the time t + 1 prior p(Θt+1 , Λt+1 |It+1 , Dt )
of eqn. (7) with t updated to t + 1. We know that KL divergence decreases through convolutions. Hence the divergence of this normal-gamma product for (Θt+1 , Λt+1 ) is a better
approximation of the exact time t + 1 prior than was the case for the time t posterior. If
we have a high-quality posterior approximation at time t, then the situation only improves
following evolution. This is important to bear in mind in forecasting, as now follows.
3.4
Forecast Computations
The simulation analysis underlies computation of aspects of forecast distributions to predict
ahead from time t. We summarize this here for 1−step ahead forecasts, which underlie our
example below. We sample the predictives p(yt+1 |It+1 , Dt ), and repeat the computations as time
evolves. The simulations, that can be partially parallelized, are as follows:
1. Directly simulate the time t + 1 states from the analytic product of normal-gamma distributions that follow after our variational Bayes (VB) decoupling step of the next section.
8
The VB step approximates the posterior p(Θt , Λt |Dt ) by a set of m, decoupled conjugate
normal-gammas as in eqns. (5,6) but with t − 1 updated to t. As a result, evolution to
time t + 1 implies the product of normal-gamma distributions for (Θt+1 , Λt+1 |It+1 , Dt ) as
in eqn. (7) but with t updated to t + 1.
Directly from these, in parallel, to generate m Monte Carlo samples of the same size M,
(i)
(i)
that when recoupled define the full simulation sample (Θt+1 , Λt+1 ), i = 1 : M.
(i)
(i)
(i)
2. Compute the implied terms At+1 , µt+1 , Σt+1 from eqn. (3) with time t updated to t + 1.
(i)
(i)
(i)
(i)
3. For each i = 1 : M, sample yt+1 ∼ N (At+1 µt+1 , Σt+1 ).
In forecasting as earlier in filtering, we stress the use of decoupling to distribute simulations in
parallel across series, followed by recoupling to evaluate and sample the full multivariate model.
4.
Implementation in C++/CUDA
We provide an overview of our GPU implementation of the complete filtering/forecasting
analysis, noting a number of technical aspects and requirements as well as giving some flavor of
the structure of C++/CUDA programming1 . This Section, and the code, contribute to the growing
body of literature linked to Bayesian (and other) statistical computations that are inherently
enabled via GPU implementations (e.g. Suchard et al. 2010; Lee et al. 2010).
4.1
Development of GPU-Accelerated Software
CUDA is an extension of the C programming language to operate on Nvidia CUDA-enabled
GPUs. Modern GPUs can feature hundreds of cores. These can be used to execute single instruction, multiple data (SIMD) operations in a massively parallel mode. In comparison, multi-core
desktop processors tend to have not more than eight cores. Whenever the same set of operations has to be performed many times on different data, GPU-accelerated implementations vastly
outperform their CPU equivalents, given their significantly higher number of cores.
Developing GPU-accelerated software is much like developing CPU software in C or C++. One
has to allocate and free memory, then implement program logic. The latter can use a combination of self-made functions, called kernels, if they are executed on the GPU, and functions
provided by third-party libraries. The CUDA development toolkit comes with several libraries,
two of which we used heavily: CUBLAS and CURAND, that provide basic linear algebra routines
(BLAS) and random number generators, respectively, for CUDA-enabled GPUs.
1
Our implementation and code is based on a computer with an Nvidia CUDA-enabled graphics processing unit
(GPU) with a compute capability of at least 2.0. Interested users may run the code, so long as CUDA runtime,
CUBLAS and CURAND libraries of Version 5.5 or newer are installed. We also have a user-friendly Matlab interface
compatible with Matlab versions as early as R2010a; though the C++/CUDA is free-standing, some researchers may
be interested in accessing GPU facilities via Matlab.
9
On top of massively parallel execution of SIMD operations in kernels, CUDA streams can
be used to organize sequences of operations. The operations within each stream are executed
sequentially, but different streams are executed concurrently on the GPU. Furthermore, different
streams can be assigned to different GPU devices, allowing multiple GPU computing. Concurrent
execution and data transfers allow the most efficient usage of GPU resources.
4.2
Key Features of CUDA Implementation
Wherever possible, our implementation uses batched functions from the CUBLAS library to
perform the linear algebra operations for forward filtering and forecasting. Batched functions
are a recent innovation introduced into the CUBLAS library. They accelerate operations on a
large number of small matrices by bundling them into a single function call. This reduces the
overhead of initiating many small operations on the GPU individually. In particular, we use:
• cublasDgemmBatched(...) for batched matrix-matrix multiplications,
• cublasDgetrfBatched(...) for batched LU factorization,
• cublasDgetriBatched(...) for batched matrix inversion.
We developed customized kernels to achieve maximum performance for operations that fall
outside the scope of current batched CUBLAS functions; see Appendix B for details.
4.3
Multiple Device Parallelization
This presents the key points regarding GPU parallelization using multiple GPU devices.
A call to the function cudaSetDevice(k) activates the k-th GPU device. This means all
future GPU calls will be executed on this device until the current device is changed by another
call to cudaSetDevice(...). Our implementation sets up one CUDA stream per GPU. In our
importance sampling steps to generate a posterior sample of size N and then forecast simulation
of M samples, each of K GPUs will compute N/K importance samples and then N/M direct
samples; this is parallelization by particles.
The key functions from the CUDA API that organize asynchronous GPU execution are:
•
•
•
•
cudaStreamCreate(...) to create a stream,
cudaMemcpyAsync(...) to asynchronously move data between computer and GPU,
cudaStreamSynchronize(...) to wait until a stream’s execution is completed,
cudaStreamDestroy(...) to destroy a stream.
Calls to CUDA kernels, most API function calls, and asynchronous memory copy operations return control to the host immediately. This allows us to send commands to different GPUs in a for
loop and still have them executed in parallel. However, it is necessary to wait until the operations
are completed before returning results; this is where the function cudaStreamSynchronize(...)
10
comes in. This operates after the GPU operations that are to be executed in parallel are sent to
their respective devices.
Below is an outline of a multiple device parallelized program.
Pseudo Code
1: for GPU device k = 1 : K do
2:
Call cudaSetDevice(k) to activate the k-th GPU.
3:
Call cudaStreamCreate(stream k) to create a stream on the k-th GPU.
4:
Call cudaMemcpyAsync(..., stream k) to asynchronously copy input arguments onto device memory.
5: end for
6: for GPU device k = 1 : K do
7:
Call cudaSetDevice(k) to activate the k-th GPU.
8:
Call the Variational Bayes posterior estimation subroutine to generate N/K importance samples,
and then the forecasting simulation subroutine to generate M forecast samples.
9: end for
10: for GPU device k = 1 : K do
11:
Call cudaSetDevice(k) to activate the k-th GPU.
12:
Call cudaMemcpyAsync(..., stream k) to asynchronously copy results to host memory.
13: end for
14: for GPU device k = 1 : K do
15:
Call cudaSetDevice(k) to activate the k-th GPU.
16:
Call cudaStreamSynchronize(stream k) to wait until the computations and memory transfers are
completed.
Call cudaStreamDestroy(stream k) to destroy the stream on the k-th GPU.
17:
18: end for
19: Combine and post-process the results on the CPU, if applicable.
4.4
Computational Load
In the importance sampling, forecasting and variational Bayes posterior decoupling, referred
to in Line 8 of above pseudo code, the importance sample sample sizes N (for posterior updates)
and M (for forecasting) will typically be very large for accurate posterior approximations. Hence
scaling in N and M is critical.
The lead complexity of sampling the parameters Θt and Λt is O N mp2max where pmax =
maxj=1:m pj . Computing the determinants |I − Γt | is the most expensive operation, with a cost of
O N m3 . The remaining operations that scale with N are of order O N mp2max or lower, so the
overall load is O N (m3 + mp2max ) as N → ∞.
In the 1−step ahead forecasting, sampling the parameters Θt+1 and Λt+1 is of the same
11
computational complexity as sampling from the posterior above, and to this we have to add the
cost of generating the needed state and volatility innovations of O M mp3max for a sample of size
M. Inverting the matrix I − Γt+1 is most expensive at O M m3 , so the overall computational
cost of forecasting is O M (m3 + mp2max ) as N → ∞. Note that this is the same as for posterior
updates when we choose M = N .
5.
5.1
Evaluation in Analysis and Forecasting of Stock Returns
Data and Study Set-Up
We analyze daily log returns of m = 400 S&P stocks. In a simple class of SGDLMs, we evaluate
1−step ahead forecasts for both selected individual series and across all series. This includes a
study of the effect of ignoring the coupling of the set of simultaneous model equations, to bear
out the utility of the decoupling/recoupling strategy.
Our data cover m = 400 current members of the S&P 500 index, restricting to those 400 that
were continuously listed from October 2000 to October 2013, our study period. The data are
daily log returns (differences in daily log prices). We use the first 845 daily observations, T1 = 1 :
845 up to December 2003, for an initial exploratory analysis to define the simultaneous parental
sets for each of the m = 400 series. We then use the following 522 daily observations, T2 = 846 :
1367 from January 2004 through December 2005, as training data to select suitable discount
factors and to provide starting priors for the test data. The remaining 2044 observations, Ttest =
1368 : 3411 from January 2006 through October 2013, serve as the hold-out or test data to
evaluate sequential filtering and forecasting. In training and test data analyses, the Monte Carlo
sample size N for posterior updates as well as M for forecasting were set at N = M = 10000.
5.2
Model Structure and Specification via Training Data Analysis
SGDLM form: For each of the 400 univariate series, we use the local-level (random walk) for
a time-varying trend DLM (West and Harrison 1997, chapt. 2) common in studies of financial
returns (e.g. Aguilar and West 2000; Nakajima and West 2013b). This is coupled with a specified
set of 10 simultaneous parents, whose coefficients also evolve via random walks. So, in eqn. (2)
we have pjφ = 1, xjt = 1 and pjγ = 10. In the state evolution models of Appendix A.1, and
eqn. (7), we then have Gjt = I, the 11-dimensional identity matrix, and a discount method
specifies the evolution innovations variance matrices Wjt (West and Harrison 1997, chapt. 6).
−1/2 −1/2
At each t, Wjt = Gjt ∆Cj,t−1 ∆ − Cj,t−1 G0jt where ∆ = diag([δφ , δγ 10 ]) in which: (i)
1 is a 10−vector of ones; (ii) δφ , δγ are two discount factors in (0, 1) that define the levels of
discounting of historical information on the local level and simultaneous parental coefficients,
respectively.
12
Simultaneous parental sets: We used the initial 3 years of data, T1 = 1 : 845 over 2003, for
exploratory analysis to select the sp(j) for each j = 1 : 400. On this initial training data set, we
simply ran separate, univariate DLMs with a local level and with all the remaining 399 series as
simultaneous parents. We then chose pj = 10 series to define each sp(j) by selecting those with
the largest estimated effect sizes over the later part of the training data.
The set of simultaneous parents obviously plays a key role in model fit and forecasting. Here,
however, we emphasize that more formal model selection focused on this is not a theme in the
current paper. In practice, we operate with a chosen set of parental sets over given periods
of time, refreshing/modifying the parental sets periodically via off-line analysis, and/or using
multiple such sets– a restricted number– and engaging in model averaging. A future paper will
address this. The current paper takes the parental sets as given, the focus and contributions
involving the SGDLM modeling innovation and the decoupling/recoupling computational strategy. We show in this study that the enhancements under this strategy are not merely theoretical
considerations, but significantly improve forecasting in this m = 400−dimensional setting.
Discount factor specification: We ran the SGDLM analysis on the second training set of data,
T2 = 846 : 1367. This initialized at t = 846 with priors based on: rj,846 = 5 and sj,845 = 0.001,
aj,846 = 0 and Rj,846 = diag(0.0001, 0.01, · · · , 0.01). This analysis was used to explore the impact
of varying the state discount factors δφ , δγ and the volatility discount βj = β, all assumed the
same across series j. This settled on chosen values βj = 0.95, δφ = 0.95, δγ = 0.99.
We then reran the analysis but without the simulation-variational Bayes step of the computations. That is, using the parallel, independently updated set of models. From this we settled
on βj = 0.95 and δφ = 0.95 again, but found improved 1−step ahead forecasting with an appropriately higher discount factor δγ = 0.999 on the parental predictors. Below we evaluate
forecasting results on test data using this as well as the full decoupled/recoupled analysis; this
choice of discount factors makes this an honest comparison as we begin the analysis of test data
with optimized parameter specification for each strategy.
5.3
Forecasts of Stock Returns
We review forecasting results by analyzing coverage rates of 1−step ahead forecast intervals,
and by evaluating the quantiles of these forecast distributions at which realized returns fall.
5.3.1.
Analysis of the Variational Bayes Approximation
Figure 1a shows the effective sample size SN of the variational Bayes step at each time point
t. Based on the N = 10000 samples, Sn exceeds 7000 more than 94% of the time. There is a
short-lived drop to about 5000 during a period of extreme market stress following the collapse
of Bear Sterns in September 2008. In the range from 7000 to 9000, the typical effective sample
13
size indicates excellent performance of the importance sampler.
10000
1
9000
0.9
8000
0.8
7000
0.7
6000
0.6
5000
0.5
4000
0.4
3000
0.3
2000
0.2
1000
0.1
0
2006
2007
2008
2009
2010
2011
2012
2013
0
2006
2014
2007
2008
2009
2010
2011
2012
2013
2014
(b) KL divergence KLpe|p
(a) Effective sample size SN
Figure 1: SN and KLpe|p during the test data time period Ttest = 1368 : 3411.
Figure 1b shows the importance sample-based estimate HN of the KL divergence KLpe|p of
pe(·) from the exact posterior p(·) of eqn. (9). Bounding the KL of the variational Bayes posterior
and the exact posterior from above, this shows that the variational Bayes posterior will be very
close to the exact posterior.
5.3.2.
Aggregate Analysis
Table 1 shows the average coverage rates of forecasts across all series and the entire test data
time period Ttest = 1368 : 3411. Across all coverage levels, forecast intervals are too wide. However, coverage rates of the analysis using the importance sampling/variational Bayes strategy
(indicated simply as “with VB”) align much more closely with the outcome data than do those
from analysis that does not use the strategy (indicated as “without VB”).
Forecast interval
w VB
w/o VB
99.0%
98.6%
99.2%
95.0%
95.9%
97.9%
90.0%
92.8%
96.3%
80.0%
85.8%
92.9%
50.0%
59.8%
76.6%
20.0%
27.2%
41.6%
10.0%
14.3%
23.2%
Table 1: Coverage of centered forecast intervals across all series j = 1 : m and the entire test
data time period Ttest = 1368 : 3411.
Figure 2 shows histograms representing the forecast distribution quantiles of the realized
stock returns. If the forecast distributions were perfectly calibrated, these realized quantiles
would be uniformly distributed. The over-representation of quantiles around the median suggests that the volatilities are generally over-estimated, consistent with the above comments on
the overly conservative dispersion of forecast intervals. The symmetry around the median indicates that our model/analyses are not biased in under or over-forecasting actual gains or losses.
14
4
10
4
x 10
10
8
8
6
6
4
4
2
2
0
0
0.2
0.4
0.6
0.8
0
1
(a) With Variational Bayes
x 10
0
0.2
0.4
0.6
0.8
1
(b) No Variational Bayes
Figure 2: Empirical distribution of realized forecast quantiles across all series j = 1 : m and the
entire test data time period t = 846 : 3410.
More elaborate DLMs might improve the overall extent of forecast calibration, but that is not a
key issue here. Rather, the comparison of the analysis with/without the decoupling/recoupling
computational strategy is the focus, and the figure clearly shows the benefits of the strategy.
5.3.3.
Close-Up Analysis of Individual Stocks
We now zoom-in on forecasting performance for stock returns of six well-known companies:
Apple, Bank of America, General Electric, McDonald’s, Pfizer and Starbucks.
Coverage of forecast intervals: Coverage rates for these companies are reported in Table 2.
The results confirm at these individual levels the findings from the aggregate analysis. While
forecast intervals up to 95% tend to be over-estimated by the specific model chosen, the decoupling/recoupling analysis generally improves forecasting performance, while also yielding more
precise forecast intervals.
Realized forecast quantiles: Figure 3 shows the distribution of the realized quantiles. As in
the aggregate summary across all series, the histograms are far closer to uniformity with VB
than without. In particular, the latter analysis more substantially over-estimates volatilities as
is evident in the higher peaks in the centers of these histograms. Again, the general symmetry
around the median in all cases indicates that the model analyses are effectively unbiased and
similarly accurate in terms of forecasting the signs of returns.
Trend: Figure 4a shows the 45-day tracking moving average of the daily log returns, comparing empirical trends with those forecast under the SGDLM and that without VB. There is no
noticeable advantage of using the posterior decoupling/recoupling. This is an expected observation, given that the posterior weighting in importance sampling uses the determinants |I − Γt | as
weights. The value of the trend parameters do affect these weights, so the marginal posteriors
should be expected to be adequately estimated without the importance sampling/VB steps.
15
Forecast interval
99.0%
95.0%
w/ VB
w/o VB
98.4%
99.3%
95.8%
98.0%
w/ VB
w/o VB
98.4%
98.4%
95.8%
96.7%
w/ VB
w/o VB
98.4%
99.2%
94.5%
97.9%
w/ VB
w/o VB
98.6%
99.3%
96.1%
98.6%
w/ VB
w/o VB
98.9%
99.6%
95.5%
97.7%
w/ VB
w/o VB
98.2%
98.8%
95.6%
97.2%
90.0%
80.0% 50.0%
Apple Inc
92.3% 86.3% 60.1%
96.2% 92.8% 74.8%
Bank of America Corp
92.8% 86.0% 61.4%
95.1% 92.1% 80.2%
General Electric Co
91.3% 84.4% 58.7%
95.8% 92.7% 77.9%
McDonald’s Corp
93.0% 86.7% 59.0%
96.7% 92.9% 73.7%
Pfizer Inc
92.5% 85.5% 59.5%
96.6% 93.2% 76.7%
Starbucks Corp
92.5% 86.5% 59.8%
95.4% 92.5% 75.1%
20.0%
10.0%
25.9%
39.6%
13.4%
19.7%
27.3%
50.8%
14.0%
30.4%
25.9%
44.6%
12.9%
25.3%
26.1%
38.5%
13.2%
19.8%
26.7%
39.9%
14.3%
20.8%
27.0%
39.3%
13.9%
21.6%
Table 2: Coverage of centered forecast intervals of individual stock returns averaged over the
test data time period Ttest = 1368 : 3411.
Volatility: Represent volatility via standard deviations of returns. Figure 4b overlays the observed 45-day tracking moving average of volatility with volatilities as estimated from our analyses. The volatility forecasts from SGDLM analysis with and without VB track each other closely
until the onset of the financial crisis in the fall of 2008. Without VB– i.e., ignoring the cross-series
constraints– the volatility forecasts skyrocket and never return to historically normal levels. In
stark contrast to the positive, the full SGDLM analysis generates forecast volatilities in extremely
good agreement with the empirically evaluated levels, before, throughout and after the financial
crisis and recessionary years.
5.4
Analysis of Computation Time
We gave theoretical considerations about computational loads in Section 4.4. To empirically
assess this we reran multiple analyses using varying numbers of series m and sizes of the parental
sets. Table 3 summarizes the empirical run times across several values, in each case generating
N = M = 10000 posterior and 1−step forecast samples at each time point. The scaling of the run
times is roughly in line with our theoretical estimates, considering some fixed time for memory
transfers and communication overheads for coordinating four GPUs. The times were measured
on a computer with four Nvidia Tesla C2050 GPUs with 448 CUDA cores each.
16
Apple Inc
Apple Inc
400
400
300
300
200
200
100
100
0
0
0.2
0.4
0.6
0.8
0
1
0
0.2
Bank of America Corp
400
300
300
200
200
100
100
0
0.2
0.4
0.6
0.8
0
1
0
0.2
General Electric Co
400
300
300
200
200
100
100
0
0.2
0.4
0.6
0.8
0
1
0
0.2
McDonald’s Corp
400
300
300
200
200
100
100
0
0.2
0.4
0.6
0.8
0
1
0.4
300
300
200
200
100
100
0.2
0.4
0
0.2
0.6
0.8
0
1
0
0.2
Starbucks Corp
400
300
300
200
200
100
100
0
0.2
0.4
0.8
1
0.6
0.8
1
0.6
0.8
1
0.8
1
0.6
0.4
0.6
0.4
Starbucks Corp
400
0
1
Pfizer Inc
400
0
0.8
0.6
0.4
Pfizer Inc
400
0
1
McDonald’s Corp
400
0
0.8
General Electric Co
400
0
0.6
Bank of America Corp
400
0
0.4
0.8
0
1
(a) With Variational Bayes
0
0.2
0.4
0.6
(b) No Variational Bayes
Figure 3: Empirical distribution of realized forecast quantiles averaged over the test data time
period Ttest = 1368 : 3411.
17
Apple Inc
Apple Inc
0.02
0.2
0.01
0.15
0
0.1
−0.01
0.05
−0.02
2006
2007
2008
2009
2010
2011
2012
2013
0
2006
2014
2007
2008
Bank of America Corp
0.2
0.02
0.15
0
0.1
−0.02
0.05
2007
2008
2009
2010
2011
2012
2013
0
2006
2014
2007
2008
General Electric Co
0.2
0.01
0.15
0
0.1
−0.01
0.05
2007
2008
2009
2010
2011
2012
2013
0
2006
2014
2007
2008
McDonald’s Corp
0.2
0.01
0.15
0
0.1
−0.01
0.05
2007
2008
2009
2010
2011
2012
2013
0
2006
2014
2009
2009
2007
2008
2009
Pfizer Inc
0.2
0.01
0.15
0
0.1
−0.01
0.05
2007
2008
2009
2010
2011
2012
2013
0
2006
2014
2007
2008
Starbucks Corp
0.2
0.01
0.15
0
0.1
−0.01
0.05
2007
2008
2009
2010
2014
2010
2012
2013
2014
2012
2013
2014
2011
2012
2013
2014
2011
2012
2013
2014
2012
2013
2014
2011
2010
2011
2010
2011
2009
2010
Starbucks Corp
0.02
−0.02
2006
2013
Pfizer Inc
0.02
−0.02
2006
2012
McDonald’s Corp
0.02
−0.02
2006
2011
General Electric Co
0.02
−0.02
2006
2010
Bank of America Corp
0.04
−0.04
2006
2009
2012
2013
0
2006
2014
(a) Trend
2007
2008
2009
2010
2011
(b) Volatility
Figure 4: Comparison of the realized 45 day moving average/volatility and forecast returns/volatilities. The observed trend is in blue, the results from the full SGDLM analysis are in
green, and those from analysis without VB are in red.
18
m = 50
m = 100
m = 200
m = 400
pj = 5
Posterior Forecast
0.07
0.07
0.16
0.25
0.50
1.17
2.13
6.75
pj = 10
Posterior Forecast
0.11
0.08
0.21
0.26
0.61
1.20
2.32
6.78
pj = 20
Posterior Forecast
0.20
0.10
0.43
0.32
1.04
1.31
3.22
7.01
Table 3: Realized run times (seconds) of SGDLM analysis and forecasting with N = M = 10000.
6.
Additional Comments
The multivariate dynamic model formulation via SGDLMs is persuasive in terms of the theoretical ability to flexibly represent– in state-space forms– the dynamics of individual series
coupled with contemporaneous, cross-series multivariate dependencies. The framework conceptually allows for scalability to higher dimensional series and completely frees the modeler
from conceptual and technical constraints encountered in existing models. With that outlook, while also recognizing the inherent opportunities for model decoupling as part of an
overall analysis, and with the insight that individual, univariate model analyses will generally come close to representing core aspects of the full multivariate model, we defined the
decoupling/recoupling strategy that builds on importance sampling and variational Bayes to
define computationally efficient and practically effective analyses. The computational efficacy
arise from aggressive distributed computation, for which GPU hardware is ideally suited; we
have presented the ideas and key details of GPU-enhanced computations, and provide freely
available software for interested researchers to follow-up. The practical effectiveness is demonstrated in the 400−dimensional time series study, and underpinned by the discussion of theoretical questions of importance sampling and variational Bayes/divergence based approximation
of unknowable exact posteriors.
Relative to running sequential analyses of completely decoupled models, our example shows
substantial improvements in forecasting performance at a reasonable computational cost: the
run-time is only increased by about a third. On a rather standard 2012 desktop (with four
Nvidia Tesla C2050 GPUs having 448 CUDA cores each) posterior update and forecasting with
N = M = 10000 samples for our 400−dimensional time series complete in less than 10 seconds
per step. This easily allows for real-time analysis in intervals less than one minute and will scale
to several hundreds and low thousands of series.
Current and future work includes the refinement of our existing Matlab interface, and development of an R interface to our GPU-accelerated implementation of Bayesian learning and
forecasting of SGDLMs. Current applied and methodological questions under study are questions of parental set selection– which again must involve a focus on practicalities and move away
19
from a purely theoretical but practically unworkable “global model averaging” perspective. Here
again forecasting performance coupled with innovations in computational strategies will be key
to practical progress.
A.
Appendix: State Evolution Models and Updating
Details of the standard DLMs for individual series used in Section 3.1 are summarized here.
A.1
Evolutions
In series j, the time t − 1 normal-gamma posterior is p(θ j,t−1 , λj,t−1 |Dt−1 ) of eqn. (6). With
known state evolution matrices Gjt , state innovations variance matrices Wjt , and volatility discount factors βj , the evolution model has the following elements (West and Harrison 1997;
Prado and West 2010).
• First, the volatilities evolve via
λjt = λj,t−1 ηjt /βj ,
ηjt ∼ Be(βj nj,t−1 /2, (1 − βj )nj,t−1 /2)
(11)
with independent, idiosyncratic beta distributed innovations ηjt . This yields the time t prior
(λjt |It , Dt−1 ) ∼ G(rjt /2, rjt sj,t−1 /2)
with rjt = βj nj,t−1 and where It represents the information that the volatility has changed.
• Second, the time t − 1 conditional for the state vector in eqn. (6) updates to
(θ j,t−1 |λjt , It , Dt−1 ) ∼ N (mj,t−1 , Cj,t−1 /(sj,t−1 λjt )),
now conditional on the evolved precision λjt .
• Third, the state vector evolves as
θ jt = Gjt θ j,t−1 + ω jt ,
ω jt ∼ N (0, Wjt /(sj,t−1 λjt ))
(12)
with independent, idiosyncratic innovations ω jt .
One implication is that the resulting time t prior p(θ jt , λjt |It , Dt−1 ) is the multivariate normalgamma of eqn. (8).
A.2
Updating Equations
In the set of m DLMs above and in Section 3.1, the time t update of the normal-gamma prior
p(Θt , Λt |It , Dt−1 ) leads to the normal-gamma posterior p(Θt , Λt |Dt ) of eqns. (9,10) where the
20
parameters are given by the standard updating equations.
First, compute the following:
1−step ahead forecast error:
ejt = yjt − F0jt ajt
1−step ahead forecast variance factor:
qjt = sj,t−1 + F0jt Rjt Fjt
Adaptive coefficient vector:
Ajt = Rjt Fjt /qjt
Volatility update factor:
cjt = (rjt + e2jt /qjt )/(rjt + 1)
Then, compute the posterior parameters:
Posterior mean vector:
Posterior covariance matrix factor:
e jt = ajt + Ajt ejt
m
e jt = (Rjt − Ajt A0 qjt )cjt
C
jt
Posterior degrees of freedom:
n
ejt = rjt + 1
Posterior residual variance estimate:
sejt = cjt sj,t−1
B.
Appendix: C++/CUDA Implementation Details
We provide some technical details about the C++/CUDA implementation of forward filtering
and forecasting of the SGDLM.
B.1
Generation of Gamma Variates
While the CUDA toolkit provides basic random number generators for the uniform and
normal distributions, it is up to the developer to implement other distributions. Our implementation of a gamma random number generator uses the standard rejection sampling algorithm (Marsaglia and Tsang 2000), requiring one uniform and one normal random number at
each step. With expected acceptance probabilities over 90%, running the scheme 2n times to
generate n Gamma random numbers will almost certainly suffice.
We generate 2n uniforms and normals with just one call to curandGenerateUniformDouble
and curandGenerateNormalDouble, respectively. Bundled calls to these CUDA toolkit functions
are the most efficient way to generate random numbers on the GPU. We choose to generate
enough random numbers for 2n proposals, because this allows highly efficient processing on the
GPU without unnecessary synchronization with the host device.
Specifically, a block of CUDA threads is tasked with generating a batch k < n gamma random
numbers. The batch size is typically chosen as the maximum number of parallel threads the GPU
can evaluate, that is, k = 512 or k = 1024. That batch size is large enough so that there is going
to be at least one rejected proposal. This means that at least one thread will have to try a second
attempt for acceptance. On the other hand, k is large enough that 2k proposals will yield at least
k acceptances.
The architecture of a GPU is such that no thread of a block is available when at least one
21
thread is busy. The second attempt of at least one thread blocks the entire block. Parallel
processing of the threads means that it doesn’t cost additional time to let all threads of the same
block generate a second proposal. Furthermore, synchronization within a block is very fast. We
will count the number of acceptances and rejections of the first attempt and fill up the k-size
vector of gamma random variables in the second attempt.
B.2
Computing njt in Variational Bayes
In the VB optimization of Section 3.3.1, the degrees-of-freedom parameter njt is defined
implicitly by
log(njt + pj − djt ) − ψ(njt /2) − (pj − djt )/njt − log(2E[λjt ]) + E[log λjt ] = 0.
(13)
Our C++/CUDA program implements a CUDA kernel for the Newton-Raphson method (Atkinson 1989) to solve for njt , initialized at n
ejt . This uses standard numerical approximations for
the digamma and trigamma functions (Abramowitz and Stegun 1972), implemented for this
problem as the CUDA toolkit does not provide implementations of these functions. These computations are parallelized over series j = 1 : m.
B.3
Memory Management
C++and CUDA programming requires proper memory management to avoid segmentation
faults as well as memory leaks. Memory management on GPUs is very similar to memory management on the host device/CPUs.
Memory is allocated using the malloc or cudaMalloc commands and freed using free or
cudaFree. This is a conceptually trivial task. However, keeping track of all allocated memory
poses a real challenge when more involved software projects branch into different paths. To
address this, we developed a universal and lightweight helper class, memory manager, that automates memory management. This keeps pointers to allocated memory in a C++/boost list. It
provides methods to allocate, and keep track of, memory as well as a clear method to free all
memory it tracks.
Each function starts its own instance of our memory manager to manage memory that is not
to live beyond the lifetime of that function. At every branch that ends a function, we simply
invoke the clear method of the respective function’s memory manager to free that memory.
Furthermore, there is a global instance of memory manager to manage the large memory
blocks needed for simulation of forecasting and variational Bayes approximation. Allocating
and freeing that memory at every time step t would be prohibitively time-consuming.
22
B.4
Matlab Interface
MathWorks provides a C library, mex.h, to allow the development of a Matlab interface to C,
C++ and CUDA software. This library provides functionality to read Matlab input, write Matlab
output and write Matlab messages.
Recent versions of the Matlab Parallel Computing toolbox have another library to read Matlab GPU input and write Matlab GPU output, mxGPUArray.h, that also allows the allocation
and freeing of GPU memory. However, in contrast to our memory manager, Matlab’s memory
tracking only avoids memory leaks when Matlab ends, not while the function is active.
We decided against the use of the Matlab GPU library for three reasons. First, up to Matlab
R2013b, Matlab ships with an outdated version of the CUDA toolkit library that does not yet
provide the latest batch parallelism functionality that we use. Second, by not relying on the
Matlab GPU library, our functions can be compiled and run by users that do not license the
parallel computing toolbox. Third, we add backwards compatibility to earlier versions of Matlab
that do not come with GPU capabilities by programming the GPU interface ourselves. This has
been tested with Matlab versions as early as R2010a without any problems. We will of course
revisit the questions as new versions of Matlab come along.
References
Abramowitz, M. and Stegun, I. A. (1972). Handbook of Mathematical Functions with Formulas,
Graphs, and Mathematical Tables. Dover Publications, New York.
Agarwal, D., Chen, D., Lin, L., Shanmugasundaram, J., and Vee, E. (2010). Forecasting highdimensional data. In Elmagarmid, A. K. and Agrawal, D., editors, Proceedings of the ACM
SIGMOD International Conference on Management of Data, Indianapolis, USA, pages 1003–
1012. ACM.
Aguilar, O. and West, M. (2000). Bayesian dynamic factor models and portfolio allocation.
Journal of Business & Economic Statistics, 18(3):338–357.
Alspach, D. L. and Sorenson, H. W. (1972). Nonlinear Bayesian estimation using Gaussian sum
approximations. IEEE Transactions on Automatic Control, 17:439–448.
Atkinson, K. (1989). An Introduction to Numerical Analysis. Wiley, New York.
Carvalho, C. M. and West, M. (2007). Dynamic matrix-variate graphical models. Bayesian
Analysis, 2(1):69–98.
Harrison, P. J. and Stevens, C. F. (1971). A Bayesian approach to short-term forecasting. Operations Research Quarterly, 22:341–362.
23
Harrison, P. J. and Stevens, C. F. (1976). Bayesian forecasting (with discussion). Journal of the
Royal Statistical Society (Series B, Methodological), 38:205–247.
Hosking, J., Natarajan, R., Ghosh, S., Subramanian, S., and Zhang, X. (2013). Short-term
forecasting of the daily load curve for residential electricity usage in the Smart Grid. Applied
Stochastic Models in Business and Industry, 29(6):604–620.
Jaakkola, T. and Jordan, M. I. (2000). Bayesian parameter estimation via variational methods.
Statistics and Computing, 10:25–27.
Koop, G. (2012). Using VARs and TVP-VARs with many macroeconomic variables. Central
European Journal of Economic Modelling and Econometrics, 4:143–167.
Lee, A., Yau, C., Giles, M. B., Doucet, A., and Holmes, C. C. (2010). On the utility of graphics
cards to perform massively parallel simulation with advanced Monte Carlo methods. Journal
of Computational & Graphical Statistics, 19(4):769–789.
Marsaglia, G. and Tsang, W. W. (2000). A simple method for generating gamma variables. ACM
Transactions on Mathematical Software, 26(3):363–372.
Mukherjee, C., Kasibhatla, P. S., and West, M. (2014). Spatially-varying SAR models and
Bayesian inference for high-resolution lattice data. Annals of the Institute of Statistical Mathematics, 66:473–494. First published online: October 8, 2013.
Nakajima, J. and West, M. (2013a). Bayesian analysis of latent threshold dynamic models.
Journal of Business & Economic Statistics, 31:151–164.
Nakajima, J. and West, M. (2013b). Bayesian dynamic factor models: Latent threshold approach. Journal of Financial Econometrics, 11:116–153.
Prado, R. (2010). Multi-state models for mental fatigue. In O’Hagan, A. and West, M., editors,
The Handbook of Applied Bayesian Analysis, pages 845–874. Oxford University Press.
Prado, R. and West, M. (2010). Time Series: Modeling, Computation & Inference. Chapman &
Hall/CRC Press.
Quintana, J. M. and West, M. (1987). An analysis of international exchange rates using multivariate DLMs. The Statistician, 36:275–281.
Smith, A. F. M. and West, M. (1983). Monitoring renal transplants: An application of the multiprocess Kalman filter. Biometrics, 39:867–878.
24
Suchard, M. A., Wang, Q., Chan, C., Frelinger, J., Cron, A. J., and West, M. (2010). Understanding GPU programming for statistical computation: Studies in massively parallel massive
mixtures. Journal of Computational and Graphical Statistics, 19(2):419–438.
Trejo, L. J., Knuth, K., Prado, R., Rosipal, R., Kubitz, K., Kochavi, R., Matthews, B., and Zhang.,
Y. (2007). EEG-based estimation of mental fatigue: Convergent evidence for a three-state
model. In Schmorrow, D. D. and Reeves, L. M., editors, Augmented Cognition, HCII 2007, LNAI
4565, pages 201–211. Springer-Verlag, Berlin Heidelberg.
Wang, H. and West, M. (2009).
Bayesian analysis of matrix normal graphical models.
Biometrika, 96:821–834.
West, M. (1993). Approximating posterior distributions by mixtures. Journal of the Royal Statistical Society (Ser. B), 54:553–568.
West, M. and Harrison, P. J. (1989). Subjective intervention in formal models. Journal of
Forecasting, 8:33–53.
West, M. and Harrison, P. J. (1997). Bayesian Forecasting & Dynamic Models. Springer Verlag,
2nd edition.
Zhao, Z. Y. and West, M. (2014). Dynamic compositional regression modelling: Application
in financial time series forecasting and portfolio decisions. Technical report, Department of
Statistical Science, Duke University.
Zhou, X., Nakajima, J., and West, M. (2014). Bayesian forecasting and portfolio decisions using
dynamic dependent factor models. International Journal of Forecasting. in press.
25