GPU-Accelerated Bayesian Learning and Forecasting in Simultaneous Graphical Dynamic Linear Models Lutz F. Gruber∗ Mike West† Department of Mathematics Chair of Mathematical Statistics Technical University of Munich Munich, Germany [email protected] Department of Statistical Science Duke University Durham, North Carolina, USA [email protected] July 2, 2014 Abstract We discuss modeling and GPU-based computation in a new class of multivariate dynamic models customized to learning and prediction with increasingly high-dimensional time series. This defines an approach to decoupling analysis into a parallel set of univariate time series dynamic models, while flexibly modeling cross-series relationships in a novel, induced class of time-varying graphical models of multivariate volatility matrices. A novel decoupling/recoupling computational strategy allows for exact analysis and simulation of univariate time series models that are then coherently linked to represent the full multivariate model. The computational innovations use importance sampling and variational Bayes ideas, and the overall strategy is ideally suited to GPU implementation. The analysis is enormously scalable with time series dimension, as we demonstrate in an analysis of a 400-dimensional financial time series. Keywords: Decoupling models; High-dimensional time series; Importance sampling; Parallel computing; Recoupling models; Variational Bayes. ∗ Research developed as a Fulbright Visiting Scholar in the Department of Statistical Science at Duke University, partly supported under the Fulbright Program for Foreign Students sponsored by U.S. Department of State. † Research partially supported by the US National Science Foundation [grant DMS-1106516]. Any opinions, findings and conclusions or recommendations expressed in this work are those of the authors and do not necessarily reflect the views of the Fulbright Foundation or the NSF. 1. Introduction Challenges of scaling on-line Bayesian analyses of multivariate dynamic models to increasingly high-dimensions define a need for new modeling and computational strategies. Sequential analysis that is both computationally accessible and sensitive in characterizing the dynamic, multivariate structure of increasingly many series is of interest in many application areas, such as macroeconomics (Koop 2012; Nakajima and West 2013a), financial portfolio studies (Aguilar and West 2000; Zhou et al. 2014), commercial demand, sales and consumer interest forecasting (Agarwal et al. 2010), large scale-networks in energy demand forecasting (Hosking et al. 2013), and biomedical areas including the neurosciences (Trejo et al. 2007; Prado 2010). We address these challenges in a novel class of Simultaneous Graphical Dynamic Linear Models (SGDLMs) and utilizing GPU-based distributed implementations. This: • Uses fast, efficient sequential learning in a set of decoupled univariate dynamic linear models that capture cross-series contemporaneous dependencies via a sparse and dynamic simultaneous equations formulation; • Recouples these parallel univariate analyses using importance reweighting of sets of direct simulations from the univariate models, to define coherent inference and forecasting of the multivariate series; and • Moves ahead in time using decoupled models based on a variational Bayes strategy. On modeling: SGDLM allows for flexible time-varying parameter, state-space models of each individual series. If analyzed separately, each of these is amenable to standard, closed form sequential filtering and forecasting (Prado and West 2010; West and Harrison 1997), so analysis is most efficient. SGDLMs further integrate a novel specification of multivariate dependencies across series at each time point, based on the simultaneous equations representation. This contributes a new model class to the growing field of sparse modeling of volatility matrices. Unlike Cholesky-style and factor models (e.g. Aguilar and West 2000; Nakajima and West 2013b), SGDLMs do not require an ordering of the univariate series, which can be an obstacle to model specification and is a major question when dealing with many time series. Compared to dynamic graphical models of precision matrices (e.g. Quintana and West 1987; Carvalho and West 2007; Wang and West 2009), SGDLMs are immensely more scalable computationally, and substantially more flexible in allowing for patterns of change in multivariate volatility matrices that are specified implicitly via state-space models for sets of simultaneous regression parameters. On computation: The recoupling analysis uses exact, within-series simulation of states and parameters for each time step, combined with importance sampling (IS) reweighting for inference and prediction of the multivariate series. We then exploit an efficient and accurate variational 1 Bayes (VB) strategy (e.g. Jaakkola and Jordan 2000, and West and Harrison 1997, sect. 12.3.4) to decouple and proceed to the next time point. Importantly, we monitor and adjust for the VB approximation using information from the IS step. A main interest is in scaling-up the number of time series, and here we exploit (C++/CUDA) GPU computation (Suchard et al. 2010; Lee et al. 2010). GPU computation is simply ideally suited to the analysis. Univariate time series model updates and simulation are performed in parallel across the series at each time point, and then recoupled for coherent inference and forecasting before decoupling again to move to the next time point and re-parallelization. The overall modeling and computational strategy is enormously scalable with time series dimension as a result. We discuss computational demands in general, illustrate the GPU implementation, and demonstrate that this allows for an involved, real-time Bayesian analysis of a 400-dimensional daily stock return time series. 2. 2.1 SGDLMs Model Structure The m−dimensional time series yt = (y1t , . . . , ymt )0 , t = 1, 2, . . . , is modeled as follows. At each time t, the real-valued, scalar components yjt follow univariate DLMs (West and Harrison 1997; Prado and West 2010) 0 yjt = x0jt φjt + ysp(j),t γ jt + νjt = F0jt θ jt + νjt , j = 1 : m, (1) where: • xjt is a known column vector of predictors or constants, with corresponding dynamic regression coefficients in the column state vector φjt , each of dimension pjφ . • The index set sp(j) ⊆ {1 : m}\{j} identifies the pjγ = |sp(j)| simultaneous parents of series j in the column vector ysp(j),t , with corresponding state vector γ jt . • The state and regression vectors are catenated as θ jt = φjt γ jt ! and Fjt = xjt ysp(j),t ! , each of dimension pj = pjφ + pjγ ; at each t, we denote them collectively by Θt = θ 1:m,t . • The observation noise terms νjt ∼ N (0, λ−1 jt ) have precisions λjt and are independent across series and over time. 2 Each of the state vectors θ jt will follow some specified state evolutions models, such as vector random walks, TV-VAR models, etc., depending on context. The model defines a set of coupled, dynamic simultaneous equations that cohere based on the independence of the νjt . The parental sets sp(j) allow for sparse modeling of contemporaneous structure; with larger m, we will select small parental sets that inherently define sparse graphical model structures. This draws on the use of simultaneous autoregressive (SAR) models in spatial statistics (e.g. Mukherjee et al. 2014). The result is seen in the immediate deduction that yt ∼ N (At µt , Σt ) (2) where At = (I − Γt )−1 and 0 Σ−1 t ≡ Ωt = (I − Γt )Λt (I − Γt ) (3) and: (i) µt = (µ1t , . . . , µmt )0 where µjt = x0jt φjt for each j = 1 : m; (ii) Λt = diag(λ1t , . . . , λmt ); and (ii) the dynamic simultaneous coefficient matrix is defined as 0 γ1,2,t γ1,3,t ··· γ1,m,t γ2,1,t 0 γ · · · γ 2,3,t 2,m,t .. .. . . .. .. .. Γt = . . . γ . . . γ 0 γ m−1,1,t m−1,m−2,t m−1,m,t γm,1,t γm,2,t ... γm,m−1,t 0 with entries γj,h,t = 0 for each h 6∈ sp(j), and implicitly indexing the elements of each γ jt against their partner parents in sp(j) to define the time-varying, non-zero entries. A sparse model has a sparse simultaneous coefficient matrix Γt , with a chosen and fixed set of zeros and with dynamic elements in other entries. The cross-series dependence characteristics of the covariance matrix Σt derive from the precision matrix Ωt , that being implied by the simultaneous parental structure in Γt . When parental sets are very sparse relative to m, the zeros in Γt generally result in a sparse implied precision matrix Ωt , with non-zero and time-varying entries elsewhere, i.e., a sparse, dynamic graphical model– hence we refer to the multivariate coupled DLMs as an SGDLM. Practical relevance will typically mean that sp(j) ∼ k for some (relatively small) fixed k, so models with moderate and large m will be very sparse indeed; the effective number of non-zero parameters underlying the m(m + 1)/2 elements of the volatility matrix Ωt is then just m(k + 1). 2.2 Challenges to Efficient Sequential Learning The univariate DLM for yjt is a linear, normal state-space model for θ jt that can be coupled with the traditional variance discount model for stochastic volatilities λjt . This yields the stan3 dard forward filtering and forecasting analysis in closed form (West and Harrison 1997). At the heart of this are sequentially updated normal/gamma priors p(θ jt , λjt |Dt−1 ), and posteriors p(θ jt , λjt |Dt ), where Dt = {yt , It , Dt−1 } represents all information available at time t; from time t − 1 to t, this is updated with the most recent observation yt and any additional subjective information It that may be incorporated (e.g. West and Harrison 1989, 1997, chapt. 11). The full multivariate model raises computational challenges evident in the non-linearities in state vectors in eqns. (2,3). However, there is some simplification and structure, again paralleling forms arising in spatial SAR models. It is easily shown that the time t likelihood function for Θt , Λt , directly derived from the m−variate normal density of eqn. (2), can be reduced to p(yt |Θt , Λt ) ∝ |I − Γt | Y p(yjt |θ jt , λjt ) (4) j=1:m where the product is of normal density terms from individual models yjt ∼ N (F0jt θ jt , 1/λjt ). The determinant factor here induces the computational challenges. Apart from the special cases of compositional models (Zhao and West 2014) in which Γt is triangular with a diagonal of zeros and so |I − Γt | = 1, this term contributes to the likelihood for the γ jt vectors. In very sparse models, it can tend to be quite diffuse compared to the product of individual likelihood terms, but generally it matters; it arises theoretically and ensures coherence of the multivariate analysis– reflecting the parametric constraints on Γt that are inherently required to ensure positive definiteness and symmetry of Ωt . Below in our financial time series example we demonstrate some of the negative practical consequences of simply ignoring this. That said, the form of eqn. (4) suggests opportunity to exploit separate, parallel analyses of each series in order to define an effective sequential computational strategy for the full multivariate model. This observation underlies our decoupling/recoupling computational approach. The naive approximation of ignoring the determinant correcting factor is akin to running a univariate DLM on each series individually. Our strategy improves on this while retaining the analytical tractability of the time evolution and update steps. 3. 3.1 Model Decoupling/Recoupling for Bayesian Computation Model Emulation: Decoupling for Forward Filtering The above, intuition-building discussion suggests that a bank of m decoupled, univariate models updated in parallel at time t will define a useful emulator of the full multivariate model at that time point. This underlies our decoupling/recoupling strategy. A. Time t − 1 Posterior: Suppose the time t − 1 posterior for the full set of m states and volatilities Θt−1 , Λt−1 is approximated– or, better, emulated– by a product of m multivariate 4 normal-gamma distributions. That is, p(Θt−1 , Λt−1 |Dt−1 ) = Y pj,t−1 (θ j,t−1 , λj,t−1 |Dt−1 ) (5) j=1:m with components (θ j,t−1 , λj,t−1 |Dt−1 ) ∼ N G(mj,t−1 , Cj,t−1 /sj,t−1 , nj,t−1 /2, nj,t−1 sj,t−1 /2) (6) where the notation (Prado and West 2010) represents the normal-gamma distribution (θ j,t−1 |λj,t−1 , Dt−1 ) ∼ N (mj,t−1 , Cj,t−1 /(sj,t−1 λj,t−1 )), (λj,t−1 |Dt−1 ) ∼ G(nj,t−1 /2, nj,t−1 sj,t−1 /2) in which G(n, ns) is gamma with shape n, scale ns, and mean n/(ns) = 1/s. The implied margin for the state vector is p−variate T with nj,t−1 degrees of freedom and scale matrix Cj,t−1 . B. State-Space Evolution to Time t: The individual states (θ jt , λjt ) are modeled as evolving according to standard linear, normal state-space models with stochastic volatility discounting (West and Harrison 1997; Prado and West 2010), independently across j = 1 : m. With It representing the information defining the evolution model, this defines the conditional distributions (θ jt , λjt |θ j,t−1 , λj,t−1 , It , Dt−1 ) with standard details summarized in Appendix A.1. For each j = 1 : m, we have known state evolution matrices Gjt , state innovations variance matrices Wjt , and volatility discount factors βj , that define these evolution models. One key result to note here is that the resulting time t prior is the “evolved” version of eqn. (5) given by p(Θt , Λt |It , Dt−1 ) = Y p(θ jt , λjt |It , Dt−1 ) (7) j=1:m with components (θ jt , λjt |It , Dt−1 ) ∼ N G(ajt , Rjt /sj,t−1 , rjt /2, rjt sj,t−1 /2) (8) where ajt = Gjt mj,t−1 , Rjt = Gjt Cj,t−1 Gjt + Wjt and rjt = βj nj,t−1 . C. Update to Time t Posterior: The time t − 1 → t filtering is completed by updating the prior in eqn. (7) via the likelihood function in eqn. (4). We now see that, apart from the determinant term, the posterior is dominated by the product of independent posteriors across the series, with 5 the latter being defined by standard updates (West and Harrison 1997). That is, the posterior is p(Θt , Λt |Dt ) ∝ |I − Γt | Y (9) pejt (θ jt , λjt ) j=1:m where each pejt (·) is the density of a conjugate normal-gamma e jt /e e jt , C (θ jt , λjt ) ∼ N G(m sjt , n ejt /2, n ejt sejt /2), (10) e jt , n e jt , C with m ejt , sejt given by the standard updating equations; see full details in Appendix A.2. 3.2 Model Emulation: Direct and Importance Sampling Expecting the determinant term of eqn. (9) to be generally dominated by the product of density functions of eqn. (10) over their range, it is clear that this product will be a close approximation of the exact posterior. This suggests using the set of m decoupled conjugate posteriors of eqn. (10) to emulate the exact posterior, and we utilize it as an importance sampling proposal distribution. We can trivially sample directly from the independent, conjugate normal-gammas– in parallel and independently– and then combine the samples across series; this yields importance samples with importance weights, from eqn. (9), proportional to |I − Γt |. Explicitly: 1. For each series j = 1 : m in parallel and independently, draw random samples of size (i) (i) (θ jt , λjt ), i = 1 : N, from pejt (·). (i) (i) 2. Combine these marginal samples to define multivariate importance samples (Θt , Λt ). P (i) 3. Compute the importance weight αit ∝ |I − Γt | and normalize so that i=1:N αit = 1. 4. Assess the Monte Carlo accuracy of the importance sample using standard methods, inP cluding effective sample size SN = N/ i=1:N αit2 . 5. Use the importance sample for Monte Carlo inference on current states, volatilities, and for forecasting ahead from time t. We comment further on forecasting below. Note that we can massively parallelize when m is large. Also, the importance weights do not depend on the φjt , λjt , only on the simultaneous coefficients γ jt . This reflects the view that the parallel conjugate models will effectively emulate the full multivariate model in inferring seriesspecific states and volatilities, while some “corrections” will be needed for inferences on the parental states that explicitly define cross-series structure. Importance sampling is the natural and elegant approach to making these corrections. The series/models are decoupled for updates and direct simulation, and then recoupled for importance sampling targeting the exact posterior. 6 3.3 Variational Bayes Posterior Decoupling 3.3.1. Mean Field Posterior Completing the evolution-update steps from time t − 1 to time t, we now see that moving this analysis along to time t + 1 and the future will require morphing the importance sample-based posterior for (Θt , Λt |Dt ) to a set of m, decoupled conjugate normal-gammas as in eqns. (5,6) but with t − 1 updated to t. This will then complete the cycle and define an overall sequential analysis strategy. We do this via a variational Bayes (VB) step, in fact a mean-field approximation (Jaakkola and Jordan 2000). On an historical note, both the conceptual basis and technical aspects of this kind of mapping to sets of conjugate forms has a long history in the Bayesian dynamic modeling and forecasting literature (e.g. Harrison and Stevens 1971; Alspach and Sorenson 1972; Harrison and Stevens 1976; Smith and West 1983; West and Harrison 1997, sect. 12.3.4). The VB decoupling replaces the importance sample representation of the posterior p(Θt , Λt |Dt ) with the Kullback-Leibler (KL) “nearest” product of normal-gammas across the m series. Again the intuition is that the product of the m independent terms will generally dominate the posterior; now, for moving ahead to time t + 1 and in view of the interests in computational efficacy enabled by analytic tractability, we cut back to the latter form, using KL divergence-based optimization of the product form. That is, we now choose the m sets of parameters {mjt , Cjt , njt , sjt } to define the time t product form posterior as in eqns. (5,6) with time index t − 1 updated to t. It is easily shown that the KL divergence of such a product of conjugate forms from the exact posterior p(Θt , Λt |Dt ) is minimized via: • mjt = E[λjt θ jt ]/E[λjt ]; • Vjt = E[λjt (θ jt − mjt )(θ jt − mjt )0 ]; −1 • djt = E[λjt (θ jt − mjt )0 Vjt (θ jt − mjt )]; • njt such that log(njt + pj − djt ) − ψ(njt /2) − (pj − djt )/njt − log(2E[λjt ]) + E[log λjt ] = 0; • sjt = (njt + pj − djt )/(njt E[λjt ]); • Cjt = sjt Vjt . Here the expectations E[·] are with respect to the importance sampling-based posterior, so that these parameters can be easily computed. The noteworthy cases are the njt , whose values are trivially computed via an iterative numerical approach; see Appendix B.2. 3.3.2. Comments on KL Divergence and Simulation-VB Strategy Recall that, for any random quantity z, the KL divergence of a distribution with density g(x) from one with density p(x) is KLg|p = −Ep log{p(x)/g(x)} . Here the densities are continuous, 7 discrete or mixed and have common support, and Ep [·] refers to expectation under p(·). We make the following observations connected to the above use of the KL minimizing VB strategy. 1. Denote by pe(Θt , Λt |Dt ) the product of the initial set of m independent normal-gamma posteriors in eqn. (10). The importance sampling weights α1:N,t provide a direct assessment of the divergence KLpe|p of pe(·) from the exact posterior p(·) of eqn. (9), as follows. Write P HN = i=1:N αit log(N αit ), the entropy of the importance sampling weights αit relative to a set of N uniform weights– one measure of efficacy of importance samplers (e.g. West 1993). It can be easily shown that, as N → ∞, HN → KLpe|p , so the relative entropy gives a direct estimate of the KL divergence. P It can also be shown that HN ≤ N i=1:N αit2 − 1 = N/SN − 1 where SN is effective sample size; for large N , the limiting value of SN /N is bounded from above by 1/(1 + KLpe|p ). 2. The VB-based posterior defined in Section 3.3.1 represents an improvement over the initial pe(·) as it minimizes KLg|p over all g(Θt , Λt ) that are products of m normal-gamma forms for the (θ jt , λjt ), j = 1 : m. Hence HN gives an estimate of the upper bound of the minimized divergence; if HN is already small– based on calibrating to effective sample size as above– then we are assured of closeness of the revised VB-based posterior approximation. 3. The KL divergence of any subset of parameters (Θt , Λt ) cannot exceed the divergence on the full set. Hence the latter is an operational upper bound on divergences of the approximating marginal posteriors in any of the m individual models. If the overall approximation is good, we do not have to monitor the margins. 4. In evolution from time t to t + 1, the KL-optimized product of normal-gamma posteriors for (Θt , Λt |Dt ) evolves to a similar analytic form for the time t + 1 prior p(Θt+1 , Λt+1 |It+1 , Dt ) of eqn. (7) with t updated to t + 1. We know that KL divergence decreases through convolutions. Hence the divergence of this normal-gamma product for (Θt+1 , Λt+1 ) is a better approximation of the exact time t + 1 prior than was the case for the time t posterior. If we have a high-quality posterior approximation at time t, then the situation only improves following evolution. This is important to bear in mind in forecasting, as now follows. 3.4 Forecast Computations The simulation analysis underlies computation of aspects of forecast distributions to predict ahead from time t. We summarize this here for 1−step ahead forecasts, which underlie our example below. We sample the predictives p(yt+1 |It+1 , Dt ), and repeat the computations as time evolves. The simulations, that can be partially parallelized, are as follows: 1. Directly simulate the time t + 1 states from the analytic product of normal-gamma distributions that follow after our variational Bayes (VB) decoupling step of the next section. 8 The VB step approximates the posterior p(Θt , Λt |Dt ) by a set of m, decoupled conjugate normal-gammas as in eqns. (5,6) but with t − 1 updated to t. As a result, evolution to time t + 1 implies the product of normal-gamma distributions for (Θt+1 , Λt+1 |It+1 , Dt ) as in eqn. (7) but with t updated to t + 1. Directly from these, in parallel, to generate m Monte Carlo samples of the same size M, (i) (i) that when recoupled define the full simulation sample (Θt+1 , Λt+1 ), i = 1 : M. (i) (i) (i) 2. Compute the implied terms At+1 , µt+1 , Σt+1 from eqn. (3) with time t updated to t + 1. (i) (i) (i) (i) 3. For each i = 1 : M, sample yt+1 ∼ N (At+1 µt+1 , Σt+1 ). In forecasting as earlier in filtering, we stress the use of decoupling to distribute simulations in parallel across series, followed by recoupling to evaluate and sample the full multivariate model. 4. Implementation in C++/CUDA We provide an overview of our GPU implementation of the complete filtering/forecasting analysis, noting a number of technical aspects and requirements as well as giving some flavor of the structure of C++/CUDA programming1 . This Section, and the code, contribute to the growing body of literature linked to Bayesian (and other) statistical computations that are inherently enabled via GPU implementations (e.g. Suchard et al. 2010; Lee et al. 2010). 4.1 Development of GPU-Accelerated Software CUDA is an extension of the C programming language to operate on Nvidia CUDA-enabled GPUs. Modern GPUs can feature hundreds of cores. These can be used to execute single instruction, multiple data (SIMD) operations in a massively parallel mode. In comparison, multi-core desktop processors tend to have not more than eight cores. Whenever the same set of operations has to be performed many times on different data, GPU-accelerated implementations vastly outperform their CPU equivalents, given their significantly higher number of cores. Developing GPU-accelerated software is much like developing CPU software in C or C++. One has to allocate and free memory, then implement program logic. The latter can use a combination of self-made functions, called kernels, if they are executed on the GPU, and functions provided by third-party libraries. The CUDA development toolkit comes with several libraries, two of which we used heavily: CUBLAS and CURAND, that provide basic linear algebra routines (BLAS) and random number generators, respectively, for CUDA-enabled GPUs. 1 Our implementation and code is based on a computer with an Nvidia CUDA-enabled graphics processing unit (GPU) with a compute capability of at least 2.0. Interested users may run the code, so long as CUDA runtime, CUBLAS and CURAND libraries of Version 5.5 or newer are installed. We also have a user-friendly Matlab interface compatible with Matlab versions as early as R2010a; though the C++/CUDA is free-standing, some researchers may be interested in accessing GPU facilities via Matlab. 9 On top of massively parallel execution of SIMD operations in kernels, CUDA streams can be used to organize sequences of operations. The operations within each stream are executed sequentially, but different streams are executed concurrently on the GPU. Furthermore, different streams can be assigned to different GPU devices, allowing multiple GPU computing. Concurrent execution and data transfers allow the most efficient usage of GPU resources. 4.2 Key Features of CUDA Implementation Wherever possible, our implementation uses batched functions from the CUBLAS library to perform the linear algebra operations for forward filtering and forecasting. Batched functions are a recent innovation introduced into the CUBLAS library. They accelerate operations on a large number of small matrices by bundling them into a single function call. This reduces the overhead of initiating many small operations on the GPU individually. In particular, we use: • cublasDgemmBatched(...) for batched matrix-matrix multiplications, • cublasDgetrfBatched(...) for batched LU factorization, • cublasDgetriBatched(...) for batched matrix inversion. We developed customized kernels to achieve maximum performance for operations that fall outside the scope of current batched CUBLAS functions; see Appendix B for details. 4.3 Multiple Device Parallelization This presents the key points regarding GPU parallelization using multiple GPU devices. A call to the function cudaSetDevice(k) activates the k-th GPU device. This means all future GPU calls will be executed on this device until the current device is changed by another call to cudaSetDevice(...). Our implementation sets up one CUDA stream per GPU. In our importance sampling steps to generate a posterior sample of size N and then forecast simulation of M samples, each of K GPUs will compute N/K importance samples and then N/M direct samples; this is parallelization by particles. The key functions from the CUDA API that organize asynchronous GPU execution are: • • • • cudaStreamCreate(...) to create a stream, cudaMemcpyAsync(...) to asynchronously move data between computer and GPU, cudaStreamSynchronize(...) to wait until a stream’s execution is completed, cudaStreamDestroy(...) to destroy a stream. Calls to CUDA kernels, most API function calls, and asynchronous memory copy operations return control to the host immediately. This allows us to send commands to different GPUs in a for loop and still have them executed in parallel. However, it is necessary to wait until the operations are completed before returning results; this is where the function cudaStreamSynchronize(...) 10 comes in. This operates after the GPU operations that are to be executed in parallel are sent to their respective devices. Below is an outline of a multiple device parallelized program. Pseudo Code 1: for GPU device k = 1 : K do 2: Call cudaSetDevice(k) to activate the k-th GPU. 3: Call cudaStreamCreate(stream k) to create a stream on the k-th GPU. 4: Call cudaMemcpyAsync(..., stream k) to asynchronously copy input arguments onto device memory. 5: end for 6: for GPU device k = 1 : K do 7: Call cudaSetDevice(k) to activate the k-th GPU. 8: Call the Variational Bayes posterior estimation subroutine to generate N/K importance samples, and then the forecasting simulation subroutine to generate M forecast samples. 9: end for 10: for GPU device k = 1 : K do 11: Call cudaSetDevice(k) to activate the k-th GPU. 12: Call cudaMemcpyAsync(..., stream k) to asynchronously copy results to host memory. 13: end for 14: for GPU device k = 1 : K do 15: Call cudaSetDevice(k) to activate the k-th GPU. 16: Call cudaStreamSynchronize(stream k) to wait until the computations and memory transfers are completed. Call cudaStreamDestroy(stream k) to destroy the stream on the k-th GPU. 17: 18: end for 19: Combine and post-process the results on the CPU, if applicable. 4.4 Computational Load In the importance sampling, forecasting and variational Bayes posterior decoupling, referred to in Line 8 of above pseudo code, the importance sample sample sizes N (for posterior updates) and M (for forecasting) will typically be very large for accurate posterior approximations. Hence scaling in N and M is critical. The lead complexity of sampling the parameters Θt and Λt is O N mp2max where pmax = maxj=1:m pj . Computing the determinants |I − Γt | is the most expensive operation, with a cost of O N m3 . The remaining operations that scale with N are of order O N mp2max or lower, so the overall load is O N (m3 + mp2max ) as N → ∞. In the 1−step ahead forecasting, sampling the parameters Θt+1 and Λt+1 is of the same 11 computational complexity as sampling from the posterior above, and to this we have to add the cost of generating the needed state and volatility innovations of O M mp3max for a sample of size M. Inverting the matrix I − Γt+1 is most expensive at O M m3 , so the overall computational cost of forecasting is O M (m3 + mp2max ) as N → ∞. Note that this is the same as for posterior updates when we choose M = N . 5. 5.1 Evaluation in Analysis and Forecasting of Stock Returns Data and Study Set-Up We analyze daily log returns of m = 400 S&P stocks. In a simple class of SGDLMs, we evaluate 1−step ahead forecasts for both selected individual series and across all series. This includes a study of the effect of ignoring the coupling of the set of simultaneous model equations, to bear out the utility of the decoupling/recoupling strategy. Our data cover m = 400 current members of the S&P 500 index, restricting to those 400 that were continuously listed from October 2000 to October 2013, our study period. The data are daily log returns (differences in daily log prices). We use the first 845 daily observations, T1 = 1 : 845 up to December 2003, for an initial exploratory analysis to define the simultaneous parental sets for each of the m = 400 series. We then use the following 522 daily observations, T2 = 846 : 1367 from January 2004 through December 2005, as training data to select suitable discount factors and to provide starting priors for the test data. The remaining 2044 observations, Ttest = 1368 : 3411 from January 2006 through October 2013, serve as the hold-out or test data to evaluate sequential filtering and forecasting. In training and test data analyses, the Monte Carlo sample size N for posterior updates as well as M for forecasting were set at N = M = 10000. 5.2 Model Structure and Specification via Training Data Analysis SGDLM form: For each of the 400 univariate series, we use the local-level (random walk) for a time-varying trend DLM (West and Harrison 1997, chapt. 2) common in studies of financial returns (e.g. Aguilar and West 2000; Nakajima and West 2013b). This is coupled with a specified set of 10 simultaneous parents, whose coefficients also evolve via random walks. So, in eqn. (2) we have pjφ = 1, xjt = 1 and pjγ = 10. In the state evolution models of Appendix A.1, and eqn. (7), we then have Gjt = I, the 11-dimensional identity matrix, and a discount method specifies the evolution innovations variance matrices Wjt (West and Harrison 1997, chapt. 6). −1/2 −1/2 At each t, Wjt = Gjt ∆Cj,t−1 ∆ − Cj,t−1 G0jt where ∆ = diag([δφ , δγ 10 ]) in which: (i) 1 is a 10−vector of ones; (ii) δφ , δγ are two discount factors in (0, 1) that define the levels of discounting of historical information on the local level and simultaneous parental coefficients, respectively. 12 Simultaneous parental sets: We used the initial 3 years of data, T1 = 1 : 845 over 2003, for exploratory analysis to select the sp(j) for each j = 1 : 400. On this initial training data set, we simply ran separate, univariate DLMs with a local level and with all the remaining 399 series as simultaneous parents. We then chose pj = 10 series to define each sp(j) by selecting those with the largest estimated effect sizes over the later part of the training data. The set of simultaneous parents obviously plays a key role in model fit and forecasting. Here, however, we emphasize that more formal model selection focused on this is not a theme in the current paper. In practice, we operate with a chosen set of parental sets over given periods of time, refreshing/modifying the parental sets periodically via off-line analysis, and/or using multiple such sets– a restricted number– and engaging in model averaging. A future paper will address this. The current paper takes the parental sets as given, the focus and contributions involving the SGDLM modeling innovation and the decoupling/recoupling computational strategy. We show in this study that the enhancements under this strategy are not merely theoretical considerations, but significantly improve forecasting in this m = 400−dimensional setting. Discount factor specification: We ran the SGDLM analysis on the second training set of data, T2 = 846 : 1367. This initialized at t = 846 with priors based on: rj,846 = 5 and sj,845 = 0.001, aj,846 = 0 and Rj,846 = diag(0.0001, 0.01, · · · , 0.01). This analysis was used to explore the impact of varying the state discount factors δφ , δγ and the volatility discount βj = β, all assumed the same across series j. This settled on chosen values βj = 0.95, δφ = 0.95, δγ = 0.99. We then reran the analysis but without the simulation-variational Bayes step of the computations. That is, using the parallel, independently updated set of models. From this we settled on βj = 0.95 and δφ = 0.95 again, but found improved 1−step ahead forecasting with an appropriately higher discount factor δγ = 0.999 on the parental predictors. Below we evaluate forecasting results on test data using this as well as the full decoupled/recoupled analysis; this choice of discount factors makes this an honest comparison as we begin the analysis of test data with optimized parameter specification for each strategy. 5.3 Forecasts of Stock Returns We review forecasting results by analyzing coverage rates of 1−step ahead forecast intervals, and by evaluating the quantiles of these forecast distributions at which realized returns fall. 5.3.1. Analysis of the Variational Bayes Approximation Figure 1a shows the effective sample size SN of the variational Bayes step at each time point t. Based on the N = 10000 samples, Sn exceeds 7000 more than 94% of the time. There is a short-lived drop to about 5000 during a period of extreme market stress following the collapse of Bear Sterns in September 2008. In the range from 7000 to 9000, the typical effective sample 13 size indicates excellent performance of the importance sampler. 10000 1 9000 0.9 8000 0.8 7000 0.7 6000 0.6 5000 0.5 4000 0.4 3000 0.3 2000 0.2 1000 0.1 0 2006 2007 2008 2009 2010 2011 2012 2013 0 2006 2014 2007 2008 2009 2010 2011 2012 2013 2014 (b) KL divergence KLpe|p (a) Effective sample size SN Figure 1: SN and KLpe|p during the test data time period Ttest = 1368 : 3411. Figure 1b shows the importance sample-based estimate HN of the KL divergence KLpe|p of pe(·) from the exact posterior p(·) of eqn. (9). Bounding the KL of the variational Bayes posterior and the exact posterior from above, this shows that the variational Bayes posterior will be very close to the exact posterior. 5.3.2. Aggregate Analysis Table 1 shows the average coverage rates of forecasts across all series and the entire test data time period Ttest = 1368 : 3411. Across all coverage levels, forecast intervals are too wide. However, coverage rates of the analysis using the importance sampling/variational Bayes strategy (indicated simply as “with VB”) align much more closely with the outcome data than do those from analysis that does not use the strategy (indicated as “without VB”). Forecast interval w VB w/o VB 99.0% 98.6% 99.2% 95.0% 95.9% 97.9% 90.0% 92.8% 96.3% 80.0% 85.8% 92.9% 50.0% 59.8% 76.6% 20.0% 27.2% 41.6% 10.0% 14.3% 23.2% Table 1: Coverage of centered forecast intervals across all series j = 1 : m and the entire test data time period Ttest = 1368 : 3411. Figure 2 shows histograms representing the forecast distribution quantiles of the realized stock returns. If the forecast distributions were perfectly calibrated, these realized quantiles would be uniformly distributed. The over-representation of quantiles around the median suggests that the volatilities are generally over-estimated, consistent with the above comments on the overly conservative dispersion of forecast intervals. The symmetry around the median indicates that our model/analyses are not biased in under or over-forecasting actual gains or losses. 14 4 10 4 x 10 10 8 8 6 6 4 4 2 2 0 0 0.2 0.4 0.6 0.8 0 1 (a) With Variational Bayes x 10 0 0.2 0.4 0.6 0.8 1 (b) No Variational Bayes Figure 2: Empirical distribution of realized forecast quantiles across all series j = 1 : m and the entire test data time period t = 846 : 3410. More elaborate DLMs might improve the overall extent of forecast calibration, but that is not a key issue here. Rather, the comparison of the analysis with/without the decoupling/recoupling computational strategy is the focus, and the figure clearly shows the benefits of the strategy. 5.3.3. Close-Up Analysis of Individual Stocks We now zoom-in on forecasting performance for stock returns of six well-known companies: Apple, Bank of America, General Electric, McDonald’s, Pfizer and Starbucks. Coverage of forecast intervals: Coverage rates for these companies are reported in Table 2. The results confirm at these individual levels the findings from the aggregate analysis. While forecast intervals up to 95% tend to be over-estimated by the specific model chosen, the decoupling/recoupling analysis generally improves forecasting performance, while also yielding more precise forecast intervals. Realized forecast quantiles: Figure 3 shows the distribution of the realized quantiles. As in the aggregate summary across all series, the histograms are far closer to uniformity with VB than without. In particular, the latter analysis more substantially over-estimates volatilities as is evident in the higher peaks in the centers of these histograms. Again, the general symmetry around the median in all cases indicates that the model analyses are effectively unbiased and similarly accurate in terms of forecasting the signs of returns. Trend: Figure 4a shows the 45-day tracking moving average of the daily log returns, comparing empirical trends with those forecast under the SGDLM and that without VB. There is no noticeable advantage of using the posterior decoupling/recoupling. This is an expected observation, given that the posterior weighting in importance sampling uses the determinants |I − Γt | as weights. The value of the trend parameters do affect these weights, so the marginal posteriors should be expected to be adequately estimated without the importance sampling/VB steps. 15 Forecast interval 99.0% 95.0% w/ VB w/o VB 98.4% 99.3% 95.8% 98.0% w/ VB w/o VB 98.4% 98.4% 95.8% 96.7% w/ VB w/o VB 98.4% 99.2% 94.5% 97.9% w/ VB w/o VB 98.6% 99.3% 96.1% 98.6% w/ VB w/o VB 98.9% 99.6% 95.5% 97.7% w/ VB w/o VB 98.2% 98.8% 95.6% 97.2% 90.0% 80.0% 50.0% Apple Inc 92.3% 86.3% 60.1% 96.2% 92.8% 74.8% Bank of America Corp 92.8% 86.0% 61.4% 95.1% 92.1% 80.2% General Electric Co 91.3% 84.4% 58.7% 95.8% 92.7% 77.9% McDonald’s Corp 93.0% 86.7% 59.0% 96.7% 92.9% 73.7% Pfizer Inc 92.5% 85.5% 59.5% 96.6% 93.2% 76.7% Starbucks Corp 92.5% 86.5% 59.8% 95.4% 92.5% 75.1% 20.0% 10.0% 25.9% 39.6% 13.4% 19.7% 27.3% 50.8% 14.0% 30.4% 25.9% 44.6% 12.9% 25.3% 26.1% 38.5% 13.2% 19.8% 26.7% 39.9% 14.3% 20.8% 27.0% 39.3% 13.9% 21.6% Table 2: Coverage of centered forecast intervals of individual stock returns averaged over the test data time period Ttest = 1368 : 3411. Volatility: Represent volatility via standard deviations of returns. Figure 4b overlays the observed 45-day tracking moving average of volatility with volatilities as estimated from our analyses. The volatility forecasts from SGDLM analysis with and without VB track each other closely until the onset of the financial crisis in the fall of 2008. Without VB– i.e., ignoring the cross-series constraints– the volatility forecasts skyrocket and never return to historically normal levels. In stark contrast to the positive, the full SGDLM analysis generates forecast volatilities in extremely good agreement with the empirically evaluated levels, before, throughout and after the financial crisis and recessionary years. 5.4 Analysis of Computation Time We gave theoretical considerations about computational loads in Section 4.4. To empirically assess this we reran multiple analyses using varying numbers of series m and sizes of the parental sets. Table 3 summarizes the empirical run times across several values, in each case generating N = M = 10000 posterior and 1−step forecast samples at each time point. The scaling of the run times is roughly in line with our theoretical estimates, considering some fixed time for memory transfers and communication overheads for coordinating four GPUs. The times were measured on a computer with four Nvidia Tesla C2050 GPUs with 448 CUDA cores each. 16 Apple Inc Apple Inc 400 400 300 300 200 200 100 100 0 0 0.2 0.4 0.6 0.8 0 1 0 0.2 Bank of America Corp 400 300 300 200 200 100 100 0 0.2 0.4 0.6 0.8 0 1 0 0.2 General Electric Co 400 300 300 200 200 100 100 0 0.2 0.4 0.6 0.8 0 1 0 0.2 McDonald’s Corp 400 300 300 200 200 100 100 0 0.2 0.4 0.6 0.8 0 1 0.4 300 300 200 200 100 100 0.2 0.4 0 0.2 0.6 0.8 0 1 0 0.2 Starbucks Corp 400 300 300 200 200 100 100 0 0.2 0.4 0.8 1 0.6 0.8 1 0.6 0.8 1 0.8 1 0.6 0.4 0.6 0.4 Starbucks Corp 400 0 1 Pfizer Inc 400 0 0.8 0.6 0.4 Pfizer Inc 400 0 1 McDonald’s Corp 400 0 0.8 General Electric Co 400 0 0.6 Bank of America Corp 400 0 0.4 0.8 0 1 (a) With Variational Bayes 0 0.2 0.4 0.6 (b) No Variational Bayes Figure 3: Empirical distribution of realized forecast quantiles averaged over the test data time period Ttest = 1368 : 3411. 17 Apple Inc Apple Inc 0.02 0.2 0.01 0.15 0 0.1 −0.01 0.05 −0.02 2006 2007 2008 2009 2010 2011 2012 2013 0 2006 2014 2007 2008 Bank of America Corp 0.2 0.02 0.15 0 0.1 −0.02 0.05 2007 2008 2009 2010 2011 2012 2013 0 2006 2014 2007 2008 General Electric Co 0.2 0.01 0.15 0 0.1 −0.01 0.05 2007 2008 2009 2010 2011 2012 2013 0 2006 2014 2007 2008 McDonald’s Corp 0.2 0.01 0.15 0 0.1 −0.01 0.05 2007 2008 2009 2010 2011 2012 2013 0 2006 2014 2009 2009 2007 2008 2009 Pfizer Inc 0.2 0.01 0.15 0 0.1 −0.01 0.05 2007 2008 2009 2010 2011 2012 2013 0 2006 2014 2007 2008 Starbucks Corp 0.2 0.01 0.15 0 0.1 −0.01 0.05 2007 2008 2009 2010 2014 2010 2012 2013 2014 2012 2013 2014 2011 2012 2013 2014 2011 2012 2013 2014 2012 2013 2014 2011 2010 2011 2010 2011 2009 2010 Starbucks Corp 0.02 −0.02 2006 2013 Pfizer Inc 0.02 −0.02 2006 2012 McDonald’s Corp 0.02 −0.02 2006 2011 General Electric Co 0.02 −0.02 2006 2010 Bank of America Corp 0.04 −0.04 2006 2009 2012 2013 0 2006 2014 (a) Trend 2007 2008 2009 2010 2011 (b) Volatility Figure 4: Comparison of the realized 45 day moving average/volatility and forecast returns/volatilities. The observed trend is in blue, the results from the full SGDLM analysis are in green, and those from analysis without VB are in red. 18 m = 50 m = 100 m = 200 m = 400 pj = 5 Posterior Forecast 0.07 0.07 0.16 0.25 0.50 1.17 2.13 6.75 pj = 10 Posterior Forecast 0.11 0.08 0.21 0.26 0.61 1.20 2.32 6.78 pj = 20 Posterior Forecast 0.20 0.10 0.43 0.32 1.04 1.31 3.22 7.01 Table 3: Realized run times (seconds) of SGDLM analysis and forecasting with N = M = 10000. 6. Additional Comments The multivariate dynamic model formulation via SGDLMs is persuasive in terms of the theoretical ability to flexibly represent– in state-space forms– the dynamics of individual series coupled with contemporaneous, cross-series multivariate dependencies. The framework conceptually allows for scalability to higher dimensional series and completely frees the modeler from conceptual and technical constraints encountered in existing models. With that outlook, while also recognizing the inherent opportunities for model decoupling as part of an overall analysis, and with the insight that individual, univariate model analyses will generally come close to representing core aspects of the full multivariate model, we defined the decoupling/recoupling strategy that builds on importance sampling and variational Bayes to define computationally efficient and practically effective analyses. The computational efficacy arise from aggressive distributed computation, for which GPU hardware is ideally suited; we have presented the ideas and key details of GPU-enhanced computations, and provide freely available software for interested researchers to follow-up. The practical effectiveness is demonstrated in the 400−dimensional time series study, and underpinned by the discussion of theoretical questions of importance sampling and variational Bayes/divergence based approximation of unknowable exact posteriors. Relative to running sequential analyses of completely decoupled models, our example shows substantial improvements in forecasting performance at a reasonable computational cost: the run-time is only increased by about a third. On a rather standard 2012 desktop (with four Nvidia Tesla C2050 GPUs having 448 CUDA cores each) posterior update and forecasting with N = M = 10000 samples for our 400−dimensional time series complete in less than 10 seconds per step. This easily allows for real-time analysis in intervals less than one minute and will scale to several hundreds and low thousands of series. Current and future work includes the refinement of our existing Matlab interface, and development of an R interface to our GPU-accelerated implementation of Bayesian learning and forecasting of SGDLMs. Current applied and methodological questions under study are questions of parental set selection– which again must involve a focus on practicalities and move away 19 from a purely theoretical but practically unworkable “global model averaging” perspective. Here again forecasting performance coupled with innovations in computational strategies will be key to practical progress. A. Appendix: State Evolution Models and Updating Details of the standard DLMs for individual series used in Section 3.1 are summarized here. A.1 Evolutions In series j, the time t − 1 normal-gamma posterior is p(θ j,t−1 , λj,t−1 |Dt−1 ) of eqn. (6). With known state evolution matrices Gjt , state innovations variance matrices Wjt , and volatility discount factors βj , the evolution model has the following elements (West and Harrison 1997; Prado and West 2010). • First, the volatilities evolve via λjt = λj,t−1 ηjt /βj , ηjt ∼ Be(βj nj,t−1 /2, (1 − βj )nj,t−1 /2) (11) with independent, idiosyncratic beta distributed innovations ηjt . This yields the time t prior (λjt |It , Dt−1 ) ∼ G(rjt /2, rjt sj,t−1 /2) with rjt = βj nj,t−1 and where It represents the information that the volatility has changed. • Second, the time t − 1 conditional for the state vector in eqn. (6) updates to (θ j,t−1 |λjt , It , Dt−1 ) ∼ N (mj,t−1 , Cj,t−1 /(sj,t−1 λjt )), now conditional on the evolved precision λjt . • Third, the state vector evolves as θ jt = Gjt θ j,t−1 + ω jt , ω jt ∼ N (0, Wjt /(sj,t−1 λjt )) (12) with independent, idiosyncratic innovations ω jt . One implication is that the resulting time t prior p(θ jt , λjt |It , Dt−1 ) is the multivariate normalgamma of eqn. (8). A.2 Updating Equations In the set of m DLMs above and in Section 3.1, the time t update of the normal-gamma prior p(Θt , Λt |It , Dt−1 ) leads to the normal-gamma posterior p(Θt , Λt |Dt ) of eqns. (9,10) where the 20 parameters are given by the standard updating equations. First, compute the following: 1−step ahead forecast error: ejt = yjt − F0jt ajt 1−step ahead forecast variance factor: qjt = sj,t−1 + F0jt Rjt Fjt Adaptive coefficient vector: Ajt = Rjt Fjt /qjt Volatility update factor: cjt = (rjt + e2jt /qjt )/(rjt + 1) Then, compute the posterior parameters: Posterior mean vector: Posterior covariance matrix factor: e jt = ajt + Ajt ejt m e jt = (Rjt − Ajt A0 qjt )cjt C jt Posterior degrees of freedom: n ejt = rjt + 1 Posterior residual variance estimate: sejt = cjt sj,t−1 B. Appendix: C++/CUDA Implementation Details We provide some technical details about the C++/CUDA implementation of forward filtering and forecasting of the SGDLM. B.1 Generation of Gamma Variates While the CUDA toolkit provides basic random number generators for the uniform and normal distributions, it is up to the developer to implement other distributions. Our implementation of a gamma random number generator uses the standard rejection sampling algorithm (Marsaglia and Tsang 2000), requiring one uniform and one normal random number at each step. With expected acceptance probabilities over 90%, running the scheme 2n times to generate n Gamma random numbers will almost certainly suffice. We generate 2n uniforms and normals with just one call to curandGenerateUniformDouble and curandGenerateNormalDouble, respectively. Bundled calls to these CUDA toolkit functions are the most efficient way to generate random numbers on the GPU. We choose to generate enough random numbers for 2n proposals, because this allows highly efficient processing on the GPU without unnecessary synchronization with the host device. Specifically, a block of CUDA threads is tasked with generating a batch k < n gamma random numbers. The batch size is typically chosen as the maximum number of parallel threads the GPU can evaluate, that is, k = 512 or k = 1024. That batch size is large enough so that there is going to be at least one rejected proposal. This means that at least one thread will have to try a second attempt for acceptance. On the other hand, k is large enough that 2k proposals will yield at least k acceptances. The architecture of a GPU is such that no thread of a block is available when at least one 21 thread is busy. The second attempt of at least one thread blocks the entire block. Parallel processing of the threads means that it doesn’t cost additional time to let all threads of the same block generate a second proposal. Furthermore, synchronization within a block is very fast. We will count the number of acceptances and rejections of the first attempt and fill up the k-size vector of gamma random variables in the second attempt. B.2 Computing njt in Variational Bayes In the VB optimization of Section 3.3.1, the degrees-of-freedom parameter njt is defined implicitly by log(njt + pj − djt ) − ψ(njt /2) − (pj − djt )/njt − log(2E[λjt ]) + E[log λjt ] = 0. (13) Our C++/CUDA program implements a CUDA kernel for the Newton-Raphson method (Atkinson 1989) to solve for njt , initialized at n ejt . This uses standard numerical approximations for the digamma and trigamma functions (Abramowitz and Stegun 1972), implemented for this problem as the CUDA toolkit does not provide implementations of these functions. These computations are parallelized over series j = 1 : m. B.3 Memory Management C++and CUDA programming requires proper memory management to avoid segmentation faults as well as memory leaks. Memory management on GPUs is very similar to memory management on the host device/CPUs. Memory is allocated using the malloc or cudaMalloc commands and freed using free or cudaFree. This is a conceptually trivial task. However, keeping track of all allocated memory poses a real challenge when more involved software projects branch into different paths. To address this, we developed a universal and lightweight helper class, memory manager, that automates memory management. This keeps pointers to allocated memory in a C++/boost list. It provides methods to allocate, and keep track of, memory as well as a clear method to free all memory it tracks. Each function starts its own instance of our memory manager to manage memory that is not to live beyond the lifetime of that function. At every branch that ends a function, we simply invoke the clear method of the respective function’s memory manager to free that memory. Furthermore, there is a global instance of memory manager to manage the large memory blocks needed for simulation of forecasting and variational Bayes approximation. Allocating and freeing that memory at every time step t would be prohibitively time-consuming. 22 B.4 Matlab Interface MathWorks provides a C library, mex.h, to allow the development of a Matlab interface to C, C++ and CUDA software. This library provides functionality to read Matlab input, write Matlab output and write Matlab messages. Recent versions of the Matlab Parallel Computing toolbox have another library to read Matlab GPU input and write Matlab GPU output, mxGPUArray.h, that also allows the allocation and freeing of GPU memory. However, in contrast to our memory manager, Matlab’s memory tracking only avoids memory leaks when Matlab ends, not while the function is active. We decided against the use of the Matlab GPU library for three reasons. First, up to Matlab R2013b, Matlab ships with an outdated version of the CUDA toolkit library that does not yet provide the latest batch parallelism functionality that we use. Second, by not relying on the Matlab GPU library, our functions can be compiled and run by users that do not license the parallel computing toolbox. Third, we add backwards compatibility to earlier versions of Matlab that do not come with GPU capabilities by programming the GPU interface ourselves. This has been tested with Matlab versions as early as R2010a without any problems. We will of course revisit the questions as new versions of Matlab come along. References Abramowitz, M. and Stegun, I. A. (1972). Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. Dover Publications, New York. Agarwal, D., Chen, D., Lin, L., Shanmugasundaram, J., and Vee, E. (2010). Forecasting highdimensional data. In Elmagarmid, A. K. and Agrawal, D., editors, Proceedings of the ACM SIGMOD International Conference on Management of Data, Indianapolis, USA, pages 1003– 1012. ACM. Aguilar, O. and West, M. (2000). Bayesian dynamic factor models and portfolio allocation. Journal of Business & Economic Statistics, 18(3):338–357. Alspach, D. L. and Sorenson, H. W. (1972). Nonlinear Bayesian estimation using Gaussian sum approximations. IEEE Transactions on Automatic Control, 17:439–448. Atkinson, K. (1989). An Introduction to Numerical Analysis. Wiley, New York. Carvalho, C. M. and West, M. (2007). Dynamic matrix-variate graphical models. Bayesian Analysis, 2(1):69–98. Harrison, P. J. and Stevens, C. F. (1971). A Bayesian approach to short-term forecasting. Operations Research Quarterly, 22:341–362. 23 Harrison, P. J. and Stevens, C. F. (1976). Bayesian forecasting (with discussion). Journal of the Royal Statistical Society (Series B, Methodological), 38:205–247. Hosking, J., Natarajan, R., Ghosh, S., Subramanian, S., and Zhang, X. (2013). Short-term forecasting of the daily load curve for residential electricity usage in the Smart Grid. Applied Stochastic Models in Business and Industry, 29(6):604–620. Jaakkola, T. and Jordan, M. I. (2000). Bayesian parameter estimation via variational methods. Statistics and Computing, 10:25–27. Koop, G. (2012). Using VARs and TVP-VARs with many macroeconomic variables. Central European Journal of Economic Modelling and Econometrics, 4:143–167. Lee, A., Yau, C., Giles, M. B., Doucet, A., and Holmes, C. C. (2010). On the utility of graphics cards to perform massively parallel simulation with advanced Monte Carlo methods. Journal of Computational & Graphical Statistics, 19(4):769–789. Marsaglia, G. and Tsang, W. W. (2000). A simple method for generating gamma variables. ACM Transactions on Mathematical Software, 26(3):363–372. Mukherjee, C., Kasibhatla, P. S., and West, M. (2014). Spatially-varying SAR models and Bayesian inference for high-resolution lattice data. Annals of the Institute of Statistical Mathematics, 66:473–494. First published online: October 8, 2013. Nakajima, J. and West, M. (2013a). Bayesian analysis of latent threshold dynamic models. Journal of Business & Economic Statistics, 31:151–164. Nakajima, J. and West, M. (2013b). Bayesian dynamic factor models: Latent threshold approach. Journal of Financial Econometrics, 11:116–153. Prado, R. (2010). Multi-state models for mental fatigue. In O’Hagan, A. and West, M., editors, The Handbook of Applied Bayesian Analysis, pages 845–874. Oxford University Press. Prado, R. and West, M. (2010). Time Series: Modeling, Computation & Inference. Chapman & Hall/CRC Press. Quintana, J. M. and West, M. (1987). An analysis of international exchange rates using multivariate DLMs. The Statistician, 36:275–281. Smith, A. F. M. and West, M. (1983). Monitoring renal transplants: An application of the multiprocess Kalman filter. Biometrics, 39:867–878. 24 Suchard, M. A., Wang, Q., Chan, C., Frelinger, J., Cron, A. J., and West, M. (2010). Understanding GPU programming for statistical computation: Studies in massively parallel massive mixtures. Journal of Computational and Graphical Statistics, 19(2):419–438. Trejo, L. J., Knuth, K., Prado, R., Rosipal, R., Kubitz, K., Kochavi, R., Matthews, B., and Zhang., Y. (2007). EEG-based estimation of mental fatigue: Convergent evidence for a three-state model. In Schmorrow, D. D. and Reeves, L. M., editors, Augmented Cognition, HCII 2007, LNAI 4565, pages 201–211. Springer-Verlag, Berlin Heidelberg. Wang, H. and West, M. (2009). Bayesian analysis of matrix normal graphical models. Biometrika, 96:821–834. West, M. (1993). Approximating posterior distributions by mixtures. Journal of the Royal Statistical Society (Ser. B), 54:553–568. West, M. and Harrison, P. J. (1989). Subjective intervention in formal models. Journal of Forecasting, 8:33–53. West, M. and Harrison, P. J. (1997). Bayesian Forecasting & Dynamic Models. Springer Verlag, 2nd edition. Zhao, Z. Y. and West, M. (2014). Dynamic compositional regression modelling: Application in financial time series forecasting and portfolio decisions. Technical report, Department of Statistical Science, Duke University. Zhou, X., Nakajima, J., and West, M. (2014). Bayesian forecasting and portfolio decisions using dynamic dependent factor models. International Journal of Forecasting. in press. 25
© Copyright 2025 ExpyDoc