An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue* and K. Murakami* *Graduate School of Information Science and Electrical Engineering, Kyushu University, Japan **Institute of Systems, Information Technologies and Nanotechnologies (ISIT), Fukuoka, Japan E-mail: [email protected], [email protected] 1 Agenda Introduction Large-Scale Reconfigurable Data-Path (LSRDP) General Architecture and Specifications Design Procedure and Tool Chain Preliminary Results Conclusions and Future Work Kyushu University WAHA 2009 2 Introduction Parallel computer clusters with General-Purpose Processors (GPP) are often used for HPC Various accelerators are used with GPPs for further performance improvement PowerXcell, GPGPU, GRAPE-DR, ClearSpeed, etc. Small size and low power consumption comparing to processors with similar performance Roadrunner with PowerXcell TSUBAME NVIDIA Tesla S1070 http://it.nikkei.co.jp/ http://www.elsa-jp.co.jp/products/hpc/tesla/s1070/index.html Kyushu University WAHA 2009 http://www.top500.org/system/9485 3 Single Flux Quantum Large Scale Reconfigurable Data-Path (SFQ-LSRDP) A large memory bandwidth is demanded in conventional accelerators for high-performance computation On chip memories are often used to hide memory access latency Large-Scale Reconfigurable Data-Path (LSRDP): • is introduced as an alternative accelerator • reduces the no. of memory accesses • is implemented by Single-Flux Quantum (SFQ) circuits instead of CMOS circuits • is suitable for high performance scientific computations Kyushu University WAHA 2009 4 Outline of Large-Scale Reconfigurable Data-Path (LSRDP) processor LSRDP Reconfigurable data-path includes: A large number of floating point Functional Units (FUs) Reconfigurable Operand Routing Network : ORN Dynamic reconfiguration facilities FU GPP FU FU ... FU ORN : Operand Routing Network : : : FU FU FU : ... FU ORN FU FU Implementation FU ... : : ... by SFQ circuits FU SB : Streaming Buffers (SB) for I/O ports : Features: Data Flow Graphs (DFGs) extracted from critical calculation parts are directly mapped Pipeline execution Burst transfer is used for input /output rearranged data from/to memory SMAC Main Memory Kyushu University Scratchpad Memory WAHA 2009 5 Single-Flux Quantum (SFQ) against CMOS CMOS issues: high electric power consumption high heat radiation and difficulties in high-density packing memory wall problem which limits the processing speed SFQ Features: High-speed switching and signal transmission Low power consumption Compact implementation of a system (small area) No cost for latch Suitable for pipeline processing of data stream Serial bit-level processing Kyushu University WAHA 2009 6 CREST-JST (2006~): Low-power, high-performance, reconfigurable processor using single-flux quantum circuits Yokohama National Univ. SFQ-FPU chip, cell library Kyushu Univ. Prof. N. Yoshikawa et al. Nagoya Univ. SFQ-RDP chip, cell library, and wiring Prof. A. Fujimaki et al. SFQ-LSRDP Architecture, Compiler and Applications Prof. K. Murakami Dr. K. Inoue Dr. H. Honda Dr. F. Mehdipour H. Kataoka Nagoya Univ. CAD for logic design and arithmetic circuits Superconducting Research Lab. (SRL) Prof. N. Takagi (Leader) et al. SFQ process Dr. S. Nagasawa et al. Kyushu University WAHA 2009 7 Goals of the Project Discovering appropriate applications Developing compiler tools Developing performance analyzing tools Designing and Implementing SFQ-LSRDP architecture considering the features and limitations of SFQ circuits Kyushu University WAHA 2009 8 LSRDP General Architecture and Specifications 9 Parameters Should Be Decided Within the LSRDP Design Procedure Streaming Buffer (SB) • Core structure a matrix of PEs ... Operand Routing Network (ORN) PE1 PE2 PE3 Width ... • PE: combination of a Functional Unit (FU) and a data Transfer Unit (TU) Width and Height ? ... Height Maximum Connection Length (MCL) between consecutive rows? ORN ... ORN . . . . . . . . . ... ORN ... Streaming Buffer (SB) Kyushu University PEm Layout: FU types (ADD/SUB and MUL)? • Reconfiguration mechanism? (PE, ORN, Immediate data) • On-chip memory configuration? WAHA 2009 10 LSRDP Architecture Processing Elements FU implements basic 64-bit double-precision floating point operations including: ADD, SUB and MUL TU (transfer unit) as a routing resource for transferring data from a row to an inconsecutive row FU FU TU FU TU TU TU FU TU PE including Two components Four functionalities Kyushu University WAHA 2009 11 Layout Types- Type I W A M T A M T A M T … A M T A M T Each PE implements ADD/SUB and MUL ORN A M T A M T A M H M : MUL … A M T A M T ORN A M A T T A M T A M T … A M T ADD/SUM A T M MUL TU . : ADD/SUB . T : Transfer Unit . ORN A M T Kyushu University Flexible A M T A M T … A M T A M T but consume a lot of resources WAHA 2009 12 Layout Types- Type II (Checkered) W A T M T Each PE implements ADD/SUB or MUL A T … A T M T Each PE implements ADD/SUB or MUL ORN A T M T A T … A T M T … A T MUL M T … A T M T ORN ADD/SUMA T TU M T A T TU . . . H ORN A T Kyushu University M T A T WAHA 2009 13 Layout Types- Type III (Striped) Each PE implements ADD/SUB or MUL W M T M T M T … M T M T Each PE implements ADD/SUB or MUL ORN ADD/SUMA T TU A T A T … A T A T … M T MUL M T … A T A T ORN M T M T H M T TU . . . ORN A T Type Kyushu University A T A T II or III, which WAHA one is more efficient? 2009 14 Maximum Connection Length (MCL) Connection Length= 0 (i, 0) ... Connection Length= 2 (i,j) (i, j+1) (i, j+2) Longest Connection Length= L ... ... ... ... ORN (i+1, 0) ... (i+1, j) (i+1, j+1) (i+1, j+2) (i+1, j+L) MCL: maximum horizontal distance between two PEs located in two consecutive rows Kyushu University WAHA 2009 15 An ORN Structure ORN T FPU FPU T ½CB ½CB ½CB T2 CB CB T2 CB CB CB FPU ½CB CB CB CB T FPU T FPU T FPU ½CB ½CB ½CB ½CB ½CB ½CB CB CB CB CB CB CB T2 CB T FPU T FPU T2 CB CB CB CB CB CB CB CB CB T2 T CB CB CB T2 CB CB T2 CB T2 CB T FPU 2bit shift register ORN is FPU consisted FPU of 2-bit shift registers, 1-by-2 and 2-by-2 cross bar switches T T2 CB CB Kyushu University A. Fujimaki, et al., Demonstration ofWAHA an SFQ-Based 2009Accelerator Prototype for a High-Performance Computer,” ASC08,162008. Dynamic Reconfiguration Mechanism Input-B Input-A Input-C PE Immediate Register (64b) Immediate Register Transfer Unit MUX FU (A op B) Conf. Reg. [bit] ・・・ ORN ・・・ log(2x (2MCL+1)) x 3 [b] Three bit-stream lines for dynamic reconfiguration of: • Immediate registers (64bit) in each PE • Selector bits for muxes selecting the input data of FUs • Cross-bar switches in ORNs Kyushu University WAHA 2009 17 Design Procedure and Tool Chain 18 Compiler and Design Flow Application Code Mapping Placement Bit-Stream Generation Hardware-Software Partitioning (manual or automatic) Routing Configurations (for LSRDP) LSRDP Architecture Critical Parts (h/w part) Port positioning Analyzing DFG mapping results DFG Genration Design Phase DFGs DFGs Non-Critical Parts (s/w part) s/w Part Modification Binary Code (for GPP) • DFGs are manually generated from critical parts of applications • DFG mapping results are used for • Analyzing LSRDP architecture statistics • Generating LSRDP configuration bit-streams Kyushu University WAHA 2009 19 LSRDP Design Procedure DFGs & LSRDP HW constraints Choosing a design parameter For each parameter Mapping DFGs onto the LSRDP Obtaining required statistics Analyzing the mapping results Choosing the appropriate value Kyushu University Appropriate value for each parameter WAHA 2009 20 Benchmark Applications for Design Procedures Finite differential method calculation of 2nd order partial differential equations 1dim-Heat equation 1dim-Vibration equation 2dim-Poisson equation (Heat) (Vibration) (Poisson) Quantum chemistry application Recursive parts of Electron Repulsion Integral calculation (ERI-Rec) Only ADD/SUB and MUL operations are used in the critical calculations of all above applications Kyushu University WAHA 2009 21 DFG Extraction- Heat Equation 1-dim. heat equation for T(x,t) T ( x, t ) 2T ( x, t ) A t x 2 (A is const.) T(i-1,j) T(i,j) T(i+1,j) + Calculation by Finite Difference Method (FDM) T ( xi , t j 1 ) D * B * D * T ( xi , t j ) B * T ( xi 1 , t j ) T ( xi 1 , t j ) + T(i,j+1) Basic DFG can be extended to horizontal and vertical directions to make a larger DFG Kyushu University WAHA 2009 Basic DFG corresponding to Minimum FDM calculation 22 Example of extracted DFGs- Heat Inputs: Outputs: Operations: Immediates: 32 16 721 364 A huge sample DFG (Heat) Kyushu University WAHA 2009 23 DFG Classification # of # of Inputs Outputs Class # of FUs RDP-S 128 19 12 RDP-M 512 19 12 RDP-L 1024 38 24 RDPXL > 1024 64 52 # of DFGs Heat (3) Poi (1) Vib (2) Eri (4) Heat (1) Poi (1) Vib (1) Eri (4) Heat (2) Poi (1) Vib (2) Eri (5) Heat (1) Poi (1) Vib (2) Eri (5) Totally, 24 DFGs are prepared for benchmark DFG Due to broad range of DFG sizes DFGs are classified as S, M, L, XL with respect to their size the number of Input/Output nodes Kyushuand University WAHA 2009 24 Mapping DFGs onto LSRDP LSRDP Architecture Description DFG Placing DFG nodes on LSRDP Routing Connections Placing IO nodes Routing Inp/Out Connections Configuration File Kyushu University Longest connections WAHA 2009 25 Preliminary Results 26 LSRDP Specifications: Width & Height # of Input ports # of Output ports Width Height LSRDP-S 19 12 16 16 LSRDP-M 19 12 32 16 LSRDP-L 38 24 64 32 LSRDP Dimensions and the number of Input/Output Ports Kyushu University WAHA 2009 27 LSRDP Specifications: MCL (i, j) MCL = L (i, j+1) LSRDP MCL (avg/max) (i+1, j+1) (i+2, (i+L, ・・・ j+1) j+1) ORN SizeNo of Inps (avg/max), Outs LSRDP-S 4/8 18/34, 3 LSRDP-M 5/9 22/38, 3 LSRDP-L 5/9 22/34, 3 Needs further MCL optimization Kyushu University WAHA 2009 28 Analyzing Various LSRDP Layouts Heat Viration Poisson ERI1 ERI2 Layout I II III I II III I II III I II III I II III Size 8x3 8x3 8x4 10x8 10x8 10x11 10x10 10x12 15x18 6x2 9x3 6x2 10x10 10x10 15x8 Layout I Layout II (Except ERI1 DFG which gives better size for Layout III) Layout II can be used instead of Layout I to obtain a smaller LSRDP architecture with lower power consumption and implementation cost. Kyushu University WAHA 2009 29 LSRDP at One Glance (1/2) Functional units ADD/SUB, MUL Layout Type II (checker pattern) Operations 64-bit floating point Processing structure Pipelined PE structure FU, T, FU+T, T+T LSRDP Size Small Medium Large No. of inp/out ports 19/12 19/12 38/24 Width/Height 16/16 32/16 64/32 16*16*64 32*16*64 64*32*64 16*BSS(ORN) 32* BSS(ORN) 64*BSS(ORN) 16*16* 2 32*16*2 64*32* 2 22 , 3 26 , 3 26 , 3 Conf. bit-stream size Imm. Regs ORNs PEs ORN inputs, outputs Structure Cross-bar switch Conn. Type One-directional Kyushu University WAHA 2009 30 LSRDP at One Glance (2/2) Internal memory External memory Reconf. mechanism Kyushu University Type Immediate registers Size and count 64-bit registers, One reg. for each PE Communication mechanism Serial No. of memory modules 16 Date trans. rate 1800Mbps/pin Overall data trans. rate 24 GB/s Mem. to LSRDP bus width 64 bit Channels per module Two Bit serial configuration through a serial chain WAHA 2009 31 Preliminary Performance Evaluation Base processor configuration Processor type Out-of-order GPP operating frequency 3.2GHz Inst. issue width 4 instruction/cc Inst. decode width 4 instruction/cc Cache configuration L1 data 64KB(128B Entry, 2way, 2cc) L1 instruction 64KB(64B Entry, 1way, 1cc) L2 unified 4MB(128B Entry, 4way, 16cc) Latency of main memory 300cc L2 to main memory Bus width 64 Bytes Freq 800 MHz GPP+LSRDP configuration LSRDP operating frequency 80 GHz Reconfiguration Latency 1cc Latency SPM LSRDP latency 1cc Latency Main Memory SPM 7500cc Bandwidth SPMLSRDP Max. 64 * 8 Bytes/cc Bandwidth Main Memory SPM 102.4GB/sec GPP: Exec. time measurement by means of a processor simulator Kyushu University WAHA 2009 LSRDP: Estimation by performance modeling 32 Preliminary Performance Evaluation (Heat) Normalized by GPP Exec. Time 0.4 0.35 Re c o n f. 0.3 Comm. Re ar r an ge 0.25 St all 0.2 LSRDP GPP 0.15 Basic: SB only Reuse: SB + SPM 0.1 0.05 0 Basic Reuse Heat (M) Basic Reuse Heat (L) Data reusing is employed to avoid the need for data rearrangement as well as frequently data retrieval from the scratchpad memory. Kyushu University WAHA 2009 33 Preliminary Performance Evaluation (Poisson) Normalized by GPP Exec. Time 0.4 0.35 Re c o n f. 0.3 Comm. Re ar r an ge 0.25 St all LSRDP 0.2 GPP 0.15 0.1 0.05 0 poisson(S) poisson(M) poisson(L) A small fraction is related to processing time on LSRDP and the main fraction concerns to various overhead times as well the execution time on GPP Kyushu University WAHAas 2009 34 Conclusions & Future Work A high-performance computer comprising an accelerator (LSRDP) implemented by superconducting circuits was introduced. 24 benchmark Data Flow Graphs (DFGs) were manually generated. LSRDP micro-architecture is designed based on characteristics of scientific applications via a quantitative approach. LSRDP is promising for resolving issues originated from CMOS technology as well as achieving considerable performances. Future Work: •To achieve higher performance it is required to reduce various overhead costs mainly related to data management part. •To reduce the implementation cost of LSRDP, we will focus on reducing maximum connection length and ORN size. Kyushu University WAHA 2009 35 Acknowledgement This research was supported in part by Core Research for Evolutional Science and Technology (CREST) of Japan Science and Technology Corporation (JST). Kyushu University WAHA 2009 36 Thanks! Any Questions? 37 Backup Slides Backup Slides Kyushu University WAHA 2009 38 SFQ (Single Flux Quantum) Circuit High speed, Low power consumption, and Operating by a different principle from the CMOS 磁束量子 Single Flux Quantum 超伝導ループ Superconductivity loop ジョセフソン接合 Josephson junction Ib L Ic Tunneling effect Φ0 2mV Kyushu University 2ps WAHA 2009 39 Mapping Results # of FUs # of Inputs # of Outputs Width Height RDP-S 128 19 RDP-M 512 RDP-L RDP-XL Total # of Extra TUs PEs (max/avg) (max/avg) (max/avg) (max/avg) 12 26/14.9 10/6.7 98/51.7 56/23.5 19 12 26/17.1 16/9.3 170/77.8 92/37.1 1024 38 24 58/40 24/14.4 730/260.1 428/141.3 > 1024 64 52 122/45.3 25/12.4 1217/350.4 1065/240 For each class, a lot of extra TUs are needed to map all DFGs Kyushu University WAHA 2009 PE types FU FU T T T T 40 Connection Length Minimization- Results Final optimized Maximum Connection Length (MCL) results MCL (ave/max) RDP-S RDP-M 4/9 (i, j) MCL = L 5/9 (i, j+1) RDP-L (i+1, j+1) (i+2, (i+L, j+1) ・・・ j+1) 9.3/19 ORNs should provide the connection length of 9 in LSRDP-S/M (MCL= 9). Possible to decrease? For LSRDP-L, MCL = 19 !!! ⇒ Serious Implementation Cost Kyushu University WAHA 2009 41 Distributions of Connection Lengths Average Fraction of Connection Lengths in the RDP/S Maps Connection length 4% 3% 0% 2%1%1% 0 0% 1 10% 2 3 4 5 6 7 8 9 79% 93% of connection lengths are 0 ~ 2 Only small fractions of connections results in larger ORNs Kyushu University WAHA 2009 42 Analyzing Various LSRDP Layouts Heat Viration Poisson ERI1 ERI2 Layout I II III I II III I II III I II III I II III Size 8x3 8x3 8x4 10x8 10x8 10x11 10x10 10x12 15x18 6x2 9x3 6x2 10x10 10x10 15x8 • Almost a similar small size values are achieved for Layout I and II for the majority of DFGs (except ERI1 DFG which gives better size for Layout III) Layout II can be used instead of Layout I to obtain a smaller LSRDP architecture with lower power Kyushu University WAHA 2009 consumption and implementation cost as well 43 Why only ERI1 DFG is suitable to Layout III ? ADD/ SUB MUL ADD/ SUB MUL ... MUL ... MUL ... ... ORN ADD/ SUB MUL ADD/ SUB ORN ADD/ SUB MUL ADD/ SUB ORN ADD/ SUB MUL ADD/ SUB MUL . . . . . . . . . . . . Heat RDP-Layout Type II Layout II MUL MUL MUL MUL ... ADD/ SUB ... MUL ... ... ORN ADD/ SUB ADD/ SUB ADD/ SUB ORN MUL MUL MUL ORN ERI 1 Kyushu University ADD/ SUB ADD/ SUB ADD/ SUB ADD/ SUB . . . . . . . . . . . . RDP-Layout Type III Layout III WAHA 2009 44 FU Layout for DIV, SQRT, EXP operations 16Bits Floating point DIV, SQRT, and EXP Functional unit have been already developed by SFQ current technology. FU FU FU ... FU ORN : Operand Routing Network FU DIV FU FU ... FU ... FU ORN Where ? Three times larger latency FU FU FU : : : FU FU FU : ... FU ... FU ORN FU FU FU Pipeline execution based on ADD and MUL latency Where should we place different latency FU ? Kyushu University WAHA 2009 Heterogeneous configuration of FU array ? 45 Estimated performance improvement of 2-dim Poisson equation by LSRDP calc. 8 Normalized exec. time by GPP(3GHz) calc. 7 6 Treconf 5 Ttra 4 Trearr 3 Tst 2 Tcal 1 Tgpp 0 12.8 102.4 poisson(S) Kyushu University 12.8 102.4 poisson(M) WAHA 2009 12.8 102.4 poisson(L) Main Mem. bandwidth [GByte/sec] 46 Estimated performance improvement of ERI calculation by LSRDP 1 0.8 Treconf 0.6 Ttra Trearr 0.4 Tst 0.2 Tcal Tgpp 0 GPPonly (3GHz) Kyushu University LSRDP ERI WAHA 2009 47 Recursive Parts of Electron Repulsion Integral Formula (ERI-Rec) No. of Operations No. of Inputs No. of Output (ps,ss) 9 8 3 (ps,ps) 51 16 9 (pp,ss) 66 14 9 (pp,ps) 252 22 27 (pp,pp) 1004 28 81 DFG sizes have already determined from original recursive formula Kyushu University WAHA 2009 48 What types of software/algorithms are suitable for LSRDP ? When same calculations have to be calculated repeatedly. LSRDP is used for high throughput accelerator. Input/Output data size is small compared with the amount of the operations. memory access X small size of input Kyushu University WAHA 2009 Large amount of calculations small size of output 49 LSRDP Exploration of suitable applications for LSRDP Application matrix elements calculation Molecular integral calculations in molecular orbital method Monte Carlo type simulation etc… Numerical calculation library special function (promising?) differential equation numerical integration matrix operation (difficult ??) Triangular matrix simultaneous equation etc… Investigating applicability against various applications WAHA 2009 Kyushu University 50 Recursive Parts of Electron Repulsion Integral Formula in Molecular Orbital Calc. ~Up to (pp,pp) Recursive Calculation~ (ss,ss)(m) and all coefficients are given as input ( pi s, ss)(0) PA i ( ss, ss)(0) WPi ( ss, ss)(1) ( pi s, pk s )(0) QCk ( pi s, ss )(0) WQk ( pi s, ss )(1) ( pi p j , ss )(0) PBi ( pi s, ss )(0) WPj ( pi s, ss )(1) ik Zab 2 ij Za 2 ( ss, ss)(1) (ss, ss) (0) Za( ss, ss )(1) ( pi p j , pk s )(0) QCk ( pi p j , ss )( 0) WQk ( pi p j , ss )(1) Zab ik ( sp j , ss )(1) jk ( pi s, ss )(1) 2 ( pi p j , pk pl )(0) QDk ( pi p j , pk s )(0) WQl ( pi p j , pk s)(1) ij Zb 2 ( p p , ss) i (0) j Zb( pi p j , ss )(1) Zab il ( sp j , pk s)(1) jl ( pi s, pk s)(1) 2 (i,j,k,l = x,y,z): p function has 3 components (as 1dim array) DFG sizes areFUs. determined by Each DFG has only ADD (SUB) and MUL Kyushu University algorithm WAHA 2009 # of Inputs: # of Outputs: Max. 28 1 ~ 81 original calculation 51 DFG Distribution for each application # of FUs 1000 ERI-Rec (8 DFGs) Vibration (7) 800 Heat (6) 600 400 Poisson (3) 200 0 0 10 20 30 40 50 60 70 # of Inputs DFGs have different qualities in terms of the # of FUs, # of Inputs and Kyushu University WAHA 2009 Outputs 52 Example of MCL (Heat) MCL Heat original DFG (I/O: 8/4, FUs: 32) Kyushu University Mapping result WAHA 2009 53 Example of extracted DFGs (ERI-Rec) Maximum DFG of ERI-Rec: (pipj,pkpl) Inputs: 28 Outputs: 81 FUs: 1004 Immediates: 0 Vertical Partitioning Kyushu University WAHA 2009 Inputs: 24 Outputs: 1 FUs: 108 Immediates: 0 54 Poisson Equation 2u( x, y) 2u( x, y) f ( x, y) 2 2 x y 2D – Poisson Eq. Successive Over Relaxation method u ( n 1) ( xi , y j ) ω is const. (1 ) * u ( n ) ( xi , y j ) u 4 (n) ( xi 1 , y j ) u ( n ) ( xi 1 , y j ) u ( n ) ( xi , y j 1 ) u ( n ) ( xi , y j 1 ) h 2 f ( xi , y j ) Red/Black Gauss Seidel In order to obtain u(n+1) (xi,yj) in the next iteration, current values of five variables i.e. u(n) (xi,yj), u(n) (xi±1,yj), u(n) 55 55 (xi,yjUniversity Kyushu WAHA 2009 ±1) are needed Example of extracted DFGs (Poisson) Maximum Poisson DFG Inputs: 32 Outputs: 1 FUs: 721 Immediates: 364 Kyushu University WAHA 2009 56 Performance Evaluation: Simulation Environment Variable parameters: GPP LSRDP Main Memory • Freq. of GPP and LSRDP • Bandwidth between main memory and LSRDP • Latency of reconfiguration time • # of FPUs in LSRDP • Supporting FPU types (Add, Mul, Div, Exp, Sqrt, Error function units are supported) Use streaming buffer in the LSRDP chip I/O data is sorted in the main memory. GPP: Exec. time measurement by processor simulator LSRDP: Estimation by performance modeling Kyushu University WAHA 2009 57 57 Estimated performance improvement of 1-dim heat equation by LSRDP calc. 1.2 Trec 1 Ttra 0.8 Trearr 0.6 Tst 0.4 Tcal 0.2 ETgpp 0 Only GPP 12.8 51.2 25.6 LSRDP Kyushu University WAHA 2009 102.4 Main Mem. bandwidth [GByte/sec] 58 Estimated performance improvement of 1-dim heat equation by LSRDP calc. Normalized exec. time by GPP(3GHz) calc. 0.35 0.3 0.25 Trec 0.2 Ttra Trearr 0.15 Tst 0.1 Tcal 0.05 ETgpp 0 12.8 25.6 51.2 Heat Kyushu University WAHA 2009 102.4 Main Mem. bandwidth [GByte/sec] 59 Poisson Red/Black 法における DFGの拡大による繰り返し回数の増加 4+1ノード の入力 9+4ノード の入力 SOR式 1回の計算 SOR式 2回の繰り返し 中心1ノードの出力 中心1ノードの出力 •DFGの拡大により1度に計算可能な繰り返し回数が増加 これに伴い必要な入力数も増加 Kyushu University WAHA 2009 60 60 Implementation of Heat calculation to LSRDP Original GPP code Loop j Loop i T(xi,tj) End Loop End Loop Kyushu University LSRDP code LSRDP Reconfiguration Loop j’ Input Data Rearrangement Loop N LSRDP pipeline exec. (FDM DFG calc.) End Loop Output Data Rearrangement End Loop WAHA 2009 61 61 Implementation of Poisson calculation to LSRDP Original GPP code LSRDP code Loop Iter Loop i loop j u(xi,yj) End Loop End Loop End Loop LSRDP Reconfiguration Loop Iter’ Input Data rearrangement Loop N LSRDP pipeline exec. (FDM DFG calc.) End Loop Output Data rearrangement End Loop Kyushu University WAHA 2009 62 62 Implementation of ERI-Rec calculation to LSRDP original GPP code Loop I,J,K,L Loop contraction Initial Integral Calc. Recursive Calc. End Loop Partial Fock Calc. End Loop Initial Integral Calc.: 1/Sqrt, Exp, Fm(T) are utilized => GPP calculation Recursive Calc.: only ADD/SUB, MUL => LSRDP calculation Kyushu University LSRDP code Loop I,J,K,L LSRDP Reconfiguration Loop contraction Initial Integral Calc. End Loop Input Data rearrangement Loop N LSRDP pipeline calc. (Recursive DFG calc.) End Loop Output Data rearrangement Partial Fock Calc. End Loop WAHA 2009 63 63 Vertical vs. Horizontal DFG Decomposition Original Loop N Reconfiguration Loop M LSRDP pipeline calc. End Loop End Loop Vertical Decomp. Loop n ( > N) Reconfiguration Loop M LSRDP pipeline calc. End Loop End Loop Kyushu University Horizontal Decomp. Loop N Reconfiguration Loop M 1st LSRDP pipeline calc. End Loop End Loop Loop N Reconfiguration Loop M 2nd LSRDP pipeline calc. End Loop End Loop WAHA 2009 64 64 Example of extracted DFGs Maximum DFG of ERI-Rec: (pipj,pkpl) Inputs: 28 Outputs: 81 FUs: 1004 Immediates: 0 Vertical Partitioning Kyushu University WAHA 2009 Inputs: 24 Outputs: 1 FUs: 108 Immediates: 0 65 Example of extracted DFGs- Heat Inputs: Outputs: Operations: Immediates: 32 16 721 364 A huge sample DFG (Heat) Kyushu University WAHA 2009 66 Performance Evaluation: Simulation Environment Variable parameters: GPP LSRDP Main Memory • Freq. of GPP and LSRDP • Bandwidth between main memory and LSRDP • Latency of reconfiguration time • # of FPUs in LSRDP • Supporting FPU types (Add, Mul, Div, Exp, Sqrt, Error function units are supported) Use streaming buffer in the LSRDP chip I/O data is sorted in the main memory. GPP: Exec. time measurement by processor simulator LSRDP: Estimation by performance modeling Kyushu University WAHA 2009 67 67 Performance Evaluation: Execution Time Modeling Execution time ETTOTAL ETGPP ETLSRDP TOverhead ETLSRDP TCal STLSRDP Sort data + Reconfig. + Send signal for comm. Calculation time Tcal Total pipeline depth # of rows of LSRDP + in the given program (latency of LSRDP) Stall time of LSRDP <->Mem STLSRDP Latency For first Input and last output Kyushu University WAHA 2009 + Stall from Bandwidthreq > Bandwidthmem 68 68 Layout Types- Type I W A M T A M T A M T … A M T A M T Each PE implements ADD/SUB and MUL ORN A M T A M T A M H M A : ADD/SUB T : Transfer Unit … A M T A M T ORN : MUL A T M T A M T A M T … A M T ADD/SUM A T M MUL TU . . . ORN Total No. of PEs= WA * H M T A M T A M T … A M T A M T Total Area= W*H* [Area(MUL)+Area(ADD/SUB)+ Area(TU)]+ Area(ORNs) Kyushu University WAHA 2009 69 Layout Types- Type II W A T M T Each PE implements ADD/SUB or MUL A T … A T M T Each PE implements ADD/SUB or MUL ORN A T M T A T … A T M T … A T MUL M T … A T M T ORN ADD/SUMA T TU M T A T TU . . . H ORN Total No. of PEs= W * H A T M T A T Total Area= ½* W*H*[Area(MUL)+Area(TU)]+ ½*W*H*[Area(ADD/SUB)+Area(TU)]+ Area (ORNs) Kyushu University WAHA 2009 70 Layout Types- Type III Each PE implements ADD/SUB or MUL W M T M T M T … M T M T Each PE implements ADD/SUB or MUL ORN ADD/SUMA T TU A T A T … A T A T … M T MUL M T … A T M T ORN M T M T H M T TU . . . ORN Total No. of PEs= W * H A T M T A T Total Area= ½* W*H*[Area(MUL)+Area(TU)]+ ½*W*H*[Area(ADD/SUB)+Area(TU)]+ Area (ORNs) Kyushu University WAHA 2009 71 CB Various Functionalities CB two inputs/outputs four cases are possible reconfigurable 1/2CB one input/ two outputs four cases are possible reconfigurable CB 00 01 ½CB 10 Kyushu University 11 00 WAHA 2009 01 10 11 72 An ORN Structure CB T CB ½CB T large number of Josephson junctions M number of ½ CB and (2×M+1)×MCL number of CB The number of FPUs is M, the number of Transfer Units (T) is also M; MCL is a maximum connection length if we consider FPUs only => ½ CB – 2×M T2 – (M+4×MCL+2) CB – (2×MCL+1) ×(4×M-1) CB: 351 JJs ½ CB: 216 JJs Reduction of the number of Josephson junctions is essential! * T2 is a 2-bit shift register CB T2 CB T2 T2 FPU FPU ½CB CB CB CB CB T T CB CB ½CB T2 “–”: CB CB ½CB CB FPU FPU CB CB T T CB CB ½CB T2 CB CB ½CB CB FPU FPU CB CB T CB CB ½CB T CB CB ½CB T2 FPU FPU CB CB T CB CB CB ½CB scalable pipelined easily re-designed for any number of N and M CB CB “+”: FPU FPU ½CB T2 T2 T T2 73 Kyushu University 2009 A. Fujimaki, et al., Demonstration of anWAHA SFQ-Based Accelerator Prototype for a High-Performance Computer,” ASC08, 2008. Reconfiguration Mechanism Input-B Input-A Input-C PE Immediate Register (64b) Immediate Register Transfer Unit MUX Initial State FU (A op B) Conf. Reg. [bit] idle End of Execution ・・・ Starting of Reconfiguration Reconfiguration ・・・ ORN ORN Execution log(2x (2MCL+1)) x 3 [b] Immediate PE & ORN reconfiguration structure PE Starting of Execution wait M U X Imm. Reg. Imm. Reg. FU ・・・ M U X TU MUX Imm. Reg. TU ・・・ FU ・・・ ORN MUX TU Imm. Reg. FU ・・・ MUX End of Reconfiguration ORN MUX State Diagram TU FU ORN ・・・ ORN Reconfiguration in LSRDP Kyushu University WAHA 2009 74 LSRDP Design Procedure Start Choosing a design parameter based on the priority of parameters Mapping DFGs onto the LSRDP (placement, routing. I/O positioning) Obtaining required statistics from the mapping results Analyzing the mapping results Choosing the appropriate value of the respective design parameter Kyushu University WAHA 2009 75 10TFLOPS SFQ-RDP computer 4.2 K 2TB memory module (FB-DIMM [DDR3@1333MHz, 128GB] ×16 modules) CMOS CPU (1chip) SFQ 0.5um process ORN ... FPU ORN : : : : SFQ RDP (32FPU×32chips) (4GFLOPS/FPU) ... ORN SFQ Streaming Buffer (64Kb×2chips) ... ORN SB : : : SMAC SMAC ... : 1024FPU@MCM (34chips)×4MCM SMAC Kyushu University Memory band width per MCM:256GB/s (=16GB/s ×16 channels) WAHA 2009 76 Power Consumption Comparison for 10TFLOPS computers MPU Mem. HD CMOS RISC (90 nm) 125 kW 12.5 kW 5 KW SFQ RDP (0.5 um) 3W +3.3 W 250 W 100 W Kyushu University WAHA 2009 Freezer 1 kW Air Conditioning Total 43 kW 186 kW 0.1 kW 1.5 kW 77 Power Consumption / Performance Performance (GFlops) Power Consumption (W) Power Cons./ Perf.(W/GFlops) SFQ-RDP ~10K ~1.5K ~0.15 0.5 um, Whole system SFQ-RDP ~10K ~6.3 (MPU) ~0.63*10^(-3) 0.5 um, MPU GRAPE-DR 512 65 0.12 Chip CSX600 25 10 0.40 ClearSpeed Chip Cell 192 (single) 32 0.17 Inside Chip, SPE core Cell (eDP) 1.33PF(#12960)? 12960*110W? GeForce8800GTX 518 (single) 150 SX-9 CMOS RISC ~10K (90nm) Kyushu University ~186K WAHA 2009 Roadrunner 0.29 Chip 18.3 512 nodes whole system: 15.4MW ~18.6 78
© Copyright 2024 ExpyDoc