Optimizing the Architecture of SFQ-RDP (Single Flux QuantumReconfigurable Datapath) F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue* and K. Murakami* *Graduate School of Information Science and Electrical Engineering, Kyushu University, Japan **Institute of Systems, Information Technologies and Nanotechnologies (ISIT), Fukuoka, Japan E-mail: [email protected] CREST-JST (2006~): Low-power, high-performance, reconfigurable processor using single-flux quantum circuits Yokohama National Univ. SFQ-FPU chip, cell library Kyushu Univ. Prof. N. Yoshikawa et al. Nagoya Univ. SFQ-RDP chip, cell library, and wiring Prof. A. Fujimaki et al. SFQ-LSRDP Architecture, Compiler and Applications Prof. K. Murakami Dr. K. Inoue Dr. H. Honda Dr. F. Mehdipour H. Kataoka Nagoya Univ. CAD for logic design and arithmetic circuits Superconducting Research Lab. (SRL) Prof. N. Takagi (Leader) et al. SFQ process Dr. S. Nagasawa et al. Kyushu University SSV 2009 2 Agenda Introduction Large-Scale Reconfigurable Data-Path (LSRDP) General Architecture and Specifications Design Procedure and Tool Chain Preliminary Results Conclusions and Future Work Kyushu University SSV 2009 3 Introduction For performance improvement various accelerators are used with GPPs PowerXcell, GPU, GRAPE-DR, ClearSpeed, etc. Small size and low power consumption comparing to processors with similar performance NVIDIA Tesla S1070 http://www.nvidia.com Kyushu University SSV 2009 4 Acceleration Through a Data-Path Processor Mechanism Acceleration by using a data-path accelerator Augmenting the accelerator to the base processor Executes hot portions of applications on the accelerator Kyushu University SSV 2009 5 How a Reconfigurable Processor Works Non-critical code critical code GPP LSRDP Non-critical code critical code Non-critical code . . . Application code Kyushu University SSV 2009 Main Memory 6 Coupling an Accelerator to a Processor Tight Coupling Processor RFU Tight Coupling Memory Coprocessor Bridge Attached Processor Loose Coupling Kyushu University SSV 2009 7 Motivation Conventional accelerators: A large memory bandwidth is demanded in conventional accelerators for high-performance computation On chip memories are often used to hide memory access latency Large-Scale Reconfigurable Data-Path (LSRDP): • is introduced as an alternative accelerator • reduces the no. of memory accesses by utilizing data-path Kyushu University SSV 2009 8 Outline of Large-Scale Reconfigurable Data-Path (LSRDP) processor LSRDP Reconfigurable data-path includes: A large number of floating point Functional Units (FUs) Arranged as arrays Reconfigurable Operand Routing Network : (ORN) Dynamic reconfiguration facilities FU GPP FU FU ... FU ORN : Operand Routing Network : : : FU FU FU : ... FU ORN FU FU FU ... FU Features: SB : : : ... Data Flow Graphs (DFGs) extracted from critical calculation parts are directly mapped Pipeline execution Burst transfer is used for input /output rearranged data from/to memory : SMAC Main Memory Kyushu University Scratchpad Memory Streaming Buffer (SB) for I/O ports SSV 2009 9 Single-Flux Quantum (SFQ) against CMOS CMOS issues: (if LSRDP has 32x32 FUs) high electric power consumption high heat radiation and difficulties in high-density packing 磁束量子 超伝導ループ Superconductivity loop SFQ Features: Single Flux Quantum ジョセフソン接合 Josephson junction High-speed switching and signal transmission Low power consumption Compact implementation of a system (small area) No cost for latch Suitable for pipeline processing of data stream Serial bit-level processing Kyushu University SSV 2009 10 Goals of the Project Discovering appropriate scientific applications Developing compiler tools Developing performance analyzing tools Designing and Implementing SFQ-LSRDP architecture considering the features and limitations of SFQ circuits Kyushu University SSV 2009 11 LSRDP General Architecture and Specifications Parameters Should Be Decided Within the LSRDP Design Procedure Streaming Buffer (SB) • Core structure: a rectangular matrix of PEs ... Operand Routing Network (ORN) PE1 PE2 PE3 Width ... • PE: combination of a Functional Unit (FU) and a data Transfer Unit (TU) Width and Height ? ... Height Maximum Connection Length (MCL) between consecutive rows? (impossible to implement full cross bar) ORN ... ORN . . . . . . . . . ... ORN PEm Layout: FU types (ADD/SUB and MUL)? Reconfiguration mechanism? (PE, ORN, Immediate data) ... Streaming Buffer (SB) Kyushu University • On-chip memory configuration? SSV 2009 13 LSRDP Architecture Processing Elements FU implements basic 64-bit double-precision floating point operations including: ADD, SUB and MUL TU (transfer unit) as a routing resource for transferring data from a row to an inconsecutive row FU FU TU FU TU TU TU FU TU PE including Two components Four functionalities Kyushu University SSV 2009 14 Layout Types- Type I W A M T A M T A M T … A M T A M T Each PE implements ADD/SUB and MUL ORN A M T A M T A M H M : MUL … A M T A M T ORN A M A T T A M T A M T … A M T ADD/SUB A T M MUL TU . : ADD/SUB . T : Transfer Unit . ORN A M T Kyushu University Flexible A M T A M T … A M T A M T but consume a lot of resources SSV 2009 15 Layout Types- Type II (Checkered) W A T M T Each PE implements ADD/SUB or MUL A T … A T M T Each PE implements ADD/SUB or MUL ORN A T M T A T … A T M T … A T MUL M T … A T M T ORN ADD/SUBA T TU M T A T TU . . . H ORN A T Kyushu University M T A T SSV 2009 16 Layout Types- Type III (Striped) Each PE implements ADD/SUB or MUL W M T M T M T … M T M T Each PE implements ADD/SUB or MUL ORN ADD/SUBA T TU A T A T … A T A T … M T MUL M T … A T A T ORN M T M T H M T TU . . . ORN A T Type Kyushu University A T A T II or III, which one is more efficient? SSV 2009 17 Maximum Connection Length (MCL) Connection Length= 0 (i, 0) ... Connection Length= 2 (i,j) (i, j+1) (i, j+2) Longest Connection Length= L ... ... ... ... ORN (i+1, 0) ... (i+1, j) (i+1, j+1) (i+1, j+2) (i+1, j+L) MCL: maximum horizontal distance between two PEs located in two subsequent rows Kyushu University SSV 2009 18 An ORN Structure ORN T FPU FPU T ½CB ½CB ½CB T2 CB CB T2 CB CB CB FPU ½CB CB CB CB T FPU T FPU T FPU ½CB ½CB ½CB ½CB ½CB ½CB CB CB CB CB CB CB T2 CB T FPU T FPU T2 CB CB CB CB CB CB CB CB CB T2 T CB CB CB T2 CB CB T2 CB T2 CB T FPU 2bit shift register ORN is FPU consisted FPU of 2-bit shift registers, 1-by-2 and 2-by-2 cross bar switches T T2 CB CB Kyushu University A. Fujimaki, et al., Demonstration of an SFQ-Based SSV 2009 Accelerator Prototype for a High-Performance Computer,” ASC08,192008. Dynamic Reconfiguration Mechanism Initial State idle End of Execution Starting of Reconfiguration Reconfiguration ORN Execution Immediate PE Starting of Execution wait Kyushu University SSV 2009 End of Reconfiguration 20 Dynamic Reconfiguration Architecture Input-B Input-A Input-C PE Immediate Register (64b) Immediate Register Transfer Unit MUX FU (A op B) Conf. Reg. [bit] ・・・ ORN ・・・ log(2x (2MCL+1)) x 3 [b] Three bit-stream lines for dynamic reconfiguration of: • Immediate registers (64bit) in each PE • Selector bits for muxes selecting the input data of FUs • Cross-bar switches in ORNs Kyushu University SSV 2009 21 Design Procedure and Tool Chain Compiler and Design Flow Application Code Mapping Placement Bit-Stream Generation Hardware-Software Partitioning (manual or automatic) Routing Configurations (for LSRDP) LSRDP Architecture Critical Parts (h/w part) Port positioning Analyzing DFG mapping results DFG Genration Design Phase DFGs DFGs Non-Critical Parts (s/w part) s/w Part Modification Binary Code (for GPP) • DFGs are manually generated from critical parts of applications • DFG mapping results are used for • Analyzing LSRDP architecture statistics • Generating LSRDP configuration bit-streams Kyushu University SSV 2009 23 Benchmark Applications for Design Procedures Finite differential method calculation of 2nd order partial differential equations 1dim-Heat equation 1dim-Vibration equation 2dim-Poisson equation (Heat) (Vibration) (Poisson) Quantum chemistry application Recursive parts of Electron Repulsion Integral calculation (ERI-Rec) Only ADD/SUB and MUL operations are used in the critical calculations of all above applications Kyushu University SSV 2009 24 DFG Extraction- Heat Equation 1-dim. heat equation for T(x,t) T ( x, t ) 2T ( x, t ) A t x 2 (A is const.) T(i-1,j) T(i,j) T(i+1,j) + Calculation by Finite Difference Method (FDM) T ( xi , t j 1 ) D * B * D * T ( xi , t j ) B * T ( xi 1 , t j ) T ( xi 1 , t j ) + T(i,j+1) Basic DFG can be extended to horizontal and vertical directions to make a larger DFG Kyushu University SSV 2009 Basic DFG corresponding to minimum FDM calculation 25 Example of extracted DFGs- Heat Inputs: Outputs: Operations: Immediates: 32 16 721 364 A huge sample DFG (Heat) Kyushu University SSV 2009 26 DFG Distribution for each application # of FUs 1000 ERI-Rec (8 DFGs) Vibration (7) 800 24DFGs Heat (6) 600 400 Poisson (3) 200 0 0 10 20 30 40 50 60 70 # of Inputs DFGs have different qualities in terms of the # of FUs, # of Inputs Kyushu University SSVand 2009 Outputs 27 DFG Classification Class # of FUs RDP-S 128 # of # of Inputs Outputs 19 12 RDP-M 512 19 12 RDP-L 1024 38 24 RDPXL > 1024 64 52 # of DFGs Heat (3) Poi (1) Vib (2) Eri (4) Heat (1) Poi (1) Vib (1) Eri (4) Heat (2) Poi (1) Vib (2) Eri (5) Heat (1) Poi (1) Vib (2) Eri (5) Totally, 24 DFGs are prepared for benchmark Apps. Due to broad range of DFG sizes DFGs are classified as S, M, L, XL with respect to their size and the number of Input/Output nodes => LSRDP designing processes for S, M, L, XL, respectively Kyushu University SSV 2009 28 Mapping DFGs onto LSRDP LSRDP Architecture Description DFG Placing DFG nodes on LSRDP Routing Connections Placing IO nodes Routing Inp/Out Connections Configuration File Kyushu University Longest connections SSV 2009 29 LSRDP Design Procedure DFGs & LSRDP HW constraints Choosing a design parameter For each parameter Mapping DFGs onto the LSRDP Obtaining required statistics Analyzing the mapping results Choosing the appropriate value Kyushu University Appropriate values for all parameters SSV 2009 30 Preliminary Results LSRDP Specifications: Width & Height # of Input ports # of Output ports Width Height LSRDP-S 19 12 16 16 LSRDP-M 19 12 32 16 LSRDP-L 38 24 64 32 LSRDP Dimensions and the number of input/output ports Kyushu University SSV 2009 32 LSRDP Specifications: MCL MCL (Max. Conn. Len.)= 2 (i, j) ... FU T FU T FU T ... ... FU T FU T FU T FU T ... MCL = L No. of Inputs =(2xMCL+1)x2 = 10 (i, j+1) 10 to 3 (i+1, j+1) (i+2, (i+L, j+1) ・・・ j+1) No. of Outputs= 3 FU T FU T FU T LSRDP FU T FU T FU T FU MCL (avg/max) T ... ORN SizeNo of Inps (avg/max), Outs LSRDP-S 4/8 18/34, 3 LSRDP-M 5/9 22/38, 3 LSRDP-L 5/9 22/34, 3 Kyushu University Further MCL optimization needed SSV 2009 33 Analyzing Various LSRDP Layouts Heat Viration Poisson ERI1 ERI2 Layout I II III I II III I II III I II III I II III Size 8x3 8x3 8x4 10x8 10x8 10x11 10x10 10x12 15x18 6x2 9x3 6x2 10x10 10x10 15x8 (Except ERI1 DFG which gives better size for Layout III) Layout II can be used instead of Layout I to obtain a smaller LSRDP Kyushu University SSV 2009 34 LSRDP at One Glance (1/2) Functional units ADD/SUB, MUL Layout Type II (checker pattern) Operations 64-bit floating point Processing structure Pipelined PE structure FU, T, FU+T, T+T LSRDP Size Small Medium Large No. of inp/out ports 19/12 19/12 38/24 Width/Height 16/16 32/16 64/32 16*16*64 32*16*64 64*32*64 16*BSS(ORN) 32* BSS(ORN) 64*BSS(ORN) 16*16* 2 32*16*2 64*32* 2 22 , 3 26 , 3 26 , 3 Conf. bit-stream size Imm. Regs ORNs PEs ORN inputs, outputs Structure Cross-bar switch Conn. Type One-directional Kyushu University SSV 2009 35 LSRDP at One Glance (2/2) Internal memory External memory Reconf. mechanism Kyushu University Type Immediate registers Size and count 64-bit registers, One reg. for each PE Communication mechanism Serial No. of memory modules 16 Date trans. rate 1800Mbps/pin Overall data trans. rate 24 GB/s Mem. to LSRDP bus width 64 bit Channels per module Two Bit serial configuration through a serial chain SSV 2009 36 Preliminary Performance Evaluation Base processor configuration Processor type Out-of-order GPP operating frequency 3.2GHz Inst. issue width 4 instruction/cc Inst. decode width 4 instruction/cc Cache configuration L1 data 64KB(128B Entry, 2way, 2cc) L1 instruction 64KB(64B Entry, 1way, 1cc) L2 unified 4MB(128B Entry, 4way, 16cc) Latency of main memory 300cc L2 to main memory Bus width 64 Bytes Freq 800 MHz GPP+LSRDP configuration LSRDP operating frequency 80 GHz Reconfiguration Latency 1cc Latency SPM LSRDP latency 1cc Latency Main Memory SPM 7500cc Bandwidth SPMLSRDP Max. 64 * 8 Bytes/cc Bandwidth Main Memory SPM 102.4GB/sec GPP: Exec. time measurement by means of a processor simulator Kyushu University SSV 2009 LSRDP: Estimation by performance modeling 37 Preliminary Performance Evaluation (Heat) Normalized by GPP Exec. Time 0.4 0.35 Re c o n f. 0.3 Comm. Re ar r an ge 0.25 St all 0.2 LSRDP GPP 0.15 Basic: SB only Reuse: SB + SPM 0.1 0.05 0 Basic Reuse Heat (M) Basic Reuse Heat (L) Data reusing is employed to avoid the need for data rearrangement as well as frequently data retrieval from the scratchpad memory. Kyushu University SSV 2009 38 Preliminary Performance Evaluation (Poisson) Normalized by GPP Exec. Time 0.4 0.35 Re c o n f. 0.3 Comm. Re ar r an ge 0.25 St all LSRDP 0.2 GPP 0.15 0.1 0.05 0 poisson(S) poisson(M) poisson(L) A small fraction is related to processing time on LSRDP and the main fraction concerns to various overhead times as well Kyushu University SSV as 2009the execution time on GPP 39 Conclusions & Future Work A high-performance computer comprising an accelerator (LSRDP) implemented by superconducting circuits was introduced. 24 benchmark Data Flow Graphs (DFGs) were manually generated. LSRDP micro-architecture is designed based on characteristics of scientific applications via a quantitative approach. LSRDP is promising for resolving issues originated from CMOS technology as well as achieving considerable performances. Future Work: •To achieve higher performance it is required to reduce various overhead costs mainly related to data management part. •To reduce the implementation cost of LSRDP, we will focus on reducing maximum connection length and ORN size. Kyushu University SSV 2009 40 Acknowledgement This research was supported in part by Core Research for Evolutional Science and Technology (CREST) of Japan Science and Technology Corporation (JST). Kyushu University SSV 2009 41 2x3 RDP processor prototype ALU Controller SR_IN 8-bit ALUs implementing: SR_OUT ORN1 ALU1 ORN3 ALU3 ORN5 ALU5 ORN2 ALU2 ORN4 ALU4 ORN6 ALU6 ADD, SUB, AND, OR, XOR 25GHz Frequency 6-bit Data transfer shift registers 16-bit I/O shift registers 21 Pipeline stages 7-bit Data width Area: 6.84 x 6.72 mm2 Total number of Junctions:14040JJs Bias current: 1.652A Kyushu University SSV 2009 Fujimaki, et al., Demonstration of an SFQ-Based Accelerator Prototype for a High-Performance Computer,” ASC08, 2008 42
© Copyright 2024 ExpyDoc