Developing an Architecture for a Single-Flux Quantum Based Reconfigurable Accelerator F. Mehdipour, Hiroaki Honda*, H. Kataoka, K. Inoue and K. Murakami Graduate School of Information Science and Electrical Engineering, Kyushu University, Japan *Institute of Systems, Information Technologies and Nanotechnologies (ISIT), Fukuoka, Japan E-mail: [email protected] Agenda Introduction SFQ-LSRDP General Architecture The Design Procedure and Tool Chain Input/ Output Nodes Placement Area Minimization Experimental Results Conclusions 2009/01/29 CREST-JST SFQ-RDP Project (2006~): A Low-power, high-performance reconfigurable processor based on single-flux quantum circuits Yokohama National Univ. SFQ-FPU chip, cell library Prof. N. Yoshikawa et al. Nagoya Univ. Kyushu Univ. Architecture, Compiler and Applications SFQ-RDP chip, cell library, and wiring Prof. A. Fujimaki et al. Nagoya Univ. CAD for logic design and arithmetic circuits Prof. N. Takagi (Leader) et al. Prof. K. Murakami et al. SFQ-LSRDP Superconducting Research Lab. (SRL) SFQ process Dr. S. Nagasawa et al. 2009/01/29 Goals Discovering appropriate scientific applications Developing compiler tools Developing performance analyzing tools Designing and Implementing SFQ-LSRDP architecture considering the features and limitations of SFQ circuits 2009/01/29 How a reconfigurable processor works Non-critical code Computation-intensive (critical) code Non-critical code GPP LSRDP PE ... PE LSRDP PE ... PE PE Computation-intensive (critical) code Non-critical code . . . Application code PE ORN … PE PE ORN PE Main Memory PE PE ... PE 2009/01/29 Single-flux quantum (SFQ) against CMOS CMOS main issues in implementing a large accelerator: High electric power consumption High heat radiation Difficulties in high-density packing 磁束量子 超伝導ループ Superconductivity loop Single Flux Quantum SFQ Features: High-speed switching and signal transmission Low power consumption Compact implementation (smaller area) Suitable for pipeline processing of data stream ジョセフソン接合 Josephson junction 2009/01/29 Outline of large-scale reconfigurable data-path (LSRDP) processor LSRDP PE GPP PE PE ... PE ORN : Operand Routing Network : : : PE PE PE : ... PE ... PE Reconfigurable data-path components: A matrix of large number of floatingpoint Functional Units (FUs) Reconfigurable Operand Routing Network : (ORN) Dynamic reconfiguration facilities Streaming Buffer (SB) for I/O ports ORN PE PE PE : : ... SMAC Main Memory Scratchpad Memory Features: Handling data flow graphs (DFGs) extracted from scientific applications Pipeline execution Burst transfer of input /output rearranged data from/to memory Reduced no. of memory accesses SB : : (alleviating the memory wall problem) 2009/01/29 SFQ-LSRDP General Architecture LSRDP architecture Input ports Processing Elements FU (Functional Unit): implements basic 64-bit double-precision floating point operations including: ADD/SUB and MUL TU(transfer unit): as a routing resource for transferring data b/w inconsecutive rows 7 4 MUL TU 15 FU TU FU FU TU PE including two components Node 15 13 12 TU TU FU TU Four functionalities Output ports 2009/01/29 PE structures FU - FU TU - FU - - FU TU - - TU TU PE Basic arch. 3-inps/2-outs - TU FU TU - FU TU TU TU TU FU TU PE arch. I TU FU TU 4-inps/3-outs TU TU FU TU TU TU FU TU TU - PE arch. II 3-inps/3-outs TU TU-TU TU FU TU 2009/01/29 Layout types- Type I W A M T A M T A M T … A M T A M T Each PE implements ADD/SUB and MUL ORN A M T A M T A M H M … A M T A M T ORN : MUL A M A T T A M T A M T … A M T ADD/SUB A T M MUL TU . : ADD/SUB . T : Transfer Unit . ORN A M T A M T A M T … A M T A M T Flexible but consumes a lot of resources 2009/01/29 Layout types- Type II W A T M T Each PE implements ADD/SUB or MUL A T … A T M T Each PE implements ADD/SUB or MUL ORN A T M T A T … A T M T … A T MUL M T … A T M T ORN ADD/SUBA T TU M T A T TU . . . H ORN A T M T A T 2009/01/29 Maximum connection length (MCL)Definition Connection Length= 0 (i, 0) ... Connection Length= 2 (i,j) (i, j+1) (i, j+2) Longest Connection Length= L ... ... ... ... ORN (i+1, 0) ... (i+1, j) (i+1, j+1) (i+1, j+2) (i+1, j+L) MCL: maximum horizontal distance b/w two PEs located in two subsequent rows 2009/01/29 An ORN structure ORN T FPU FPU T ½CB ½CB ½CB T2 CB CB T2 CB CB CB FPU ½CB CB CB CB T FPU T FPU T FPU ½CB ½CB ½CB ½CB ½CB ½CB CB CB CB CB CB CB T2 CB T FPU T FPU T2 CB CB CB CB CB CB CB CB CB T2 T CB CB CB T2 CB CB T2 CB T2 CB T FPU 2bit shift register ORN is FPU consisted FPU of 2-bit shift registers, 1-by-2 and 2-by-2 cross bar switches T T2 CB CB A. Fujimaki, et al., Demonstration of an SFQ-Based Accelerator Prototype for a High-Performance Computer,” ASC08, 2008. 2009/01/29 Dynamic reconfiguration architecture Initial State Input-B Input-A Input-C PE idle Immediate Register (64b) Immediate Register End of Execution Starting of Reconfiguration Reconfiguration Transfer Unit MUX ORN Execution Immediate FU (A op B) Conf. Reg. [bit] PE Starting of Execution wait ・・・ ORN End of Reconfiguration ・・・ log(2x (2MCL+1)) x 3 [b] Three bit-stream lines for dynamic reconfiguration of: • Immediate registers (64bit) in each PE • Selector bits for muxes selecting the input data of FUs • Cross-bar switches in ORNs 2009/01/29 What should be decided during the design procedure Streaming Buffer (SB) ... Operand Routing Network (ORN) PE1 PE2 PE3 Width ... The number of I/O ports? Maximum Connection Length (MCL)? ORN size and structure? ... Height Width and Height ? ORN ... ORN . . . . . . . . . ... PEm Layout: FU types (ADD/SUB and MUL)? Reconfiguration mechanism? (PE, ORN, Immediate data) ORN ... On-chip memory configuration? Streaming Buffer (SB) 2009/01/29 The Design Procedure and Tool Chain Compiler and design flow Application Code Mapping Placement Bit-Stream Generation Routing Configurations (for LSRDP) LSRDP Architecture Port positioning Analyzing DFG mapping results Design Phase Hardware-Software Partitioning (manual or automatic) Critical Parts (h/w part) DFG Genration DFGs DFGs Non-Critical Parts (s/w part) s/w Part Modification Binary Code (for GPP) • DFGs are manually generated • DFG mapping results are employed for: • Analyzing LSRDP architecture statistics (a quantitative approach) • Generating LSRDP configuration bit-streams 2009/01/29 Benchmark applications Finite differential method calculation of 2nd order partial differential equations 1dim-Heat equation 1dim-Vibration equation 2dim-Poisson equation (Heat) (Vibration) (Poisson) Quantum chemistry application Recursive parts of Electron Repulsion Integral calculation (ERI-Rec) Types of operations in the calculations: ADD/SUB and MUL 2009/01/29 DFG extraction- Heat equation 1-dim. heat equation for T(x,t) T ( x, t ) 2T ( x, t ) A t x 2 (A is const.) T(i-1,j) T(i,j) T(i+1,j) + Calculation by Finite Difference Method (FDM) T ( xi , t j 1 ) D * Basic DFG B * D * T ( xi , t j ) B * T ( xi 1 , t j ) T ( xi 1 , t j ) + T(i,j+1) Basic DFG can be extended to horizontal and vertical directions to make a larger DFG 2009/01/29 A sample DFG - Heat Inputs: Outputs: Operations: Immediates: 32 16 721 364 A sample DFG (Heat) 2009/01/29 DFG mapping flow LSRDP Architecture Description DFG Longest connections MCL= 2 Placing DFG nodes on LSRDP Placing IO nodes Re-placing DFG nodes on LSRDP (considering IO nodes positions) Routing connections Re-palcing output nodes Routing Inp/Out connections Modified Mapping Flow Configuration File 2009/01/29 Placing Input/Output Nodes Fan-out based I/O nodes placement 2: Boundaries of the location with minimum connection length to children Objective: Minimize TCL ni= 1 X= Ci1 ni= 2 Ci1 <= X <= Ci2 ni= 3 X = Ci2 ni>=2 X = Cij, j=2…ni-1 (i, 0) ... (i,j) (i, j+1) (i, j+2) (i, j+3) (i, j+4) ... (i+1, j+4) ... Reconfigurable Routing Network (i+1, 0) ... (i+1, j) (i+1, j+1) (i+1, j+2) (i+1, j+3) ... TCL= |Ci1-X|+ |Ci2X|+…|Ci,ni-X| INP3 3: The best location is on a middle point Reconfigurable Routing Network ... ... ... ... Total Connection Length: INP1 ... 1: There is only one location with minimum connection length to a child ... INP2 ... ni: the number of children of input node i Ci1, Ci2, Ci3, Ci,ni X: location of the input node i ... 2009/01/29 One main reason for the large MCL Inputs Ports are far from each other 2009/01/29 Proximity-factor based placement Proximity factor indicates how far a pair of input ports should be located from each other For a pair of input nodes The larger number of closer descendants, higher proximity factor is assigned Si,j: a set of common descendants for input nodes i and j Dk,i(=Dk,j): distance of common descendant node k to the input nodes i and j (it is equal to ASAP execution level of the node) p2,1 P pn,1 p1, 2 ... pn , 2 ... ... ... p1, n p2 , n p i , j p j ,i if i j 1 if i j kS Dk ,i i, j 2009/01/29 Proximity factor-Example I1 1 4 I2 I3 2 3 S1, 2 4,6,7 5 P( I1 , I 2 ) p1,2 S1,3 7 P( I1 , I 3 ) p1,3 1 1 1 13 2 3 4 12 S 2,3 7 1 1 P ( I 2 , I 3 ) p 2, 3 4 4 6 7 Inputs nodes I1 and I2 should be located closer than I3 2009/01/29 Input nodes placement alg.: Example r 1 C (l ) pij pij i l 1 r 1 C (r ) i l 1 1 i l if C(l)> C(r) l= l+1, L[l]=j else r= r+1, L[r]=j 1 r i Placing the 1st input node with the highest proximity factor … N/2-3 N/2-2 N/2-1 1 N/2+1 N/2+2 N/2+3 … Placing the 2nd input node with the highest proximity factor … N/2-3 N/2-2 2 1 N/2+1 N/2+2 N/2+3 … 2009/01/29 Input ports placement alg.: Example Placing i-th input node … N/2-K … 2 1 3 … l N/2+M … N/2+M … r If C(l)> C(r): … i … 2 1 3 … r l If C(r)> C(l): N/2-K … 2 l 1 3 … i … r 2009/01/29 Area Minimization Estimating the area of a PE Area(FU)= Area(ADD/SUB)= Area(MUL) Area(TU)= Area(MUX)~ 0.1 Area (FU) FU TU FU TU TU PE arch. I PE basic arch Layout I: Area(PE)= 2.1x Area(FU), Layout I: Area(PE)= 2.2x Area(FU), Layout II: Area(PE)= 1.1x Area(FU) Layout II: Area(PE)= 1.2x Area(FU) A B C A FU TU TU TU B C op PE arch. II mux TU sel Layout I: Area(PE)= 2.2x Area(FU) Layout II: Area(PE)= 1.2x Area(FU) 2009/01/29 FU TU Basic arch. 3-inps/2-outs Number of rows = 1.5×W Estimating the ORN area-PE Basic arch. MCL= 1 Number of columns = 4×MCL Area (ORN) = 1.5 x W x (4 x MCL) x Area (CB) W: the no. of the PEs in a RDP row 2009/01/29 Estimating the ORN area-PE arch. I Number of rows = 2×W TU FU TU PE arch. I 4-inps/3-outs MCL= 1 Number of columns = 6×MCL+2 Area (ORN) = 2 x W x (6 x MCL+ 2) x Area (CB) 2009/01/29 Number of rows = 1.5×W Estimating the ORN area-PE arch. II FU TU TU TU PE arch. II 3-inps/3-outs MCL= 2 Number of columns = 4×MCL+1 Area (ORN) = 1.5 x W x (4 x MCL + 1) x Area (CB) 2009/01/29 A modified connection length measurement New measurement technique for the net length src Connection length measurement: initial C.L.= dh dv modified C.L.= dh/ dv dest dh C.L.(previous)= 3 src C.L.(new)=3 dest1 C.L.(previous)= 3 C.L.(new)=1 dest2 2009/01/29 A modified connection length measurement- Example Parent 2 is chosen when C.L. is measured as dh/dv Parent 1 MCL= 1 dh dh/dv dh dh/dv 0, 4 0, 4/3 1, 3 1, 1 2, 2 2, 2/3 3, 1 3,1/3 4, 0 4, 0 0, 4 0, 1 1, 3 0.5, 0.75 2, 2 1, 0.5 3, 1 3/2, 1/4 4, 0 2, 0 is chosen when C.L. is measured as dh MCL= 2 2009/01/29 MCL minimization- Using a MCL threshold A maximum threshold is assumed for the MCL During the placement process: For each CL larger than the threshold, the vertical distance increases as: PE with the min. C.L to the source dv= CL/MCL_Threshold • max permitted length= 2 • dh =3 > max permitted length src dest • dv= 1 dest dv= dv+ [3/2]=dv+1= 2 2009/01/29 Basic placement and routing vs. integrated placement and routing DFG DFG Placing Input Nodes using PF-based alg. Placing Input Nodes LSRDP Architecture Description Placing Operational & Output Nodes LSRDP Architecture Description Placing Output Nodes Routing Nets Final Map Routing IO Nets Basic Placement and Routing Flow Placing Operational Nodes & Routing Nets (node by node) Final Map Routing Output Nets Integrated Placement and Routing Flow 2009/01/29 Experimental Results Specifications of the benchmark DFGs DFG # of nodes # of inputs # of outputs # of pure ops max. inp. nodes fan-out Max. fan-out Heat-8x1 34 6 4 16 2 1 Heat-8x2 60 8 4 32 3 3 Heat-16x2 172 16 12 96 3 3 Poisson-3x3 62 18 1 33 3 2 Vibration-4x2 48 8 4 24 3 2 Vibration-8x2 136 16 12 72 4 4 ERI-1 20 8 3 9 3 1 ERI-2 76 16 9 51 3 3 ERI-3 89 14 9 66 3 6 ERI-4 67 19 1 47 4 3 Max 172 19 12 96 4 6 2009/01/29 Evaluation results for various architecturesMCL and ORN sizes Layout-I S1 S2 PE basic arch. MCL PE arch. I PE arch. II ORN size PE basic arch (overall) PE arch. I x CB PE arch. II 14 8 10 25116 6 3 4 12600 15 9 12 29250 12 4 7 24336 30600 13680 38080 17680 18696 8237 23520 14790 nodes placement S1 S2 Layout-II S1 S2 fan-out based proximity-factor based Connection length measurement lh lhv S2 results in smaller MCL and ORN size for both layout types 2009/01/29 Evaluation results for various architecturesno. of utilized PEs No. of PEs (overall) x PE Layout-I S1 S2 Layout-II S1 S2 PE basic arch 580 683 330 344 PE arch. I PE arch. II 634 627 713 669 384 360 384 384 By using lhv, larger number of RDP rows are utilized larger number of PEs will be employed for S2 2009/01/29 Evaluation results for various architecturesoverall LSRDP area (KJJ) ADD/ SUB MUL ADD/ SUB MUL ADD/ SUB MUL ADD/ SUB MUL ADD/ SUB MUL ADD/ SUB MUL ... ADD/ SUB ORN ADD/ SUB MUL ADD/ SUB MUL ADD/ SUB MUL ADD/ SUB MUL ADD/ SUB MUL ... ADD/ SUB . . . ADD/ SUB MUL ... MUL ... FU TU TU FU TU ORN ADD/ SUB MUL ... ADD/ SUB MUL ... ADD/ SUB MUL . . . . . . ... ADD/ SUB MUL ADD/ SUB MUL ... . . . . . . . . . . . . FU TU TU TU Basic PE arch. PE arch. I PE arch. II 3-inps/2-outs 4-inps/3-outs 3-inps/3-outs ORN ADD/ SUB MUL Layout-I ADD/ SUB MUL ORN . . . MUL ORN ADD/ SUB MUL ORN ADD/ SUB MUL ADD/ SUB MUL Layout-II Layout-I Overall LSRDP PE basic arch Area x (KJJ) PE arch. I PE arch. II S1 S2 36923 48083 35307 34193 35995 31258 Layout-II S1 S2 29200 36190 27266 27040 25031 23451 S2 results in smaller overall area in terms of KJJ for both layout types Layout II results in smaller area PE arch. II gives smaller area 2009/01/29 A sample ORN implementation Block diagram of a high frequency test bench ladder clkin_lfin data_in input shift register clkin_lfout circuit under test output shift register data_out A photograph of a chip with 1-to-3 ORN prototype test bench circuit under test ladder input shift register 5 mm clkin_hf output shift register 2009/01/29 Conclusions SFQ-LSRDP is a basic core of a high-performance low-power computer Data Flow Graphs (DFGs) extracted from scientific applications are mapped on the LSRDP LSRDP micro-architecture is designed based on characteristics of DFGs via a quantitative approach LSRDP is promising for resolving issues originated from CMOS technology as well as achieving remarkable performance Acknowledgement: This research was supported in part by Core Research for Evolutional Science and Technology (CREST) of Japan Science and Technology Corporation (JST). 2009/01/29 Thanks for your attention! Any questions?
© Copyright 2024 ExpyDoc