Reconfigurable Architectures AMANO, Hideharu hunga@am.ics.keio.ac.jp Reconfigurable System (Custom Computing Machine) A target algorithm is executed directly with a hardware on SRAM-style FPGA/PLDs. High performance of special purpose machines. High degree of flexibility of general purpose machines. A completely different execution mechanism from a stored program computers. PLD(Programmable Logic Device) Integrated Circuit whose logic function can be defined by users. Standard IC,ASIC(Application Specific IC) SPLD(Simple PLD) / PLA(Programmable Logic Array) CPLD(Complex PLD) Small scale IC with AND-OR array Middle scale IC with AND-OR array FPGA(Field Progarmmable Gate Array) Large scale IC with LUT Caution! Terms are not well defined! Rapidly development of PLD Gate number Increasing Performance From 1991-2000 Amount of gate: X45 Speed: X12 Cost:1/100 10M 1M Anti-fuse FPGA SRAMFPGA 100K CPLD 10K FusePLA 1980 Hierarchical structure Embedded Core Low voltage EEPROMSPLD 1990 2000 SPLD(Simple PLD: AND-OR/Product-term) OR NOT AND Arbitrary logic is realized by changing the AND-OR connection AND/OR connection example ABCD A&B | C&D OR NOT AND A&B C&D LUT:Look Up Table Address Look Up Table … ROM/RAM … Data A simple ROM/RAM can used as a random logic. C ABC 000 001 010 011 100 101 110 111 Z 0 0 0 1 0 0 0 1 Z 0 0 0 1 0 0 0 1 B A A combination of memory and multiplexers are commonly used. An example using LUT:Look Up Table 1 C ABC 000 001 010 011 100 101 110 111 Z 0 0 0 1 0 0 0 1 Z 0 0 0 1 0 0 0 1 1 0 B A 1 Device for flexibility(1) Anti-fuse type Program by destruction of isolation with high voltage High speed but One-time ACTEL、Quicklogic EEPROM・Flash-ROM Switches for connections are realized by floating gates. Re-programmable Lattice、Altera’s MAX series Device for flexibility(2) SRAM Data on SRAM represents look up table and wire connection. ISP (In System Programming) is available. The configuration data is erased, when the power turns off. Suitable for a large scale FPGA. Recently, rapidly advanced. Xilinx XC、 Altera FLEX, Lucent ORCA The advanced series: Xilinx Virtex, Altera APEX その他 Magnetic memory DRAM AND-OR array vs. LUT AND-OR array(product-term) Efficient for logic with multiple outputs There is a type of logic which cannot be realized. Suitable for EEPROM and Flash-ROM LUT Any logic can be realized. Efficient for logic with a single output Suitable for Flash-ROM, Anti-fuse, and SRAM. Sequential circuits From AND/OR array D Q Q Feed back Input AND・OR array or LUT Output Module D Q Output D Q D Q D Q Feed Back Sequential circuit (state machine) can be built by attaching Flip-flops and feed back loops. CPLD (Complex PLD) Programmable Switch Matrix of SPLDs SPLD SPLD SPLD Programmable Switch SPLD SPLD SPLD SPLD Altera’s MAX 2-dimensional Array FPGA(Field Programmable Gate Array) LUT Connection Block F.F Configurable Logic Block island style Switch Block LUT and interconnection is decided with configuration data IOB Architectures and devices SPLD Anti-fuse CPLD EEPROM FPGA Flash-ROM SRAM High speed middle size One-time ACTEL,Quicklogic High speed small/middle size Re-programmable Delay is predictable Lattice,Altera,Xlinx Large scale Rapidly development Xilinx、Altera Recent PLDs High-end: a large scale chip with hierarchical structure: System on Programmable Device Providing DLL,CPU、DSP, ROM, RAM, Multiplier, High speed link, and other hard IPs. Xilix’s Virtex-4/EX,FX, Altera’s Stratix-3 Specialized for mass-production Xilinx’s Virtex II、Virtex-4/LX, Altera’s Stratix-3 Low cost:Xilinx’s Spartan, Altera’s Cyclone Low voltage, Multiple voltages, and Low power consumption Process and parameters(Xilinx co.) Process Products Name LUT Power 350nm XC4000 XC4085KLA 7448 3.3V 250nm XC4000 XC40250KV 20102 2.5V 220nm Virtex XCV1000 27648 2.5V 180nm Virtex-E XCV2000E 43200 1.8V 150nm Virtex-II XC2V800O 104882 1.5V 130nm Virtex-II Pro XC2VP125 125136 1.5V 90nm Virtex-4 XC4VLX200 200488 1.2V 65nm Virtex-5 XC5VLX330 51840slice 1.0V 40nm Virtex-6 XC6VLX760 118560slice 1.0V 28nm Virtex-7 XC7VX1140T 1139200slice 0.9V Xilinx Virtex II LUT LUT Carry Carry D D Q Slice X 2 → CLB (Configurable Logic Block) Q Global Clock MUX DCM IOB Slice 100000 CLBs 3Mbit Configurable Logic RAM Multiplier Programmable IOs Altera Stratix II DSP Blocks PLL Mega RAM Blocks M4K RAM Blocks M512 RAM Blocks LAB:Logic Array Block consisting of 10 LE ( 4-input LUT and F.F.) Hierarchical Interconnect SoPD (System on Programmable Device) DCM Rocket I/O, Multi-Gigabit Transceiver Xilinx Virtex-II Pro Power-PC Multiplier Block RAM CLBs Various kinds of cores are embedded on an FPGA FPGA vs. ASIC[Kuon:FPGA2006] Pure FPGA without hard macros Area:40X Speed:1/3.2X Power: 12X FPGA with hard macros Area: 21X Speed: 1/2.1X Power: 9X Spartan-3 Power Consumption [tuan2006] Clock Logic Routing Dynamic Power about 200mW (3S1000) Logic Routing Config SRAM Static Power about 60mW(3S1000) Recent technologies and products 45nm 40nm 65nm 60nm 90nm Virtex-5LX/LXT/SXT/ FXT/TXT 330000LC Virtex-4LX/FX/SX 200000LC Stratix-II/GX 179400LE Virtex-6LXT/SXT/ HXT/CXT 760000LC Stratix-IV/E/GX/GT 531200LE Stratix-III/L/E 338000LE X1.5-X2.5/generation Extented Spartan-3A N/DSP 53000LC High-end Spartan-6LX/LXT 150000LC Cyclone II 68416LE Cyclone III/LS 119088LE Cyclone IV/E/GX 149760LE High-end/Low-cost: X3-X5 Low-cost Slice structure of Virtex-6 FF 6bit LUT Carry MUX 6inX1 5inX2 MUX FF FF 6bit LUT 6inX1 Carry MUX MUX 5inX2 FF FF 6bit LUT 6inX1 5inX2 Carry MUX MUX FF FF 6bit LUT 6inX1 5inX2 Carry MUX MUX Virtex-6 manual FF Virtex-6 CLBs COUT COUT CLB Slice X1Y0 COUT COUT CLB Slice X1Y1 Slice X2Y1 Slice X3Y1 CIN CIN CIN CIN COUT COUT COUT COUT CLB Slice X0Y0 CLB Slice X1Y0 Slice X2Y0 Slice X3Y0 Virtex-6 manual Stratix-IV ALM Structure carry reg_carry shared_arith 4bit data 4bit data LUT 6in LUT 6in adder2 MUX adder1 MUX MUX FF MUX 4-in LUT X 2 5-in LUT + 3-in LUT 5-in LUT + 4-in LUT 1-input shared 5-in LUT + 5-in LUT 2-input shared 6-in LUT 6-in LUT + 6-in LUT 4-input shared FF Stratix-IV LAB structure ALMs Local LAB Interconnect MLAB Local Interconnect Power Gating for Spartan-3 Config SRAMs Interconnect Switch Matrix Virtual ground Power Gate Tile FPGA Core Config SRAMs CLB Config SRAMs Low-power FPGAs Actel: ProASIC 3/E→ IGLOO Silicon Blue ICE65 series Flush ROM IGLOO: with ARM core Flash freeze: Low power stand-by mode(2μW) Embedded Flash memory (NVCM) 5mA(1792cells,32MHz) 9mA(3520cells,32MHz) Altera Arria、Arria-II Low Power Mid-range 8-input LUT QuickLogic Lattice GAL Altera FLEX10K Xilinx Vertex Qucklogic Design of PLDs Mostly designed with common HDL(Verilog-HDL, VHDL) C level entry is used recently: Handel-C(Ceroxca) Synthesis, optimization, place and route is automatically done by vendors’ tools. Integration and combination of tools from various venders are used recently. For large circuit, a long time is required especially for place and route. Using IPs, clock/DLL adjustment is manually done. Optimization techniques are different from vendors/products. Reconfigurable System (Custom Computing Machine) A target algorithm is executed directly with a hardware on SRAM-style FPGA/PLDs. High performance of special purpose machines. High degree of flexibility of general purpose machines. A completely different execution mechanism from a stored program computers. ASIC Perform ance Refonfigurable Systems FPGAs Design A Design B High Performance and Flexibility Design D Design C CPU CPU Software for i=0; i<K; i++ X[i]=X[i+j] ..... Flexibility How enhance the performance? Performance enhancement by hardware execution itself The overhead of software execution (Instruction fetch, data load to registers, and etc.) The overhead of using fixed size data. The overhead of using only two way branches. However, these benefits are not so large, for embedded CPU and DSP are highly optimized. The key of performance improvement is parallel processing Parallel processing in reconfigurable systems Various techniques can be used SIMD execution Pipelined structure Systolic algorithm Data driven control Parallel execution other than calculation Parallel data access using internal memory units Parallel data transfer including I/O accesses SIMD (Single Instruction-stream/ Multiple Data-stream)-like calculation The same instruction is applied to different data stream In Reconfigurable Systems, the operation is not required to be same (SIMD-like calculation) Stream Data in Processing part Internal Memory module Stream Data out Pipelined structure The stream is divided and inserted periodically. StreamData Data 1 Stream Stream 53 Stream Stream Data Data Data 42 Processing part Internal Memory module Stream Stream Data Data12 Systolic Algorithm Data x Computational array Data y Data stream x,y are inserted with a certain interval. When two stream meet each other, a calculation is executed. → Systolic: The beat of heart Band matrix multiply y=Ax y0 a11 a12 0 0 x0 y1 a21 a22 a23 0 x1 0 a32 a33 a34 x2 0 0 x3 y2 = y3 a43 a44 a yi yo x X+ yo= a x + y i Band matrix multiply y=Ax a11 a12 0 0 a21 a22 a23 0 a23 a32 a22 a12 a21 a11 X+ x1 0 a32 a33 a34 0 0 a43 a44 Band matrix multiply y=Ax a11 a12 0 0 a21 a22 a23 0 a33 a23 a32 a22 a12 y1=a11x1 a21 X+ x2 X+ x1 0 a32 a33 a34 0 0 a43 a44 Band matrix multiply y=Ax a11 a12 0 0 a21 a22 a23 0 a34 a43 a33 a23 0 a32 a33 a34 0 0 a32 a22 y1=a11 x1+ a12 x2 y2=a21 x1 X+ x3 x2 x1 a43 a44 Band matrix multiply y=Ax a11 a12 0 a21 a22 a23 0 a44 a34 a43 a33 a23 y2=a21 x1+ a32 a22 x2 X+ x3 0 X+ x2 0 a32 a33 a34 0 0 a43 a44 Band matrix multiply y=Ax a11 a12 0 0 a21 a22 a23 0 a44 a34 0 a32 a33 a34 0 0 a43 y2=a21 x1+ a22 x2+ a23 x3 a33 y3= a32 x2 X+ x3 x2 a43 a44 Data flow algorithm d a b c + e x The process is activated with the available of tokens (data) + x (a+b)x(c+(dxe)) The overhead of synchronization is large. Data flow analysis and hardware generation Data Flow Graph Data Flow Language Configuration Data HDL Description Graph Decomposition Suitable for automatic generation of hardware Applications No flexible program change No IEEE standard floating point Not memory bounded Image processing, analysis, pattern matching, Logic simulation, Fault simulation. Neural network simulation. Encryption /Decryption Queuing Model、Markov Analysis Electric Power Flow Censer processing Efficient use of on the fly processing. Communication control、Protocol control Software radio Large Scale Reconfigurable Systems Stand-alone: SPLASH, RASH,BEE2 RU μP … RU RU … RU RU … RU Interconnection/Shared memory Hetero nodes using homo cores: μP …μP … μP …μP SRC6, SGI RASC RU … RU … RU … RU Interconnection/Shared memory Homo node using hetero cores: Cray XD-1, XT4(XR-1) μP … RU μP … RU μP … RU Interconnection/Shared memory μP … RU Splash-2 (Arnold et.al 92) String matching, Image processing, DNA matching, 330 times faster than the supercomputer Cray-II. Systolic algorithm VHDL, Parallel C Annapolis Micro Systems(WILDFIRE) RM-IV (Kobe Univ.) mem. FPGA FPGA mem. mem.FPGA FPGA mem. mem. FPGA FPGAmem. mem. FPGA FPIC FPGA FPGA mem. FPGA mem. mem.FPGA FPGA mem. mem. mem. FPGA FPGAmem. mem. FPGA FPGA Interface mem. Shared Memory with multiple FPGAs RASH(Mitsubishi) 6 boards consist one system unit. PCI-bus Mesh + Bus PCI-bus I/F PCI Local-bus EXE-board controller FPGA FPGA SRAM (2MB) Clocks/Cont. signals Local-bus FPGA FPGA 2 clock lines PCI bus I/F A large SRAM DRAM daughter board FPGA FPGA FPGA FPGA FPGA Altera FLEX10K100A (62K-158KGate) → Changed into Stratix, then changed into Virtex II ATTRACTOR(NTT) Combination with various units ATM I/O RISC High speed serial link (1Gbps) FPGA RAM (LUT) ATM SW FPGA Buffer RISC RISC RISC RISC RISC Ethernet Special purpose system for ATM cut-through router MPU Mem. Compact PCI CRAY-XD1: • • • • • AMD Opteron 1board is consisting of 2CPUs+FPGA(Virtex II Pro) 1 rack provides 6 boards A high speed network called Rapid Array is used Interconnection between FPGAs can be done with Rocket I/O SGI RASC •Accelerator for SGI’s NUMA Altix •Virtex II XC2V6000 and another Virtex for control •Directly connected into the controller with NUMAlink4 ReCSiP (Keio Univ.) Accelerator for bioinformatics Powerful simultaneous access facility of external RAMs Local Clock Generator 64MB SDRAM Virtex-II 4MB SSRAM XC2V6000 64bit Local Bus Configuration via USB Configration Control via PCI QuickPCI 64bit/66MHz PCI Bus ReCSiP Board ReCSiP-2 QDR-SRAM (4MB x 4) DDR-SDRAM SO-DIMM PCI-IF (QL5064) Virtex-II Pro (XC2VP70) ReCSiP Automatic generation of solvers CAD Tools Solver Library Optimizer Scheduler Solver Set SBML description HDL for Solvers is generated Control between solvers FPGA Board Dynamically Reconfigurable Processors Coarse grain structure Parallel processing →Reconfigurable Processor Array Dedicated for stream processing High speed dynamic reconfiguration Distributed memory Multicontext Multicast/Broadcast of configuration data inside the chip On-line Configuration C-base design Short history of Dynamically Reconfigurable Processors 1990 1995 2000 The 1st Generation FPGA with Dynamic Reconfiguration MPLD(Fujitsu) WASMII(Keio) Processor with Reconfigurable Instructions 2005 The 2nd Generation Time Multiplexed FPGA(Xilinx) DFabric(Elixcent) DAPDNA/2(IPFlex) DAPDNA/IMX (IPFlex) Xpp(PACT) DRL(NEC) CS2112(Chameleon) FE-GA(Hitachi) DRP(NEC elec.) X-bridge (NEC ele.) PipeRench(CMU) Kilocore(Rapport) S-5(Stretch) S-6(Stretch) GARP(UCB) CHIMAERA(NorthWestern Univ.) DISC(Brigham Young Univ.) A lot of commercial systems Dynamically Reconfigurable processors Product Vendor Context Data PE D-Fabric Panasonic Deliver 4 Homo Xpp PACT Deliver 24 Homo S5/S6 engine Stretch Deliver 4/8 Hetero CS2112 Chameleon Multi-C(8) 16/32 Homo DAPDNA-2 IPFlex Multi-C(4) 32 Hetero DRP-1 NEC electronics Multi-C(16) 8 Homo X-bridge NEC electronics Multi-C(32) 8 Homo Kilocore Rapport Multi-C 8 Homo ADRES IMEC Multi-C(32) 16 Homo FE-GA Hitachi Multi-C 16 Hetero For Car-tuners SANYO Multi-C(4) 24 Homo Cluster Fujitsu Multi-C 16 Hetero Coarse Grain Structure of PE Kress Array II Chameleon CS2112 Routing MUX Instruction Register & Mask Routing MUX OP Barrel Shifter Register & Mask Register Register Xpp (PACT Informations technologie) I/O I/O PAC PAC I/O I/O CM CM SCM CM CM I/O I/O PAC PAC I/O I/O PAC: Processing Array Cluster) CM: Configuration Manager SCM: Supervising CM PAE Xpp64 (8x8 PAC) is available. Configuation requires 100s’ clocks. PAE adopts 24bit-width, Clock cycles is 40MHz. Configuration controller Panasonic(Elixent) DFA1000 Register 4bit ALU RAM based switch box ALU R ALU R R R R R ALU R ALU R R R R R ALU R ALU R R R R R ALU R ALU R R R R R Multicontext structure A PE provides multiple configuration RAM sets One clock context switching can be done. Output data Logic cells Logic cells Multiplexer PEs or Switches 1 2 n SRAM slots Input data Context memory Context Context pointer Chameleon CS2112 32-bit PCI Bus 64-bit Memory Bus Memory RISC Core Controller PCI Cont. 128-bit RoadRunner Bus Configuration Subsystem DMA Subsystem Reconfigurable Processing Fabric 160-pin Programmable I/O 8 instructions stored Fabric in the CTL in Reconfigurable Processing are executed in the DPU. Chameleon The CTL can select the next LM DPU instruction in the same cycle. CTL DPU Configuration can beLMchanged CTL by loading a bit stream. Tile 0 Slice 0 Tile 0 Slice 3 108 DPU(Data Path Unit)s consists 4 Slices(3Tiles each) 1Tile: 9DPU=32bit ALU X 7 16bit + 16bit multiplier X 2 Ipflex DAP/DNA-2 DDR SDR IF (64bit 166MHz) DNA load buffer DAP (RISC) PCI IF (32bit 66MHz) Interrupt Controller Timer SROM IF GPIO UART Serial IF BSU DMA Controller DNA direct I/O (Async. In) DNA Matrix DNA store buffer DNA direct I/O (Async. out) 368 PEs ALU,Memory, Delay etc. Heterogeneous An example PE structure FF FF FF FF Shift/Mask Shift/Mask Shift/Mask Shift/Mask FF FF FF FF ALU ALU ALU FF FF DRP (Dynamically Reconfigurable Processor: NEC) Tile DRP Tile and PE structure HMEM HMEM HMEM HMEM VMEM PE PE PE PE PE PE PE PE VMEM VMEM PE PE PE PE PE PE PE PE VMEM VMEM PE PE PE PE PE PE PE PE VMEM VMEM PE PE PE PE PE PE PE PE HMEM(1-port memory) VMEM(2-port VMEM VMEM ctrl VMEM ctrl State Transition Controller VMEM PE VMEM 8bit × 8092entry 256entry VMEM ctrl VMEM ctrl PE PE PE PE PE PE PE VMEM PE PE PE PE PE PE PE PE VMEM VMEM PE PE PE PE PE PE PE PE VMEM VMEM PE PE PE PE PE PE PE PE VMEM HMEM HMEM HMEM HMEM Context control for DRP 1. Context switching 0 Data input 2. Parallel processing in a context 3. Serial execution in a context 1 2 3 4 5 Data output Description in BDL DRP compiler controls 3-dimensional assignment Main Advantage: Low power consumption Why low power ? 1. No redundant hardware There are no instruction fetch mechanisms, cache, TLB, and etc. → Of course, it cannot be a general purpose engine, but enough for an accelerator. A bare datapath works only for computation. 2. Parallel Execution with a number of PEs Much lower clock frequency can be used to achieve the same performance as other architectures. The main problem is leakage power, but can be suppressed by power gating techniques. 10X energy efficient compared with DSPs. 5-50X with FPGAs. Sometimes similar to that for hardwired logic. The main limitations as an accelerator in SoCs The data must be stored in the memory modules placed around the PE array. If the required contexts are more than its context memory, the operational speed is much degraded. If the data is more than the memory, it is hard to be treated. The virtual hardware mechanism is provided but there is a certain limitation. The performance is not so improved for problems without parallelism. Dynamically Reconfigurable Processors: The 2nd generation Customized for a specific target application area Multi-core structure with small PE arrays rather than a big array SANYO car tuner → Tuner Fujitsu → Wireless communication Toshiba SAKE → Multi-media NEC electronics X-bridge → Multi-media Cooperation with various type cores Integrated design environment Low power design → The main advantage! X-bridge: NEC electronics (2008) General Port 8bX4 UART UART CSI GPIO JTAG CPU MIPS I-C D-C INTC DMA STP Engine SPL SPL SPL 64bit on chip bus (266MHz) SPL SPL SPL DMA Work PCIexp PCIexp RAM HB/EP HB/EP Periph (1kB) (1-lane) (1-lane) I/F From Invited talk in Design Gaia.2008 SPL Nconnect 64bit Memory Switch (266MHz) DMA Dynamically Reconfigurable Core 512PE(8bit) 32-context Providing the virtual SPL hardware mechanism SPL DMA controller hides the communication overhead DMA 10/100 Ether MAC PCI Host/ Target DDR2 SDRAM CTR Mixture of SIMD and DRP units:Toshiba’s SAKE Dynamically Reconfigurable Units Optimized for Stream Processing (Indenepndently Controlled) Our Architecture Host Processor Host I/F code data System Memory I/O Buffer (Data RAM) Code Buffer (Code RAM) Write Control Formatter0 Inter-Unit Buffer (Data Registers) AUX1 SIMD Units From FPT2007 Tutorial session AUX0 Formatter1 The Architecture (Formatter) Cfg Controller Xbar In 128 Xbar In 128 Shuffle Simple Hardware •Pipeline registers only •No intra-PE data transfer •PE:4 cfgs, Xbar: 16cfgs •ALU, shift & absolute ops only From FPT2007 Tutorial session data A data B 64 PE CfgMem 16-bit ALU x 8 Suitable for batterfly operations CodeMem 19 ID valid PE PE PE PE w/o Shuffle Xbar Out Xbar In: Formatter0 only XBar Out: Formatter1 only SANYO’s Car tuner DRP ALU array command memory sequencer Feedback In ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU main memory Out Pipelined execution of 4 threads L1 ALU ALU ALU ALU ALU ALU L2 ALU ALU ALU ALU ALU ALU L3 ALU ALU ALU ALU ALU ALU L4 ALU ALU ALU ALU ALU ALU L1 L2 L3 L4 Th1-1 Th2-1 Th3-1 Th4-1 Th1-2 Th2-2 Th3-2 Th1-3 Th2-3 Th1-4 Th1-5 Th4-2 Th3-3 Th2-4 Th2-5 Th1-6 Th2-6 Th4-3 Th1-7 Th2-7 Th3-4 Th4-4 Th1-8 Fine carrier frequency offset estimation/correction LT1 I Q I Cluster0 Q to FFT LT2 I Q Cluster0 Cluster3 data out control Cluster0 Reg Cluster4 a) Fine carrier frequency offset estimation for LT1 phase offset calculation Cluster5 Cluster6 Cluster2 Reg in cluster0 self-correlation I DIV ATAN Cluster1 Q to FFT b) Fine carrier frequency offset estimation for LT2 Cluster1 Cluster6 (through) correction offset calculation in phase polar I Q Cluster2 complex multiply Cluster3 data out control & clip I Q c) Fine carrier frequency offset correction for SIGNAL and DATA Hitachi’s FE-GA Interrupt/DMA request Sequence Manager Computational Cell Array I/O port ALU MLT ALU ALU ALU MLT ALU ALU MLT ALU ALU Load/Store Cells LS MEM ALU LS MEM ALU ALU LS MEM MLT ALU ALU LS MEM MLT ALU ALU LS MEM LS Bus MEM Interface LS MEM ALU MLT ALU ALU ALU MLT ALU ALU LS MEM ALU MLT ALU ALU LS MEM LS MEM Local Memory Crossbar Network Configuration Manager Heterogeneous Multi-Core using FE-GA CPU0 SH-4 CPU1 DRP0 FE-GA DTU LPM LDM LPM LDM FVR DSM FVR DSM DRP1 DTU Network Interface Network Interface Network Interface On-Chip CSM CPU2 CPU3 DRP2 DRP3 The codes are generated by a parallelizing compiler and standard APIs. Examples High speed configuration PACT xpp Elixent DFA NTT PCA Multicontext Chameleon CS2112 IPFlex DAP/DNA NEC DRP CMU PipeRench SONY Virtual Mobile Engine (Embedded in PSP) Multicontext style Dynamically Reconfigurable Processors Traditional Processors DAP/DNA CS2112 Granularity FPGA 32bit 8 1000 rDSP PC101 Parallelism 16 ACM Chip-Multiprocessor VLIW Common Processor DRP DRL PipeRench 16bit 100 8bit 10 4bit 1 3 8 16 Many Time Multiplexing Dynamically Reconfigurable Processors Coarse grain architecture, somehow like on-chip multiprocessors, while somehow like FPGA. Rapidly development from 2001 They don’t find killer application (Chameleon’s fail) High level language development environment has not been well established. A lot of competitors High performance embedded processors Chip multiprocessors Application Specific Configurable Processors DSP Standard FPGA/CPLD System On Chip Open Problems What’s difference between a Program and Configuration Data Reconfigurable Processor Array=a VLIW machine with an extremely large instructions (Configuration data) How frequent should Configuration change? Every-clock-context switching is not advantageous from the viewpoint of consuming power. However, if configuration is rarely switched, dynamic reconfiguration function is useless. How is grain size of Processing Element decided? 8-32bit calculators are correct solution? Is it a only escape way from Xilinx’s patent ? How is the balance between calculators and controllers? Since DRP focuses on calculators, it is difficult to implement complicated control. Does the node balance of ACM correct? Summary Another computing system than stored program computers. Not a perfect replace of stored program type computers. Advance of the semiconductor techniques directly enhance the performance. A lot of problems and subjects to research. Historical flow of computer systems ENIAC EDVAC、EDSAC IBM machines Reconfigurable Machine RISC, Intel’s microprocessors Exercise There is a systolic array which multiplies 8 x 8 tri-diagonal matrix A with a size 8 vector x. Compute the number of clock cycles for the multiply. Here, the time when the first element of x reaches to the left-most array is assumed to be time 0.
© Copyright 2024 ExpyDoc