The last lesson: Recent Embedded Architectures Hideharu Amano Embedded processors • Cost/Power-centric, Performance for specific application • RISC Processors • Shrunk instructions are provided – ARM (ARM) – MIPS (MIPS) – SH (Hitachi/Lunesus) • Works at 60MHz-800MHz depending on the applications → Performance was enough until 90s’ MOPS(Million Operations Per Second) for various embedded applications 10 Image processing 100 1000 10000 MPEG2/4 Enc. JPEG Enc./Dec. MPEG2/4 Dec. Dolby Enc./Dec. Voice MP3 Enc./Dec. Music Graphics 5K sentence translation 100K words identification 3 dimensional image generation 2 dimensional image generation Communication VoIP modem The performance of the simple RISC processor is not enough CDMA modem Performance enhancement techniques for CPU Using high clock frequency Instruction Level Parallel processing Superscalar Sophisticated Branch Prediction SuperScaler VLIW (Very Long Instruction Word) Dynamic scheduling of instructions SMT (Simultaneous MultiThreading) Thread Level Parallel Processing SIMD (Single Instruction stream Multiple Data streams) MIMD (Multiple Instruction streams Multiple Data streams) Chip-multiprocessors Efficient for every CPU Of course, useful for embedded CPUs Increasing cost/power consumption Embedded CPU+Hardware Accelerator Embedded CPU Hardware Accelerator Hardware accelerator is suitable for high performance in specific application On-Chip bus On-Chip Network RAM I/O I/O Various type of architectures for embedded processing Amdahl’s Law • Total SpeedUp = (1-ratio of acceleration) + ratio of acceleration SpeedUp of acceleration • 100 times acceleration. • If the ratio of acceleration is 50%, total speed up becomes 2.001 times. • Fortunately, the ratio is large in media processing. High performance for Various embedded architectures narrow application field Special Purpose processor Dedicated hardware DSP Stream processor Graphic processor Network processor Multiple Cores Heterogeneous Multiprocessor Programmable Hardware FPGA、Reconfigurable systems Dynamically Reconfigurable Processors Tile Processor Homogeneous Chip-multiprocessor Configurable Processor Special instructions General purpose CPU Multiple Cores High performance for wide application field Hardware/Software Co-design Specification Analysis System Spec. Hardware/Software division Hardware Spec. Hardware Functional Synthesis Hardware design High level design cost can be reduced. Recently, Low level design cost is increased. Software Spec. Interface Generation Interface design Co-verification System design Program Generation Program Configurable Processor /Integrated Platform • Configurable Processor – Hardware accelerators, special purpose processors can be combined as special instructions. • • • • ARC(ARC) Xtensa (Tensilica) MeP(Toshiba) Triton(Poseidon Design Systems) – Various type of interconnection is possible. – Integrated software emvironment • Integrated Platform → Standard components – UniPhier(Matsushita) Configurable Processor MeP MeP Core 32bit Processor Core Configuration Optional Inst. Memory Size Interrupt Debugging ... MM1 MM2 ... MMn Bus IF Extension Extended Inst. UCI DSP VLIW ... Hardware engine Local bus Global bus Multi-Core/Multiprocessor • Heterogeneous Processors – Special purpose processors for each application – High performance/cost – Different programming for different processor → Complicated BUGs! • Homogeneous Processors – Multiple general purpose processors – Programming environment for servers can be introduced. • Parallel OS, Parallel Compilers – Dynamic Voltage Control/Dynamic Frequency Control →Necessary performance with optimized power. • Each processor executes its own task ⇔ Different from Tile processors NEC MP211 Sec. Acc. DMAC USB OTG 3D Acc. Rotater. Camera Image Acc. ARM926 PE0 Bus Interface TIM1 ARM926 PE1 APB Bridge0 ARM926 PE2 Async Bridge0 SPX-K602 DSP TIM2 Scheduler TIM3 SDRAM Controller WDT Async Bridge1 Cam DTV I/F. APB Bridge1 Mem. card FLASH LCD I/F SRAM Interface Inst. RAM On-chip SRAM PMU (640KB) PLL OSC PCM IIC LCD SMU uWIRE UART INTC TIM0GPIO SIO DDR SDRAM Cell(IBM/SONY/Toshiba) External DRAM SXU SXU SXU SXU LS LS LS LS DMA DMA DMA DMA SPE: Synergistic Processing Element (SIMD core) 512KB Local Store MIC EIB: 2+2 Ring Bus 512KB L2 C 32KB+32KB L1 C Flex I/O SXU LS PPE PXU BIC DMA CPU Core IBM Power SXU SXU SXU LS LS LS DMA DMA DMA NUMA machines which share a single address space Private FIQ Lines MPCore (ARM+NEC) … Interrupt Distributor Timer CPU Wdog interface Timer CPU Wdog interface IRQ IRQ Timer CPU Wdog interface IRQ Timer CPU Wdog interface IRQ CPU/VFP CPU/VFP CPU/VFP CPU/VFP L1 Memory L1 Memory L1 Memory L1 Memory Snoop Control Unit (SCU) Private Peripheral Bus Duplicated L1 Tag Private AXI R/W 64bit Bus L2 Cache Coherence Control Bus Tile Processor/Processor Array • Each PE provides its own PC, and fetches instructions from its own instruction memory. → Falls into NORMA machines • However, it is close to dynamically reconfigurable processors shown later. – A single task is executed with all PEs ⇔ Multiprocessors – – – – – – – – – – Heterogeneous PEs A lot of homogeneous PEs Program is embedded. Simple Interconnection network. The concept of context switching The target is image processing and media processing. MIT RAW Quicksilver’s ACM MorphTech’s rDSP PicoChip’s PC101 MIT’s RAW Computing Processor (8 stages 32bit Single issue In order) 96KB I-Cache 32KB D-Cache 4-stage pipelined FPU Communication Processor 8 32-bit channels On-Chip NORMA system for embedded applications ACM (Quicksilver) Matrix Interconnect Network Adaptive Node Programmable Node Domain Node Level1 Cluster Level2 Cluster Level3 Cluster Dynamically Reconfigurable Processors • Reconfigurable systems → Previous lesson • – Flexible but It takes 10’s milliseconds for dynamic reconfiguration. Dynamically Reconfigurable Processors – Improves area efficiency by changing hardware structure. – IPs used in various SoCs. – History • Reconfigurable Co-processor Garp(1997), CHIMAERA(2000) • Multicontext reconfigurable devices WASMII(1992),Time-multiplexing FPGA(1997), PipeRench(1998), DRL(1998) • Functional-level synthesis – Various commercial products are available since 2000 • IPFlex DAPDNA-2, NEC electronics DRP-1, PACT Xpp, Elixent DFabrix – SONY’s VME(Virtual Mobile Engine) is embedded in Network Workman and PSP – Recently, many Japanese vendors start to develop commercial products • • • • • Fujitsu Hitachi Lucent Sanyo Toshiba (Mep+D-Fabrix) Processing Element • Specialized for media/stream processing Coarse grain ⇔ Fine grain: LUT of FPGAs • Components – – – – ALU Shifter+Mask unit Multiplexers Registers • Operations and interconnection between components are changeable • No instruction fetch mechanism : A part of large datapath Chameleon CS2112 DPU OP:Operations in C or Verilog SIMD arrays and pipelines are formed with multiple DPUs. 32bit・16bit Routing MUX Routing MUX Instruction Register & Mask Barrel Shifter OP Register & Mask Register Register Dynamic reconfiguration • Compared with FPGAs, coarse grain PE is area effective for media/stream processing. → However, flexible part requires semiconductor area : Not comparable with hardware accelerators • But it is flexible! → Dynamic reconfiguration By changing hardware structure, the same semiconductor area can be used for multiple tasks. Instructions/Configuration data delivery PE •10’s micro-seconds •PACT Xpp •Elixent’s D-Fabrix On-Chip Memory PE Multiple tasks can be switched → High area efficiency On-Chip Memory Xpp (PACT Informations technologie) RAM I/O PAE ALU I/O PAC PAC I/O I/O CM CM SCM CM CM I/O I/O PAC I/O PAC I/O PAC: Processing Array Cluster) CM: Configuration Manager SCM: Supervising CM Xpp64 (8x8 PAC) is available Configuation requires 100’s clock cycles 24bits Data, 40MHz Clock Configuration Manager I/O Elixent D-Fabrix RAM based switch box 4bit ALU Register ALU R R ALU R R R R ALU R R D-Fabrix Processing Array R R R R ALU R R R ALU R R R R R R R R ALU R RAM 8bit address 8bit data ALU R R ALU R ALU R R ALU R R ALU R ALU R R ALU R R R R ALU R R R ALU R R R R ALU R R R Stretch S5 engine Inst Cache Data Cache MMU Inst Unit Load/ Store Unit FR AR WR FPU ALU ISEF FP Unit Integer Unit Extension Unit Multicontext reconfiguration Multiple sets of configuration can be switched with a clock cycle. Context memory is combined into PE/Switches Fujitsu’s MPLD using ROMs(1990)、 WASMII used RAM(1992)、Xilinx’s proposal(1997)、 NEC’s DRL(1998)、Chameleon CS2112(2000) Output data Logic cells Multiplexer PE and Switches Context memory 1 2 Context n SRAM slots Input data Context pointer Double buffering using multicontext devices • Task is switched without overhead Execution Task N Loading Configuration Data Task N+1 Double buffering using multicontext devices Loading Configuration Data Task N+2 Task N+1 Execution Ipflex’s DAPDNA-2 DDR SDR IF (64bit 166MHz) DNA load buffer DAP (RISC) PCI IF (32bit 66MHz) Interrupt Controller Timer SROM IF GPIO UART Serial IF BSU DMA Controller DNA direct I/O (Async. In) DNA Matrix DNA store buffer DNA direct I/O (Async. out) Heterogeneous 368 PEs ALU,Memory、 Delay Time-multiplexing execution of a single task Target hardware Reconfigurable Device If the performance becomes 1/n, the performance/area is not increased. Time-multiplexing execution of a single task Target hardware Even in the dedicated hardware, everything cannot be done with a single clock. In this example, it takes 4 clock cycles. The dynamic reconfigurable processor requires 8 clock cycles → The performance/area is improved. NEC electronics’ DRP (Dynamically Reconfigurable Processor) • Multicontext reconfiguration – 16 contexts – Controlled by FSM (Finite State Machine) – Background loading of configuration data • • • • • 8x8 PEs + distributed memory modules → A Tile DRP-1 is consisting of 8 tiles → 512PEs 8bits data width State transition/Configuration is controlled with a tile. Single task is executed with multiple contexts. DRP-1 Tile Vmem Hmem DRP Tile Structure HMEM HMEM HMEM HMEM VMEM PE PE PE PE PE PE PE PE VMEM VMEM PE PE PE PE PE PE PE PE VMEM VMEM PE PE PE PE PE PE PE PE VMEM VMEM PE PE PE PE PE PE PE PE HMEM(1-port memory) VMEM(2-port VMEM VMEM ctrl VMEM ctrl State Transition Controller VMEM PE VMEM 8bit × 8092entry 256entry VMEM ctrl VMEM ctrl PE PE PE PE PE PE PE VMEM PE PE PE PE PE PE PE PE VMEM VMEM PE PE PE PE PE PE PE PE VMEM VMEM PE PE PE PE PE PE PE PE VMEM HMEM HMEM HMEM HMEM Context control with a FSM 1.Context switching 0 Data input 2. Parallel Processing in a context 3. Sequential execution in a context 1 2 3 4 5 Data output DRP compiler automatically generates the diagram from C-like language: BDL. IMEC ADRES Instruction Fetch Instruction Dispatch Instruction Decode Data Cache VLIW view RF FU FU FU FU FU FU FU FU FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF Reconfigurable Array View Rapport Kilocore PE PE Input Controller Configuration Controller 32bits PE PE 672 bits PE PE ….. PE Interconnect PE PE PE ….. PE Interconnect 128bits Output Controller PE Interconnect 128bits Fabric 16PEs X 16PEs ….. PE PE PE ….. Interconnect Stripe PE Hitachi’s FE-GA Interrupt/DMA request Sequence manager Computational cell array I/O port ALU MLT ALU ALU ALU MLT ALU ALU MLT ALU ALU Load/Store cell LS MEM ALU LS MEM ALU ALU LS MEM MLT ALU ALU LS MEM MLT ALU ALU LS MEM LS Bus MEM interface LS MEM ALU MLT ALU ALU ALU MLT ALU ALU LS MEM ALU MLT ALU ALU LS MEM LS MEM Local memory Crossbar switch Configuration Manager Dynamically Reconfigurable Processors Product Vendor Conf. Data Width PE Xpp-64 PACT Delivery 24 Homo D-Fabric Elixent Delivery 4 Homo S5 engine Stretch Delivery 4/8 Hetero PCA-2 NTT Delivery 9 Homo CS2112 Chameleon Multi-c(8) 16/32 Hetero DAPDNA-2 IPFlex Multi-c(4) 32 Hetero DRP-1 NECEL Multi-c(16) 8 Homo Kilocore Rapport Multi-c 8 Homo ADRES IMEC Multi-c(32) 16 Homo FE-GA Hitachi Multi-c(4) 16 Hetero Cluster machine Fujitsu Multi-c 16 Hetero Dynamically reconfigurable Processors Superscalar Gates Number 10M Chip-Multiprocessor Super Scalar DAPDNA-2 1M CS2112 100K Simple RISC rDSP PC101 ACM FPGA 10K 32bit ALU/ 1000ト Registers 1000 8bit ALU/ registers 100 4・5input 10 LUT VLIW Number of nodes DRP-1 ADRES DRL PARS Kilocore 100 Cost 10 1 3 8 16 Many Time-multiplexing C-level design (DRP) C Source Code • Behavaioral Description Language (BDL):C-like – Bit width, Pragma – Pointer is limited. • Functional synthesis: FSM and Data path are generated. High Level Synthesis FSM Datapath Technology Mapper – Synthesis tools for ASIC can be used. Place & Router • Mapping: FSM → STC、Datapath → PE array • Place & Routing • Configuration data generation Code Generation Object Code BDL code example mem(0:16) d0[8], d1[8], d2[8], d3[8], d4[8], d5[8], d6[8], d7[8]; void row() { 16bit memory: ter(0:16) SUMT0, SUMT1, SUMT2, SUMT3; Allocated to VMEM reg(0:16) SUB0, SUB1, SUB2, SUB3; ter(0:16) z0, z1, z2, z3, z4, z5, z6, z7; Terminals & Registers reg(0:8) i=0; $ Delimiter for the state/context for(; i < 8; i++) { d0[i], d1[i], d2[i], d3[i], d4[i], d5[i], d6[i], d7[i]; $ Memory Access for SUMT0 = d0[i] + d7[i]; SUB0 = d0[i] - d7[i]; giving an address SUMT1 = d1[i] + d6[i]; SUB1 = d1[i] - d6[i]; . . . . . z0 = A * SUMT0 + A * SUMT1 + A * SUMT2 + A * SUMT3; z2 = B * SUMT0 + C * SUMT1 – C * SUMT2 – B * SUMT3; . . . . . Terminals must be used In $ z1 = D * SUB0 + E * SUB1 * F * SUB2 + G * SUB3; the assigned state/context z3 = E * SUB0 – G * SUB1 – D * SUB2 – F * SUB3; . . . . . $ Registers can be used in } the next states/contexts High performance for Various embedded architectures narrow application field Special Purpose processor Dedicated hardware DSP Stream processor Graphic processor Network processor Programmable Hardware Going major ? FPGA、Reconfigurable systems Dynamically Reconfigurable Processors Tile Processor Homogeneous Chip-multiprocessor Next going Multiple major Cores Multiple Cores Heterogeneous Multiprocessor Configurable Processor Special instructions Now going major General purpose CPU High performance for wide application field Glossary • 今回は、いままで出てきた単語が多く、しかもそのま ま呼ばれているものばかり • Tile Processor:タイルプロセッサ • Dynamically reconfigurable processor: 動的再構 成可能(リコンフィギャラブル)プロセッサ • FSM(Finite State Machine) 有限状態マシン • Multicontext:マルチコンテキスト型(マルチコンテク ストかも) • Functional Synthesis:機能合成 • Time multiplexed Execution:時分割多重実行 Excise • Assume that the dynamically reconfigurable processor executes 1000 times faster than that of the host processor. • Compute the total performance when it can be used for 90% of the total task.
© Copyright 2025 ExpyDoc