Function-Level Processor (FLP): Raising Efficiency by Operating at Function Granularity for Market-Oriented MPSoC Hamed Tabkhi*, Robert Bushey^, Gunar Schirner* *Electrical and Computer Engineering Northeastern University Boston (MA), USA ^ Embedded Systems Products and Technology Analog Devices Inc. (ADI) Norwood (MA), USA High-performance Low-Power Computing • Increasing demand for high performance low power computing – Tens billions of operations per second – Less than few watts power – Example Markets: • Embedded vision • Software Define Radio (SDR) • Cyber Physical Systems (CPS) • Instruction-Level Processors (ILP) – Great flexibility – Very high power / operation • 3% of power for computation [Keckler’11] • Remain: inst/data fetch, memory hierarchy, decode Function-Level Processor (FLP) H. Tabkhi, R. Bushey, G. Schirner 2 Multi-Processor System-on-Chips (MP-SoCs) – Heterogeneous composition for performance / power-efficiency • ILP (e.g. CPU, DSP, or even GPU) • Flexibility Just increase number of accelerators? Function-Level Processor (FLP) Core 4 Core 3 Core 2 Core 1 Alg2 Alg3 Alg2 Alg3 Alg2 Alg3 Alg2 Alg3 4 INT31 ... OS/Drv INT31 ... OS/Drv INT0 INT31 ... OS/Drv INT0 INT31 ... OS/Drv I - L1 D– L1 INT0 I - L1 D– L1 INT0 I - L1 L2D – L1 I - L1 L2D – L1 L2 M L2 M M M ... Alg5 [63] Alg5 [2] Alg5 [1] GPU Alg5 [0] • Power efficiency Recover NRE: aim for market (domain of applications) E.g. Vision, SDR, Radar 0.8 GHz 0.8 GHz 0.8 GHz 0.8 GHz • Custom-HW for compute-intense kernels M M M Function-Level Processor Function-Level Processor Alg4[1] Alg4[2] Alg1[1] Alg1 DMA DMA DMA DMA Low Low Performance Low Performance Peripheral Performance Peripheral S Peripheral S S M Bridge M S S S S Transducer M SRAM S S SDRAM Contrl. SDRAM SDRAM SDRAM Contrl. Contrl. SDRAM Contrl. IP Comp. H. Tabkhi, R. Bushey, G. Schirner 3 ILP + HW Accel. Scaling Limitations • Approach: – HW-ACCs • Executes Compute-intense kernels – ILPs • Executes remaining application • Orchestrates HWACCs / coordinate data movement ACC 0 ACC 1 ACC N DMA DMA DMA – On-chip scratchpad memory Cache ILP 0 • Keep data between ILP and ACCs on-chip ACC Shared/LLC Memory Memory Interface – Avoid costly off-chip memory access – Increasing # HWACCs not scalable – Large on-chip scratch-pad memory Flexibility – But often under-utilized [Lyons’10] reports 30% utilization Efficiency – High volume of on-/off-chip traffic – Significant synchronization overhead on ILPs Need new solutions balancing flexibility / efficiency. Function-Level Processor (FLP) H. Tabkhi, R. Bushey, G. Schirner 4 Outline • • • • • • • Intro: High-performance low-power computing Flexibility / Efficiency trade-off Related work Function-Level Programming Function-Level Processor (FLP) Experimental results Conclusions Function-Level Processor (FLP) H. Tabkhi, R. Bushey, G. Schirner 5 Flexibility / Efficiency Trade-off • Architecture Class Efficiency – FPGAs: • High flexibility (gate level) • Good efficiency [GOPs/Watt] HW IPs Flexibility/Efficiency Gap FPGAs ILPs GPUs DSPs – ILPs: Control Processors • High flexibility (instruction) • Low efficiency Application Instruction Gate Flexibility – HW IPs • Low flexibility (not programmable) • Best efficiency • Programming needs new fabrication! FPGA ILP HW IP Bit0 ALU Bit1 Bit2 Bit3 RF Custom HW R Need new solutions to fill flexibility / efficiency gap. Function-Level Processor (FLP) H. Tabkhi, R. Bushey, G. Schirner 6 Outline • • • • • • • Intro: High-performance low-power computing Flexibility / Efficiency trade-off Related work Function-Level Programming Function-Level Processor (FLP) Experimental results Conclusions Function-Level Processor (FLP) H. Tabkhi, R. Bushey, G. Schirner 7 Related Work • Systems with many accelerators – Compose larger applications out of many accelerators – e.g. Accelerator-Store [Lyons’10], Accelerator-Rich CMPs [Cong’12], CHARM [Cong’12] • Problem: Redundant on-chip traffic and scratch-pad for exchanging data between accelerators. • Coarse-Grain Reconfigurable Architectures (CGRAs) – Interconnect homogeneous compute units via mesh/crossbar • Great for instruction-level parallelism (SW Loop-accelerations) [Hartenstein’01][Park’11] • Challenges: – Higher power consumption than HW ACCs – Tightly depends on host ILPs • Application-Specific Instruction Processors (ASIPs) – Extend ILP data-path with custom-HW to execute Cfunctions, e.g. [Venkatesh’14] – Higher efficiency with fairly good flexibility • Problem: suffer from the limitation of ILPs Function-Level Processor (FLP) H. Tabkhi, R. Bushey, G. Schirner 8 Outline • • • • • • • Intro: High-performance low-power computing Flexibility / Efficiency trade-off Related work Function-Level Programming Function-Level Processor (FLP) Experimental results Conclusions Function-Level Processor (FLP) H. Tabkhi, R. Bushey, G. Schirner 9 • Function-Level Processor (FLP): – Match programming abstraction at function-level granularity Abstraction Filter CNV Sort Compiler Instruction [GOPs/Watt] HWACC Efficiency • Main target: Streaming applications Function Function-Level Architecture • Increase efficiency (coarser programmability) • Maintain flexibility • Simplify application composition – Focus on Market-oriented MPSoCs (for group of apps) Programming • Programming v.s. Architecture Abstraction Architecture Raising Abstraction to Function-level Add Sub For FLP FPGAs ILPs GPUs DSPs – e.g. Vision, SDR, Radar Control Processors Application Function Instruction Gate Flexibility Function-Level Processor (FLP) H. Tabkhi, R. Bushey, G. Schirner 10 Function-level programming • Applications composed of functions • Requirements: – Identify common functions within market applications • Similar to libraries, such as OpenCV, OpenSDR – Identify common function interactions in desired applications Function Set Applications of Market (Domain) Function A Function E Function B Function I Function B Function F Function D Function B Function E Function F Function D Function E Function J Function J Function C Function G Function D Function B Function E Function I Function F Function J Function G Function K Function H Function I Function J Function H Function H Function A Function C Function K Function K • Function-Level Processor (FLP) – Architecture for function-level programming Function-Level Processor (FLP) H. Tabkhi, R. Bushey, G. Schirner 11 Outline • • • • • • • Intro: High-performance low-power computing Flexibility / Efficiency trade-off Related work Function-Level Programming Function-Level Processor (FLP) Experimental results Conclusions Function-Level Processor (FLP) H. Tabkhi, R. Bushey, G. Schirner 12 FLP Overview • FLP Principles: • Target stream processing applications • E.g. Vision, SDR, Radar – Construct macro pipeline out of functions to build application – Self controlled function blocks – Power gate unused blocks FLP – Compute contiguously inside FLP Stream Function3 Function0 In – Avoid costly data movement Function4 Function1 – Limited ILP interaction Function5 – Independent processing unit Function2 Stream Out • FLP Architecture Components: 1. 2. 3. 4. Optimized Function Blocks (FBs) MUX-based / market-specific communication Separation of data traffic Autonomous control and synchronization Function-Level Processor (FLP) H. Tabkhi, R. Bushey, G. Schirner 13 Function Block: Traffic Separation • Insight: Not all traffic is equal! – Routes, roles and importance – Access patterns Streaming: input / output data stream – Independent of computation realization – Can be hidden by FB-to-FB connection Operational: traffic generated by algorithm itself (other than output) – Part of algorithm (algorithm-intrinsic) • E.g. model of stream – Reuse potential: cache? – Size? May be emitted to mem hierarchy Function-Level Processor (FLP) H. Tabkhi, R. Bushey, G. Schirner 14 Function Block (FB) • Decomposition (a) Function-specific (but configurable) data-path • • Implements function y = f(x) e.g., CNV, Histogram, Gaussian (b) Configuration / control registers • • Alter / Parameterize functionality e.g., threshold values, dimensions (c) Stream-in / stream-out • • Stream data for direct FB-to-FB communication e.g., data samples, frame pixels (d) Stream buffer • • Temporary storage e.g., integral operation (e) Operational data / local buffer • • Access the memory hierarchy e.g. Gaussian parameters, CNV co-efficient values Function-Level Processor (FLP) H. Tabkhi, R. Bushey, G. Schirner 15 MUX-Based Interconnection • Point-to-point communication between FBs – Streaming data – Set of configurable MUXs • Hides stream traffic from memory hierarchy • Sparse connectivity – Compatible data types – Connectivity required by application set – Arranged in pipeline stages with forwarding and backwarding • Data type and flow control – Marshaling / de-marshaling of streaming data – Side-channel signals for flow control Function-Level Processor (FLP) H. Tabkhi, R. Bushey, G. Schirner 16 Autonomous Control / FSA • Autonomous Control – FLP with centralized Control Unit (CU) FLP – Fetches configuration through DMA Function0 Function1 Confg./Cont. Confg./Cont. – Transitions across configurations – Distributes configuration DMA Control Unit – Interconnect and FB configuration – Coordinates FBs – Minimizes or even eliminates need for ILP-centric control Config code --------System Mem • Function-Set Architecture (FSA) – Exposes • Type and number of FBs • FB configuration • FBs connectivity – Coarse-grained programmability at the function-level • Significantly more abstract than ISA • Comparable to developing applications out of API function Function-Level Processor (FLP) H. Tabkhi, R. Bushey, G. Schirner 17 FLP Architecture • Efficiency contributors: Optimized data-path per function block Eliminate instruction fetch overhead Eliminate system-level traffic for streaming data Direct system inputs/outputs Dedicating memory hierarchy only for reusable data (operational traffic) Minimize ILP interaction Direct-Memory Access (DMA) Direct-Memory Access (DMA) System Input Interface Input Encoder/Formatter FLP Streaming-Pipe Controller Foreward MUX MUX MUX Function3 Parameters Data Foreward MUX MUX FunctionN-1 MUX MUX Function1 MUX MUX Function2 MUX Backward MUX Direct-Memory Access (DMA) Foreward MUX MUX Function0 Streaming Data Parameters Buffer/Cache Direct-Memory Access (DMA) Function-Level Processor (FLP) Function5 (Arithmetic Unit) MUX FunctionN Function6 Backward MUX Parameters Buffer/Cache Parameters Buffer/Cache Output Encoder/Formatter – – – – – – Direct-Memory Access (DMA) Direct-Memory Access (DMA) Backward MUX Parameters Buffer/Cache Direct-Memory Access (DMA) Direct-Memory Access (DMA) H. Tabkhi, R. Bushey, G. Schirner 18 System Integration – FLP as independent processor – Pair with ILP cores Last-Level Cache • Create complete control and analytic applications. – FLP for compute intense processing – ILPs for the top-level adaptive control and intelligence. I/O Interface I/O Interface FLP 0 FLP N Config Cache Config Cache Cache Cache ILP 0 ILP N Function-Level Processor (FLP) Memory Interface H. Tabkhi, R. Bushey, G. Schirner 19 Outline • • • • • • • Intro: High-performance low-power computing Flexibility / Efficiency trade-off Related work Function-Level Programming Function-Level Processor (FLP) Experimental results Conclusions Function-Level Processor (FLP) H. Tabkhi, R. Bushey, G. Schirner 20 Experimental Setup • Pipeline Vision Processor (PVP) – Contains 11 FBs • Manually selected (app. analysis) – Vision filters: Edge detection / enhancement – Target market: • Advanced Deriver Assistance Systems (ADAS) – Fabricated at 65nm @ 83MHz – Part of Blackfin 60x series • + 2x Blackfin DSP cores • FLP is generalization of PVP – Result of joint work with PVP chief architect (2nd author) – Generalized & improved Input Formatter CNV Conv0 Conv0 0,1,2,3 Conv0 Input Formatter CNV Conv0 Conv0 0,1,2,3 Conv0 0X2 0X3 0X1 0X4 0X0 0X5 0X9 0X8 0X7 PMA Control Register 0X6 THC IIM 0,1 0,1 PEC IIM IIM 0,1 0,1 Output Formatter Output Formatter – PVP can be considered early instance FLP – Use PVP for component results (due to similarity) Function-Level Processor (FLP) H. Tabkhi, R. Bushey, G. Schirner 21 Flexibility: Application Mapping • Example: Parallel Execution of: – High-beam / Low-beam detection – Canny edge detection HB/LB Canny CNV CNV THC CNV CNV PMA PEC • Mapping Flexibility: • 10 ADAS apps mapped • Up to 2 parallel streams Function-Level Processor (FLP) H. Tabkhi, R. Bushey, G. Schirner 22 Comparison Setup ILP-BFDSP ILP-BFDSP ILP-BFDSP ILP-BFDSP ILP-BFDSP ILP-BFDSP Cache Cache Cache Cache Cache Cache 1. ILP-BFDSP – Blackfin DSPs Low-Pass Color/ Filter Illumination (Convolution) Extraction DMA DMA 2. ILP-BFDSP+HWACC FLP-PVP Cache DMA Conf. 3. FLP-PVP – Blackfin DSPs – Custom-HW Accelerators – Convolution – Color extraction – PVP only • VERY optimistic assumptions for ILP: – – – – Fused implementation Optimal pipeline utilization (no stalls) 90% of data traffic is hidden from off-chip memory Stream data ACC <-> ILP in scratch pad • Power: – PVP for function blocks from fabricated chip: TSMC 65nm LP – Single Blackfin DSP core 280mW Function-Level Processor (FLP) H. Tabkhi, R. Bushey, G. Schirner 23 • Communication: – Off-chip traffic – FLP-PVP offer: • 1/5th of ILP • 2/5th of ILP+ACC # of ILP cores – FLP-PVP <= 22.5 GOPs (HB/LB + TSR) – ILP+ACC requires 2 ILP cores – ILP requires 7 ILP cores 12 24 18 12 6 0 ILP ILP+ACC FLP 0 ILP ILP+ACC FLP 0.8 0.4 3 2.5 2 1.5 1 0.5 0 ILP+ACC FLP Communication Computation ILP Function-Level Processor (FLP) 3 0 Power [w] – FLP-PVP offer: • 14x-18x less than ILP • 5x less than ILP+ACC 6 1.2 ILP • Power: 9 1.6 Off-chip traffic [GB/s] • Computation: Operation [GOPs] Results Comparison ILP+ACC H. Tabkhi, R. Bushey, G. Schirner FLP 24 Outline • • • • • • • Intro: High-performance low-power computing Flexibility / Efficiency trade-off Related work Function-Level Programming Function-Level Processor (FLP) Experimental results Conclusions Function-Level Processor (FLP) H. Tabkhi, R. Bushey, G. Schirner 25 • Conclusions • Function-level Processor (FLP) – Raises architecture abstraction to functions • Targets streaming application markets (domain specific) – Avoids inefficiency of traditional ILP+HWACCs • • • • No instruction fetch (reduced flexibility at function-level) Custom traffic management (stream v.s. operational) Stream traffic hidden from system fabric Autonomous control, higher independence • Research Challenges / Future Work – Selection of FBs & their potential composition • A minimal but sufficiently contiguous set of FBs Need for tools explore and analyze the market applications – Enhance from spatial multi-stream to timed multi-stream – Simplify FLP programming / traffic optimization Function-Level Processor (FLP) H. Tabkhi, R. Bushey, G. Schirner 26 Thank you! 27
© Copyright 2024 ExpyDoc