Function-Level Processor (FLP)

Function-Level Processor (FLP):
Raising Efficiency by Operating at Function
Granularity for Market-Oriented MPSoC
Hamed Tabkhi*, Robert Bushey^, Gunar Schirner*
*Electrical and Computer
Engineering
Northeastern University
Boston (MA), USA
^ Embedded Systems Products
and Technology
Analog Devices Inc. (ADI)
Norwood (MA), USA
High-performance Low-Power Computing
• Increasing demand for high
performance low power computing
– Tens billions of operations per second
– Less than few watts power
– Example Markets:
• Embedded vision
• Software Define Radio (SDR)
• Cyber Physical Systems (CPS)
• Instruction-Level Processors (ILP)
– Great flexibility
– Very high power / operation
• 3% of power for computation [Keckler’11]
• Remain: inst/data fetch, memory
hierarchy, decode
Function-Level Processor (FLP)
H. Tabkhi, R. Bushey, G. Schirner
2
Multi-Processor System-on-Chips (MP-SoCs)
– Heterogeneous composition for
performance / power-efficiency
• ILP (e.g. CPU, DSP, or even GPU)
• Flexibility
Just increase number of
accelerators?
Function-Level Processor (FLP)
Core 4
Core 3
Core 2
Core 1
Alg2 Alg3
Alg2 Alg3
Alg2 Alg3
Alg2 Alg3
4
INT31
...
OS/Drv
INT31
...
OS/Drv
INT0
INT31
...
OS/Drv
INT0
INT31
...
OS/Drv
I - L1
D–
L1
INT0
I - L1
D–
L1
INT0
I - L1
L2D – L1
I - L1
L2D – L1
L2 M
L2 M
M
M
...
Alg5 [63]
Alg5 [2]
Alg5 [1]
GPU
Alg5 [0]
• Power efficiency
 Recover NRE:
aim for market
(domain of applications)
 E.g. Vision, SDR, Radar
0.8 GHz
0.8 GHz
0.8 GHz
0.8 GHz
• Custom-HW for compute-intense kernels
M
M
M
Function-Level
Processor
Function-Level
Processor
Alg4[1]
Alg4[2]
Alg1[1]
Alg1
DMA
DMA
DMA
DMA
Low
Low
Performance
Low
Performance
Peripheral
Performance
Peripheral
S
Peripheral
S
S
M
Bridge
M
S
S
S
S
Transducer
M
SRAM
S
S
SDRAM
Contrl.
SDRAM
SDRAM
SDRAM
Contrl.
Contrl.
SDRAM
Contrl.
IP Comp.
H. Tabkhi, R. Bushey, G. Schirner
3
ILP + HW Accel. Scaling Limitations
• Approach:
– HW-ACCs
• Executes Compute-intense kernels
– ILPs
• Executes remaining application
• Orchestrates HWACCs / coordinate
data movement
ACC 0
ACC 1
ACC N
DMA
DMA
DMA
– On-chip scratchpad memory
Cache
ILP 0
• Keep data between ILP and ACCs on-chip
ACC Shared/LLC
Memory
Memory
Interface
– Avoid costly off-chip memory access
– Increasing # HWACCs not scalable
– Large on-chip scratch-pad memory
Flexibility
– But often under-utilized
[Lyons’10] reports 30% utilization
Efficiency
– High volume of on-/off-chip traffic
– Significant synchronization overhead on ILPs
 Need new solutions balancing flexibility / efficiency.
Function-Level Processor (FLP)
H. Tabkhi, R. Bushey, G. Schirner
4
Outline
•
•
•
•
•
•
•
Intro: High-performance low-power computing
Flexibility / Efficiency trade-off
Related work
Function-Level Programming
Function-Level Processor (FLP)
Experimental results
Conclusions
Function-Level Processor (FLP)
H. Tabkhi, R. Bushey, G. Schirner
5
Flexibility / Efficiency Trade-off
• Architecture Class
Efficiency
– FPGAs:
• High flexibility (gate level)
• Good efficiency
[GOPs/Watt]
HW IPs
Flexibility/Efficiency
Gap
FPGAs
ILPs
GPUs
DSPs
– ILPs:
Control
Processors
• High flexibility (instruction)
• Low efficiency
Application
Instruction
Gate
Flexibility
– HW IPs
• Low flexibility (not programmable)
• Best efficiency
• Programming needs new fabrication!
FPGA
ILP
HW IP
Bit0
ALU
Bit1
Bit2
Bit3
RF
Custom
HW
R
Need new solutions to fill flexibility / efficiency gap.
Function-Level Processor (FLP)
H. Tabkhi, R. Bushey, G. Schirner
6
Outline
•
•
•
•
•
•
•
Intro: High-performance low-power computing
Flexibility / Efficiency trade-off
Related work
Function-Level Programming
Function-Level Processor (FLP)
Experimental results
Conclusions
Function-Level Processor (FLP)
H. Tabkhi, R. Bushey, G. Schirner
7
Related Work
• Systems with many accelerators
– Compose larger applications out of many accelerators
– e.g. Accelerator-Store [Lyons’10], Accelerator-Rich CMPs
[Cong’12], CHARM [Cong’12]
• Problem: Redundant on-chip traffic and scratch-pad for
exchanging data between accelerators.
• Coarse-Grain Reconfigurable Architectures (CGRAs)
– Interconnect homogeneous compute units via mesh/crossbar
• Great for instruction-level parallelism
(SW Loop-accelerations) [Hartenstein’01][Park’11]
• Challenges:
– Higher power consumption than HW ACCs
– Tightly depends on host ILPs
• Application-Specific Instruction Processors (ASIPs)
– Extend ILP data-path with custom-HW to execute Cfunctions, e.g. [Venkatesh’14]
– Higher efficiency with fairly good flexibility
• Problem: suffer from the limitation of ILPs
Function-Level Processor (FLP)
H. Tabkhi, R. Bushey, G. Schirner
8
Outline
•
•
•
•
•
•
•
Intro: High-performance low-power computing
Flexibility / Efficiency trade-off
Related work
Function-Level Programming
Function-Level Processor (FLP)
Experimental results
Conclusions
Function-Level Processor (FLP)
H. Tabkhi, R. Bushey, G. Schirner
9
• Function-Level Processor (FLP):
– Match programming abstraction at
function-level granularity
Abstraction
Filter
CNV
Sort
Compiler
Instruction
[GOPs/Watt]
HWACC
Efficiency
• Main target: Streaming applications
Function
Function-Level
Architecture
• Increase efficiency (coarser
programmability)
• Maintain flexibility
• Simplify application composition
– Focus on Market-oriented MPSoCs
(for group of apps)
Programming
• Programming v.s.
Architecture Abstraction
Architecture
Raising Abstraction to Function-level
Add
Sub
For
FLP
FPGAs
ILPs
GPUs
DSPs
– e.g. Vision, SDR, Radar
Control
Processors
Application
Function
Instruction
Gate
Flexibility
Function-Level Processor (FLP)
H. Tabkhi, R. Bushey, G. Schirner
10
Function-level programming
• Applications composed of functions
• Requirements:
– Identify common functions within market applications
• Similar to libraries, such as OpenCV, OpenSDR
– Identify common function interactions in desired applications
Function Set
Applications of Market (Domain)
Function
A
Function
E
Function
B
Function
I
Function
B
Function
F
Function
D
Function
B
Function
E
Function
F
Function
D
Function
E
Function
J
Function
J
Function
C
Function
G
Function
D
Function
B
Function
E
Function
I
Function
F
Function
J
Function
G
Function
K
Function
H
Function
I
Function
J
Function
H
Function
H
Function
A
Function
C
Function
K
Function
K
• Function-Level Processor (FLP)
– Architecture for function-level programming
Function-Level Processor (FLP)
H. Tabkhi, R. Bushey, G. Schirner
11
Outline
•
•
•
•
•
•
•
Intro: High-performance low-power computing
Flexibility / Efficiency trade-off
Related work
Function-Level Programming
Function-Level Processor (FLP)
Experimental results
Conclusions
Function-Level Processor (FLP)
H. Tabkhi, R. Bushey, G. Schirner
12
FLP Overview
• FLP Principles:
• Target stream processing applications
• E.g. Vision, SDR, Radar
– Construct macro pipeline out of functions to build application
– Self controlled function blocks
– Power gate unused blocks
FLP
– Compute contiguously inside FLP Stream
Function3
Function0
In
– Avoid costly data movement
Function4
Function1
– Limited ILP interaction
Function5
– Independent processing unit
Function2
Stream
Out
• FLP Architecture Components:
1.
2.
3.
4.
Optimized Function Blocks (FBs)
MUX-based / market-specific communication
Separation of data traffic
Autonomous control and synchronization
Function-Level Processor (FLP)
H. Tabkhi, R. Bushey, G. Schirner
13
Function Block: Traffic Separation
• Insight: Not all traffic is equal!
– Routes, roles and importance
– Access patterns
Streaming: input / output data stream
– Independent of computation realization
– Can be hidden by FB-to-FB connection
Operational: traffic generated by algorithm
itself (other than output)
– Part of algorithm (algorithm-intrinsic)
• E.g. model of stream
– Reuse potential: cache?
– Size? May be emitted to mem hierarchy
Function-Level Processor (FLP)
H. Tabkhi, R. Bushey, G. Schirner
14
Function Block (FB)
•
Decomposition
(a) Function-specific (but configurable)
data-path
•
•
Implements function y = f(x)
e.g., CNV, Histogram, Gaussian
(b) Configuration / control registers
•
•
Alter / Parameterize functionality
e.g., threshold values, dimensions
(c) Stream-in / stream-out
•
•
Stream data for direct FB-to-FB communication
e.g., data samples, frame pixels
(d) Stream buffer
•
•
Temporary storage
e.g., integral operation
(e) Operational data / local buffer
•
•
Access the memory hierarchy
e.g. Gaussian parameters, CNV co-efficient values
Function-Level Processor (FLP)
H. Tabkhi, R. Bushey, G. Schirner
15
MUX-Based Interconnection
• Point-to-point communication
between FBs
– Streaming data
– Set of configurable MUXs
• Hides stream traffic from memory
hierarchy
• Sparse connectivity
– Compatible data types
– Connectivity required by
application set
– Arranged in pipeline stages with
forwarding and backwarding
• Data type and flow control
– Marshaling / de-marshaling of
streaming data
– Side-channel signals for flow control
Function-Level Processor (FLP)
H. Tabkhi, R. Bushey, G. Schirner
16
Autonomous Control / FSA
• Autonomous Control
– FLP with centralized Control Unit (CU)
FLP
– Fetches configuration through DMA
Function0
Function1
Confg./Cont.
Confg./Cont.
– Transitions across configurations
– Distributes configuration
DMA
Control Unit
– Interconnect and FB configuration
– Coordinates FBs
– Minimizes or even eliminates need for ILP-centric control
Config
code
--------System
Mem
• Function-Set Architecture (FSA)
– Exposes
• Type and number of FBs
• FB configuration
• FBs connectivity
– Coarse-grained programmability at the function-level
• Significantly more abstract than ISA
• Comparable to developing applications out of API function
Function-Level Processor (FLP)
H. Tabkhi, R. Bushey, G. Schirner
17
FLP Architecture
• Efficiency contributors:
Optimized data-path per function block
Eliminate instruction fetch overhead
Eliminate system-level traffic for streaming data
Direct system inputs/outputs
Dedicating memory hierarchy only for reusable data (operational traffic)
Minimize ILP interaction
Direct-Memory
Access (DMA)
Direct-Memory
Access (DMA)
System Input
Interface
Input Encoder/Formatter
FLP Streaming-Pipe Controller
Foreward
MUX
MUX
MUX
Function3
Parameters Data
Foreward
MUX
MUX
FunctionN-1
MUX
MUX
Function1
MUX
MUX
Function2
MUX
Backward
MUX
Direct-Memory
Access (DMA)
Foreward
MUX
MUX
Function0
Streaming Data
Parameters Buffer/Cache
Direct-Memory
Access (DMA)
Function-Level Processor (FLP)
Function5
(Arithmetic Unit)
MUX
FunctionN
Function6
Backward
MUX
Parameters Buffer/Cache
Parameters Buffer/Cache
Output Encoder/Formatter
–
–
–
–
–
–
Direct-Memory
Access (DMA)
Direct-Memory
Access (DMA)
Backward
MUX
Parameters Buffer/Cache
Direct-Memory
Access (DMA)
Direct-Memory
Access (DMA)
H. Tabkhi, R. Bushey, G. Schirner
18
System Integration
– FLP as independent processor
– Pair with ILP cores
Last-Level Cache
• Create complete control and analytic applications.
– FLP for compute intense processing
– ILPs for the top-level adaptive control and intelligence.
I/O Interface
I/O Interface
FLP 0
FLP N
Config
Cache
Config
Cache
Cache
Cache
ILP 0
ILP N
Function-Level Processor (FLP)
Memory
Interface
H. Tabkhi, R. Bushey, G. Schirner
19
Outline
•
•
•
•
•
•
•
Intro: High-performance low-power computing
Flexibility / Efficiency trade-off
Related work
Function-Level Programming
Function-Level Processor (FLP)
Experimental results
Conclusions
Function-Level Processor (FLP)
H. Tabkhi, R. Bushey, G. Schirner
20
Experimental Setup
• Pipeline Vision Processor (PVP)
– Contains 11 FBs
• Manually selected (app. analysis)
– Vision filters: Edge detection /
enhancement
– Target market:
• Advanced Deriver Assistance
Systems (ADAS)
– Fabricated at 65nm @ 83MHz
– Part of Blackfin 60x series
• + 2x Blackfin DSP cores
• FLP is generalization of PVP
– Result of joint work with
PVP chief architect (2nd author)
– Generalized & improved
Input
Formatter
CNV
Conv0
Conv0
0,1,2,3
Conv0
Input
Formatter
CNV
Conv0
Conv0
0,1,2,3
Conv0
0X2
0X3
0X1
0X4
0X0
0X5
0X9
0X8
0X7
PMA Control Register
0X6
THC
IIM
0,1
0,1
PEC
IIM
IIM
0,1
0,1
Output
Formatter
Output
Formatter
– PVP can be considered early instance FLP
– Use PVP for component results (due to similarity)
Function-Level Processor (FLP)
H. Tabkhi, R. Bushey, G. Schirner
21
Flexibility: Application Mapping
• Example: Parallel Execution of:
– High-beam / Low-beam detection
– Canny edge detection
HB/LB
Canny
CNV
CNV
THC
CNV
CNV
PMA
PEC
• Mapping Flexibility:
• 10 ADAS apps mapped
• Up to 2 parallel streams
Function-Level Processor (FLP)
H. Tabkhi, R. Bushey, G. Schirner
22
Comparison Setup
ILP-BFDSP
ILP-BFDSP
ILP-BFDSP
ILP-BFDSP
ILP-BFDSP
ILP-BFDSP
Cache
Cache
Cache
Cache
Cache
Cache
1. ILP-BFDSP
– Blackfin DSPs
Low-Pass
Color/
Filter
Illumination
(Convolution) Extraction
DMA
DMA
2. ILP-BFDSP+HWACC
FLP-PVP
Cache DMA Conf.
3. FLP-PVP
– Blackfin DSPs
– Custom-HW Accelerators
– Convolution
– Color extraction
– PVP only
• VERY optimistic assumptions for ILP:
–
–
–
–
Fused implementation
Optimal pipeline utilization (no stalls)
90% of data traffic is hidden from off-chip memory
Stream data ACC <-> ILP in scratch pad
• Power:
– PVP for function blocks from fabricated chip: TSMC 65nm LP
– Single Blackfin DSP core 280mW
Function-Level Processor (FLP)
H. Tabkhi, R. Bushey, G. Schirner
23
• Communication:
– Off-chip traffic
– FLP-PVP offer:
• 1/5th of ILP
• 2/5th of ILP+ACC
# of ILP cores
– FLP-PVP <= 22.5 GOPs
(HB/LB + TSR)
– ILP+ACC requires 2 ILP cores
– ILP requires 7 ILP cores
12
24
18
12
6
0
ILP
ILP+ACC
FLP
0
ILP
ILP+ACC
FLP
0.8
0.4
3
2.5
2
1.5
1
0.5
0
ILP+ACC
FLP
Communication
Computation
ILP
Function-Level Processor (FLP)
3
0
Power [w]
– FLP-PVP offer:
• 14x-18x less than ILP
• 5x less than ILP+ACC
6
1.2
ILP
• Power:
9
1.6
Off-chip traffic
[GB/s]
• Computation:
Operation [GOPs]
Results Comparison
ILP+ACC
H. Tabkhi, R. Bushey, G. Schirner
FLP
24
Outline
•
•
•
•
•
•
•
Intro: High-performance low-power computing
Flexibility / Efficiency trade-off
Related work
Function-Level Programming
Function-Level Processor (FLP)
Experimental results
Conclusions
Function-Level Processor (FLP)
H. Tabkhi, R. Bushey, G. Schirner
25
• Conclusions
• Function-level Processor (FLP)
– Raises architecture abstraction to functions
• Targets streaming application markets (domain specific)
– Avoids inefficiency of traditional ILP+HWACCs
•
•
•
•
No instruction fetch (reduced flexibility at function-level)
Custom traffic management (stream v.s. operational)
Stream traffic hidden from system fabric
Autonomous control, higher independence
• Research Challenges / Future Work
– Selection of FBs & their potential composition
• A minimal but sufficiently contiguous set of FBs
 Need for tools explore and analyze the market applications
– Enhance from spatial multi-stream to timed multi-stream
– Simplify FLP programming / traffic optimization
Function-Level Processor (FLP)
H. Tabkhi, R. Bushey, G. Schirner
26
Thank you!
27