The last lesson: Recent embedded Architectures

The last lesson:
Recent
Embedded Architectures
Hideharu Amano
Embedded processors
• Cost/Power-centric, Performance for specific
application
• RISC Processors
• Shrunk instructions are provided
– ARM (ARM)
– MIPS (MIPS)
– SH (Hitachi/Lunesus)
• Works at 60MHz-800MHz depending on the
applications
→ Performance was enough until 90s’
MOPS(Million Operations Per Second) for
various embedded applications
10
Image
processing
100
1000
10000
MPEG2/4 Enc.
JPEG Enc./Dec.
MPEG2/4 Dec.
Dolby Enc./Dec.
Voice MP3 Enc./Dec.
Music
Graphics
5K sentence translation
100K words identification
3 dimensional image generation
2 dimensional image generation
Communication
VoIP modem
The performance of the
simple RISC processor is not
enough
CDMA modem
Performance enhancement techniques for
CPU
Using high clock
frequency
Instruction Level
Parallel processing
Superscalar
Sophisticated Branch Prediction
SuperScaler
VLIW (Very Long Instruction Word)
Dynamic scheduling of instructions
SMT (Simultaneous
MultiThreading)
Thread Level Parallel
Processing
SIMD (Single Instruction stream
Multiple Data streams)
MIMD (Multiple Instruction streams
Multiple Data streams)
Chip-multiprocessors
Efficient for every CPU
Of course, useful for embedded CPUs
Increasing cost/power consumption
Embedded CPU+Hardware Accelerator
Embedded
CPU
Hardware
Accelerator
Hardware accelerator is
suitable for high
performance in specific
application
On-Chip bus
On-Chip Network
RAM
I/O
I/O
Various type of
architectures for
embedded processing
Amdahl’s Law
• Total SpeedUp =
(1-ratio of acceleration) +
ratio of acceleration
SpeedUp of acceleration
• 100 times acceleration.
• If the ratio of acceleration is 50%, total speed up
becomes 2.001 times.
• Fortunately, the ratio is large in media
processing.
High performance for
Various embedded architectures narrow application field
Special Purpose processor
Dedicated hardware
DSP
Stream processor
Graphic processor
Network processor
Multiple Cores
Heterogeneous
Multiprocessor
Programmable
Hardware
FPGA、Reconfigurable systems
Dynamically Reconfigurable
Processors
Tile Processor
Homogeneous
Chip-multiprocessor
Configurable
Processor
Special
instructions
General purpose
CPU
Multiple
Cores
High performance for
wide application field
Hardware/Software
Co-design
Specification Analysis
System Spec.
Hardware/Software
division
Hardware Spec.
Hardware Functional
Synthesis
Hardware design
High level design
cost can be reduced.
Recently, Low level
design cost is
increased.
Software Spec.
Interface Generation
Interface design
Co-verification
System design
Program Generation
Program
Configurable Processor
/Integrated Platform
• Configurable Processor
– Hardware accelerators, special purpose processors
can be combined as special instructions.
•
•
•
•
ARC(ARC)
Xtensa (Tensilica)
MeP(Toshiba)
Triton(Poseidon Design Systems)
– Various type of interconnection is possible.
– Integrated software emvironment
• Integrated Platform → Standard components
– UniPhier(Matsushita)
Configurable Processor
MeP
MeP Core
32bit
Processor Core
Configuration
Optional Inst.
Memory Size
Interrupt
Debugging
...
MM1
MM2
... MMn
Bus IF
Extension
Extended Inst.
UCI
DSP
VLIW
...
Hardware
engine
Local bus
Global bus
Multi-Core/Multiprocessor
• Heterogeneous Processors
– Special purpose processors for each application
– High performance/cost
– Different programming for different processor
→ Complicated BUGs!
• Homogeneous Processors
– Multiple general purpose processors
– Programming environment for servers can be
introduced.
• Parallel OS, Parallel Compilers
– Dynamic Voltage Control/Dynamic Frequency Control
→Necessary performance with optimized power.
• Each processor executes its own task ⇔
Different from Tile processors
NEC MP211
Sec.
Acc.
DMAC
USB
OTG
3D
Acc.
Rotater.
Camera
Image
Acc.
ARM926
PE0
Bus Interface
TIM1
ARM926
PE1
APB
Bridge0
ARM926
PE2
Async
Bridge0
SPX-K602
DSP
TIM2
Scheduler
TIM3
SDRAM
Controller
WDT
Async
Bridge1
Cam
DTV
I/F.
APB
Bridge1
Mem. card
FLASH
LCD
I/F
SRAM
Interface
Inst.
RAM
On-chip
SRAM
PMU (640KB)
PLL OSC
PCM
IIC
LCD
SMU uWIRE
UART
INTC TIM0GPIO SIO
DDR SDRAM
Cell(IBM/SONY/Toshiba)
External
DRAM
SXU
SXU
SXU
SXU
LS
LS
LS
LS
DMA
DMA
DMA
DMA
SPE:
Synergistic Processing
Element
(SIMD core)
512KB Local Store
MIC
EIB: 2+2 Ring Bus
512KB
L2 C
32KB+32KB
L1 C
Flex I/O
SXU
LS
PPE
PXU
BIC
DMA
CPU Core IBM Power
SXU
SXU
SXU
LS
LS
LS
DMA
DMA
DMA
NUMA machines
which share a
single address
space
Private
FIQ Lines
MPCore (ARM+NEC)
…
Interrupt Distributor
Timer
CPU
Wdog interface
Timer
CPU
Wdog interface
IRQ
IRQ
Timer
CPU
Wdog interface
IRQ
Timer
CPU
Wdog interface
IRQ
CPU/VFP
CPU/VFP
CPU/VFP
CPU/VFP
L1 Memory
L1 Memory
L1 Memory
L1 Memory
Snoop Control Unit (SCU)
Private
Peripheral
Bus
Duplicated
L1 Tag
Private
AXI R/W
64bit Bus
L2 Cache
Coherence
Control Bus
Tile Processor/Processor Array
• Each PE provides its own PC, and fetches instructions
from its own instruction memory.
→ Falls into NORMA machines
• However, it is close to dynamically reconfigurable
processors shown later.
– A single task is executed with all PEs ⇔ Multiprocessors
–
–
–
–
–
–
–
–
–
–
Heterogeneous PEs
A lot of homogeneous PEs
Program is embedded.
Simple Interconnection network.
The concept of context switching
The target is image processing and media processing.
MIT RAW
Quicksilver’s ACM
MorphTech’s rDSP
PicoChip’s PC101
MIT’s RAW
Computing
Processor
(8 stages 32bit
Single issue
In order)
96KB
I-Cache
32KB
D-Cache
4-stage
pipelined
FPU
Communication
Processor
8 32-bit
channels
On-Chip NORMA system for embedded applications
ACM (Quicksilver)
Matrix Interconnect Network
Adaptive Node
Programmable Node
Domain Node
Level1 Cluster
Level2 Cluster
Level3 Cluster
Dynamically Reconfigurable Processors
• Reconfigurable systems → Previous lesson
•
– Flexible but It takes 10’s milliseconds for dynamic reconfiguration.
Dynamically Reconfigurable Processors
– Improves area efficiency by changing hardware structure.
– IPs used in various SoCs.
– History
• Reconfigurable Co-processor Garp(1997), CHIMAERA(2000)
• Multicontext reconfigurable devices WASMII(1992),Time-multiplexing
FPGA(1997), PipeRench(1998), DRL(1998)
• Functional-level synthesis
– Various commercial products are available since 2000
• IPFlex DAPDNA-2, NEC electronics DRP-1, PACT Xpp, Elixent DFabrix
– SONY’s VME(Virtual Mobile Engine) is embedded in Network Workman
and PSP
– Recently, many Japanese vendors start to develop commercial products
•
•
•
•
•
Fujitsu
Hitachi
Lucent
Sanyo
Toshiba (Mep+D-Fabrix)
Processing Element
• Specialized for media/stream processing
Coarse grain ⇔ Fine grain: LUT of FPGAs
• Components
–
–
–
–
ALU
Shifter+Mask unit
Multiplexers
Registers
• Operations and interconnection between
components are changeable
• No instruction fetch mechanism : A part of
large datapath
Chameleon CS2112 DPU
OP:Operations in C or Verilog
SIMD arrays and pipelines are
formed with multiple DPUs.
32bit・16bit
Routing
MUX
Routing
MUX
Instruction
Register
&
Mask
Barrel
Shifter
OP
Register
&
Mask
Register
Register
Dynamic reconfiguration
• Compared with FPGAs, coarse grain PE is area
effective for media/stream processing.
→ However, flexible part requires semiconductor
area : Not comparable with hardware
accelerators
• But it is flexible!
→ Dynamic reconfiguration
By changing hardware structure, the same
semiconductor area can be used for multiple
tasks.
Instructions/Configuration data
delivery
PE
•10’s micro-seconds
•PACT Xpp
•Elixent’s D-Fabrix
On-Chip Memory
PE
Multiple tasks can be switched
→ High area efficiency
On-Chip Memory
Xpp (PACT Informations technologie)
RAM
I/O
PAE
ALU
I/O
PAC
PAC
I/O
I/O
CM
CM
SCM
CM
CM
I/O
I/O
PAC
I/O
PAC
I/O
PAC: Processing Array Cluster)
CM: Configuration Manager
SCM: Supervising CM
Xpp64 (8x8 PAC) is available
Configuation requires 100’s clock cycles
24bits Data, 40MHz Clock
Configuration Manager
I/O
Elixent D-Fabrix
RAM based
switch box
4bit ALU
Register
ALU R
R
ALU R
R
R
R
ALU R
R
D-Fabrix
Processing Array
R
R
R
R
ALU R
R
R
ALU R
R
R
R
R
R
R
R
ALU R
RAM
8bit address
8bit data ALU R
R
ALU R
ALU R
R
ALU R
R
ALU R
ALU R
R
ALU R
R
R
R
ALU R
R
R
ALU R
R
R
R
ALU R
R
R
Stretch S5 engine
Inst
Cache
Data
Cache
MMU
Inst
Unit
Load/
Store
Unit
FR
AR
WR
FPU
ALU
ISEF
FP Unit
Integer
Unit
Extension Unit
Multicontext reconfiguration
Multiple sets of configuration can be switched with a clock cycle.
Context memory is combined into PE/Switches
Fujitsu’s MPLD using ROMs(1990)、
WASMII used RAM(1992)、Xilinx’s proposal(1997)、
NEC’s DRL(1998)、Chameleon CS2112(2000)
Output data
Logic cells
Multiplexer
PE and Switches
Context
memory
1
2
Context
n
SRAM slots
Input data
Context pointer
Double buffering using
multicontext devices
• Task is switched without overhead
Execution
Task N
Loading
Configuration
Data
Task N+1
Double buffering using
multicontext devices
Loading
Configuration
Data
Task N+2
Task N+1
Execution
Ipflex’s DAPDNA-2
DDR SDR IF
(64bit 166MHz)
DNA load
buffer
DAP
(RISC)
PCI IF
(32bit 66MHz)
Interrupt
Controller
Timer
SROM IF
GPIO
UART
Serial IF
BSU
DMA
Controller
DNA direct I/O
(Async. In)
DNA
Matrix
DNA store
buffer
DNA direct I/O
(Async. out)
Heterogeneous
368 PEs
ALU,Memory、
Delay
Time-multiplexing execution of
a single task
Target hardware
Reconfigurable Device
If the performance becomes 1/n, the performance/area
is not increased.
Time-multiplexing execution of
a single task
Target hardware
Even in the dedicated hardware, everything cannot be done
with a single clock.
In this example, it takes 4 clock cycles.
The dynamic reconfigurable processor requires 8 clock cycles
→ The performance/area is improved.
NEC electronics’ DRP (Dynamically
Reconfigurable Processor)
• Multicontext reconfiguration
– 16 contexts
– Controlled by FSM (Finite State Machine)
– Background loading of configuration data
•
•
•
•
•
8x8 PEs + distributed memory modules → A Tile
DRP-1 is consisting of 8 tiles → 512PEs
8bits data width
State transition/Configuration is controlled with a tile.
Single task is executed with multiple contexts.
DRP-1
Tile
Vmem
Hmem
DRP Tile Structure
HMEM
HMEM
HMEM
HMEM
VMEM
PE
PE
PE
PE
PE
PE
PE
PE
VMEM
VMEM
PE
PE
PE
PE
PE
PE
PE
PE
VMEM
VMEM
PE
PE
PE
PE
PE
PE
PE
PE
VMEM
VMEM
PE
PE PE PE PE PE PE PE
HMEM(1-port
memory)
VMEM(2-port
VMEM
VMEM ctrl
VMEM ctrl
State Transition Controller
VMEM
PE
VMEM
8bit × 8092entry
256entry
VMEM ctrl
VMEM ctrl
PE
PE
PE
PE
PE
PE
PE
VMEM
PE
PE
PE
PE
PE
PE
PE
PE
VMEM
VMEM
PE
PE
PE
PE
PE
PE
PE
PE
VMEM
VMEM
PE
PE
PE
PE
PE
PE
PE
PE
VMEM
HMEM
HMEM
HMEM
HMEM
Context control with a FSM
1.Context
switching
0
Data input
2. Parallel Processing in a context
3. Sequential execution in a context
1
2
3
4
5
Data output
DRP compiler automatically generates
the diagram from C-like language: BDL.
IMEC ADRES
Instruction Fetch
Instruction Dispatch
Instruction Decode
Data Cache
VLIW
view
RF
FU FU FU FU FU FU FU FU
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
Reconfigurable
Array
View
Rapport Kilocore
PE
PE
Input Controller
Configuration
Controller
32bits
PE
PE
672
bits
PE
PE
…..
PE
Interconnect
PE
PE
PE
…..
PE
Interconnect
128bits
Output Controller
PE
Interconnect
128bits
Fabric
16PEs X 16PEs
…..
PE
PE
PE
…..
Interconnect
Stripe
PE
Hitachi’s FE-GA
Interrupt/DMA request
Sequence manager
Computational cell array
I/O port
ALU
MLT
ALU
ALU
ALU
MLT
ALU
ALU
MLT
ALU
ALU
Load/Store cell
LS
MEM
ALU
LS
MEM
ALU
ALU
LS
MEM
MLT
ALU
ALU
LS
MEM
MLT
ALU
ALU
LS
MEM
LS
Bus
MEM interface
LS
MEM
ALU
MLT
ALU
ALU
ALU
MLT
ALU
ALU
LS
MEM
ALU
MLT
ALU
ALU
LS
MEM
LS
MEM
Local
memory
Crossbar switch
Configuration Manager
Dynamically Reconfigurable Processors
Product
Vendor
Conf.
Data Width
PE
Xpp-64
PACT
Delivery
24
Homo
D-Fabric
Elixent
Delivery
4
Homo
S5 engine
Stretch
Delivery
4/8
Hetero
PCA-2
NTT
Delivery
9
Homo
CS2112
Chameleon
Multi-c(8)
16/32
Hetero
DAPDNA-2
IPFlex
Multi-c(4)
32
Hetero
DRP-1
NECEL
Multi-c(16)
8
Homo
Kilocore
Rapport
Multi-c
8
Homo
ADRES
IMEC
Multi-c(32)
16
Homo
FE-GA
Hitachi
Multi-c(4)
16
Hetero
Cluster
machine
Fujitsu
Multi-c
16
Hetero
Dynamically reconfigurable Processors
Superscalar
Gates Number
10M
Chip-Multiprocessor
Super
Scalar
DAPDNA-2
1M
CS2112
100K
Simple
RISC
rDSP
PC101
ACM
FPGA
10K
32bit
ALU/
1000ト
Registers
1000
8bit ALU/
registers 100
4・5input
10
LUT
VLIW
Number of nodes
DRP-1
ADRES
DRL
PARS
Kilocore
100
Cost
10
1
3
8
16
Many
Time-multiplexing
C-level design (DRP)
C Source Code
• Behavaioral Description
Language (BDL):C-like
– Bit width, Pragma
– Pointer is limited.
• Functional synthesis: FSM and
Data path are generated.
High Level Synthesis
FSM
Datapath
Technology Mapper
– Synthesis tools for ASIC can be
used.
Place & Router
• Mapping: FSM → STC、Datapath
→ PE array
• Place & Routing
• Configuration data generation
Code Generation
Object Code
BDL code example
mem(0:16) d0[8], d1[8], d2[8], d3[8], d4[8], d5[8], d6[8], d7[8];
void row() {
16bit memory:
ter(0:16) SUMT0, SUMT1, SUMT2, SUMT3;
Allocated to VMEM
reg(0:16) SUB0, SUB1, SUB2, SUB3;
ter(0:16) z0, z1, z2, z3, z4, z5, z6, z7;
Terminals & Registers
reg(0:8) i=0;
$
Delimiter for the state/context
for(; i < 8; i++) {
d0[i], d1[i], d2[i], d3[i], d4[i], d5[i], d6[i], d7[i];
$
Memory Access for
SUMT0 = d0[i] + d7[i]; SUB0 = d0[i] - d7[i];
giving an address
SUMT1 = d1[i] + d6[i]; SUB1 = d1[i] - d6[i];
. . . . .
z0 = A * SUMT0 + A * SUMT1 + A * SUMT2 + A * SUMT3;
z2 = B * SUMT0 + C * SUMT1 – C * SUMT2 – B * SUMT3;
. . . . .
Terminals must be used In
$
z1 = D * SUB0 + E * SUB1 * F * SUB2 + G * SUB3; the assigned state/context
z3 = E * SUB0 – G * SUB1 – D * SUB2 – F * SUB3;
. . . . .
$
Registers can be used in
}
the next states/contexts
High performance for
Various embedded architectures narrow application field
Special Purpose processor
Dedicated hardware
DSP
Stream processor
Graphic processor
Network processor
Programmable
Hardware
Going major ?
FPGA、Reconfigurable systems
Dynamically Reconfigurable
Processors
Tile Processor
Homogeneous
Chip-multiprocessor
Next going Multiple
major Cores
Multiple Cores
Heterogeneous
Multiprocessor
Configurable
Processor
Special
instructions
Now going
major
General purpose
CPU
High performance for
wide application field
Glossary
• 今回は、いままで出てきた単語が多く、しかもそのま
ま呼ばれているものばかり
• Tile Processor:タイルプロセッサ
• Dynamically reconfigurable processor: 動的再構
成可能(リコンフィギャラブル)プロセッサ
• FSM(Finite State Machine) 有限状態マシン
• Multicontext:マルチコンテキスト型(マルチコンテク
ストかも)
• Functional Synthesis:機能合成
• Time multiplexed Execution:時分割多重実行
Excise
• Assume that the dynamically
reconfigurable processor executes 1000
times faster than that of the host processor.
• Compute the total performance when it
can be used for 90% of the total task.