システムLSIとアーキテクチャ技術 (part III:チップ間

Reconfigurable Architectures
AMANO, Hideharu
hunga@am.ics.keio.ac.jp
Reconfigurable System
(Custom Computing Machine)

A target algorithm is executed directly with
a hardware on SRAM-style FPGA/PLDs.



High performance of special purpose machines.
High degree of flexibility of general purpose
machines.
A completely different execution
mechanism from a stored program
computers.
PLD(Programmable Logic Device)


Integrated Circuit whose logic function can be
defined by users.
Standard IC,ASIC(Application Specific IC)
SPLD(Simple PLD) / PLA(Programmable Logic
Array)


CPLD(Complex PLD)


Small scale IC with AND-OR array
Middle scale IC with AND-OR array
FPGA(Field Progarmmable Gate Array)

Large scale IC with LUT
Caution! Terms are not well defined!
Rapidly development of PLD
Gate number
Increasing Performance
From 1991-2000
Amount of gate: X45
Speed: X12
Cost:1/100
10M
1M
Anti-fuse
FPGA
SRAMFPGA
100K
CPLD
10K
FusePLA
1980
Hierarchical structure
Embedded Core
Low voltage
EEPROMSPLD
1990
2000
SPLD(Simple PLD:
AND-OR/Product-term)
OR
NOT
AND
Arbitrary logic is realized by
changing the AND-OR connection
AND/OR connection example
ABCD
A&B | C&D
OR
NOT
AND
A&B
C&D
LUT:Look Up Table
Address
Look Up Table
…
ROM/RAM
…
Data
A simple ROM/RAM can used as a
random logic.
C
ABC
000
001
010
011
100
101
110
111
Z
0
0
0
1
0
0
0
1
Z
0
0
0
1
0
0
0
1
B
A
A combination of memory and
multiplexers are commonly used.
An example using LUT:Look Up Table
1
C
ABC
000
001
010
011
100
101
110
111
Z
0
0
0
1
0
0
0
1
Z
0
0
0
1
0
0
0
1
1
0
B
A
1
Device for flexibility(1)

Anti-fuse type




Program by destruction of isolation with high
voltage
High speed but One-time
ACTEL、Quicklogic
EEPROM・Flash-ROM



Switches for connections are realized by
floating gates.
Re-programmable
Lattice、Altera’s MAX series
Device for flexibility(2)

SRAM







Data on SRAM represents look up table and wire
connection.
ISP (In System Programming) is available.
The configuration data is erased, when the power
turns off.
Suitable for a large scale FPGA. Recently, rapidly
advanced.
Xilinx XC、 Altera FLEX, Lucent ORCA
The advanced series: Xilinx Virtex, Altera APEX
その他


Magnetic memory
DRAM
AND-OR array vs. LUT

AND-OR array(product-term)




Efficient for logic with multiple outputs
There is a type of logic which cannot be realized.
Suitable for EEPROM and Flash-ROM
LUT



Any logic can be realized.
Efficient for logic with a single output
Suitable for Flash-ROM, Anti-fuse, and SRAM.
Sequential circuits
From AND/OR array
D
Q
Q
Feed back
Input
AND・OR
array
or
LUT
Output
Module
D Q
Output
D Q
D Q
D Q
Feed
Back
Sequential circuit (state machine) can be built
by attaching Flip-flops and feed back loops.
CPLD (Complex PLD)
Programmable Switch
Matrix of SPLDs
SPLD
SPLD
SPLD
Programmable
Switch
SPLD
SPLD
SPLD
SPLD
Altera’s MAX
2-dimensional Array
FPGA(Field Programmable Gate Array)
LUT
Connection Block
F.F
Configurable Logic
Block
island style
Switch
Block
LUT and interconnection
is decided with
configuration data
IOB
Architectures and devices
SPLD
Anti-fuse
CPLD
EEPROM
FPGA
Flash-ROM
SRAM
High speed middle size
One-time
ACTEL,Quicklogic
High speed small/middle size
Re-programmable
Delay is predictable
Lattice,Altera,Xlinx
Large scale
Rapidly development
Xilinx、Altera
Recent PLDs

High-end: a large scale chip with hierarchical
structure:





System on Programmable Device
Providing DLL,CPU、DSP, ROM, RAM, Multiplier, High
speed link, and other hard IPs.
Xilix’s Virtex-4/EX,FX, Altera’s Stratix-3
Specialized for mass-production


Xilinx’s Virtex II、Virtex-4/LX, Altera’s Stratix-3
Low cost:Xilinx’s Spartan, Altera’s Cyclone
Low voltage, Multiple voltages, and Low power
consumption
Process and parameters(Xilinx co.)
Process
Products
Name
LUT
Power
350nm
XC4000
XC4085KLA
7448
3.3V
250nm
XC4000
XC40250KV
20102
2.5V
220nm
Virtex
XCV1000
27648
2.5V
180nm
Virtex-E
XCV2000E
43200
1.8V
150nm
Virtex-II
XC2V800O
104882
1.5V
130nm
Virtex-II
Pro
XC2VP125
125136
1.5V
90nm
Virtex-4
XC4VLX200
200488
1.2V
65nm
Virtex-5
XC5VLX330
51840slice
1.0V
40nm
Virtex-6
XC6VLX760
118560slice
1.0V
28nm
Virtex-7
XC7VX1140T
1139200slice
0.9V
Xilinx Virtex II
LUT
LUT
Carry
Carry
D
D
Q
Slice X 2 → CLB (Configurable Logic Block)
Q
Global
Clock
MUX
DCM
IOB
Slice
100000 CLBs
3Mbit
Configurable Logic
RAM Multiplier
Programmable IOs
Altera Stratix II
DSP Blocks
PLL
Mega RAM
Blocks
M4K RAM
Blocks
M512 RAM
Blocks
LAB:Logic Array Block
consisting of 10 LE (
4-input LUT and F.F.)
Hierarchical Interconnect
SoPD (System on Programmable Device)
DCM
Rocket I/O, Multi-Gigabit Transceiver
Xilinx
Virtex-II Pro
Power-PC
Multiplier
Block RAM
CLBs
Various kinds of cores are
embedded on an FPGA
FPGA vs. ASIC[Kuon:FPGA2006]

Pure FPGA without hard macros




Area:40X
Speed:1/3.2X
Power: 12X
FPGA with hard macros



Area: 21X
Speed: 1/2.1X
Power: 9X
Spartan-3 Power Consumption
[tuan2006]
Clock
Logic
Routing
Dynamic Power
about 200mW (3S1000)
Logic
Routing
Config SRAM
Static Power
about 60mW(3S1000)
Recent technologies and products
45nm 40nm
65nm 60nm
90nm
Virtex-5LX/LXT/SXT/
FXT/TXT
330000LC
Virtex-4LX/FX/SX
200000LC
Stratix-II/GX
179400LE
Virtex-6LXT/SXT/
HXT/CXT
760000LC
Stratix-IV/E/GX/GT
531200LE
Stratix-III/L/E
338000LE
X1.5-X2.5/generation
Extented
Spartan-3A N/DSP
53000LC
High-end
Spartan-6LX/LXT
150000LC
Cyclone II
68416LE Cyclone III/LS
119088LE
Cyclone IV/E/GX
149760LE
High-end/Low-cost: X3-X5
Low-cost
Slice structure of Virtex-6
FF
6bit
LUT
Carry
MUX
6inX1
5inX2
MUX
FF
FF
6bit
LUT
6inX1
Carry
MUX
MUX
5inX2
FF
FF
6bit
LUT
6inX1
5inX2
Carry
MUX
MUX
FF
FF
6bit
LUT
6inX1
5inX2
Carry
MUX
MUX
Virtex-6 manual
FF
Virtex-6 CLBs
COUT
COUT
CLB
Slice
X1Y0
COUT
COUT
CLB
Slice
X1Y1
Slice
X2Y1
Slice
X3Y1
CIN
CIN
CIN
CIN
COUT
COUT
COUT
COUT
CLB
Slice
X0Y0
CLB
Slice
X1Y0
Slice
X2Y0
Slice
X3Y0
Virtex-6 manual
Stratix-IV ALM Structure
carry
reg_carry
shared_arith
4bit
data
4bit
data
LUT
6in
LUT
6in
adder2
MUX
adder1
MUX
MUX
FF
MUX
4-in LUT X 2
5-in LUT + 3-in LUT
5-in LUT + 4-in LUT 1-input shared
5-in LUT + 5-in LUT 2-input shared
6-in LUT
6-in LUT + 6-in LUT 4-input shared
FF
Stratix-IV LAB structure
ALMs
Local
LAB
Interconnect
MLAB
Local
Interconnect
Power Gating for Spartan-3
Config
SRAMs
Interconnect
Switch Matrix
Virtual
ground Power
Gate
Tile
FPGA Core
Config
SRAMs
CLB
Config
SRAMs
Low-power FPGAs

Actel: ProASIC 3/E→ IGLOO




Silicon Blue ICE65 series




Flush ROM
IGLOO: with ARM core
Flash freeze: Low power stand-by mode(2μW)
Embedded Flash memory (NVCM)
5mA(1792cells,32MHz)
9mA(3520cells,32MHz)
Altera Arria、Arria-II


Low Power Mid-range
8-input LUT
QuickLogic
Lattice GAL
Altera FLEX10K
Xilinx Vertex
Qucklogic
Design of PLDs

Mostly designed with common HDL(Verilog-HDL,
VHDL)


C level entry is used recently: Handel-C(Ceroxca)
Synthesis, optimization, place and route is
automatically done by vendors’ tools.




Integration and combination of tools from various venders
are used recently.
For large circuit, a long time is required especially for place
and route.
Using IPs, clock/DLL adjustment is manually done.
Optimization techniques are different from
vendors/products.
Reconfigurable System
(Custom Computing Machine)

A target algorithm is executed directly with
a hardware on SRAM-style FPGA/PLDs.



High performance of special purpose machines.
High degree of flexibility of general purpose
machines.
A completely different execution
mechanism from a stored program
computers.
ASIC
Perform
ance
Refonfigurable Systems
FPGAs
Design
A
Design
B
High Performance and
Flexibility
Design
D
Design
C
CPU
CPU
Software
for i=0; i<K; i++
X[i]=X[i+j]
.....
Flexibility
How enhance the performance?

Performance enhancement by hardware
execution itself



The overhead of software execution (Instruction
fetch, data load to registers, and etc.)
The overhead of using fixed size data.
The overhead of using only two way branches.
However, these benefits are not so large, for embedded CPU and DSP
are highly optimized.
The key of performance improvement is parallel processing
Parallel processing in reconfigurable
systems

Various techniques can be used





SIMD execution
Pipelined structure
Systolic algorithm
Data driven control
Parallel execution other than calculation


Parallel data access using internal memory units
Parallel data transfer including I/O accesses
SIMD (Single Instruction-stream/
Multiple Data-stream)-like calculation
The same instruction is applied to different data stream
In Reconfigurable Systems, the operation is not required to be same
(SIMD-like calculation)
Stream Data in
Processing part
Internal
Memory module
Stream Data out
Pipelined structure
The stream is divided and inserted periodically.
StreamData
Data
1
Stream
Stream
53
Stream
Stream Data
Data
Data
42
Processing part
Internal
Memory module
Stream
Stream Data
Data12
Systolic Algorithm
Data x
Computational array
Data y
Data stream x,y are inserted with a certain interval.
When two stream meet each other, a calculation is executed.
→ Systolic: The beat of heart
Band matrix multiply y=Ax
y0
a11 a12 0
0
x0
y1
a21 a22 a23 0
x1
0
a32 a33 a34
x2
0
0
x3
y2
=
y3
a43 a44
a
yi
yo
x
X+
yo= a x + y i
Band matrix multiply y=Ax
a11 a12 0
0
a21 a22 a23 0
a23
a32
a22
a12
a21
a11
X+
x1
0
a32 a33 a34
0
0
a43 a44
Band matrix multiply y=Ax
a11 a12 0
0
a21 a22 a23 0
a33
a23
a32
a22
a12 y1=a11x1
a21
X+
x2
X+
x1
0
a32 a33 a34
0
0
a43 a44
Band matrix multiply y=Ax
a11 a12 0
0
a21 a22 a23 0
a34
a43
a33
a23
0
a32 a33 a34
0
0
a32
a22
y1=a11 x1+
a12 x2
y2=a21 x1
X+
x3
x2
x1
a43 a44
Band matrix multiply y=Ax
a11 a12 0
a21 a22 a23 0
a44
a34
a43
a33
a23 y2=a21 x1+
a32
a22 x2
X+
x3
0
X+
x2
0
a32 a33 a34
0
0
a43 a44
Band matrix multiply y=Ax
a11 a12 0
0
a21 a22 a23 0
a44
a34
0
a32 a33 a34
0
0
a43
y2=a21 x1+
a22 x2+
a23 x3
a33 y3= a32 x2
X+
x3
x2
a43 a44
Data flow algorithm
d
a
b
c
+
e
x
The process is activated
with the available of tokens
(data)
+
x
(a+b)x(c+(dxe))
The overhead of synchronization is large.
Data flow analysis and hardware generation
Data Flow Graph
Data Flow Language
Configuration
Data
HDL
Description
Graph Decomposition
Suitable for automatic generation of hardware
Applications



No flexible program change
No IEEE standard floating point
Not memory bounded








Image processing, analysis, pattern matching,
Logic simulation, Fault simulation.
Neural network simulation.
Encryption /Decryption
Queuing Model、Markov Analysis
Electric Power Flow
Censer processing
Efficient use of on the fly processing.


Communication control、Protocol control
Software radio
Large Scale Reconfigurable Systems
Stand-alone: SPLASH, RASH,BEE2
RU
μP
… RU
RU
… RU
RU
… RU
Interconnection/Shared memory
Hetero nodes using homo cores:
μP
…μP …
μP
…μP
SRC6, SGI RASC
RU
… RU …
RU
… RU
Interconnection/Shared memory
Homo node using hetero cores: Cray XD-1, XT4(XR-1)
μP
… RU
μP
… RU
μP
… RU
Interconnection/Shared memory
μP
… RU
Splash-2 (Arnold et.al 92)




String matching, Image
processing, DNA
matching, 330 times
faster than the
supercomputer Cray-II.
Systolic algorithm
VHDL, Parallel C
Annapolis Micro
Systems(WILDFIRE)
RM-IV (Kobe Univ.)
mem.
FPGA
FPGA
mem.
mem.FPGA
FPGA
mem.
mem.
FPGA
FPGAmem.
mem.
FPGA
FPIC
FPGA
FPGA
mem.
FPGA
mem.
mem.FPGA
FPGA
mem.
mem.
mem.
FPGA
FPGAmem.
mem.
FPGA
FPGA
Interface
mem.
Shared Memory with multiple
FPGAs
RASH(Mitsubishi)
6 boards consist one system unit.
PCI-bus
Mesh + Bus
PCI-bus I/F
PCI Local-bus
EXE-board controller
FPGA
FPGA
SRAM
(2MB)
Clocks/Cont. signals
Local-bus
FPGA
FPGA
2 clock lines
PCI bus I/F
A large SRAM
DRAM daughter board
FPGA
FPGA
FPGA
FPGA
FPGA Altera FLEX10K100A (62K-158KGate)
→ Changed into Stratix, then changed into Virtex II
ATTRACTOR(NTT)
Combination with various units
ATM
I/O
RISC
High speed serial link
(1Gbps)
FPGA
RAM
(LUT)
ATM
SW
FPGA
Buffer
RISC
RISC
RISC
RISC
RISC
Ethernet
Special purpose
system for ATM
cut-through router
MPU
Mem.
Compact PCI
CRAY-XD1:
•
•
•
•
•
AMD Opteron
1board is consisting of 2CPUs+FPGA(Virtex II Pro)
1 rack provides 6 boards
A high speed network called Rapid Array is used
Interconnection between FPGAs can be done with Rocket I/O
SGI RASC
•Accelerator for SGI’s NUMA Altix
•Virtex II XC2V6000 and another Virtex for control
•Directly connected into the controller with NUMAlink4
ReCSiP (Keio Univ.)
Accelerator for bioinformatics
Powerful simultaneous access facility of external RAMs
Local Clock
Generator
64MB SDRAM
Virtex-II
4MB SSRAM
XC2V6000
64bit Local Bus
Configuration
via USB
Configration
Control
via PCI
QuickPCI
64bit/66MHz PCI Bus
ReCSiP Board

ReCSiP-2
QDR-SRAM
(4MB x 4)
DDR-SDRAM
SO-DIMM
PCI-IF
(QL5064)
Virtex-II Pro
(XC2VP70)
ReCSiP

Automatic generation of solvers
CAD Tools
Solver Library
Optimizer
Scheduler
Solver Set
SBML description
HDL for Solvers is
generated
Control between
solvers
FPGA Board
Dynamically Reconfigurable Processors



Coarse grain structure
Parallel processing →Reconfigurable Processor
Array
Dedicated for stream processing


High speed dynamic reconfiguration




Distributed memory
Multicontext
Multicast/Broadcast of configuration data inside the chip
On-line Configuration
C-base design
Short history of Dynamically Reconfigurable Processors
1990
1995
2000
The 1st Generation
FPGA with Dynamic
Reconfiguration
MPLD(Fujitsu)
WASMII(Keio)
Processor with
Reconfigurable
Instructions
2005
The 2nd Generation
Time Multiplexed
FPGA(Xilinx)
DFabric(Elixcent)
DAPDNA/2(IPFlex) DAPDNA/IMX
(IPFlex)
Xpp(PACT)
DRL(NEC)
CS2112(Chameleon)
FE-GA(Hitachi)
DRP(NEC elec.)
X-bridge
(NEC ele.)
PipeRench(CMU)
Kilocore(Rapport)
S-5(Stretch)
S-6(Stretch)
GARP(UCB)
CHIMAERA(NorthWestern Univ.)
DISC(Brigham Young Univ.)
A lot of commercial
systems
Dynamically Reconfigurable processors
Product
Vendor
Context
Data
PE
D-Fabric
Panasonic
Deliver
4
Homo
Xpp
PACT
Deliver
24
Homo
S5/S6 engine
Stretch
Deliver
4/8
Hetero
CS2112
Chameleon
Multi-C(8)
16/32
Homo
DAPDNA-2
IPFlex
Multi-C(4)
32
Hetero
DRP-1
NEC electronics
Multi-C(16)
8
Homo
X-bridge
NEC electronics
Multi-C(32)
8
Homo
Kilocore
Rapport
Multi-C
8
Homo
ADRES
IMEC
Multi-C(32)
16
Homo
FE-GA
Hitachi
Multi-C
16
Hetero
For Car-tuners
SANYO
Multi-C(4)
24
Homo
Cluster
Fujitsu
Multi-C
16
Hetero
Coarse Grain Structure of PE
Kress Array II
Chameleon CS2112
Routing
MUX
Instruction
Register
&
Mask
Routing
MUX
OP
Barrel
Shifter
Register
&
Mask
Register
Register
Xpp (PACT Informations technologie)
I/O
I/O
PAC
PAC
I/O
I/O
CM
CM
SCM
CM
CM
I/O
I/O
PAC
PAC
I/O
I/O
PAC: Processing Array Cluster)
CM: Configuration Manager
SCM: Supervising CM
PAE
Xpp64 (8x8 PAC) is available.
Configuation requires 100s’ clocks.
PAE adopts 24bit-width, Clock cycles
is 40MHz.
Configuration controller
Panasonic(Elixent) DFA1000
Register
4bit ALU
RAM based
switch box
ALU
R
ALU
R
R
R
R
R
ALU
R
ALU
R
R
R
R
R
ALU
R
ALU
R
R
R
R
R
ALU
R
ALU
R
R
R
R
R
Multicontext structure
A PE provides multiple configuration RAM sets
One clock context switching can be done.
Output data
Logic cells
Logic
cells
Multiplexer
PEs or Switches
1
2
n
SRAM slots
Input data
Context
memory
Context
Context pointer
Chameleon CS2112
32-bit PCI Bus
64-bit Memory Bus
Memory
RISC Core
Controller
PCI Cont.
128-bit RoadRunner Bus
Configuration
Subsystem
DMA
Subsystem
Reconfigurable
Processing
Fabric
160-pin Programmable I/O
8 instructions
stored Fabric
in the CTL in
Reconfigurable
Processing
are executed in the DPU.
Chameleon The CTL can select the next
LM
DPU
instruction in the same cycle.
CTL
DPU
Configuration
can beLMchanged
CTL
by loading a bit stream.
Tile 0
Slice 0
Tile 0
Slice 3
108 DPU(Data Path Unit)s consists 4 Slices(3Tiles each)
1Tile: 9DPU=32bit ALU X 7 16bit + 16bit multiplier X 2
Ipflex DAP/DNA-2
DDR SDR IF
(64bit 166MHz)
DNA load
buffer
DAP
(RISC)
PCI IF
(32bit 66MHz)
Interrupt
Controller
Timer
SROM IF
GPIO
UART
Serial IF
BSU
DMA
Controller
DNA direct I/O
(Async. In)
DNA
Matrix
DNA store
buffer
DNA direct I/O
(Async. out)
368 PEs
ALU,Memory,
Delay etc.
Heterogeneous
An example PE structure
FF
FF
FF
FF
Shift/Mask
Shift/Mask
Shift/Mask
Shift/Mask
FF
FF
FF
FF
ALU
ALU
ALU
FF
FF
DRP (Dynamically Reconfigurable Processor: NEC)
Tile
DRP Tile and PE structure
HMEM
HMEM
HMEM
HMEM
VMEM
PE
PE
PE
PE
PE
PE
PE
PE
VMEM
VMEM
PE
PE
PE
PE
PE
PE
PE
PE
VMEM
VMEM
PE
PE
PE
PE
PE
PE
PE
PE
VMEM
VMEM
PE
PE PE PE PE PE PE PE
HMEM(1-port
memory)
VMEM(2-port
VMEM
VMEM ctrl
VMEM ctrl
State Transition Controller
VMEM
PE
VMEM
8bit × 8092entry
256entry
VMEM ctrl
VMEM ctrl
PE
PE
PE
PE
PE
PE
PE
VMEM
PE
PE
PE
PE
PE
PE
PE
PE
VMEM
VMEM
PE
PE
PE
PE
PE
PE
PE
PE
VMEM
VMEM
PE
PE
PE
PE
PE
PE
PE
PE
VMEM
HMEM
HMEM
HMEM
HMEM
Context control for DRP
1.
Context
switching
0
Data input
2. Parallel processing in a context
3. Serial execution in a context
1
2
3
4
5
Data output
Description in BDL
DRP compiler controls 3-dimensional
assignment
Main Advantage:
Low power consumption
Why low power ?
1. No redundant hardware
 There are no instruction fetch mechanisms, cache, TLB, and etc.
→ Of course, it cannot be a general purpose engine, but enough for
an accelerator.
 A bare datapath works only for computation.
2. Parallel Execution with a number of PEs
 Much lower clock frequency can be used to achieve the same
performance as other architectures.
 The main problem is leakage power, but can be suppressed by
power gating techniques.
10X energy efficient compared with DSPs.
5-50X with FPGAs.
Sometimes similar to that for hardwired logic.
The main limitations as an accelerator in
SoCs

The data must be stored in the memory modules
placed around the PE array.


If the required contexts are more than its context
memory, the operational speed is much degraded.


If the data is more than the memory, it is hard to be treated.
The virtual hardware mechanism is provided but there is a
certain limitation.
The performance is not so improved for problems
without parallelism.
Dynamically Reconfigurable Processors:
The 2nd generation

Customized for a specific target application area





Multi-core structure with small PE arrays rather than
a big array



SANYO car tuner → Tuner
Fujitsu → Wireless communication
Toshiba SAKE → Multi-media
NEC electronics X-bridge → Multi-media
Cooperation with various type cores
Integrated design environment
Low power design
→ The main advantage!
X-bridge: NEC electronics (2008)
General
Port
8bX4
UART
UART
CSI
GPIO
JTAG
CPU
MIPS
I-C
D-C
INTC
DMA
STP
Engine
SPL
SPL
SPL
64bit on chip bus (266MHz)
SPL
SPL
SPL
DMA
Work
PCIexp PCIexp
RAM
HB/EP HB/EP Periph
(1kB)
(1-lane) (1-lane)
I/F
From Invited talk in Design Gaia.2008
SPL
Nconnect
64bit Memory
Switch (266MHz)
DMA
Dynamically
Reconfigurable
Core
512PE(8bit)
32-context
Providing the virtual
SPL
hardware
mechanism
SPL DMA controller hides
the communication
overhead
DMA
10/100
Ether
MAC
PCI
Host/
Target
DDR2
SDRAM
CTR
Mixture of SIMD and DRP units:Toshiba’s SAKE
Dynamically Reconfigurable Units
Optimized for
Stream Processing
(Indenepndently Controlled)
Our Architecture
Host
Processor
Host
I/F
code
data
System
Memory
I/O Buffer
(Data RAM)
Code Buffer
(Code RAM)
Write
Control
Formatter0
Inter-Unit Buffer (Data Registers)
AUX1
SIMD Units
From FPT2007 Tutorial session
AUX0
Formatter1
The Architecture (Formatter)
Cfg Controller
Xbar In
128
Xbar In
128
Shuffle
Simple Hardware
•Pipeline registers only
•No intra-PE data transfer
•PE:4 cfgs, Xbar: 16cfgs
•ALU, shift & absolute ops
only
From FPT2007 Tutorial session
data A
data B
64
PE
CfgMem
16-bit ALU x 8
Suitable for batterfly operations
CodeMem
19
ID
valid
PE
PE
PE
PE w/o Shuffle
Xbar Out
Xbar In: Formatter0 only
XBar Out: Formatter1 only
SANYO’s Car tuner DRP
ALU array
command
memory
sequencer
Feedback
In
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
main memory
Out
Pipelined execution of 4 threads
L1
ALU
ALU
ALU
ALU
ALU
ALU
L2
ALU
ALU
ALU
ALU
ALU
ALU
L3
ALU
ALU
ALU
ALU
ALU
ALU
L4
ALU
ALU
ALU
ALU
ALU
ALU
L1
L2
L3
L4
Th1-1 Th2-1 Th3-1 Th4-1
Th1-2 Th2-2 Th3-2
Th1-3 Th2-3
Th1-4
Th1-5
Th4-2
Th3-3
Th2-4
Th2-5
Th1-6 Th2-6
Th4-3 Th1-7 Th2-7
Th3-4 Th4-4 Th1-8
Fine carrier frequency offset
estimation/correction
LT1
I
Q
I
Cluster0
Q
to FFT
LT2
I
Q Cluster0
Cluster3
data out
control
Cluster0
Reg
Cluster4
a) Fine carrier frequency offset estimation for LT1
phase offset calculation
Cluster5
Cluster6
Cluster2
Reg in
cluster0
self-correlation
I
DIV
ATAN
Cluster1
Q
to FFT
b) Fine carrier frequency offset estimation for LT2
Cluster1
Cluster6
(through)
correction offset calculation in phase
polar
I
Q
Cluster2
complex
multiply
Cluster3
data out
control &
clip
I
Q
c) Fine carrier frequency offset correction for SIGNAL and DATA
Hitachi’s FE-GA
Interrupt/DMA request
Sequence Manager
Computational Cell Array
I/O
port
ALU
MLT
ALU
ALU
ALU
MLT
ALU
ALU
MLT
ALU
ALU
Load/Store
Cells
LS
MEM
ALU
LS
MEM
ALU
ALU
LS
MEM
MLT
ALU
ALU
LS
MEM
MLT
ALU
ALU
LS
MEM
LS
Bus
MEM Interface
LS
MEM
ALU
MLT
ALU
ALU
ALU
MLT
ALU
ALU
LS
MEM
ALU
MLT
ALU
ALU
LS
MEM
LS
MEM
Local
Memory
Crossbar Network
Configuration Manager
Heterogeneous Multi-Core using FE-GA
CPU0
SH-4
CPU1
DRP0
FE-GA
DTU
LPM
LDM
LPM
LDM
FVR
DSM
FVR
DSM
DRP1
DTU
Network Interface
Network Interface
Network Interface
On-Chip CSM
CPU2
CPU3
DRP2
DRP3
The codes are
generated by a
parallelizing
compiler and
standard APIs.
Examples

High speed configuration




PACT xpp
Elixent DFA
NTT PCA
Multicontext





Chameleon CS2112
IPFlex DAP/DNA
NEC DRP
CMU PipeRench
SONY Virtual Mobile Engine (Embedded in PSP)
Multicontext style
Dynamically
Reconfigurable
Processors
Traditional
Processors
DAP/DNA
CS2112
Granularity
FPGA
32bit
8
1000
rDSP
PC101
Parallelism
16
ACM
Chip-Multiprocessor
VLIW
Common Processor
DRP
DRL
PipeRench
16bit
100
8bit
10
4bit
1
3
8
16
Many
Time
Multiplexing
Dynamically Reconfigurable Processors





Coarse grain architecture, somehow like on-chip multiprocessors,
while somehow like FPGA.
Rapidly development from 2001
They don’t find killer application (Chameleon’s fail)
High level language development environment has not been well
established.
A lot of competitors






High performance embedded processors
Chip multiprocessors
Application Specific Configurable Processors
DSP
Standard FPGA/CPLD
System On Chip
Open Problems




What’s difference between a Program and Configuration Data
 Reconfigurable Processor Array=a VLIW machine with an
extremely large instructions (Configuration data)
How frequent should Configuration change?
 Every-clock-context switching is not advantageous from the
viewpoint of consuming power.
 However, if configuration is rarely switched, dynamic
reconfiguration function is useless.
How is grain size of Processing Element decided?
 8-32bit calculators are correct solution?
 Is it a only escape way from Xilinx’s patent ?
How is the balance between calculators and controllers?
 Since DRP focuses on calculators, it is difficult to implement
complicated control.
 Does the node balance of ACM correct?
Summary




Another computing system than stored
program computers.
Not a perfect replace of stored program
type computers.
Advance of the semiconductor techniques
directly enhance the performance.
A lot of problems and subjects to research.
Historical flow of computer systems
ENIAC
EDVAC、EDSAC
IBM machines
Reconfigurable
Machine
RISC, Intel’s microprocessors
Exercise

There is a systolic array which multiplies 8 x
8 tri-diagonal matrix A with a size 8 vector x.
Compute the number of clock cycles for the
multiply. Here, the time when the first element
of x reaches to the left-most array is assumed
to be time 0.