Document

Developing an Architecture for a
Single-Flux Quantum Based
Reconfigurable Accelerator
F. Mehdipour, Hiroaki Honda*, H. Kataoka,
K. Inoue and K. Murakami
Graduate School of Information Science and Electrical
Engineering, Kyushu University, Japan
*Institute of Systems, Information Technologies and
Nanotechnologies (ISIT), Fukuoka, Japan
E-mail: [email protected]
Agenda
Introduction
SFQ-LSRDP General Architecture
The Design Procedure and Tool Chain
Input/ Output Nodes Placement
Area Minimization
Experimental Results
Conclusions
2009/01/29
CREST-JST SFQ-RDP Project (2006~):
A Low-power, high-performance reconfigurable processor
based on single-flux quantum circuits
Yokohama National Univ.
SFQ-FPU chip, cell library
Prof. N. Yoshikawa et al.
Nagoya Univ.
Kyushu Univ.
Architecture, Compiler
and Applications
SFQ-RDP chip, cell library,
and wiring
Prof. A. Fujimaki et al.
Nagoya Univ.
CAD for logic design
and arithmetic circuits
Prof. N. Takagi (Leader)
et al.
Prof. K. Murakami et al.
SFQ-LSRDP
Superconducting
Research Lab. (SRL)
SFQ process
Dr. S. Nagasawa et al.
2009/01/29
Goals

Discovering appropriate scientific applications

Developing compiler tools

Developing performance analyzing tools

Designing and Implementing SFQ-LSRDP
architecture considering the features and
limitations of SFQ circuits
2009/01/29
How a reconfigurable processor works
Non-critical code
Computation-intensive
(critical) code
Non-critical code
GPP
LSRDP
PE
...
PE
LSRDP
PE ...
PE
PE
Computation-intensive
(critical) code
Non-critical code
.
.
.
Application code
PE
ORN
…
PE
PE
ORN
PE
Main
Memory
PE
PE
...
PE
2009/01/29
Single-flux quantum (SFQ)
against CMOS

CMOS main issues in implementing a large accelerator:



High electric power consumption
High heat radiation
Difficulties in high-density packing
磁束量子
超伝導ループ
Superconductivity
loop
Single Flux Quantum
SFQ Features:




High-speed switching and signal transmission
Low power consumption
Compact implementation (smaller area)
Suitable for pipeline processing of data stream
ジョセフソン接合
Josephson
junction
2009/01/29
Outline of large-scale reconfigurable
data-path (LSRDP) processor

LSRDP
PE
GPP
PE
PE
...
PE
ORN : Operand Routing Network
:
:
:
PE
PE
PE
:
...
PE
...
PE
Reconfigurable data-path components:
 A matrix of large number of floatingpoint Functional Units (FUs)
 Reconfigurable Operand Routing
Network : (ORN)
 Dynamic reconfiguration facilities

Streaming Buffer (SB) for I/O ports
ORN
PE
PE
PE
:
:
...
SMAC
Main
Memory
Scratchpad
Memory
Features:
Handling data flow graphs (DFGs)
extracted from scientific applications
 Pipeline execution
 Burst transfer of input /output
rearranged data from/to memory
 Reduced no. of memory accesses

SB
:

:
(alleviating the memory wall problem)
2009/01/29
SFQ-LSRDP General Architecture
LSRDP architecture

Input ports
Processing Elements
FU (Functional Unit): implements basic 64-bit
double-precision floating point operations
including: ADD/SUB and MUL

TU(transfer unit): as a routing resource for
transferring data b/w inconsecutive rows 7

4
MUL TU
15
FU TU
FU
FU TU
PE including two
components
Node 15
13
12
TU
TU
FU TU
Four functionalities
Output ports
2009/01/29
PE structures
FU
-
FU TU
-
FU
-
-
FU TU
-
-
TU
TU
PE Basic arch.
3-inps/2-outs
-
TU FU TU
-
FU TU
TU TU TU
FU TU
PE arch. I
TU FU TU
4-inps/3-outs
TU TU
FU
TU
TU TU
FU
TU TU
-
PE arch. II
3-inps/3-outs
TU
TU-TU
TU
FU TU
2009/01/29
Layout types- Type I
W
A
M
T
A
M
T
A
M
T
…
A
M
T
A
M
T
Each PE implements
ADD/SUB and MUL
ORN
A
M
T
A
M
T
A
M
H
M
…
A
M
T
A
M
T
ORN
: MUL
A
M
A
T
T
A
M
T
A
M
T
…
A
M
T
ADD/SUB
A
T
M MUL
TU
.
: ADD/SUB
.
T
: Transfer Unit
.
ORN
A
M
T
A
M
T
A
M
T
…
A
M
T
A
M
T
Flexible but consumes a lot of resources
2009/01/29
Layout types- Type II
W
A T
M T
Each PE implements
ADD/SUB or MUL
A T
…
A T
M T
Each PE implements
ADD/SUB or MUL
ORN
A T
M T
A T
…
A T
M T
…
A T
MUL
M T
…
A T
M T
ORN
ADD/SUBA T
TU M T
A T
TU
.
.
.
H
ORN
A T
M T
A T
2009/01/29
Maximum connection length (MCL)Definition
Connection
Length= 0
(i, 0)
...
Connection
Length= 2
(i,j)
(i,
j+1)
(i,
j+2)
Longest Connection
Length= L
...
...
...
...
ORN
(i+1,
0)
...
(i+1,
j)
(i+1,
j+1)
(i+1,
j+2)
(i+1,
j+L)
MCL:
maximum horizontal distance b/w
two PEs located in two subsequent rows
2009/01/29
An ORN structure
ORN
T
FPU
FPU
T
½CB
½CB
½CB
T2
CB
CB
T2
CB
CB
CB
FPU
½CB
CB
CB
CB
T
FPU
T
FPU
T
FPU
½CB
½CB
½CB
½CB
½CB
½CB
CB
CB
CB
CB
CB
CB
T2
CB
T
FPU
T
FPU
T2
CB
CB
CB
CB
CB
CB
CB
CB
CB
T2
T
CB
CB
CB
T2
CB
CB
T2
CB
T2
CB
T
FPU
2bit shift
register
ORN is FPU
consisted FPU
of 2-bit shift registers,
1-by-2 and 2-by-2 cross bar switches
T
T2
CB
CB
A. Fujimaki, et al., Demonstration of an SFQ-Based Accelerator Prototype for a High-Performance Computer,” ASC08, 2008.
2009/01/29
Dynamic reconfiguration architecture
Initial
State
Input-B Input-A Input-C
PE
idle
Immediate
Register (64b)
Immediate
Register
End of
Execution
Starting of
Reconfiguration
Reconfiguration
Transfer
Unit
MUX
ORN
Execution
Immediate
FU
(A op B)
Conf. Reg.
[bit]
PE
Starting of
Execution
wait
・・・
ORN
End of
Reconfiguration
・・・
log(2x (2MCL+1)) x 3 [b]
Three bit-stream lines for dynamic reconfiguration of:
• Immediate registers (64bit) in each PE
• Selector bits for muxes selecting the input data of FUs
• Cross-bar switches in ORNs
2009/01/29
What should be decided during the design
procedure
Streaming Buffer (SB)
...
Operand Routing Network (ORN)
PE1
PE2
PE3
Width
...
The number of I/O ports?
Maximum Connection Length (MCL)?
ORN size and structure?
...
Height
Width and Height ?
ORN
...
ORN
.
.
.
.
.
.
.
.
.
...
PEm
Layout: FU types
(ADD/SUB and MUL)?
Reconfiguration mechanism?
(PE, ORN, Immediate data)
ORN
...
On-chip memory configuration?
Streaming Buffer (SB)
2009/01/29
The Design Procedure and
Tool Chain
Compiler and design flow
Application Code
Mapping
Placement
Bit-Stream Generation
Routing
Configurations
(for LSRDP)
LSRDP Architecture
Port positioning
Analyzing DFG mapping
results
Design Phase
Hardware-Software Partitioning
(manual or automatic)
Critical Parts
(h/w part)
DFG Genration
DFGs
DFGs
Non-Critical Parts
(s/w part)
s/w Part Modification
Binary Code
(for GPP)
• DFGs are manually generated
• DFG mapping results are employed for:
• Analyzing LSRDP architecture statistics (a quantitative approach)
• Generating LSRDP configuration bit-streams
2009/01/29
Benchmark applications

Finite differential method calculation of
2nd order partial differential equations




1dim-Heat equation
1dim-Vibration equation
2dim-Poisson equation
(Heat)
(Vibration)
(Poisson)
Quantum chemistry application

Recursive parts of Electron Repulsion Integral calculation
(ERI-Rec)
Types of operations in the calculations:
ADD/SUB and MUL
2009/01/29
DFG extraction- Heat equation

1-dim. heat equation for T(x,t)
T ( x, t )
 2T ( x, t )
A
t
x 2
(A is const.)
T(i-1,j)
T(i,j)
T(i+1,j)
+

Calculation by Finite Difference
Method (FDM)
T ( xi , t j 1 )
D
*
Basic DFG
B
*

 D * T ( xi , t j )  B * T ( xi 1 , t j )  T ( xi 1 , t j )

+
T(i,j+1)
Basic DFG can be extended to horizontal
and vertical directions to make a larger DFG
2009/01/29
A sample DFG - Heat
Inputs:
Outputs:
Operations:
Immediates:
32
16
721
364
A sample DFG (Heat)
2009/01/29
DFG mapping flow
LSRDP Architecture
Description
DFG
Longest connections
MCL= 2
Placing DFG nodes on LSRDP
Placing IO nodes
Re-placing DFG nodes on
LSRDP (considering IO nodes
positions)
Routing connections
Re-palcing output nodes
Routing Inp/Out connections
Modified Mapping Flow
Configuration File
2009/01/29
Placing Input/Output Nodes
Fan-out based I/O nodes placement
2: Boundaries of the location with
minimum connection length to children

Objective: Minimize TCL




ni= 1  X= Ci1
ni= 2  Ci1 <= X <= Ci2
ni= 3  X = Ci2
ni>=2  X = Cij, j=2…ni-1
(i, 0)
...
(i,j)
(i,
j+1)
(i,
j+2)
(i,
j+3)
(i,
j+4)
...
(i+1,
j+4)
...
Reconfigurable Routing Network
(i+1,
0)
...
(i+1,
j)
(i+1,
j+1)
(i+1,
j+2)
(i+1,
j+3)
...
TCL= |Ci1-X|+ |Ci2X|+…|Ci,ni-X|
INP3 3: The best location is on a middle point
Reconfigurable Routing Network
...
...
...
...
Total Connection Length:
INP1
...

1: There is only one location with
minimum connection length to a child
...

INP2
...

ni: the number of children of input
node i
Ci1, Ci2, Ci3, Ci,ni
X: location of the input node i
...

2009/01/29
One main reason for the large MCL
Inputs Ports are far
from each other
2009/01/29
Proximity-factor based placement


Proximity factor indicates how far a pair of input ports should be
located from each other
For a pair of input nodes
 The larger number of closer descendants, higher proximity factor is
assigned


Si,j: a set of common descendants for input nodes i and j
Dk,i(=Dk,j): distance of common descendant node k to the input nodes
i and j (it is equal to ASAP execution level of the node)
 

p2,1

P


 pn,1
p1, 2

...
pn , 2
...
...
...
p1, n 

p2 , n 


 
p i , j  p j ,i

if i  j


1

if i  j
kS Dk ,i
 i, j

2009/01/29
Proximity factor-Example
I1
1
4
I2
I3
2
3
S1, 2  4,6,7
5
P( I1 , I 2 )  p1,2 
S1,3  7
P( I1 , I 3 )  p1,3 
1 1 1 13
  
2 3 4 12
S 2,3  7
1
1
P ( I 2 , I 3 )  p 2, 3 
4
4
6
7
Inputs nodes I1 and I2 should
be located closer than I3
2009/01/29
Input nodes placement alg.: Example
r 1
C (l ) 

pij 

pij 
i l 1
r 1
C (r ) 
i l 1
1
i l
if C(l)> C(r)
l= l+1, L[l]=j
else
r= r+1, L[r]=j
1
r i
Placing the 1st input node with the highest proximity factor
…
N/2-3
N/2-2
N/2-1
1
N/2+1
N/2+2
N/2+3
…
Placing the 2nd input node with the highest proximity factor
…
N/2-3
N/2-2
2
1
N/2+1
N/2+2
N/2+3
…
2009/01/29
Input ports placement alg.: Example
Placing i-th input node
…
N/2-K
…
2
1
3
…
l
N/2+M
…
N/2+M
…
r
If C(l)> C(r):
…
i
…
2
1
3
…
r
l
If C(r)> C(l):
N/2-K
…
2
l
1
3
…
i
…
r
2009/01/29
Area Minimization
Estimating the area of a PE
Area(FU)= Area(ADD/SUB)= Area(MUL)
Area(TU)= Area(MUX)~ 0.1 Area (FU)
FU TU
FU TU TU
PE arch. I
PE basic arch
Layout I: Area(PE)= 2.1x Area(FU),
Layout I: Area(PE)= 2.2x Area(FU),
Layout II: Area(PE)= 1.1x Area(FU)
Layout II: Area(PE)= 1.2x Area(FU)
A B
C
A
FU
TU
TU TU
B
C
op
PE arch. II
mux
TU
sel
Layout I: Area(PE)= 2.2x Area(FU)
Layout II: Area(PE)= 1.2x Area(FU)
2009/01/29
FU TU
Basic arch.
3-inps/2-outs
Number of rows = 1.5×W
Estimating the ORN area-PE Basic arch.
MCL= 1
Number of columns = 4×MCL
Area (ORN) = 1.5 x W x (4 x MCL) x Area (CB)
W: the no. of the PEs in a RDP row
2009/01/29
Estimating the ORN area-PE arch. I
Number of rows = 2×W
TU FU TU
PE arch. I
4-inps/3-outs
MCL= 1
Number of columns = 6×MCL+2
Area (ORN) = 2 x W x (6 x MCL+ 2) x Area (CB)
2009/01/29
Number of rows = 1.5×W
Estimating the ORN area-PE arch. II
FU
TU
TU TU
PE arch. II
3-inps/3-outs
MCL= 2
Number of columns = 4×MCL+1
Area (ORN) = 1.5 x W x (4 x MCL + 1) x Area (CB)
2009/01/29
A modified connection length measurement

New measurement technique for the net length
src
Connection length measurement:
initial  C.L.= dh
dv
modified  C.L.= dh/ dv
dest
dh
C.L.(previous)= 3
src
C.L.(new)=3
dest1
C.L.(previous)= 3
C.L.(new)=1
dest2
2009/01/29
A modified connection length
measurement- Example
Parent 2
is chosen when C.L.
is measured as dh/dv
Parent 1
MCL= 1
dh
dh/dv
dh
dh/dv
0, 4
0, 4/3
1, 3
1, 1
2, 2
2, 2/3
3, 1
3,1/3
4, 0
4, 0
0, 4
0, 1
1, 3
0.5, 0.75
2, 2
1, 0.5
3, 1
3/2, 1/4
4, 0
2, 0
is chosen when C.L.
is measured as dh
MCL= 2
2009/01/29
MCL minimization- Using a MCL threshold


A maximum threshold is assumed for the MCL
During the placement process:

For each CL larger than the threshold, the vertical distance
increases as:

PE with the min.
C.L to the source
dv= CL/MCL_Threshold
• max permitted length= 2
• dh =3 > max permitted length
src
dest
• dv= 1
dest
dv= dv+ [3/2]=dv+1= 2
2009/01/29
Basic placement and routing vs.
integrated placement and routing
DFG
DFG
Placing Input Nodes
using PF-based alg.
Placing Input Nodes
LSRDP
Architecture
Description
Placing Operational &
Output Nodes
LSRDP
Architecture
Description
Placing Output Nodes
Routing Nets
Final Map
Routing IO Nets
Basic Placement and Routing Flow
Placing Operational Nodes &
Routing Nets (node by node)
Final Map
Routing Output Nets
Integrated Placement and Routing Flow
2009/01/29
Experimental Results
Specifications of the benchmark DFGs
DFG
# of
nodes
# of
inputs
# of
outputs
# of pure
ops
max. inp.
nodes
fan-out
Max.
fan-out
Heat-8x1
34
6
4
16
2
1
Heat-8x2
60
8
4
32
3
3
Heat-16x2
172
16
12
96
3
3
Poisson-3x3
62
18
1
33
3
2
Vibration-4x2
48
8
4
24
3
2
Vibration-8x2
136
16
12
72
4
4
ERI-1
20
8
3
9
3
1
ERI-2
76
16
9
51
3
3
ERI-3
89
14
9
66
3
6
ERI-4
67
19
1
47
4
3
Max
172
19
12
96
4
6
2009/01/29
Evaluation results for various architecturesMCL and ORN sizes
Layout-I
S1
S2
PE basic arch.
MCL
PE arch. I
PE arch. II
ORN size PE basic arch
(overall) PE arch. I
x CB
PE arch. II
14
8
10
25116
6
3
4
12600
15
9
12
29250
12
4
7
24336
30600
13680
38080
17680
18696
8237
23520
14790
nodes placement
S1
S2
Layout-II
S1
S2
fan-out based
proximity-factor based
Connection length
measurement
lh
lhv
S2 results in smaller MCL and ORN size for both layout types
2009/01/29
Evaluation results for various architecturesno. of utilized PEs
No. of PEs
(overall)
x PE
Layout-I
S1
S2
Layout-II
S1
S2
PE basic arch
580
683
330
344
PE arch. I
PE arch. II
634
627
713
669
384
360
384
384
By using lhv, larger number of RDP rows are utilized
larger number of PEs will be employed for S2
2009/01/29
Evaluation results for various architecturesoverall LSRDP area (KJJ)
ADD/
SUB
MUL
ADD/
SUB
MUL
ADD/
SUB
MUL
ADD/
SUB
MUL
ADD/
SUB
MUL
ADD/
SUB
MUL
...
ADD/
SUB
ORN
ADD/
SUB
MUL
ADD/
SUB
MUL
ADD/
SUB
MUL
ADD/
SUB
MUL
ADD/
SUB
MUL
...
ADD/
SUB
.
.
.
ADD/
SUB
MUL
...
MUL
...
FU
TU
TU
FU
TU
ORN
ADD/
SUB
MUL
...
ADD/
SUB
MUL
...
ADD/
SUB
MUL
.
.
.
.
.
.
...
ADD/
SUB
MUL
ADD/
SUB
MUL
...
.
.
.
.
.
.
.
.
.
.
.
.
FU
TU TU
TU
Basic PE arch.
PE arch. I
PE arch. II
3-inps/2-outs
4-inps/3-outs
3-inps/3-outs
ORN
ADD/
SUB
MUL
Layout-I
ADD/
SUB
MUL
ORN
.
.
.
MUL
ORN
ADD/
SUB
MUL
ORN
ADD/
SUB
MUL
ADD/
SUB
MUL
Layout-II
Layout-I
Overall LSRDP PE basic arch
Area x (KJJ) PE arch. I
PE arch. II
S1
S2
36923
48083
35307
34193
35995
31258
Layout-II
S1
S2
29200
36190
27266
27040
25031
23451
S2 results in smaller overall area in terms of KJJ for both layout types
Layout II results in smaller area
PE arch. II gives smaller area
2009/01/29
A sample ORN implementation
Block diagram of a high frequency test bench
ladder
clkin_lfin
data_in
input shift register
clkin_lfout
circuit
under
test
output shift register
data_out
A photograph of a chip with 1-to-3
ORN prototype test bench
circuit under test
ladder
input shift register
5 mm
clkin_hf
output shift register
2009/01/29
Conclusions

SFQ-LSRDP is a basic core of a high-performance low-power
computer

Data Flow Graphs (DFGs) extracted from scientific applications are
mapped on the LSRDP

LSRDP micro-architecture is designed based on characteristics of
DFGs via a quantitative approach

LSRDP is promising for resolving issues originated from CMOS
technology as well as achieving remarkable performance
Acknowledgement:
This research was supported in part by Core Research for Evolutional Science
and Technology (CREST) of Japan Science
and Technology Corporation (JST).
2009/01/29
Thanks for your attention!
Any questions?