Document

An Accelerator Based on Single-Flux
Quantum Circuits for a High-Performance
Reconfigurable Computer
F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*
and K. Murakami*
*Graduate School of Information Science and Electrical
Engineering, Kyushu University, Japan
**Institute of Systems, Information Technologies and
Nanotechnologies (ISIT), Fukuoka, Japan
E-mail: [email protected], [email protected]
1
Agenda
 Introduction
 Large-Scale Reconfigurable Data-Path (LSRDP)
General Architecture and Specifications
 Design Procedure and Tool Chain
 Preliminary Results
 Conclusions and Future Work
Kyushu University
WAHA 2009
2
Introduction

Parallel computer clusters with General-Purpose Processors (GPP)
are often used for HPC

Various accelerators are used with GPPs for further performance
improvement


PowerXcell, GPGPU, GRAPE-DR, ClearSpeed, etc.
Small size and low power consumption comparing to processors with similar
performance
Roadrunner with
PowerXcell
TSUBAME
NVIDIA Tesla S1070
http://it.nikkei.co.jp/
http://www.elsa-jp.co.jp/products/hpc/tesla/s1070/index.html
Kyushu University
WAHA 2009
http://www.top500.org/system/9485
3
Single Flux Quantum Large Scale
Reconfigurable Data-Path (SFQ-LSRDP)

A large memory bandwidth is demanded in conventional
accelerators for high-performance computation

On chip memories are often used to hide memory access latency
Large-Scale Reconfigurable Data-Path (LSRDP):
• is introduced as an alternative accelerator
• reduces the no. of memory accesses
• is implemented by Single-Flux Quantum (SFQ) circuits
instead of CMOS circuits
• is suitable for high performance scientific computations
Kyushu University
WAHA 2009
4
Outline of Large-Scale Reconfigurable
Data-Path (LSRDP) processor

LSRDP
Reconfigurable data-path includes:
A large number of floating point
Functional Units (FUs)
 Reconfigurable Operand Routing
Network : ORN
 Dynamic reconfiguration facilities

FU
GPP
FU
FU
...
FU
ORN : Operand Routing Network
:
:
:
FU
FU
FU
:
...
FU

ORN
FU
FU
 Implementation
FU
...
:
:
...
by SFQ circuits
FU
SB
:
Streaming Buffers (SB) for I/O ports
:

Features:
Data Flow Graphs (DFGs) extracted
from critical calculation parts are
directly mapped
 Pipeline execution
 Burst transfer is used for input /output
rearranged data from/to memory

SMAC
Main
Memory
Kyushu University
Scratchpad
Memory
WAHA 2009
5
Single-Flux Quantum (SFQ)
against CMOS

CMOS issues:




high electric power consumption
high heat radiation and difficulties in high-density packing
memory wall problem which limits the processing speed
SFQ Features:






High-speed switching and signal transmission
Low power consumption
Compact implementation of a system (small area)
No cost for latch
Suitable for pipeline processing of data stream
Serial bit-level processing
Kyushu University
WAHA 2009
6
CREST-JST (2006~): Low-power,
high-performance, reconfigurable processor
using single-flux quantum circuits
Yokohama National Univ.
SFQ-FPU chip, cell library
Kyushu Univ.
Prof. N. Yoshikawa et al.
Nagoya Univ.
SFQ-RDP chip, cell library,
and wiring
Prof. A. Fujimaki et al.
SFQ-LSRDP
Architecture, Compiler
and Applications
Prof. K. Murakami
Dr. K. Inoue
Dr. H. Honda
Dr. F. Mehdipour
H. Kataoka
Nagoya Univ.
CAD for logic design
and arithmetic circuits
Superconducting
Research Lab. (SRL)
Prof. N. Takagi (Leader)
et al.
SFQ process
Dr. S. Nagasawa et al.
Kyushu University
WAHA 2009
7
Goals of the Project

Discovering appropriate applications

Developing compiler tools

Developing performance analyzing tools
Designing and Implementing SFQ-LSRDP architecture
considering the features and limitations of SFQ circuits
Kyushu University
WAHA 2009
8
LSRDP General Architecture and
Specifications
9
Parameters Should Be Decided
Within the LSRDP Design Procedure
Streaming Buffer (SB)
• Core structure a matrix of PEs
...
Operand Routing Network (ORN)
PE1
PE2
PE3
Width
...
• PE: combination of a Functional Unit (FU)
and a data Transfer Unit (TU)
Width and Height ?
...
Height
Maximum Connection Length (MCL)
between consecutive rows?
ORN
...
ORN
.
.
.
.
.
.
.
.
.
...
ORN
...
Streaming Buffer (SB)
Kyushu University
PEm
Layout: FU types
(ADD/SUB and MUL)?
• Reconfiguration
mechanism?
(PE, ORN, Immediate data)
• On-chip memory configuration?
WAHA 2009
10
LSRDP Architecture

Processing Elements

FU


implements basic 64-bit double-precision floating point
operations including: ADD, SUB and MUL
TU (transfer unit) as a routing resource for transferring data
from a row to an inconsecutive row
FU
FU TU
FU TU
TU
TU
FU TU
PE including Two components
Four functionalities
Kyushu University
WAHA 2009
11
Layout Types- Type I
W
A
M
T
A
M
T
A
M
T
…
A
M
T
A
M
T
Each PE implements
ADD/SUB and MUL
ORN
A
M
T
A
M
T
A
M
H
M
: MUL
…
A
M
T
A
M
T
ORN
A
M
A
T
T
A
M
T
A
M
T
…
A
M
T
ADD/SUM
A
T
M MUL
TU
.
: ADD/SUB
.
T
: Transfer Unit
.
ORN
A
M
T
Kyushu University Flexible
A
M
T
A
M
T
…
A
M
T
A
M
T
but consume
a lot of resources
WAHA 2009
12
Layout Types- Type II (Checkered)
W
A T
M T
Each PE implements
ADD/SUB or MUL
A T
…
A T
M T
Each PE implements
ADD/SUB or MUL
ORN
A T
M T
A T
…
A T
M T
…
A T
MUL
M T
…
A T
M T
ORN
ADD/SUMA T
TU M T
A T
TU
.
.
.
H
ORN
A T
Kyushu University
M T
A T
WAHA 2009
13
Layout Types- Type III (Striped)
Each PE implements
ADD/SUB or MUL
W
M T
M T
M T
…
M T
M T
Each PE implements
ADD/SUB or MUL
ORN
ADD/SUMA T
TU
A T
A T
…
A T
A T
…
M T
MUL
M T
…
A T
A T
ORN
M T
M T
H
M T
TU
.
.
.
ORN
A T
Type
Kyushu University
A T
A T
II or III, which WAHA
one is
more efficient?
2009
14
Maximum Connection Length (MCL)
Connection
Length= 0
(i, 0)
...
Connection
Length= 2
(i,j)
(i,
j+1)
(i,
j+2)
Longest Connection
Length= L
...
...
...
...
ORN
(i+1,
0)
...
(i+1,
j)
(i+1,
j+1)
(i+1,
j+2)
(i+1,
j+L)
MCL:
maximum horizontal distance
between two PEs located in two consecutive rows
Kyushu University
WAHA 2009
15
An ORN Structure
ORN
T
FPU
FPU
T
½CB
½CB
½CB
T2
CB
CB
T2
CB
CB
CB
FPU
½CB
CB
CB
CB
T
FPU
T
FPU
T
FPU
½CB
½CB
½CB
½CB
½CB
½CB
CB
CB
CB
CB
CB
CB
T2
CB
T
FPU
T
FPU
T2
CB
CB
CB
CB
CB
CB
CB
CB
CB
T2
T
CB
CB
CB
T2
CB
CB
T2
CB
T2
CB
T
FPU
2bit shift
register
ORN is FPU
consisted FPU
of 2-bit shift registers,
1-by-2 and 2-by-2 cross bar switches
T
T2
CB
CB
Kyushu University
A. Fujimaki, et al., Demonstration ofWAHA
an SFQ-Based
2009Accelerator Prototype for a High-Performance Computer,” ASC08,162008.
Dynamic Reconfiguration Mechanism
Input-B Input-A Input-C
PE
Immediate
Register (64b)
Immediate
Register
Transfer
Unit
MUX
FU
(A op B)
Conf. Reg.
[bit]
・・・
ORN
・・・
log(2x (2MCL+1)) x 3 [b]
Three bit-stream lines for dynamic reconfiguration of:
• Immediate registers (64bit) in each PE
• Selector bits for muxes selecting the input data of FUs
• Cross-bar switches in ORNs
Kyushu University
WAHA 2009
17
Design Procedure and
Tool Chain
18
Compiler and Design Flow
Application Code
Mapping
Placement
Bit-Stream Generation
Hardware-Software Partitioning
(manual or automatic)
Routing
Configurations
(for LSRDP)
LSRDP Architecture
Critical Parts
(h/w part)
Port positioning
Analyzing DFG mapping
results
DFG Genration
Design Phase
DFGs
DFGs
Non-Critical Parts
(s/w part)
s/w Part Modification
Binary Code
(for GPP)
• DFGs are manually generated from critical parts of applications
• DFG mapping results are used for
• Analyzing LSRDP architecture statistics
• Generating LSRDP configuration bit-streams
Kyushu University
WAHA 2009
19
LSRDP Design Procedure
DFGs & LSRDP HW constraints
Choosing a design parameter
For each
parameter
Mapping DFGs onto the LSRDP
Obtaining required statistics
Analyzing the mapping results
Choosing the appropriate value
Kyushu University
Appropriate value for each parameter
WAHA 2009
20
Benchmark Applications
for Design Procedures

Finite differential method calculation of
2nd order partial differential equations




1dim-Heat equation
1dim-Vibration equation
2dim-Poisson equation
(Heat)
(Vibration)
(Poisson)
Quantum chemistry application

Recursive parts of Electron Repulsion Integral calculation
(ERI-Rec)
Only ADD/SUB and MUL operations are used
in the critical calculations of all above applications
Kyushu University
WAHA 2009
21
DFG Extraction- Heat Equation

1-dim. heat equation for T(x,t)
T ( x, t )
 2T ( x, t )
A
t
x 2
(A is const.)
T(i-1,j)
T(i,j)
T(i+1,j)
+

Calculation by Finite Difference
Method (FDM)
T ( xi , t j 1 )
D
*
B
*

 D * T ( xi , t j )  B * T ( xi 1 , t j )  T ( xi 1 , t j )

+
T(i,j+1)
Basic DFG can be extended to horizontal
and vertical directions to make a larger DFG
Kyushu University
WAHA 2009
Basic DFG corresponding to
Minimum FDM calculation
22
Example of extracted DFGs- Heat
Inputs:
Outputs:
Operations:
Immediates:
32
16
721
364
A huge sample DFG (Heat)
Kyushu University
WAHA 2009
23
DFG Classification
# of
# of
Inputs Outputs
Class
# of FUs
RDP-S
128
19
12
RDP-M
512
19
12
RDP-L
1024
38
24
RDPXL
> 1024
64
52
# of
DFGs
Heat (3)
Poi (1)
Vib (2)
Eri (4)
Heat (1)
Poi (1)
Vib (1)
Eri (4)
Heat (2)
Poi (1)
Vib (2)
Eri (5)
Heat (1)
Poi (1)
Vib (2)
Eri (5)
Totally,
24 DFGs are prepared
for benchmark DFG
Due to broad range of DFG sizes
DFGs are classified as S, M, L, XL with respect to their size
the number of Input/Output
nodes
Kyushuand
University
WAHA 2009
24
Mapping DFGs onto LSRDP
LSRDP
Architecture
Description
DFG
Placing DFG
nodes on LSRDP
Routing
Connections
Placing IO nodes
Routing Inp/Out
Connections
Configuration
File
Kyushu University
Longest connections
WAHA 2009
25
Preliminary Results
26
LSRDP Specifications: Width & Height
# of Input
ports
# of Output
ports
Width
Height
LSRDP-S
19
12
16
16
LSRDP-M
19
12
32
16
LSRDP-L
38
24
64
32
LSRDP Dimensions and the number of Input/Output Ports
Kyushu University
WAHA 2009
27
LSRDP Specifications: MCL
(i, j)
MCL = L
(i,
j+1)
LSRDP
MCL
(avg/max)
(i+1,
j+1)
(i+2,
(i+L,
・・・
j+1)
j+1)
ORN SizeNo of Inps (avg/max),
Outs
LSRDP-S
4/8
18/34, 3
LSRDP-M
5/9
22/38, 3
LSRDP-L
5/9
22/34, 3
Needs further MCL optimization
Kyushu University
WAHA 2009
28
Analyzing Various LSRDP Layouts
Heat
Viration
Poisson
ERI1
ERI2
Layout
I
II
III
I
II
III
I
II
III
I
II
III
I
II
III
Size
8x3
8x3
8x4
10x8
10x8
10x11
10x10
10x12
15x18
6x2
9x3
6x2
10x10
10x10
15x8
Layout I
Layout II
(Except ERI1 DFG which gives
better size for Layout III)
Layout II can be used instead of Layout I to obtain a smaller LSRDP
architecture with lower power consumption and implementation cost.
Kyushu University
WAHA 2009
29
LSRDP at One Glance (1/2)
Functional units
ADD/SUB, MUL
Layout
Type II (checker pattern)
Operations
64-bit floating point
Processing structure
Pipelined
PE structure
FU, T, FU+T, T+T
LSRDP Size
Small
Medium
Large
No. of inp/out ports
19/12
19/12
38/24
Width/Height
16/16
32/16
64/32
16*16*64
32*16*64
64*32*64
16*BSS(ORN)
32* BSS(ORN)
64*BSS(ORN)
16*16* 2
32*16*2
64*32* 2
22 , 3
26 , 3
26 , 3
Conf. bit-stream
size
Imm. Regs
ORNs
PEs
ORN
inputs, outputs
Structure
Cross-bar switch
Conn. Type
One-directional
Kyushu University
WAHA 2009
30
LSRDP at One Glance (2/2)
Internal memory
External memory
Reconf. mechanism
Kyushu University
Type
Immediate registers
Size and count
64-bit registers,
One reg. for each PE
Communication mechanism
Serial
No. of memory modules
16
Date trans. rate
1800Mbps/pin
Overall data trans. rate
24 GB/s
Mem. to LSRDP bus width
64 bit
Channels per module
Two
Bit serial configuration through a serial chain
WAHA 2009
31
Preliminary Performance Evaluation
Base processor configuration
Processor type
Out-of-order
GPP operating frequency
3.2GHz
Inst. issue width
4 instruction/cc
Inst. decode width
4 instruction/cc
Cache configuration
L1 data
64KB(128B Entry, 2way, 2cc)
L1 instruction
64KB(64B Entry, 1way, 1cc)
L2 unified
4MB(128B Entry, 4way, 16cc)
Latency of main memory
300cc
L2 to main memory
Bus width
64 Bytes
Freq
800 MHz
GPP+LSRDP configuration
LSRDP operating frequency
80 GHz
Reconfiguration Latency
1cc
Latency SPM LSRDP latency
1cc
Latency Main Memory SPM
7500cc
Bandwidth SPMLSRDP
Max. 64 * 8 Bytes/cc
Bandwidth Main Memory SPM
102.4GB/sec
GPP: Exec. time measurement by means of a processor simulator
Kyushu University
WAHA 2009
LSRDP: Estimation by performance modeling
32
Preliminary Performance Evaluation
(Heat)
Normalized by GPP Exec. Time
0.4
0.35
Re c o n f.
0.3
Comm.
Re ar r an ge
0.25
St all
0.2
LSRDP
GPP
0.15
Basic: SB only
Reuse: SB + SPM
0.1
0.05
0
Basic
Reuse
Heat (M)
Basic
Reuse
Heat (L)
Data reusing is employed to avoid the need for data
rearrangement as well as frequently data retrieval
from
the scratchpad memory.
Kyushu
University
WAHA 2009
33
Preliminary Performance Evaluation
(Poisson)
Normalized by GPP Exec. Time
0.4
0.35
Re c o n f.
0.3
Comm.
Re ar r an ge
0.25
St all
LSRDP
0.2
GPP
0.15
0.1
0.05
0
poisson(S)
poisson(M)
poisson(L)
A small fraction is related to processing time on
LSRDP and the main fraction concerns to various
overhead times as well
the execution time on GPP
Kyushu University
WAHAas
2009
34
Conclusions & Future Work

A high-performance computer comprising an accelerator (LSRDP)
implemented by superconducting circuits was introduced.

24 benchmark Data Flow Graphs (DFGs) were manually generated.

LSRDP micro-architecture is designed based on characteristics of scientific
applications via a quantitative approach.

LSRDP is promising for resolving issues originated from CMOS
technology as well as achieving considerable performances.
Future Work:
•To achieve higher performance it is required to reduce various
overhead costs mainly related to data management part.
•To reduce the implementation cost of LSRDP, we will focus on
reducing maximum connection length and ORN size.
Kyushu University
WAHA 2009
35
Acknowledgement
This research was supported
in part by Core Research for Evolutional Science
and Technology (CREST) of Japan Science
and Technology Corporation (JST).
Kyushu University
WAHA 2009
36
Thanks!
Any Questions?
37
Backup Slides

Backup Slides
Kyushu University
WAHA 2009
38
SFQ (Single Flux Quantum) Circuit
High speed, Low power consumption, and
Operating by a different principle from the CMOS
磁束量子
Single Flux Quantum
超伝導ループ
Superconductivity
loop
ジョセフソン接合
Josephson
junction
Ib
L
Ic
Tunneling effect
Φ0
2mV
Kyushu University
2ps
WAHA 2009
39
Mapping Results
# of
FUs
# of
Inputs
# of
Outputs
Width
Height
RDP-S
128
19
RDP-M
512
RDP-L
RDP-XL
Total # of Extra TUs
PEs
(max/avg) (max/avg) (max/avg) (max/avg)
12
26/14.9
10/6.7
98/51.7
56/23.5
19
12
26/17.1
16/9.3
170/77.8
92/37.1
1024
38
24
58/40
24/14.4
730/260.1 428/141.3
> 1024
64
52
122/45.3
25/12.4
1217/350.4 1065/240
For each class, a lot of extra TUs
are needed to map all DFGs
Kyushu University
WAHA 2009
PE types FU
FU T
T
T T
40
Connection Length Minimization- Results
Final optimized Maximum Connection Length (MCL) results
MCL (ave/max)
RDP-S
RDP-M
4/9
(i, j)
MCL = L
5/9
(i,
j+1)
RDP-L
(i+1,
j+1)
(i+2,
(i+L,
j+1) ・・・ j+1)
9.3/19
ORNs should provide the connection length of 9
in LSRDP-S/M (MCL= 9).
Possible to decrease?
For LSRDP-L, MCL = 19 !!!
⇒ Serious Implementation Cost
Kyushu University
WAHA 2009
41
Distributions of Connection Lengths
Average Fraction of Connection Lengths in
the RDP/S Maps
Connection
length
4%
3%
0%
2%1%1%
0
0%
1
10%
2
3
4
5
6
7
8
9
79%
93% of connection lengths are 0 ~ 2
Only small fractions of connections results in larger ORNs
Kyushu University
WAHA 2009
42
Analyzing Various LSRDP Layouts
Heat
Viration
Poisson
ERI1
ERI2
Layout
I
II
III
I
II
III
I
II
III
I
II
III
I
II
III
Size
8x3
8x3
8x4
10x8
10x8
10x11
10x10
10x12
15x18
6x2
9x3
6x2
10x10
10x10
15x8
•
Almost a similar small size
values are achieved
for Layout I and II
for the majority of DFGs (except
ERI1 DFG which gives better
size for Layout III)
Layout II can be used instead of Layout I to obtain a
smaller LSRDP architecture with lower power
Kyushu University
WAHA 2009
consumption and implementation
cost as well
43
Why only ERI1 DFG is suitable to Layout III ?
ADD/
SUB
MUL
ADD/
SUB
MUL
...
MUL
...
MUL
...
...
ORN
ADD/
SUB
MUL
ADD/
SUB
ORN
ADD/
SUB
MUL
ADD/
SUB
ORN
ADD/
SUB
MUL
ADD/
SUB
MUL
.
.
.
.
.
.
.
.
.
.
.
.
Heat
RDP-Layout Type II
Layout II
MUL
MUL
MUL
MUL
...
ADD/
SUB
...
MUL
...
...
ORN
ADD/
SUB
ADD/
SUB
ADD/
SUB
ORN
MUL
MUL
MUL
ORN
ERI 1
Kyushu University
ADD/
SUB
ADD/
SUB
ADD/
SUB
ADD/
SUB
.
.
.
.
.
.
.
.
.
.
.
.
RDP-Layout Type III
Layout III
WAHA 2009
44
FU Layout for DIV, SQRT, EXP operations
16Bits Floating point DIV, SQRT, and EXP Functional unit have
been already developed by SFQ current technology.
FU
FU
FU
...
FU
ORN : Operand Routing Network
FU
DIV
FU
FU
...
FU
...
FU
ORN
Where ?
Three times
larger latency
FU
FU
FU
:
:
:
FU
FU
FU
:
...
FU
...
FU
ORN
FU
FU
FU
Pipeline execution based
on ADD and MUL latency
Where should we place different latency FU ?
Kyushu University
WAHA 2009
Heterogeneous configuration
of FU array ?
45
Estimated performance improvement of 2-dim Poisson
equation by LSRDP calc.
8
Normalized exec. time
by GPP(3GHz) calc.
7
6
Treconf
5
Ttra
4
Trearr
3
Tst
2
Tcal
1
Tgpp
0
12.8
102.4
poisson(S)
Kyushu University
12.8
102.4
poisson(M)
WAHA 2009
12.8
102.4
poisson(L)
Main Mem.
bandwidth
[GByte/sec]
46
Estimated performance improvement of ERI
calculation by LSRDP
1
0.8
Treconf
0.6
Ttra
Trearr
0.4
Tst
0.2
Tcal
Tgpp
0
GPPonly
(3GHz)
Kyushu University
LSRDP
ERI
WAHA 2009
47
Recursive Parts of Electron Repulsion
Integral Formula (ERI-Rec)
No. of
Operations
No. of
Inputs
No. of
Output
(ps,ss)
9
8
3
(ps,ps)
51
16
9
(pp,ss)
66
14
9
(pp,ps)
252
22
27
(pp,pp)
1004
28
81
DFG sizes have already determined
from original recursive formula
Kyushu University
WAHA 2009
48
What types of software/algorithms are
suitable for LSRDP ?

When same calculations have to be calculated
repeatedly.


LSRDP is used for high throughput accelerator.
Input/Output data size is small compared with the
amount of the operations.
memory access
X
small size
of input
Kyushu University
WAHA 2009
Large amount
of calculations
small size
of output
49
LSRDP
Exploration of suitable applications
for LSRDP

Application

matrix elements calculation




Molecular integral calculations in molecular orbital method
Monte Carlo type simulation
etc…
Numerical calculation library






special function (promising?)
differential equation
numerical integration
matrix operation (difficult ??)
Triangular matrix simultaneous equation
etc…
Investigating applicability
against various applications
WAHA 2009
Kyushu University
50
Recursive Parts of Electron Repulsion Integral Formula in
Molecular Orbital Calc.
~Up to (pp,pp) Recursive Calculation~
(ss,ss)(m) and all coefficients
are given as input
( pi s, ss)(0)  PA i ( ss, ss)(0)  WPi ( ss, ss)(1)
( pi s, pk s )(0)  QCk ( pi s, ss )(0)  WQk ( pi s, ss )(1) 
( pi p j , ss )(0)  PBi ( pi s, ss )(0)  WPj ( pi s, ss )(1) 
 ik Zab
2
 ij Za
2
( ss, ss)(1)
(ss, ss)
(0)
  Za( ss, ss )(1)

( pi p j , pk s )(0)  QCk ( pi p j , ss )( 0)  WQk ( pi p j , ss )(1)


Zab
 ik ( sp j , ss )(1)   jk ( pi s, ss )(1)
2

( pi p j , pk pl )(0)  QDk ( pi p j , pk s )(0)  WQl ( pi p j , pk s)(1) 

 ij Zb
2
( p p , ss)
i
(0)
j
  Zb( pi p j , ss )(1)

Zab
 il ( sp j , pk s)(1)   jl ( pi s, pk s)(1)
2

(i,j,k,l = x,y,z): p function has 3 components (as 1dim array)
DFG
sizes
areFUs.
determined by
Each DFG has only ADD
(SUB)
and MUL
Kyushu University
algorithm
WAHA 2009
# of Inputs:
# of Outputs:

Max. 28
1 ~ 81
original calculation
51
DFG Distribution for each application
# of FUs
1000
ERI-Rec
(8 DFGs)
Vibration (7)
800
Heat (6)
600
400
Poisson (3)
200
0
0
10
20
30
40
50
60
70
# of Inputs
DFGs have different qualities in terms of the
# of FUs, # of Inputs
and
Kyushu University
WAHA
2009 Outputs
52
Example of MCL (Heat)
MCL
Heat original DFG
(I/O: 8/4, FUs: 32)
Kyushu University
Mapping result
WAHA 2009
53
Example of extracted DFGs (ERI-Rec)

Maximum DFG of ERI-Rec: (pipj,pkpl)
Inputs:
28
Outputs:
81
FUs:
1004
Immediates:
0
Vertical Partitioning
Kyushu University
WAHA 2009
Inputs:
24
Outputs:
1
FUs:
108
Immediates:
0
54
Poisson Equation
 2u( x, y)  2u( x, y)

 f ( x, y)
2
2
x
y
2D – Poisson Eq.
Successive Over Relaxation method
u ( n 1) ( xi , y j )
ω is const.
 (1   ) * u ( n ) ( xi , y j )



u
4
(n)
( xi 1 , y j )  u ( n ) ( xi 1 , y j )
 u ( n ) ( xi , y j 1 )  u ( n ) ( xi , y j 1 )  h 2 f ( xi , y j )

Red/Black
Gauss Seidel
In order to obtain u(n+1) (xi,yj) in the next iteration,
current values of five variables i.e. u(n) (xi,yj), u(n) (xi±1,yj), u(n)
55
55
(xi,yjUniversity
Kyushu
WAHA 2009
±1) are needed
Example of extracted DFGs (Poisson)

Maximum Poisson DFG
Inputs:
32
Outputs:
1
FUs:
721
Immediates: 364
Kyushu University
WAHA 2009
56
Performance Evaluation:
Simulation Environment
Variable parameters:
GPP
LSRDP
Main
Memory
• Freq. of GPP and LSRDP
• Bandwidth between main memory
and LSRDP
• Latency of reconfiguration time
• # of FPUs in LSRDP
• Supporting FPU types
(Add, Mul, Div, Exp, Sqrt,
Error function units are supported)
Use streaming buffer in the LSRDP chip
I/O data is sorted in the main memory.
GPP: Exec. time measurement by processor simulator
LSRDP: Estimation by performance modeling
Kyushu University
WAHA 2009
57
57
Estimated performance improvement
of 1-dim heat equation by LSRDP calc.
1.2
Trec
1
Ttra
0.8
Trearr
0.6
Tst
0.4
Tcal
0.2
ETgpp
0
Only
GPP
12.8
51.2
25.6
LSRDP
Kyushu University
WAHA 2009
102.4
Main Mem.
bandwidth
[GByte/sec]
58
Estimated performance improvement of 1-dim heat
equation by LSRDP calc.
Normalized exec. time
by GPP(3GHz) calc.
0.35
0.3
0.25
Trec
0.2
Ttra
Trearr
0.15
Tst
0.1
Tcal
0.05
ETgpp
0
12.8
25.6
51.2
Heat
Kyushu University
WAHA 2009
102.4
Main Mem.
bandwidth
[GByte/sec]
59
Poisson Red/Black 法における
DFGの拡大による繰り返し回数の増加
4+1ノード
の入力
9+4ノード
の入力
SOR式
1回の計算
SOR式
2回の繰り返し
中心1ノードの出力
中心1ノードの出力
•DFGの拡大により1度に計算可能な繰り返し回数が増加
これに伴い必要な入力数も増加
Kyushu University
WAHA 2009
60
60
Implementation of Heat calculation to LSRDP
Original GPP code
Loop j
Loop i
T(xi,tj)
End Loop
End Loop
Kyushu University
LSRDP code
LSRDP Reconfiguration
Loop j’
Input Data Rearrangement
Loop N
LSRDP pipeline exec.
(FDM DFG calc.)
End Loop
Output Data Rearrangement
End Loop
WAHA 2009
61
61
Implementation of Poisson calculation
to LSRDP
Original GPP code
LSRDP code
Loop Iter
Loop i
loop j
u(xi,yj)
End Loop
End Loop
End Loop
LSRDP Reconfiguration
Loop Iter’
Input Data rearrangement
Loop N
LSRDP pipeline exec.
(FDM DFG calc.)
End Loop
Output Data rearrangement
End Loop
Kyushu University
WAHA 2009
62
62
Implementation of ERI-Rec calculation to
LSRDP
original GPP code
Loop I,J,K,L
Loop contraction
Initial Integral Calc.
Recursive Calc.
End Loop
Partial Fock Calc.
End Loop
Initial Integral Calc.:
1/Sqrt, Exp, Fm(T) are utilized
=> GPP calculation
Recursive Calc.:
only ADD/SUB, MUL
=> LSRDP calculation
Kyushu University
LSRDP code
Loop I,J,K,L
LSRDP Reconfiguration
Loop contraction
Initial Integral Calc.
End Loop
Input Data rearrangement
Loop N
LSRDP pipeline calc.
(Recursive DFG calc.)
End Loop
Output Data rearrangement
Partial Fock Calc.
End Loop
WAHA 2009
63
63
Vertical vs. Horizontal
DFG Decomposition
Original
Loop N
Reconfiguration
Loop M
LSRDP pipeline calc.
End Loop
End Loop
Vertical Decomp.
Loop n ( > N)
Reconfiguration
Loop M
LSRDP pipeline calc.
End Loop
End Loop
Kyushu University
Horizontal Decomp.
Loop N
Reconfiguration
Loop M
1st LSRDP pipeline calc.
End Loop
End Loop
Loop N
Reconfiguration
Loop M
2nd LSRDP pipeline calc.
End Loop
End Loop
WAHA 2009
64
64
Example of extracted DFGs

Maximum DFG of ERI-Rec: (pipj,pkpl)
Inputs:
28
Outputs:
81
FUs:
1004
Immediates:
0
Vertical Partitioning
Kyushu University
WAHA 2009
Inputs:
24
Outputs:
1
FUs:
108
Immediates:
0
65
Example of extracted DFGs- Heat
Inputs:
Outputs:
Operations:
Immediates:
32
16
721
364
A huge sample DFG (Heat)
Kyushu University
WAHA 2009
66
Performance Evaluation:
Simulation Environment
Variable parameters:
GPP
LSRDP
Main
Memory
• Freq. of GPP and LSRDP
• Bandwidth between main memory
and LSRDP
• Latency of reconfiguration time
• # of FPUs in LSRDP
• Supporting FPU types
(Add, Mul, Div, Exp, Sqrt,
Error function units are supported)
Use streaming buffer in the LSRDP chip
I/O data is sorted in the main memory.
GPP: Exec. time measurement by processor simulator
LSRDP: Estimation by performance modeling
Kyushu University
WAHA 2009
67
67
Performance Evaluation:
Execution Time Modeling
Execution time
ETTOTAL  ETGPP  ETLSRDP  TOverhead
ETLSRDP  TCal  STLSRDP
Sort data + Reconfig.
+ Send signal for comm.
Calculation time
Tcal 
Total pipeline depth
# of rows of LSRDP
+
in the given program
(latency of LSRDP)
Stall time
of LSRDP <->Mem
STLSRDP  Latency
For first Input and last output
Kyushu University
WAHA 2009
+
Stall from
Bandwidthreq > Bandwidthmem
68 68
Layout Types- Type I
W
A
M
T
A
M
T
A
M
T
…
A
M
T
A
M
T
Each PE implements
ADD/SUB and MUL
ORN
A
M
T
A
M
T
A
M
H
M
A
: ADD/SUB
T
: Transfer Unit
…
A
M
T
A
M
T
ORN
: MUL
A
T
M
T
A
M
T
A
M
T
…
A
M
T
ADD/SUM
A
T
M MUL
TU
.
.
.
ORN
Total No. of PEs= WA * H
M
T
A
M
T
A
M
T
…
A
M
T
A
M
T
Total Area= W*H* [Area(MUL)+Area(ADD/SUB)+ Area(TU)]+ Area(ORNs)
Kyushu University
WAHA 2009
69
Layout Types- Type II
W
A T
M T
Each PE implements
ADD/SUB or MUL
A T
…
A T
M T
Each PE implements
ADD/SUB or MUL
ORN
A T
M T
A T
…
A T
M T
…
A T
MUL
M T
…
A T
M T
ORN
ADD/SUMA T
TU M T
A T
TU
.
.
.
H
ORN
Total No. of PEs= W * H
A T
M T
A T
Total Area= ½* W*H*[Area(MUL)+Area(TU)]+
½*W*H*[Area(ADD/SUB)+Area(TU)]+ Area (ORNs)
Kyushu University
WAHA 2009
70
Layout Types- Type III
Each PE implements
ADD/SUB or MUL
W
M T
M T
M T
…
M T
M T
Each PE implements
ADD/SUB or MUL
ORN
ADD/SUMA T
TU
A T
A T
…
A T
A T
…
M T
MUL
M T
…
A T
M T
ORN
M T
M T
H
M T
TU
.
.
.
ORN
Total No. of PEs= W * H
A T
M T
A T
Total Area= ½* W*H*[Area(MUL)+Area(TU)]+
½*W*H*[Area(ADD/SUB)+Area(TU)]+ Area (ORNs)
Kyushu University
WAHA 2009
71
CB Various Functionalities

CB




two inputs/outputs
four cases are possible
reconfigurable
1/2CB



one input/ two outputs
four cases are possible
reconfigurable
CB
00
01
½CB
10
Kyushu University
11
00
WAHA 2009
01
10
11
72
An ORN Structure
CB
T
CB
½CB
T
large number of Josephson junctions
M number of ½ CB and (2×M+1)×MCL
number of CB


The number of FPUs is M, the number of Transfer
Units (T) is also M;
MCL is a maximum connection length if we consider
FPUs only
=>
 ½ CB – 2×M
 T2 – (M+4×MCL+2)
 CB – (2×MCL+1) ×(4×M-1)
CB: 351 JJs
½ CB: 216 JJs
Reduction of the number of
Josephson junctions is essential!
* T2 is a 2-bit shift register
CB
T2
CB
T2
T2
FPU
FPU
½CB
CB
CB
CB
CB
T
T
CB
CB
½CB
T2
“–”:
CB
CB
½CB
CB
FPU
FPU
CB
CB
T
T
CB
CB
½CB
T2
CB
CB
½CB
CB
FPU
FPU
CB
CB
T

CB
CB
½CB
T
CB
CB
½CB
T2
FPU
FPU
CB
CB
T
CB
CB
CB
½CB
scalable
 pipelined
 easily re-designed for any number of N and M
CB
CB
“+”:

FPU
FPU
½CB

T2
T2
T
T2
73
Kyushu University
2009
A. Fujimaki, et al., Demonstration of anWAHA
SFQ-Based
Accelerator Prototype for a High-Performance Computer,” ASC08, 2008.
Reconfiguration Mechanism
Input-B Input-A Input-C
PE
Immediate
Register (64b)
Immediate
Register
Transfer
Unit
MUX
Initial
State
FU
(A op B)
Conf. Reg.
[bit]
idle
End of
Execution
・・・
Starting of
Reconfiguration
Reconfiguration
・・・
ORN
ORN
Execution
log(2x (2MCL+1)) x 3 [b]
Immediate
PE & ORN reconfiguration structure
PE
Starting of
Execution
wait
M
U
X
Imm.
Reg.
Imm.
Reg.
FU
・・・
M
U
X
TU
MUX
Imm.
Reg.
TU
・・・
FU
・・・
ORN
MUX
TU
Imm.
Reg.
FU
・・・
MUX
End of
Reconfiguration
ORN
MUX
State Diagram
TU
FU
ORN
・・・
ORN
Reconfiguration in LSRDP
Kyushu University
WAHA 2009
74
LSRDP Design Procedure
Start
Choosing a design parameter
based on the priority of
parameters
Mapping DFGs onto the LSRDP
(placement, routing. I/O
positioning)
Obtaining required statistics
from the mapping results
Analyzing the mapping results
Choosing the appropriate value
of the respective design
parameter
Kyushu University
WAHA 2009
75
10TFLOPS SFQ-RDP computer
4.2 K
2TB memory module
(FB-DIMM
[DDR3@1333MHz, 128GB]
×16 modules)
CMOS
CPU
(1chip)
SFQ 0.5um process
ORN
...
FPU
ORN
:
:
:
:
SFQ RDP
(32FPU×32chips)
(4GFLOPS/FPU)
...
ORN
SFQ Streaming Buffer
(64Kb×2chips)
...
ORN
SB
:
:
:
SMAC
SMAC
...
:
1024FPU@MCM
(34chips)×4MCM
SMAC
Kyushu University
Memory band width per MCM:256GB/s
(=16GB/s ×16 channels)
WAHA 2009
76
Power Consumption Comparison
for 10TFLOPS computers
MPU
Mem.
HD
CMOS RISC
(90 nm)
125 kW
12.5 kW
5 KW
SFQ RDP
(0.5 um)
3W
+3.3 W
250 W
100 W
Kyushu University
WAHA 2009
Freezer
1 kW
Air
Conditioning
Total
43 kW
186 kW
0.1 kW
1.5 kW
77
Power Consumption / Performance
Performance
(GFlops)
Power
Consumption (W)
Power Cons./
Perf.(W/GFlops)
SFQ-RDP
~10K
~1.5K
~0.15
0.5 um,
Whole system
SFQ-RDP
~10K
~6.3 (MPU)
~0.63*10^(-3)
0.5 um, MPU
GRAPE-DR
512
65
0.12
Chip
CSX600
25
10
0.40
ClearSpeed Chip
Cell
192 (single)
32
0.17
Inside Chip, SPE core
Cell (eDP)
1.33PF(#12960)?
12960*110W?
GeForce8800GTX
518 (single)
150
SX-9
CMOS RISC
~10K
(90nm)
Kyushu University
~186K
WAHA 2009
Roadrunner
0.29
Chip
18.3
512 nodes whole
system: 15.4MW
~18.6
78