Document

Optimizing the Architecture of
SFQ-RDP (Single Flux QuantumReconfigurable Datapath)
F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*
and K. Murakami*
*Graduate School of Information Science and Electrical
Engineering, Kyushu University, Japan
**Institute of Systems, Information Technologies and
Nanotechnologies (ISIT), Fukuoka, Japan
E-mail: [email protected]
CREST-JST (2006~): Low-power,
high-performance, reconfigurable processor
using single-flux quantum circuits
Yokohama National Univ.
SFQ-FPU chip, cell library
Kyushu Univ.
Prof. N. Yoshikawa et al.
Nagoya Univ.
SFQ-RDP chip, cell library,
and wiring
Prof. A. Fujimaki et al.
SFQ-LSRDP
Architecture, Compiler
and Applications
Prof. K. Murakami
Dr. K. Inoue
Dr. H. Honda
Dr. F. Mehdipour
H. Kataoka
Nagoya Univ.
CAD for logic design
and arithmetic circuits
Superconducting
Research Lab. (SRL)
Prof. N. Takagi (Leader)
et al.
SFQ process
Dr. S. Nagasawa et al.
Kyushu University
SSV 2009
2
Agenda
 Introduction
 Large-Scale Reconfigurable Data-Path (LSRDP)
General Architecture and Specifications
 Design Procedure and Tool Chain
 Preliminary Results
 Conclusions and Future Work
Kyushu University
SSV 2009
3
Introduction

For performance improvement various accelerators are used with GPPs


PowerXcell, GPU, GRAPE-DR, ClearSpeed, etc.
Small size and low power consumption comparing to processors with similar
performance
NVIDIA Tesla S1070
http://www.nvidia.com
Kyushu University
SSV 2009
4
Acceleration Through a Data-Path Processor

Mechanism



Acceleration by using a data-path accelerator
Augmenting the accelerator to the base processor
Executes hot portions of applications on the accelerator
Kyushu University
SSV 2009
5
How a Reconfigurable Processor Works
Non-critical code
critical code
GPP
LSRDP
Non-critical code
critical code
Non-critical code
.
.
.
Application code
Kyushu University
SSV 2009
Main
Memory
6
Coupling an Accelerator to a Processor
Tight
Coupling
Processor
RFU
Tight
Coupling
Memory
Coprocessor
Bridge
Attached
Processor
Loose
Coupling
Kyushu University
SSV 2009
7
Motivation
Conventional accelerators:

A large memory bandwidth is demanded in conventional
accelerators for high-performance computation

On chip memories are often used to hide memory access latency
Large-Scale Reconfigurable Data-Path (LSRDP):
• is introduced as an alternative accelerator
• reduces the no. of memory accesses by utilizing data-path
Kyushu University
SSV 2009
8
Outline of Large-Scale Reconfigurable
Data-Path (LSRDP) processor

LSRDP
Reconfigurable data-path includes:
A large number of floating point
Functional Units (FUs)
Arranged as arrays
 Reconfigurable Operand Routing
Network : (ORN)
 Dynamic reconfiguration facilities

FU
GPP
FU
FU
...
FU
ORN : Operand Routing Network
:
:
:
FU
FU
FU
:
...
FU
ORN
FU
FU

FU
...
FU

Features:
SB
:
:
:
...
Data Flow Graphs (DFGs) extracted
from critical calculation parts are
directly mapped
 Pipeline execution
 Burst transfer is used for input /output
rearranged data from/to memory

:
SMAC
Main
Memory
Kyushu University
Scratchpad
Memory
Streaming Buffer (SB) for I/O ports
SSV 2009
9
Single-Flux Quantum (SFQ)
against CMOS

CMOS issues: (if LSRDP has 32x32 FUs)


high electric power consumption
high heat radiation and difficulties in high-density packing
磁束量子
超伝導ループ
Superconductivity
loop

SFQ Features:






Single Flux Quantum
ジョセフソン接合
Josephson
junction
High-speed switching and signal transmission
Low power consumption
Compact implementation of a system (small area)
No cost for latch
Suitable for pipeline processing of data stream
Serial bit-level processing
Kyushu University
SSV 2009
10
Goals of the Project

Discovering appropriate scientific applications

Developing compiler tools

Developing performance analyzing tools
Designing and Implementing SFQ-LSRDP architecture
considering the features and limitations of SFQ circuits
Kyushu University
SSV 2009
11
LSRDP General Architecture and
Specifications
Parameters Should Be Decided
Within the LSRDP Design Procedure
Streaming Buffer (SB)
• Core structure: a rectangular matrix of PEs
...
Operand Routing Network (ORN)
PE1
PE2
PE3
Width
...
• PE: combination of a Functional Unit (FU)
and a data Transfer Unit (TU)
Width and Height ?
...
Height
Maximum Connection Length (MCL)
between consecutive rows?
(impossible to implement full cross bar)
ORN
...
ORN
.
.
.
.
.
.
.
.
.
...
ORN
PEm
Layout: FU types
(ADD/SUB and MUL)?
Reconfiguration mechanism?
(PE, ORN, Immediate data)
...
Streaming Buffer (SB)
Kyushu University
• On-chip memory configuration?
SSV 2009
13
LSRDP Architecture

Processing Elements

FU


implements basic 64-bit double-precision floating point
operations including: ADD, SUB and MUL
TU (transfer unit) as a routing resource for transferring data
from a row to an inconsecutive row
FU
FU TU
FU TU
TU
TU
FU TU
PE including Two components
Four functionalities
Kyushu University
SSV 2009
14
Layout Types- Type I
W
A
M
T
A
M
T
A
M
T
…
A
M
T
A
M
T
Each PE implements
ADD/SUB and MUL
ORN
A
M
T
A
M
T
A
M
H
M
: MUL
…
A
M
T
A
M
T
ORN
A
M
A
T
T
A
M
T
A
M
T
…
A
M
T
ADD/SUB
A
T
M MUL
TU
.
: ADD/SUB
.
T
: Transfer Unit
.
ORN
A
M
T
Kyushu University Flexible
A
M
T
A
M
T
…
A
M
T
A
M
T
but consume
a lot of resources
SSV 2009
15
Layout Types- Type II (Checkered)
W
A T
M T
Each PE implements
ADD/SUB or MUL
A T
…
A T
M T
Each PE implements
ADD/SUB or MUL
ORN
A T
M T
A T
…
A T
M T
…
A T
MUL
M T
…
A T
M T
ORN
ADD/SUBA T
TU M T
A T
TU
.
.
.
H
ORN
A T
Kyushu University
M T
A T
SSV 2009
16
Layout Types- Type III (Striped)
Each PE implements
ADD/SUB or MUL
W
M T
M T
M T
…
M T
M T
Each PE implements
ADD/SUB or MUL
ORN
ADD/SUBA T
TU
A T
A T
…
A T
A T
…
M T
MUL
M T
…
A T
A T
ORN
M T
M T
H
M T
TU
.
.
.
ORN
A T
Type
Kyushu University
A T
A T
II or III, which one
is more efficient?
SSV 2009
17
Maximum Connection Length (MCL)
Connection
Length= 0
(i, 0)
...
Connection
Length= 2
(i,j)
(i,
j+1)
(i,
j+2)
Longest Connection
Length= L
...
...
...
...
ORN
(i+1,
0)
...
(i+1,
j)
(i+1,
j+1)
(i+1,
j+2)
(i+1,
j+L)
MCL:
maximum horizontal distance
between two PEs located in two subsequent rows
Kyushu University
SSV 2009
18
An ORN Structure
ORN
T
FPU
FPU
T
½CB
½CB
½CB
T2
CB
CB
T2
CB
CB
CB
FPU
½CB
CB
CB
CB
T
FPU
T
FPU
T
FPU
½CB
½CB
½CB
½CB
½CB
½CB
CB
CB
CB
CB
CB
CB
T2
CB
T
FPU
T
FPU
T2
CB
CB
CB
CB
CB
CB
CB
CB
CB
T2
T
CB
CB
CB
T2
CB
CB
T2
CB
T2
CB
T
FPU
2bit shift
register
ORN is FPU
consisted FPU
of 2-bit shift registers,
1-by-2 and 2-by-2 cross bar switches
T
T2
CB
CB
Kyushu University
A. Fujimaki, et al., Demonstration of an
SFQ-Based
SSV
2009 Accelerator Prototype for a High-Performance Computer,” ASC08,192008.
Dynamic Reconfiguration Mechanism
Initial
State
idle
End of
Execution
Starting of
Reconfiguration
Reconfiguration
ORN
Execution
Immediate
PE
Starting of
Execution
wait
Kyushu University
SSV 2009
End of
Reconfiguration
20
Dynamic Reconfiguration Architecture
Input-B Input-A Input-C
PE
Immediate
Register (64b)
Immediate
Register
Transfer
Unit
MUX
FU
(A op B)
Conf. Reg.
[bit]
・・・
ORN
・・・
log(2x (2MCL+1)) x 3 [b]
Three bit-stream lines for dynamic reconfiguration of:
• Immediate registers (64bit) in each PE
• Selector bits for muxes selecting the input data of FUs
• Cross-bar switches in ORNs
Kyushu University
SSV 2009
21
Design Procedure and
Tool Chain
Compiler and Design Flow
Application Code
Mapping
Placement
Bit-Stream Generation
Hardware-Software Partitioning
(manual or automatic)
Routing
Configurations
(for LSRDP)
LSRDP Architecture
Critical Parts
(h/w part)
Port positioning
Analyzing DFG mapping
results
DFG Genration
Design Phase
DFGs
DFGs
Non-Critical Parts
(s/w part)
s/w Part Modification
Binary Code
(for GPP)
• DFGs are manually generated from critical parts of applications
• DFG mapping results are used for
• Analyzing LSRDP architecture statistics
• Generating LSRDP configuration bit-streams
Kyushu University
SSV 2009
23
Benchmark Applications
for Design Procedures

Finite differential method calculation of
2nd order partial differential equations




1dim-Heat equation
1dim-Vibration equation
2dim-Poisson equation
(Heat)
(Vibration)
(Poisson)
Quantum chemistry application

Recursive parts of Electron Repulsion Integral calculation
(ERI-Rec)
Only ADD/SUB and MUL operations are used
in the critical calculations of all above applications
Kyushu University
SSV 2009
24
DFG Extraction- Heat Equation

1-dim. heat equation for T(x,t)
T ( x, t )
 2T ( x, t )
A
t
x 2
(A is const.)
T(i-1,j)
T(i,j)
T(i+1,j)
+

Calculation by Finite Difference
Method (FDM)
T ( xi , t j 1 )
D
*
B
*

 D * T ( xi , t j )  B * T ( xi 1 , t j )  T ( xi 1 , t j )

+
T(i,j+1)
Basic DFG can be extended to horizontal
and vertical directions to make a larger DFG
Kyushu University
SSV 2009
Basic DFG corresponding to
minimum FDM calculation
25
Example of extracted DFGs- Heat
Inputs:
Outputs:
Operations:
Immediates:
32
16
721
364
A huge sample DFG (Heat)
Kyushu University
SSV 2009
26
DFG Distribution for each application
# of FUs
1000
ERI-Rec
(8 DFGs)
Vibration (7)
800
24DFGs
Heat (6)
600
400
Poisson (3)
200
0
0
10
20
30
40
50
60
70
# of Inputs
DFGs have different qualities in terms of the
# of FUs, # of Inputs
Kyushu University
SSVand
2009 Outputs
27
DFG Classification
Class
# of FUs
RDP-S
128
# of
# of
Inputs Outputs
19
12
RDP-M
512
19
12
RDP-L
1024
38
24
RDPXL
> 1024
64
52
# of
DFGs
Heat (3)
Poi (1)
Vib (2)
Eri (4)
Heat (1)
Poi (1)
Vib (1)
Eri (4)
Heat (2)
Poi (1)
Vib (2)
Eri (5)
Heat (1)
Poi (1)
Vib (2)
Eri (5)
Totally,
24 DFGs are prepared
for benchmark Apps.
Due to broad range of DFG sizes
DFGs are classified as S, M, L, XL with respect to their size
and the number of Input/Output nodes
=> LSRDP designing processes
for S, M, L, XL, respectively
Kyushu University
SSV 2009
28
Mapping DFGs onto LSRDP
LSRDP
Architecture
Description
DFG
Placing DFG
nodes on LSRDP
Routing
Connections
Placing IO nodes
Routing Inp/Out
Connections
Configuration
File
Kyushu University
Longest connections
SSV 2009
29
LSRDP Design Procedure
DFGs & LSRDP HW constraints
Choosing a design parameter
For each
parameter
Mapping DFGs onto the LSRDP
Obtaining required statistics
Analyzing the mapping results
Choosing the appropriate value
Kyushu University
Appropriate values for all parameters
SSV 2009
30
Preliminary Results
LSRDP Specifications: Width & Height
# of Input
ports
# of Output
ports
Width
Height
LSRDP-S
19
12
16
16
LSRDP-M
19
12
32
16
LSRDP-L
38
24
64
32
LSRDP Dimensions and the number of input/output ports
Kyushu University
SSV 2009
32
LSRDP Specifications: MCL
MCL (Max. Conn. Len.)= 2
(i, j)
...
FU
T
FU
T
FU
T
...
...
FU
T
FU
T
FU
T
FU
T
...
MCL = L
No. of Inputs
=(2xMCL+1)x2
= 10
(i,
j+1)
10 to 3
(i+1,
j+1)
(i+2,
(i+L,
j+1) ・・・ j+1)
No. of Outputs= 3
FU
T
FU
T
FU
T
LSRDP
FU
T
FU
T
FU
T
FU
MCL
(avg/max)
T
...
ORN SizeNo of Inps (avg/max), Outs
LSRDP-S
4/8
18/34, 3
LSRDP-M
5/9
22/38, 3
LSRDP-L
5/9
22/34, 3
Kyushu University
Further MCL optimization
needed
SSV 2009
33
Analyzing Various LSRDP Layouts
Heat
Viration
Poisson
ERI1
ERI2
Layout
I
II
III
I
II
III
I
II
III
I
II
III
I
II
III
Size
8x3
8x3
8x4
10x8
10x8
10x11
10x10
10x12
15x18
6x2
9x3
6x2
10x10
10x10
15x8
(Except ERI1 DFG which gives
better size for Layout III)
Layout II can be used instead of Layout I
to obtain a smaller LSRDP
Kyushu University
SSV 2009
34
LSRDP at One Glance (1/2)
Functional units
ADD/SUB, MUL
Layout
Type II (checker pattern)
Operations
64-bit floating point
Processing structure
Pipelined
PE structure
FU, T, FU+T, T+T
LSRDP Size
Small
Medium
Large
No. of inp/out ports
19/12
19/12
38/24
Width/Height
16/16
32/16
64/32
16*16*64
32*16*64
64*32*64
16*BSS(ORN)
32* BSS(ORN)
64*BSS(ORN)
16*16* 2
32*16*2
64*32* 2
22 , 3
26 , 3
26 , 3
Conf. bit-stream
size
Imm. Regs
ORNs
PEs
ORN
inputs, outputs
Structure
Cross-bar switch
Conn. Type
One-directional
Kyushu University
SSV 2009
35
LSRDP at One Glance (2/2)
Internal memory
External memory
Reconf. mechanism
Kyushu University
Type
Immediate registers
Size and count
64-bit registers,
One reg. for each PE
Communication mechanism
Serial
No. of memory modules
16
Date trans. rate
1800Mbps/pin
Overall data trans. rate
24 GB/s
Mem. to LSRDP bus width
64 bit
Channels per module
Two
Bit serial configuration through a serial chain
SSV 2009
36
Preliminary Performance Evaluation
Base processor configuration
Processor type
Out-of-order
GPP operating frequency
3.2GHz
Inst. issue width
4 instruction/cc
Inst. decode width
4 instruction/cc
Cache configuration
L1 data
64KB(128B Entry, 2way, 2cc)
L1 instruction
64KB(64B Entry, 1way, 1cc)
L2 unified
4MB(128B Entry, 4way, 16cc)
Latency of main memory
300cc
L2 to main memory
Bus width
64 Bytes
Freq
800 MHz
GPP+LSRDP configuration
LSRDP operating frequency
80 GHz
Reconfiguration Latency
1cc
Latency SPM LSRDP latency
1cc
Latency Main Memory SPM
7500cc
Bandwidth SPMLSRDP
Max. 64 * 8 Bytes/cc
Bandwidth Main Memory SPM
102.4GB/sec
GPP： Exec. time measurement by means of a processor simulator
Kyushu University
SSV 2009
LSRDP： Estimation by performance modeling
37
Preliminary Performance Evaluation
(Heat)
Normalized by GPP Exec. Time
0.4
0.35
Re c o n f.
0.3
Comm.
Re ar r an ge
0.25
St all
0.2
LSRDP
GPP
0.15
Basic: SB only
Reuse: SB + SPM
0.1
0.05
0
Basic
Reuse
Heat (M)
Basic
Reuse
Heat (L)
Data reusing is employed to avoid the need for data
rearrangement as well as frequently data retrieval
from
the scratchpad memory.
Kyushu
University
SSV 2009
38
Preliminary Performance Evaluation
(Poisson)
Normalized by GPP Exec. Time
0.4
0.35
Re c o n f.
0.3
Comm.
Re ar r an ge
0.25
St all
LSRDP
0.2
GPP
0.15
0.1
0.05
0
poisson(S)
poisson(M)
poisson(L)
A small fraction is related to processing time on
LSRDP and the main fraction concerns to various
overhead times as well
Kyushu University
SSV as
2009the execution time on GPP
39
Conclusions & Future Work

A high-performance computer comprising an accelerator (LSRDP)
implemented by superconducting circuits was introduced.

24 benchmark Data Flow Graphs (DFGs) were manually generated.

LSRDP micro-architecture is designed based on characteristics of scientific
applications via a quantitative approach.

LSRDP is promising for resolving issues originated from CMOS
technology as well as achieving considerable performances.
Future Work:
•To achieve higher performance it is required to reduce various
overhead costs mainly related to data management part.
•To reduce the implementation cost of LSRDP, we will focus on
reducing maximum connection length and ORN size.
Kyushu University
SSV 2009
40
Acknowledgement
This research was supported
in part by Core Research for Evolutional Science
and Technology (CREST) of Japan Science
and Technology Corporation (JST).
Kyushu University
SSV 2009
41
2x3 RDP processor prototype
ALU Controller

SR_IN
8-bit ALUs implementing:

SR_OUT
ORN1 ALU1
ORN3
ALU3
ORN5
ALU5







ORN2
ALU2 ORN4
ALU4
ORN6
ALU6

ADD, SUB, AND, OR, XOR
25GHz Frequency
6-bit Data transfer shift registers
16-bit I/O shift registers
21 Pipeline stages
7-bit Data width
Area: 6.84 x 6.72 mm2
Total number of Junctions：14040JJs
Bias current： 1.652A
Kyushu University
SSV 2009
Fujimaki, et al., Demonstration of an SFQ-Based Accelerator Prototype for a High-Performance Computer,” ASC08, 2008
42

Download Report