Tightly-Coupled FPGA Cluster

Tightly-Coupled FPGA Cluster
with TERASIC DE5-NET boards
Custom Computing Framework for Real Applications
Kentaro Sano
Sano Laboratory
Graduate School of Information Sciences,
Tohoku University
T.C.F.C
1
21 Feb, 2014
Sano Lab
Why Tightly-Coupled FPGA Cluster?
Low-power and scalable custom computing with FPGAs
Low-power : dedicated data-paths, memory systems, networks on FPGAs
Scalable
: low-latency HW-to-HW direct communication/synchronization
via accelerator-domain network: ADN
Testbed for development and product run of “real” applications
Qsys-based hardware
framework on FPGA
Linux driver, API,
FPGA-class library
for software development
Researches for
compilers, tools, and
applications
Experiences with running
an actual system
(trouble shooting, etc.)
General-purpose network
Intra-node network
DRAM
DRAM
DRAM
DRAM
CPU
FPGA
FPGA
FPGA
FPGA
PCI-Express (x8)
DRAM
DRAM
DRAM
DRAM
CPU
DRAM
DRAM
DRAM
DRAM
FPGA
FPGA
FPGA
FPGA
Inter-FPGA network
(Accelerator-domain
network, ADN)
FPGA
FPGA
FPGA
FPGA
CPU
DRAM
DRAM
DRAM
DRAM
FPGA
FPGA
FPGA
FPGA
CPU
Computing node w/ FPGA boards
Architecture of tightly-coupled FPGA cluster
T.C.F.C
2
Tightly-Coupled FPGA Cluster
21 Feb, 2014
Sano Lab
Tightly-Coupled FPGA Cluster Overview
System configuration
4 x host PCs
4 x FPGAs / PC
4 x 10G SFP+ ports / FPGA
Node 01
Implementation
Node 02
Linux on nodes
Qsys framework on FPGAs
10G SFP+ A(Tx, Rx)
10G SFP+ B(Tx, Rx)
10GbE SW
QDR II+
SRAM D
12.8
GB/s
x18@
500MHz
1GB/s for
read/write
ALTERA
Stratix V FPGA
5SGXEA7
N2F45C2
FPGA
10G SFP+ C(Tx, Rx)
10G SFP+ D(Tx, Rx)
12.8
GB/s
x64@
800MHz
(DDR)
up to
1066MHz
DDR3 DRAM B
PC3-12800 (DDR3-1600)
SFP+
10G Ether
DDR3 DRAM A
PC3-12800 (DDR3-1600)
DE5-NET
x4
each
DDR3
memory
2GB as default
(up to 8GB)
QDR II+
SRAM A
QDR II+
SRAM B
Node 03
Node 04
QDR II+ 18 Mbits each
SRAM C (20-bit addressing for 18-bit data)
QDRII SRAM
10Gbps+ each (Tx, Rx)
PCIe 3.0 x 8 : 8GB/s (Tx, Rx)
PCI-Express
FPGA board (Stratix V)
UPS
T.C.F.C
3
Tightly-Coupled FPGA Cluster
21 Feb, 2014
Sano Lab
Front and Back
Node 01
Node 01
Node 02
Node 02
64 port
10GbE switch
10GbE SW
Node 03
Node 03
Node 04
Node 04
UPS
T.C.F.C
4
Tightly-Coupled FPGA Cluster
21 Feb, 2014
Sano Lab
More Photos
4 FPGAs
on node
Temperature
sensors
10GbE
SFP+ ports
status LEDs
on boards
64 port
10GbE switch
5
T.C.F.C
Tightly-Coupled FPGA Cluster
21 Feb, 2014
Sano Lab
Hardware/Software Stack
Application Software
FPGA class, FPGAs class
DMA API
Developed
Layers
SW
10G MAC API
PCI-Express & DMA Driver
Linux Kernel
PCIe
I/F
DMAs
DDR3
Ctrls
QDRII+
Ctrls
10G
MACs
Application
Logic
HW
FPGA
T.C.F.C
6
Tightly-Coupled FPGA Cluster
21 Feb, 2014
Sano Lab
Future Work
Scalable and low-power computation
Parallel fluid simulation with building cube method
Deep learning for image/video recognition
Molecular dynamics simulation
Gene info processing
Further development of framework
OS management of FPGA resources
Stream processor generator : SPGen
ST Splitter
cell attribute
1 word
9 words
STsink
Equilibrium
Calc & Collision
Pipelines (ECPs)
ECP 8
ECP 2
ECP 1
ECP 0
designed with
FloPoCo
ST sink
Macro calc pipeline (MCP)
Macro, Equi., Col. ST src
Unit (MECU)
9 words
ST sink
ST src
1 word
ST sink
ST sink
Translation Unit (TLU)
3 words
STsrc
Delay
Unit
ST src
9 words
3 words
ST src
ST Merger
10 words
Processing
Element
(LBM PE)
ST sink
Boundary Unit (BDU)
ST src
10 words (to memory or the next PE)
7
Tightly-Coupled FPGA Cluster
to / from adjacent PE or inter-FPGA transfer uni
System tools
10 words (from memory or the previous PE)
to / from adjacent PE or inter-FPGA transfer unit
Partial reconfiguration support
FPGA-direct communication via PCIe
Inter FPGA communication with SATA cables
Remote DMA among FPGAs
21 Feb, 2014
T.C.F.C
Sano Lab