非最短経路を用いたチップ内 ネットワーク向け経路

Non-Minimal Routing Strategy for
Application-Specific Networks-on-Chips
Hiroki Matsutani
Michihiro Koibuchi
Yutaka Yamada
Jouraku Akiya
Hideharu Amano
Keio Univ.
National Institute of Informatics
Toshiba RDC
Keio Univ.
Keio Univ.
Network-on-Chip (NoC)
• Tile-based Multi-Core
– Core: Execution
– Router: Packet delivery
• RAW
1
2
3
4
5
6
7
8
[Furtek, FPL2004]
– Tree
• aSoC
0
[Taylor, Micro2002]
– 2D Mesh
• ACM
Tile (RISC, RAM, I/O)
[Liang, TVLSI2004]
– 2D Mesh
Network-on-bChip (NoC)
• Tile-based Multi-Core
– Core: Execution
– Router: Packet delivery
• RAW
[Taylor, Micro2002]
– 2D Mesh
• ACM
[Furtek, FPL2004]
– Tree
• aSoC
MIPS
Memory
[Liang, TVLSI2004]
– 2D Mesh
Router
Network-on-Chip (NoC)
○ Advantage
• Better Wiring Delay
– Global wiring
– Limited-length Links
• Improve Modularity
– Standard Network I/F
× Drawback
Tile (RISC, RAM, I/O)
0
1
2
3
4
5
6
7
8
• Overhead
SoC is growing!  NoC is one of Scalable on-chip interconnects
Stream Processing ~Simulation~
• MPEG, JPEG, Viterbi
– System Level Design
• No Clock for execution
Data
Module(a)
• Communication is cycle
accurate
Data
Module(a)
Clock
Module(b)
UTF Model
UnTimed Functional
Module(b)
High Abstraction
BCA Model
Bus Cycle Accurate
RTL Model
Detail Design
Application
is divided into some Tasks based on Simulation.
Stream Processing ~Map, Route~
• Shared Links
– Link Congestion  Throughput is degraded
• Optimization (in general)
– Mapping: Minimum Communication Length
Strong access locality !!
– Routing: Minimal Paths
Physical Tile of NoC
(2)
(1)
(2)
(2)
(3)
(4)
(1)
(2)
(2)
(2)
(4)
(3)
Graph
Too Task
shortFlow
to distribute
path congestion by Minimal paths.
Existing Routing ~Is non-minimal path useful?~
Common feature of SAN & NoC
• Packet delivery
• Deadlock freedom
– WH Switching
Feature of SAN
• Various applications,
Various traffic patterns
– Non-minimal paths
make unstable state
– Turn-Model, …
Feature of NoC
[Ho, HPCA2003]
• Fixed application,
Fixed traffic patterns
– System level simulation
Predictable communication  Load balancing with non-minimal
Flee ~Non-minimal routing strategy~
• Stream processing in NoCs
– Strong access locality !!
– Too short to distribute path congestions
Increase # of alternative paths
by introducing non-minimal paths
• Partially non-minimal paths
Non-minimal paths are basically inefficient…
• Path establishment based on Traffic Amount
– Heavy Traffic Comm.  Minimal Path
– Light Traffic Comm.  Avoiding Congestion
Flee ~Traffic pattern Analysis~
Traffic Pattern
Analysis Record
# time, src, dst, size
10000 (0) (1) 32
# srcdst, TotalSize
Traffic Analysis
Heavy!
(0)  (1)
8192
10000 (0) (2)
4
(1)  (2)
8192
10000 (0) (3)
4
(2)  (3)
8192
(0)  (2)
1024
(0)  (3)
1024
10010 (1) (2) 32
10010 (0) (1) 32
10010 (0) (2)
4
10010 (0) (3)
4
10020 (2) (3) 32
10020 (1) (2) 32
10030 (2) (3)
4
1. For each src-dst pair,
–Totalize packet size
E.g., src-dst pair(0,1)
32 + 32  64
2. Sorting in
descending order
…
Src-dst pair with largest
TotalSize is in first line
–In order of TotalSize
Each src-dst pair gets a path in order of Analysis Record.
Flee ~Establishing Paths~
• In order of Traffic Amount:
There will be several
alternative paths …
– Search for lowest cost path
– Increase the cost of links selected
 Link with high cost is hotspot …
# srcdst, TotalSize
(0)  (1)
8192
(1)  (2)
8192
(2)  (3)
8192
(0)  (2)
1024
(0)  (3)
1024
Each link has “Cost”
(0)
(1)
(2)
(3)
…
解析結果
Analysis
Analysis
Analysis
Record
Record
Record
Paths are assigned not to disturb previously established paths
Simulation Environments
• Router Model
– 4 ports for adj. Routers
– 1 port for Core
• Network Topology
– 4×4 Mesh
– 4×4 Torus
Packet size
259 flit (2 flit header)
Switching method
Wormhole switching
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
16 node 2D mesh
# of Virtual channels Mesh:1, Torus:2
Router
Simulation time
Core
1,000,000 cycle
15
Applications for Evaluation
• App. Traces
– Viterbi Decoder
– JPEG Codec
– IPsec
– Uniform
(0)
Header
Analysis
(1)
Huffman
Decode
(2)
Inverse
Quant.
(3)
I-DCT
for Row
(4)
(5)
Yuv-rgb
Convert
(6)
MCU
Mapping
(7)
I-DCT
for Col
(8)
Rgb-yuv
Convert
(9)
MCU
Samping
(10)
I-DCT
for Col
(11)
I-DCT
for Row
(12)
(13)
Stream
Gen.
(14)
Huffman
Code
(15)
Quant.
Tile mapping example of JPEG Codec
(
for Decoder,
for Encoder)
Results ~Viterbi @ 2D Mesh~
• Flee
• DOR
– Avg Hop count:2.52
(Dimension-Order Routing)
– Avg Hop count:1.84
Y-axis: Latency [cycle]
Communication in Viterbi trace includes Fork and Join.
14.2% Improved
X-axis:Accepted Traffic [flit/cycle/node]
Results ~Viterbi @ 2D Torus~
• Flee
• DOR
– Avg Hop count:1.87
(Dimension-Order Routing)
– Avg Hop count:1.48
Y-axis: Latency [cycle]
Communication in Viterbi trace includes Fork and Join.
22.2% Improved
X-axis:Accepted Traffic [flit/cycle/node]
Flee improves 22.2% of throughput with non-minimal paths.
Results ~JPEG @ 2D Mesh~
• Flee
• DOR
– Avg Hop count:1.01
(Dimension-Order Routing)
– Avg Hop count:1.00
Y-axis: Latency [cycle]
In JPEG trace, data is sequentially process. No fork and join pattern.
No difference
X-axis:Accepted Traffic [flit/cycle/node]
Communication is between neighbors  No need non-minimal
Results ~Effect of Traffic Analysis~
Viterbi @ 2D Mesh
• Flee
• Flee (Incomplete)
– Known data amount
– Unknown data amount
Y-axis: Latency [cycle]
 All data transfer size is “1”
Incomplete Flee:
Not Improved
X-axis:Accepted Traffic [flit/cycle/node]
Results ~Effect of Traffic Analysis~
Viterbi @ 2D Torus
• Flee
• Flee (Incomplete)
– Known data amount
– Unknown data amount
Y-axis: Latency [cycle]
 All data transfer size is “1”
Incomplete Flee:
Partially Improved
X-axis:Accepted Traffic [flit/cycle/node]
Communication size is key factor to improve performance.
Summary ~Non-minimal routing strategy~
• Stream Processing in NoCs
– Strong access locality !!
– Too short to distribute path congestions
Increase # of alternative paths
by introducing non-minimal paths
• Flee: Non-minimal routing strategy
– Heavy Traffic Comm.  Minimal Paths
– Light Traffic Comm.  Avoiding Congestions
• Improve 22.2% of Throughput
Thank you for your listening