Slides - ASP

A Comprehensive and Accurate Latency Model for
Network-on-Chip Performance Analysis
Zhiliang Qian1, Da-Cheng Juan2, Paul Bogdan3, Chi-Ying Tsui1,
Diana Marculescu2 and Radu Marculescu2
1The
Hong Kong University of Science and Technology, Hong Kong
2Carnegie Mellon University, Pittsburgh, U.S.A
3University of Southern California, Los Angeles, U.S.A
IEEE/ACM ASP-DAC’14, 22 Jan., 2014, Singapore
Outline
 Introduction
 NoC Modeling for Performance Analysis




NoC end-to-end delay calculation
Link dependency analysis
GE-type traffic modeling
Wormhole router based NoC latency model
 Experimental results
 Simulation setup
 Evaluation under synthetic traffic patterns
 Evaluation under realistic benchmarks
 Conclusion
2
Network-on-Chips (NoCs)
 With technology scaling down, more and more components
can be integrated on a single chip.
1.0um
0.25um
Point-to-Point
interconnect
Communication
Network
Shared Bus
Cell
Cell
Cell
Cell
Cell
Cell
Total wire
length
<0.05um
<100cm
FB
FB
FB
IP
IP
FB
FB
<100 meters
IP
NoC
IP
IP
IP
>1Km
 An efficient way to manage the communication of on-chip
resources plays the key role in future system design.
3
NoC design space exploration
 A large design space needs to be explored for an optimal design
 Task mapping, allocation, buffer sizing, routing algorithm etc.
 Accurate and fast performance evaluation is required during the exploration
-> analytical performance evaluation model
vld
70
357
36
2
idct
Application
(traffic pattern,
injection rate)
iQuant
353
Inverse
Scan
16
27
Up
samp
362
Stripe
memory
362
30
0
49
AC/DC
500
VOP
NoC Platform
(topology, router
design)
RLD
313
padding
Router model
AR
M
Task
scheduling/
mapping
Inner loop
Vop
Core
mapping
313
Routing
allocation
16
Traffic analysis
94
Performance
feedback
Performance
evaluation
Outer loop
Simulation/
Prototyping
Design space
exploration
Detailed performance
evaluation
4
Introduction- queuing-theory-based
analytical model
 Queuing-theory-based delay estimation





A
A
Customer (packet) arrival process
System (server) service process
Number of servers
Service discipline (FCFS, Round-robin etc.)
System time and waiting time
Queue
Server
A
Service time
Waiting time
System time
Probability density
Bernoulli injection process
1
f(𝑥𝑡 )=𝜆𝑒 −𝜆𝑥𝑡 (λ = )
𝑇
1/T
Time
5
Queuing-theory-based NoC latency model
 Previous arts and motivation of this work
NoC
latency
model
Previous NoC analytical models
[VLSI 2007]
[TCAD’12,
ICCAD’09]
[TVLSI’13]
This work
[NoCs’11]
Traffic model for the application
Queue
M/M/1
M/G/1/K
G/G/1
M/M/m/K
G/G/1/K
Arrival
Poisson
Poisson
General
Poisson
General
Service
Markov
General
General
Markov
General
Small
𝐵 flits
NoC architecture modeled
Buffer
Small
𝐾 packets
PB ratio1
𝑚 (≫ 1)
<1
Arbitration
Round robin
Round robin
𝐵 flits
arbitrary
𝑚 (≫ 1)
Fixed priority Round robin
arbitrary
Round robin
1 PB ratio is defined as the ratio of average packet size (𝑚 flits) to the buffer depth (𝐵 flits)
6
Outline
 Introduction
 NoC Modeling for Performance Analysis




NoC end-to-end delay calculation
Link dependency analysis
GE-type traffic modeling
Wormhole router based NoC latency model
 Experimental results
 Simulation setup
 Evaluation under synthetic traffic patterns
 Evaluation under realistic benchmarks
 Conclusion
7
Input to the NoC latency model
 The application has been scheduled and mapped onto the NoC.
 A deterministic routing algorithm is used to avoid deadlock.
T11
T1
T1
T10
T11
T10
T2
T2
T7
T9
T7
T8
T3
T4
T5
T6
T9
T8
Task scheduling
and allocation
T3
T4
T5
T6
Mapping and
Floorplan
Path 3
Path 2
Tile A
Path 1
T1
Routing
algorithm
design
Input to the
latency
model
Tile A
T3
𝑃𝑓 : Set of links in the
routing path
(𝜆𝑓 , 𝐶𝑓 ): GE-type traffic
model
Tile B
Tile B
Performance metrics: average latency 𝐿 = (
𝑓∈𝐹 𝜆𝑓
× 𝐿𝑓 )/
𝑓∈𝐹 𝜆𝑓
8
NoC end-to-end delay calculation
 The end-to-end flow latency 𝑳𝒔,𝒅 of a specific flow 𝒇𝒔,𝒅
consists of three parts: 𝐿𝑠,𝑑 = 𝑣𝑠 + 𝜂𝑠,𝑑 + ℎ𝑠,𝑑
 The queuing time at the source 𝑣𝑠
 The packet transfer time in the path 𝜂𝑠,𝑑 = 𝑚 + 1 +
 The path acquisition time ℎ𝑠,𝑑 =
𝑑𝑓
ℎ
𝑖=1 𝑙 𝑓 𝑖
f
8
f
2
l1
f0,8
0
ηl2
l
1
2
f
4
l3
(a)
3
4
5
0
5
l2
R1
R2
v0
7
1
𝑑𝑓
𝜂
𝑖=1 𝑙 𝑓 𝑖
hl2
ql2
f
4
l
6
(b)
7
Local
current flow
8
f
5
l
contention flow
(
(c)
9
Outline
 Introduction
 NoC Modeling for Performance Analysis




NoC end-to-end delay calculation
Link dependency analysis
GE-type traffic modeling
Wormhole router based NoC latency model
 Experimental results
 Simulation setup
 Evaluation under synthetic traffic patterns
 Evaluation under realistic benchmarks
 Conclusion
10
Link dependency graph
 Building the channel (link) dependency graph
L2_5
C4
C3
L0_3
P1
P3
P2
P4
L4_7
L6_7
L7_6
L4_1
L1_4
L1_2
L5_2
L5_4
L4_5
L3_6
L4_7
L4_5
L4_3
L6_3
P6
Core Communication graph
L1_2
L1_0
L2_1
L1_4 L4_1
L3_0
L2_5 L5_2
L3_4
L3_6
C6
L0_1
P0
C2
C7
L2_1
Routing path
Traffic model
Edge in the
dependency graph
L5_4
P5
L7_4 L5_8
P7
L7_8
L8_7
NOC mesh
L8_5
P8
L3_4
L7_6
L6_3
L4_3
L6_7
L7_4
Channel Dependency Graph (CDG)
of the application communication
11
Link dependency analysis
 Topological sort algorithm is applied on the obtained CDG to
find out the proper order to analyze the queuing delays.
A sample channel dependency graph
Source
queuing 𝑣
Three
parameters:
𝜂, ℎ, 𝑠
12
Outline
 Introduction
 NoC Modeling for Performance Analysis




NoC end-to-end delay calculation
Link dependency analysis
GE-type traffic modeling
Wormhole router based NoC latency model
 Experimental results
 Simulation setup
 Evaluation under synthetic traffic patterns
 Evaluation under realistic benchmarks
 Conclusion
13
Modeling the bursty traffic input
Exponential branch
τ
M
1-τ
Direct branch
Packet generating point
0.07
Packet injection rate (packets/cycle)
 Generalized exponential (GE) distribution
Poisson injection (CV=1)
GE-type injection (CV=3)
GE-type injection (CV=4)
0.06
0.05
0.04
0.03
0.02
0.01
0
10
20
30
40
50
60
70
Time horizontal index
80
90
100
14
GE-type traffic modeling
 The GE-type cumulative distribution function (cdf) of interarrival time is:
𝐹 𝑡 = 𝑃 𝑋 ≤ 𝑡 = 1 − 𝜏𝑒 −𝜏𝜆𝑡 ,
Where the parameter 𝜏 =
2
1+𝐶 2
𝑡≥0
and 𝐶 2 is the square coefficient of variation
 In this work, we use the GE distribution to model the traffic input of
each flow, which is characterized by two parameters:
 𝜆 : the average packet arrival rate (packets/cycle)
𝜎
 𝐶: the coefficient of variation of this traffic flow, i.e., 𝐶 = , where 𝜎 is the
𝜆
standard derivation of the packet inter-arrival times.
 Accordingly, the GE/G/1/K queuing model is used to analyze
the channel waiting time by considering the traffic burstness.
15
Outline
 Introduction
 NoC Modeling for Performance Analysis




NoC end-to-end delay calculation
Link dependency analysis
GE-type traffic modeling
Wormhole router based NoC latency model
 Experimental results
 Simulation setup
 Evaluation under synthetic traffic patterns
 Evaluation under realistic benchmarks
 Conclusion
16
Flit transfer time 𝜼 calculation
 The flit transfer time 𝜼 of link 𝒍𝒂𝒃 is defined as the time taken
for the header flit after being granted link access to reach the
buffer front in link 𝒍𝒂𝒃
contention flow
ηli
S
S
S
A Packet
(m flits)
Crossbar
S
Any flow
𝑓 passing
the link
𝑙𝑎𝑏
current link li
Mean flit rate arriving at the buffer is: 𝜆𝑙𝑎𝑏 = 𝑚 × 𝜆𝑝𝑎𝑐𝑘𝑒𝑡 𝑙𝑎𝑏 = 𝑚 ×
𝑓∈𝐹𝑙𝑎𝑏 𝜆𝑓
Mean time to serve a flit in this queuing system is the weighted average of the service
ℎ𝑓
Mean
𝑙
𝑖+1
𝜆𝑓 × 𝑚 +1
∀𝑓∈𝐹𝑙
service
𝑙𝑎𝑏
𝑎𝑏
time from all flows passing through 𝑙𝑎𝑏 : 𝑠𝑓𝑙𝑖𝑡 =
time for
𝜆𝑓
∀𝑓∈𝐹𝑙
𝑎𝑏
flow 𝑓
17
Illustration of path acquisition time
 Service time in wormhole NoC to obtain 𝒉:
Waiting time W includes: 1) the link contention time H and 2) the time for
the header flit to reach the buffer head Q
H
W
Link contention
S
S
Finte size buffer
Q
M/G/1/K+v queue
Service time S is bounded by the time where the header reaches the node that
the accumulated buffer spaces between can hold the whole worm packet.
contention flow
Point A
ηli
Point D
ηli+1+hli+1 ηli+2+hli+2
S
Crossbar
Point C
current link li
S
Point B
hli+3
S
zli+3
S
A Packet
(m flits)
18
Path acquisition time 𝒉 calculation
 Number of effective subsequent links of link 𝒍 with respect to the path 𝒑:
𝑚
𝑚
𝑖𝑓 𝑟 𝑝, 𝑙 >
𝛬𝑝 𝑙 =
𝐵
𝐵
𝑟 𝑝, 𝑙
𝑒𝑙𝑠𝑒𝑤𝑖𝑠𝑒
where 𝑟(𝑝, 𝑙) is the function returns the number of remaining hops from link 𝑙 towards the
destination of path 𝑝.
 The service time of link 𝒍 with respect to the path 𝒑 is [1]:
𝑓
𝑠𝑙𝑓 =
𝑖
𝑓
𝑓
𝑓
𝑚 𝑚 + 𝑥𝑙𝑖 + 2𝑥𝑙𝑖 𝑚 /(𝑚 + 2𝑥𝑙𝑖 ) 𝑖𝑓 𝑥𝑙𝑖 < 𝑚
𝑚 𝑚+
𝑓
𝑥𝑙𝑖
+2
𝑓
𝑥𝑙𝑖
2
𝑓
/(𝑚 + 2𝑥𝑙𝑖 ) 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
 The channel service time of link 𝒍 :
𝑠𝑙𝑎𝑏 =
𝐶𝑠 2𝑙 =
𝑎𝑏
𝑠𝑙2𝑎𝑏
𝑠𝑙𝑎𝑏
∀𝑓∈𝐹𝑙𝑎𝑏 (𝜆𝑓
2−1=(
× 𝑠𝑙𝑓 )/
𝑖
∀𝑓∈𝐹𝑙𝑎𝑏 𝜆𝑓
∀𝑓∈𝐹𝑙𝑎𝑏
∀𝑓∈𝐹𝑙𝑎𝑏 𝜆𝑓
× 𝑠 2𝑓
𝑙𝑖
𝜆𝑓
)/ 𝑠𝑙𝑎𝑏
2
−1
[1] P.-C. Hu, L. Kleinrock, An Analytical model for wormhole routing with finite size input Buffers,
15th International Telegraphic Congress, 1998
19
GE/G/1/K queue based 𝒉 calculation
 Diffusion approximation for the steady state distribution probability 𝑷𝒏 of the
M/G/1/K queue with arrival rate 𝝀 and service rate 𝝁 :
𝑐 × 𝑝′𝑛 (0 ≤ 𝑛 ≤ 𝐾)
𝜆
𝑃𝑛 =
1−
1−𝑐 1−𝜇
𝜆
𝜇
𝜆
𝜇
(𝑛 = 𝐾 + 1)
Where the normalization constant 𝑐 = (1 − (1 −
𝐾
−1
𝑗=0 𝑃𝑛 ))
and 𝑝′𝑛 is the steady
state probability of M/G/1/∞ queue [2]
 Applying Little’s formula to obtain the waiting time : ℎ𝑙 ′ = (
𝐾+1
𝑖=1 𝑖
× 𝑃𝑖 )/ 𝜆
 Taking the arrival traffic burstiness in GE/G/1/K model by refining the results of
M/G/1/K queue:
ℎ𝑙𝑎𝑏 =
(𝐶𝑠 2𝑙 +𝐶𝑎 2𝑙 )
𝑎𝑏
𝑎𝑏
ℎ′ 𝑙𝑎𝑏
2
(1+𝐶𝑠 𝑙 )
𝑎𝑏
[2] M.C. Lai, et.al. An accurate and efficient performance analysis approach based on queuing
model for Network on Chip. In Proceedings of ICCAD,2009
20
Source queuing time 𝒗𝒔
 The source queue is modeled as a GE/G/1/∞ system:
𝑠𝑙𝑠
𝑣𝑠 =
2
𝑠𝑙𝑠 − 𝑚
2
𝐶𝑎 +𝜆𝑎 ×
𝑠𝑙𝑠
1+
1 − 𝜆𝑎 × 𝑠𝑙𝑠
2
− 𝑠𝑙𝑠
where the arrival process is characterized by (λa , Ca2 ) in the GE type traffic
model and the service time at source is represented as 𝑠𝑙𝑠 .
Packet
source
Open loop source queuing
measurement setup
Input timing
Infinite
source buffer
Output timing
Network
on chips
Application traffic generation trace
Traffic terminal (source and sink)
Traffic terminal (source and sink)
21
Proposed NoC latency analysis flow
 Link dependency
analysis to obtain the
link order 𝑮
 For each link 𝒍𝒂𝒃 in 𝑮:
 Calculate the flit transfer
time 𝜂
 Calculate the link service
time 𝑠
 Compute the path
acquisition time ℎ
 Calculate the source
queuing time 𝑣
 Form the latency for
each flow in application
1: foreach 𝒍𝒂𝒃 ∈ 𝑮
2: (𝝀𝒍𝒂𝒃 , 𝑪𝒂 𝟐𝒍 )= traffic_model (𝑭𝒍𝒂𝒃 )
𝒂𝒃
3:
4:
𝒍
𝒂𝒃
𝜼𝒍𝒂𝒃 = calculate_transfer_time(𝝀𝒍𝒂𝒃 ,𝒔𝒇𝒍𝒊𝒕
, 𝒎, 𝑩)
𝒇
foreach 𝒇 ∈ 𝑭𝒍𝒂𝒃 and 𝒍𝒊 = 𝒍𝒂𝒃
𝒇
5:
𝒔𝒍𝒊 = calculate_link_service_time ( )
6:
end
7: ( 𝒔𝒍𝒂𝒃 , 𝑪𝒔 𝟐𝒍 ) = service_time ( )
𝒂𝒃
8: if 𝒂 ≠ 𝒃 // the links between the routers
9: 𝒉𝒍𝒂𝒃 = GE_G_1_K_queue (𝝀𝒍𝒂𝒃 , 𝑪𝒂 𝟐𝒍 , 𝒔𝒍𝒂𝒃 , 𝑪𝟐𝒍𝒂𝒃 , 𝒌)
𝒂𝒃
10: else // the link is the source link
11: 𝒗𝒂 = GE_G_1 _queue (𝝀𝒍𝒂𝒃 , 𝑪𝒂 𝟐𝒍 , 𝒔𝒍𝒂𝒃 , 𝑪𝒔 𝟐𝒍 )
𝒂𝒃
𝒂𝒃
12: endif
13:endfor
14: foreach 𝒇 ∈ 𝑭
15: 𝑳𝒔,𝒅 =calculate_flow_latency( )
16: end
22
Outline
 Introduction
 NoC Modeling for Performance Analysis




NoC end-to-end delay calculation
Link dependency analysis
GE-type traffic modeling
Wormhole router based NoC latency model
 Experimental results
 Simulation setup
 Evaluation under synthetic traffic patterns
 Evaluation under realistic benchmarks
 Conclusion
23
Simulation setup
 The proposed analytical latency model is implemented in
MATLAB and its accuracy is compared with Booksim simulator.
 Each router takes two cycles to route a flit and the link traversal
stage takes an additional one cycle.
 Different buffer depth (𝑩 flits) and packet length (𝒎 flits)
combinations are evaluated.
 Both synthetic and real applications are adopted:





Random and shuffle traffic on 8 × 8 and 12 × 12 meshes
MMS (Multimedia system)
DVOPD (Video object plane decoder)
MPEG4 (MPEG decoder)
SPECweb99 applications
24
Evaluation under random traffic patterns
800
800
700
Latency (cycles)
600
500
8 × 8 𝑚𝑒𝑠ℎ, 𝑅𝑎𝑛𝑑𝑜𝑚
400
300
600
500
300
200
100
100
0
0.01
0.02
0.03
0.04
0.05
0.06
Packet injection rate (packets/cycle)
0.07
12 × 12 𝑚𝑒𝑠ℎ, 𝑅𝑎𝑛𝑑𝑜𝑚
400
200
0
Simulation (B=3,m=14)
Proposed model (B=3,m=14)
Simulation (B=4,m=9)
Proposed model (B=4,m=9)
Simulation(B=9,m=4)
Proposed model (B=9,m=4)
700
Latency (cycles)
Simulation (B=3,m=14)
Proposed model (B=3,m=14)
Simulation (B=4,m=9)
Proposed model (B=4,m=9)
Simulation(B=9,m=4)
Proposed model (B=9,m=4)
0
0
0.01
0.02
0.03
0.04
Packet injection rate (packets/cycle)
0.05
 The proposed latency model works for a variety of buffer depth
and packet size combinations.
 For random traffic, about 5.2%-9.9% errors are introduced in
predicting the network saturation point.
25
Evaluation under shuffle traffic patterns
800
800
700
Latency (cycles)
600
500
8 × 8 𝑚𝑒𝑠ℎ, 𝑠ℎ𝑢𝑓𝑓𝑙𝑒
400
300
600
200
500
12 × 12 𝑚𝑒𝑠ℎ, 𝑠ℎ𝑢𝑓𝑓𝑙𝑒
400
300
200
100
0
Simulation (B=3,m=14)
Proposed model (B=3,m=14)
Simulation (B=4,m=9)
Proposed model (B=4,m=9)
Simulation(B=9,m=4)
Proposed model (B=9,m=4)
700
Latency (cycles)
Simulation (B=3,m=14)
Proposed model (B=3,m=14)
Simulation (B=4,m=9)
Proposed model (B=4,m=9)
Simulation(B=9,m=4)
Proposed model (B=9,m=4)
100
0
0.01
0.02
0.03
0.04
Packet injection rate (packets/cycle)
0.05
0
0
0.002
0.004
0.006
0.008
0.01
Packet injection rate (packets/cycle)
0.012
 For the traffic patterns such as shuffle, a little larger error
(10.8%-13%) is introduced due to the uneven traffic arrival
rates across the channels.
 Overall, the analytical model achieves 70X speedup over the
simulations for both traffic patterns.
26
Evaluation under burst and real traffic
 Comparison of Poisson and GE-type traffic injection:
1000
Poisson injection (CV=1)
GE-type injection (CV=3)
GE-type injection (CV=4)
0.06
0.05
Latency (cycles)
Packet injection rate (packets/cycle)
0.07
0.04
0.03
0.02
Simulation (Poisson injection, CV=1)
Simulation (GE-type injection, CV=3)
Simulation (GE-type injection, CV=4)
Proposed model (Poisson injection, CV=1)
Proposed model (GE-type injection, CV=3)
Proposed model (GE-type injection, CV=4)
800
600
400
200
0.01
0
0
10
20
30
40
50
60
70
Time horizontal index
80
90
100
0.01
0.015
0.02
0.025
0.03
0.035
0.04
Packet injection rate (packets/cycle)
 Evaluation under real application traces:
80
60
70
End-to-end delay (cycles)
50
End-to-end delay (cycles)
Simulation
Simulation
Proposed model
40
30
20
10
0
0
Proposed model
60
50
40
30
20
10
5
10
15
20
Flow index in DVOPD application
25
30
0
DVOPD VOPD
MMS MPEG4 Apache Ocean Oracle DVB2 Sparse
27
Outline
 Introduction
 NoC Modeling for Performance Analysis




NoC end-to-end delay calculation
Link dependency analysis
GE-type traffic modeling
Wormhole router based NoC latency model
 Experimental results
 Simulation setup
 Evaluation under synthetic traffic patterns
 Evaluation under realistic benchmarks
 Conclusion
28
Conclusion
 In this work, we propose a new NoC latency model which
generalizes the previous work by modeling:
 The arrival traffic burstiness
 The general service time distribution
 The finite buffer depth and arbitrary packet length combinations
 A link dependency analysis technique is proposed to
determine the order of applying queuing analysis
 The accuracy of the model is demonstrated using both the
synthetic traffic and real applications.
 A 70X speedup over simulation is achieved with less than 13%
error in the proposed analytical model, which benefit the NoC
synthesis process.
29
 Thank you!!
 Q&A
30