A Comprehensive and Accurate Latency Model for Network-on-Chip Performance Analysis Zhiliang Qian1, Da-Cheng Juan2, Paul Bogdan3, Chi-Ying Tsui1, Diana Marculescu2 and Radu Marculescu2 1The Hong Kong University of Science and Technology, Hong Kong 2Carnegie Mellon University, Pittsburgh, U.S.A 3University of Southern California, Los Angeles, U.S.A IEEE/ACM ASP-DAC’14, 22 Jan., 2014, Singapore Outline Introduction NoC Modeling for Performance Analysis NoC end-to-end delay calculation Link dependency analysis GE-type traffic modeling Wormhole router based NoC latency model Experimental results Simulation setup Evaluation under synthetic traffic patterns Evaluation under realistic benchmarks Conclusion 2 Network-on-Chips (NoCs) With technology scaling down, more and more components can be integrated on a single chip. 1.0um 0.25um Point-to-Point interconnect Communication Network Shared Bus Cell Cell Cell Cell Cell Cell Total wire length <0.05um <100cm FB FB FB IP IP FB FB <100 meters IP NoC IP IP IP >1Km An efficient way to manage the communication of on-chip resources plays the key role in future system design. 3 NoC design space exploration A large design space needs to be explored for an optimal design Task mapping, allocation, buffer sizing, routing algorithm etc. Accurate and fast performance evaluation is required during the exploration -> analytical performance evaluation model vld 70 357 36 2 idct Application (traffic pattern, injection rate) iQuant 353 Inverse Scan 16 27 Up samp 362 Stripe memory 362 30 0 49 AC/DC 500 VOP NoC Platform (topology, router design) RLD 313 padding Router model AR M Task scheduling/ mapping Inner loop Vop Core mapping 313 Routing allocation 16 Traffic analysis 94 Performance feedback Performance evaluation Outer loop Simulation/ Prototyping Design space exploration Detailed performance evaluation 4 Introduction- queuing-theory-based analytical model Queuing-theory-based delay estimation A A Customer (packet) arrival process System (server) service process Number of servers Service discipline (FCFS, Round-robin etc.) System time and waiting time Queue Server A Service time Waiting time System time Probability density Bernoulli injection process 1 f(𝑥𝑡 )=𝜆𝑒 −𝜆𝑥𝑡 (λ = ) 𝑇 1/T Time 5 Queuing-theory-based NoC latency model Previous arts and motivation of this work NoC latency model Previous NoC analytical models [VLSI 2007] [TCAD’12, ICCAD’09] [TVLSI’13] This work [NoCs’11] Traffic model for the application Queue M/M/1 M/G/1/K G/G/1 M/M/m/K G/G/1/K Arrival Poisson Poisson General Poisson General Service Markov General General Markov General Small 𝐵 flits NoC architecture modeled Buffer Small 𝐾 packets PB ratio1 𝑚 (≫ 1) <1 Arbitration Round robin Round robin 𝐵 flits arbitrary 𝑚 (≫ 1) Fixed priority Round robin arbitrary Round robin 1 PB ratio is defined as the ratio of average packet size (𝑚 flits) to the buffer depth (𝐵 flits) 6 Outline Introduction NoC Modeling for Performance Analysis NoC end-to-end delay calculation Link dependency analysis GE-type traffic modeling Wormhole router based NoC latency model Experimental results Simulation setup Evaluation under synthetic traffic patterns Evaluation under realistic benchmarks Conclusion 7 Input to the NoC latency model The application has been scheduled and mapped onto the NoC. A deterministic routing algorithm is used to avoid deadlock. T11 T1 T1 T10 T11 T10 T2 T2 T7 T9 T7 T8 T3 T4 T5 T6 T9 T8 Task scheduling and allocation T3 T4 T5 T6 Mapping and Floorplan Path 3 Path 2 Tile A Path 1 T1 Routing algorithm design Input to the latency model Tile A T3 𝑃𝑓 : Set of links in the routing path (𝜆𝑓 , 𝐶𝑓 ): GE-type traffic model Tile B Tile B Performance metrics: average latency 𝐿 = ( 𝑓∈𝐹 𝜆𝑓 × 𝐿𝑓 )/ 𝑓∈𝐹 𝜆𝑓 8 NoC end-to-end delay calculation The end-to-end flow latency 𝑳𝒔,𝒅 of a specific flow 𝒇𝒔,𝒅 consists of three parts: 𝐿𝑠,𝑑 = 𝑣𝑠 + 𝜂𝑠,𝑑 + ℎ𝑠,𝑑 The queuing time at the source 𝑣𝑠 The packet transfer time in the path 𝜂𝑠,𝑑 = 𝑚 + 1 + The path acquisition time ℎ𝑠,𝑑 = 𝑑𝑓 ℎ 𝑖=1 𝑙 𝑓 𝑖 f 8 f 2 l1 f0,8 0 ηl2 l 1 2 f 4 l3 (a) 3 4 5 0 5 l2 R1 R2 v0 7 1 𝑑𝑓 𝜂 𝑖=1 𝑙 𝑓 𝑖 hl2 ql2 f 4 l 6 (b) 7 Local current flow 8 f 5 l contention flow ( (c) 9 Outline Introduction NoC Modeling for Performance Analysis NoC end-to-end delay calculation Link dependency analysis GE-type traffic modeling Wormhole router based NoC latency model Experimental results Simulation setup Evaluation under synthetic traffic patterns Evaluation under realistic benchmarks Conclusion 10 Link dependency graph Building the channel (link) dependency graph L2_5 C4 C3 L0_3 P1 P3 P2 P4 L4_7 L6_7 L7_6 L4_1 L1_4 L1_2 L5_2 L5_4 L4_5 L3_6 L4_7 L4_5 L4_3 L6_3 P6 Core Communication graph L1_2 L1_0 L2_1 L1_4 L4_1 L3_0 L2_5 L5_2 L3_4 L3_6 C6 L0_1 P0 C2 C7 L2_1 Routing path Traffic model Edge in the dependency graph L5_4 P5 L7_4 L5_8 P7 L7_8 L8_7 NOC mesh L8_5 P8 L3_4 L7_6 L6_3 L4_3 L6_7 L7_4 Channel Dependency Graph (CDG) of the application communication 11 Link dependency analysis Topological sort algorithm is applied on the obtained CDG to find out the proper order to analyze the queuing delays. A sample channel dependency graph Source queuing 𝑣 Three parameters: 𝜂, ℎ, 𝑠 12 Outline Introduction NoC Modeling for Performance Analysis NoC end-to-end delay calculation Link dependency analysis GE-type traffic modeling Wormhole router based NoC latency model Experimental results Simulation setup Evaluation under synthetic traffic patterns Evaluation under realistic benchmarks Conclusion 13 Modeling the bursty traffic input Exponential branch τ M 1-τ Direct branch Packet generating point 0.07 Packet injection rate (packets/cycle) Generalized exponential (GE) distribution Poisson injection (CV=1) GE-type injection (CV=3) GE-type injection (CV=4) 0.06 0.05 0.04 0.03 0.02 0.01 0 10 20 30 40 50 60 70 Time horizontal index 80 90 100 14 GE-type traffic modeling The GE-type cumulative distribution function (cdf) of interarrival time is: 𝐹 𝑡 = 𝑃 𝑋 ≤ 𝑡 = 1 − 𝜏𝑒 −𝜏𝜆𝑡 , Where the parameter 𝜏 = 2 1+𝐶 2 𝑡≥0 and 𝐶 2 is the square coefficient of variation In this work, we use the GE distribution to model the traffic input of each flow, which is characterized by two parameters: 𝜆 : the average packet arrival rate (packets/cycle) 𝜎 𝐶: the coefficient of variation of this traffic flow, i.e., 𝐶 = , where 𝜎 is the 𝜆 standard derivation of the packet inter-arrival times. Accordingly, the GE/G/1/K queuing model is used to analyze the channel waiting time by considering the traffic burstness. 15 Outline Introduction NoC Modeling for Performance Analysis NoC end-to-end delay calculation Link dependency analysis GE-type traffic modeling Wormhole router based NoC latency model Experimental results Simulation setup Evaluation under synthetic traffic patterns Evaluation under realistic benchmarks Conclusion 16 Flit transfer time 𝜼 calculation The flit transfer time 𝜼 of link 𝒍𝒂𝒃 is defined as the time taken for the header flit after being granted link access to reach the buffer front in link 𝒍𝒂𝒃 contention flow ηli S S S A Packet (m flits) Crossbar S Any flow 𝑓 passing the link 𝑙𝑎𝑏 current link li Mean flit rate arriving at the buffer is: 𝜆𝑙𝑎𝑏 = 𝑚 × 𝜆𝑝𝑎𝑐𝑘𝑒𝑡 𝑙𝑎𝑏 = 𝑚 × 𝑓∈𝐹𝑙𝑎𝑏 𝜆𝑓 Mean time to serve a flit in this queuing system is the weighted average of the service ℎ𝑓 Mean 𝑙 𝑖+1 𝜆𝑓 × 𝑚 +1 ∀𝑓∈𝐹𝑙 service 𝑙𝑎𝑏 𝑎𝑏 time from all flows passing through 𝑙𝑎𝑏 : 𝑠𝑓𝑙𝑖𝑡 = time for 𝜆𝑓 ∀𝑓∈𝐹𝑙 𝑎𝑏 flow 𝑓 17 Illustration of path acquisition time Service time in wormhole NoC to obtain 𝒉: Waiting time W includes: 1) the link contention time H and 2) the time for the header flit to reach the buffer head Q H W Link contention S S Finte size buffer Q M/G/1/K+v queue Service time S is bounded by the time where the header reaches the node that the accumulated buffer spaces between can hold the whole worm packet. contention flow Point A ηli Point D ηli+1+hli+1 ηli+2+hli+2 S Crossbar Point C current link li S Point B hli+3 S zli+3 S A Packet (m flits) 18 Path acquisition time 𝒉 calculation Number of effective subsequent links of link 𝒍 with respect to the path 𝒑: 𝑚 𝑚 𝑖𝑓 𝑟 𝑝, 𝑙 > 𝛬𝑝 𝑙 = 𝐵 𝐵 𝑟 𝑝, 𝑙 𝑒𝑙𝑠𝑒𝑤𝑖𝑠𝑒 where 𝑟(𝑝, 𝑙) is the function returns the number of remaining hops from link 𝑙 towards the destination of path 𝑝. The service time of link 𝒍 with respect to the path 𝒑 is [1]: 𝑓 𝑠𝑙𝑓 = 𝑖 𝑓 𝑓 𝑓 𝑚 𝑚 + 𝑥𝑙𝑖 + 2𝑥𝑙𝑖 𝑚 /(𝑚 + 2𝑥𝑙𝑖 ) 𝑖𝑓 𝑥𝑙𝑖 < 𝑚 𝑚 𝑚+ 𝑓 𝑥𝑙𝑖 +2 𝑓 𝑥𝑙𝑖 2 𝑓 /(𝑚 + 2𝑥𝑙𝑖 ) 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 The channel service time of link 𝒍 : 𝑠𝑙𝑎𝑏 = 𝐶𝑠 2𝑙 = 𝑎𝑏 𝑠𝑙2𝑎𝑏 𝑠𝑙𝑎𝑏 ∀𝑓∈𝐹𝑙𝑎𝑏 (𝜆𝑓 2−1=( × 𝑠𝑙𝑓 )/ 𝑖 ∀𝑓∈𝐹𝑙𝑎𝑏 𝜆𝑓 ∀𝑓∈𝐹𝑙𝑎𝑏 ∀𝑓∈𝐹𝑙𝑎𝑏 𝜆𝑓 × 𝑠 2𝑓 𝑙𝑖 𝜆𝑓 )/ 𝑠𝑙𝑎𝑏 2 −1 [1] P.-C. Hu, L. Kleinrock, An Analytical model for wormhole routing with finite size input Buffers, 15th International Telegraphic Congress, 1998 19 GE/G/1/K queue based 𝒉 calculation Diffusion approximation for the steady state distribution probability 𝑷𝒏 of the M/G/1/K queue with arrival rate 𝝀 and service rate 𝝁 : 𝑐 × 𝑝′𝑛 (0 ≤ 𝑛 ≤ 𝐾) 𝜆 𝑃𝑛 = 1− 1−𝑐 1−𝜇 𝜆 𝜇 𝜆 𝜇 (𝑛 = 𝐾 + 1) Where the normalization constant 𝑐 = (1 − (1 − 𝐾 −1 𝑗=0 𝑃𝑛 )) and 𝑝′𝑛 is the steady state probability of M/G/1/∞ queue [2] Applying Little’s formula to obtain the waiting time : ℎ𝑙 ′ = ( 𝐾+1 𝑖=1 𝑖 × 𝑃𝑖 )/ 𝜆 Taking the arrival traffic burstiness in GE/G/1/K model by refining the results of M/G/1/K queue: ℎ𝑙𝑎𝑏 = (𝐶𝑠 2𝑙 +𝐶𝑎 2𝑙 ) 𝑎𝑏 𝑎𝑏 ℎ′ 𝑙𝑎𝑏 2 (1+𝐶𝑠 𝑙 ) 𝑎𝑏 [2] M.C. Lai, et.al. An accurate and efficient performance analysis approach based on queuing model for Network on Chip. In Proceedings of ICCAD,2009 20 Source queuing time 𝒗𝒔 The source queue is modeled as a GE/G/1/∞ system: 𝑠𝑙𝑠 𝑣𝑠 = 2 𝑠𝑙𝑠 − 𝑚 2 𝐶𝑎 +𝜆𝑎 × 𝑠𝑙𝑠 1+ 1 − 𝜆𝑎 × 𝑠𝑙𝑠 2 − 𝑠𝑙𝑠 where the arrival process is characterized by (λa , Ca2 ) in the GE type traffic model and the service time at source is represented as 𝑠𝑙𝑠 . Packet source Open loop source queuing measurement setup Input timing Infinite source buffer Output timing Network on chips Application traffic generation trace Traffic terminal (source and sink) Traffic terminal (source and sink) 21 Proposed NoC latency analysis flow Link dependency analysis to obtain the link order 𝑮 For each link 𝒍𝒂𝒃 in 𝑮: Calculate the flit transfer time 𝜂 Calculate the link service time 𝑠 Compute the path acquisition time ℎ Calculate the source queuing time 𝑣 Form the latency for each flow in application 1: foreach 𝒍𝒂𝒃 ∈ 𝑮 2: (𝝀𝒍𝒂𝒃 , 𝑪𝒂 𝟐𝒍 )= traffic_model (𝑭𝒍𝒂𝒃 ) 𝒂𝒃 3: 4: 𝒍 𝒂𝒃 𝜼𝒍𝒂𝒃 = calculate_transfer_time(𝝀𝒍𝒂𝒃 ,𝒔𝒇𝒍𝒊𝒕 , 𝒎, 𝑩) 𝒇 foreach 𝒇 ∈ 𝑭𝒍𝒂𝒃 and 𝒍𝒊 = 𝒍𝒂𝒃 𝒇 5: 𝒔𝒍𝒊 = calculate_link_service_time ( ) 6: end 7: ( 𝒔𝒍𝒂𝒃 , 𝑪𝒔 𝟐𝒍 ) = service_time ( ) 𝒂𝒃 8: if 𝒂 ≠ 𝒃 // the links between the routers 9: 𝒉𝒍𝒂𝒃 = GE_G_1_K_queue (𝝀𝒍𝒂𝒃 , 𝑪𝒂 𝟐𝒍 , 𝒔𝒍𝒂𝒃 , 𝑪𝟐𝒍𝒂𝒃 , 𝒌) 𝒂𝒃 10: else // the link is the source link 11: 𝒗𝒂 = GE_G_1 _queue (𝝀𝒍𝒂𝒃 , 𝑪𝒂 𝟐𝒍 , 𝒔𝒍𝒂𝒃 , 𝑪𝒔 𝟐𝒍 ) 𝒂𝒃 𝒂𝒃 12: endif 13:endfor 14: foreach 𝒇 ∈ 𝑭 15: 𝑳𝒔,𝒅 =calculate_flow_latency( ) 16: end 22 Outline Introduction NoC Modeling for Performance Analysis NoC end-to-end delay calculation Link dependency analysis GE-type traffic modeling Wormhole router based NoC latency model Experimental results Simulation setup Evaluation under synthetic traffic patterns Evaluation under realistic benchmarks Conclusion 23 Simulation setup The proposed analytical latency model is implemented in MATLAB and its accuracy is compared with Booksim simulator. Each router takes two cycles to route a flit and the link traversal stage takes an additional one cycle. Different buffer depth (𝑩 flits) and packet length (𝒎 flits) combinations are evaluated. Both synthetic and real applications are adopted: Random and shuffle traffic on 8 × 8 and 12 × 12 meshes MMS (Multimedia system) DVOPD (Video object plane decoder) MPEG4 (MPEG decoder) SPECweb99 applications 24 Evaluation under random traffic patterns 800 800 700 Latency (cycles) 600 500 8 × 8 𝑚𝑒𝑠ℎ, 𝑅𝑎𝑛𝑑𝑜𝑚 400 300 600 500 300 200 100 100 0 0.01 0.02 0.03 0.04 0.05 0.06 Packet injection rate (packets/cycle) 0.07 12 × 12 𝑚𝑒𝑠ℎ, 𝑅𝑎𝑛𝑑𝑜𝑚 400 200 0 Simulation (B=3,m=14) Proposed model (B=3,m=14) Simulation (B=4,m=9) Proposed model (B=4,m=9) Simulation(B=9,m=4) Proposed model (B=9,m=4) 700 Latency (cycles) Simulation (B=3,m=14) Proposed model (B=3,m=14) Simulation (B=4,m=9) Proposed model (B=4,m=9) Simulation(B=9,m=4) Proposed model (B=9,m=4) 0 0 0.01 0.02 0.03 0.04 Packet injection rate (packets/cycle) 0.05 The proposed latency model works for a variety of buffer depth and packet size combinations. For random traffic, about 5.2%-9.9% errors are introduced in predicting the network saturation point. 25 Evaluation under shuffle traffic patterns 800 800 700 Latency (cycles) 600 500 8 × 8 𝑚𝑒𝑠ℎ, 𝑠ℎ𝑢𝑓𝑓𝑙𝑒 400 300 600 200 500 12 × 12 𝑚𝑒𝑠ℎ, 𝑠ℎ𝑢𝑓𝑓𝑙𝑒 400 300 200 100 0 Simulation (B=3,m=14) Proposed model (B=3,m=14) Simulation (B=4,m=9) Proposed model (B=4,m=9) Simulation(B=9,m=4) Proposed model (B=9,m=4) 700 Latency (cycles) Simulation (B=3,m=14) Proposed model (B=3,m=14) Simulation (B=4,m=9) Proposed model (B=4,m=9) Simulation(B=9,m=4) Proposed model (B=9,m=4) 100 0 0.01 0.02 0.03 0.04 Packet injection rate (packets/cycle) 0.05 0 0 0.002 0.004 0.006 0.008 0.01 Packet injection rate (packets/cycle) 0.012 For the traffic patterns such as shuffle, a little larger error (10.8%-13%) is introduced due to the uneven traffic arrival rates across the channels. Overall, the analytical model achieves 70X speedup over the simulations for both traffic patterns. 26 Evaluation under burst and real traffic Comparison of Poisson and GE-type traffic injection: 1000 Poisson injection (CV=1) GE-type injection (CV=3) GE-type injection (CV=4) 0.06 0.05 Latency (cycles) Packet injection rate (packets/cycle) 0.07 0.04 0.03 0.02 Simulation (Poisson injection, CV=1) Simulation (GE-type injection, CV=3) Simulation (GE-type injection, CV=4) Proposed model (Poisson injection, CV=1) Proposed model (GE-type injection, CV=3) Proposed model (GE-type injection, CV=4) 800 600 400 200 0.01 0 0 10 20 30 40 50 60 70 Time horizontal index 80 90 100 0.01 0.015 0.02 0.025 0.03 0.035 0.04 Packet injection rate (packets/cycle) Evaluation under real application traces: 80 60 70 End-to-end delay (cycles) 50 End-to-end delay (cycles) Simulation Simulation Proposed model 40 30 20 10 0 0 Proposed model 60 50 40 30 20 10 5 10 15 20 Flow index in DVOPD application 25 30 0 DVOPD VOPD MMS MPEG4 Apache Ocean Oracle DVB2 Sparse 27 Outline Introduction NoC Modeling for Performance Analysis NoC end-to-end delay calculation Link dependency analysis GE-type traffic modeling Wormhole router based NoC latency model Experimental results Simulation setup Evaluation under synthetic traffic patterns Evaluation under realistic benchmarks Conclusion 28 Conclusion In this work, we propose a new NoC latency model which generalizes the previous work by modeling: The arrival traffic burstiness The general service time distribution The finite buffer depth and arbitrary packet length combinations A link dependency analysis technique is proposed to determine the order of applying queuing analysis The accuracy of the model is demonstrated using both the synthetic traffic and real applications. A 70X speedup over simulation is achieved with less than 13% error in the proposed analytical model, which benefit the NoC synthesis process. 29 Thank you!! Q&A 30
© Copyright 2024 ExpyDoc