計算機アーキテクチャ特論 後半第5回 GPGPU General Purpose Computing on GPUs 講師 加藤真平 本資料は授業用です。無断で転載することを禁じます。 前回の理解度クイズ(未実施) 問1 チップマルチプロセッシングにおける共有メモリシステム 及びスヌープキャッシュの役割を簡潔に説明せよ。 答え 共有メモリシステムは、1つのプロセッサ上に存在する 複数のコア間でメモリ領域を共有するためのものである。 スヌープキャッシュは、各々のプロセッサコア専用の キャッシュに存在するデータの整合性を保つために用い られる技術である。 前回の理解度クイズ(未実施) 問2 MESIプロトコルにおける4つの状態 (Modified/Exclusive/Shared/Invalid)を簡潔に説明せよ。 答え Modified: データは正しい。他のコアに正しくない値がある。 Exclusive: データは正しい。このデータは自分しか保持していない。 Shared: データは正しい。他のコアもこのデータを保持している。 Invalid: データは正しくない。読み込むときには更新が必要である。 前回の理解度クイズ(未実施) 問3 2つのプロセッサコア(P0とP1)がそれぞれプライベートキャッ シュ(L1キャッシュ)を持っており、そのほかに共有メモリがあ るとする。P0とP1がそれぞれ以下の順序でメモリ番地Xへの データアクセスを行ったときのP0とP1のMESI状態を書け。 答え P0の状態 ①P0: Read from X E ②P1: Read from X S ③P1: Write to X I ④P0: Write to X M ⑤P1: Read from X S P1の状態 I S M I S 今日の講義 • GPGPU • 理解度クイズ GPU? Graphics Processing Unit Beautiful Graphics GPU vs Many Cores Intel Many-Core NVIDIA GPU X86 Vec L1 L1 L1 L1 L1 L1 L2 Cache L1 X86 Vec X86 Vec L1 Cache L1 Cache L1 Cache L2 Cache L2 Cache L2 Cache Interprocessor Network Memory & I/O Interface Device Memory CPU Host Memory CPU Main Memory GPU NVIDIA GPU L1 L1 L1 L1 L1 L1 L2 Cache L1 Device Memory CPU Host Memory GPU C Java C++ Maxwell Kepler Fermi Tesla 2008 250 cores 2010 2012 500 cores 3000 cores 2014 5000 cores GPU vs CPU Single Precision Performance Performance per Watt 6000 25 GTX Titan Black GFLOPS GTX Titan 4000 GTX 680 3000 NVIDIA GTX 2000 1000 Intel Xeon 8800 GTX 0 2006 X7350 2008 GTX 580 GTX 480 GTX 285 9800 GTX E7-8870 X7560 X7460 2010 2012 RELEASE YEAR GTX Titan Black 20 GFLOPS/WATT 5000 GTX Titan GTX 680 15 NVIDIA GTX 10 GTX 285 5 8800 GTX E7-8890 2014 X7350 0 2006 2008 9800 GTX Intel Xeon GTX 580 GTX 480 E7-8890 E7-8870 X7460 X7560 2010 2012 RELEASE YEAR 2014 Today’s Computer Systems Embedded GPU 396GFlops@10W http://www.nvidia.com GPU Kernel Execution Unit Processing Cores Others Details Command Buffer GPU Command Refer to GPU Command GPU Command Indirect Buffer (IB) microcontroller Read commands 24 bits Code 40 bits buffer size address offset Data IB Packet Format Unified Addressing Memory Space Page Table Page Table & GART Device Memory Host Memory Write commands CPU MMIO Space (PCI) Control registers GPGPU Computing Model CMD_HtoD CMD_HtoD CMD_LAUNCH CMD_DtoH Host Memory Host Memory Host Memory Host Memory GPU Code GPU Code GPU Code GPU Code Input Data Input Data Input Data Input Data copy copy GPU Code GPU Code Device Memory Device Memory Input Data Output Data copy GPU Code Input Data Output Data Device Memory GPU Code Input Data Output Data Device Memory GPGPU Execution Program 1 GPU driver Program 2 GPU command CPU time GPU Blocked Blocked time GPGPU Computing Stack Application User Programs API CUDA Runtime HMPP OpenCL OpenGL GPGPU Runtime Backend System Call OS Linux Kernel Device Driver I/O Hardware CPU CPU CPU CPU GPU GPU CUDA Compute Unified Device Architecture Thread Grid = (2, 2) Block = (3, 3) Abstract the device by the concept of Grid, Block, and Thread Programming model never changes as device changes CUDA Programming Example void multiply(double *a, double *b, double *c, int n) { double product = 0.0; int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x; int i, idx; for (i = 0; i < n; i++) product += a[row * n + i] * b[i * n + col]; c[row * n + col] = product; } Threads and Warp How many threads does each context have? = Block.x * Block.y * Block.z * Grid.x * Grid.y * Grid.z = So many… Too much overhead if each thread is switched… Context switch is done by a unit of 32 threads Warp TB1, W1 stall TB2, W1 stall Instruction: TB1 W1 1 2 Time 3 4 TB2 W1 5 6 1 2 TB3 W1 1 2 TB3, W2 stall TB3 W2 1 2 TB2 W1 3 4 TB1 W1 7 8 TB1 W2 1 2 TB = Thread Block, W = Warp TB1 W3 1 2 TB3 W2 3 4 NVIDIA Kepler GK110 Architecture (chip) (As used in coit-grid08.uncc.edu K20 GPU server) Highlights – To discuss in class Extracted directly from: “Whitepaper NVIDIA’s Next Generation CUDATM Compute Architecture KeplerTM GK110”, NVIDIA, 2012 http://www.nvidia.com/content/PDF/kepler/NVIDIA-KeplerGK110-Architecture-Whitepaper.pdf ITCS 4/5010 GPU Programming, B. Wilkinson, GK110ArchNotes.ppt Feb 11, 2013 Designed for performance and power efficiency 7.1 billion transistors Over 1 TFlop of double precision throughput 3x performance per watt of Fermi New features in Kepler GK110: • Dynamic Parallelism • Hyper-Q with GK110 Grid Management Unit (GMU) • NVIDIA GPUDirect™ RDMA Kepler GK110 Chip Kelper GK110 Full chip block diagram Kepler GK110 supports the new CUDA Compute Capability 3.5 GTX 470/480s have GT100s C2050s on grid06 and grid07 are compute cap 2.0 Full Kepler GK110 has 15 SMXs Some products may have 13 or 14 SMXs Quad Warp Scheduler The SMX schedules threads in groups of 32 parallel threads called warps. Each SMX features four warp schedulers and eight instruction dispatch units, allowing four warps to be issued and executed concurrently. (128 threads) Kepler GK110 allows double precision instructions to be paired with other instructions. One Warp Scheduler Unit • Each thread can access up to 255 registers (x4 of Fermi) • New Shuffle instruction which allows threads within a warp to share data without passing data through shared memory: • Atomic operations: Improved by 9x to one operation per clock – fast enough to use frequently with kernel inner loops Texture units improvements • Not considered in class • For image processing • Speed improvements when programs need to operate on image data New: 48 KB Read-only memory cache Compiler/programmer can use to advantage Faster than L2 Shared memory/L1 cache split: Each SMX has 64 KB on‐chip memory, that can be configured as: •48 KB of Shared memory with 16 KB of L1 cache, or •16 KB of shared memory with 48 KB of L1 cache or •(new) a 32KB / 32KB split between shared memory and L1 cache. Dynamic Parallelism • Fermi could only launch one kernel at a time on a single device. Kernel had to complete before calling for another GPU task. • “In Kepler GK110 any kernel can launch another kernel, and can create the necessary streams, events and manage the dependencies needed to process additional work without the need for host CPU interaction.” • “ .. makes it easier for developers to create and optimize recursive and data‐dependent execution patterns, and allows more of a program to be run directly on GPU.” Control must be transferred back to CPU before a new kernel can execute Only return to CPU when all GPU operations are completed. Why is this faster? “With Dynamic Parallelism, the grid resolution can be determined dynamically at runtime in a data dependent manner. Starting with a coarse grid, the simulation can “zoom in” on areas of interest while avoiding unnecessary calculation in areas with little change …. ” Hyper‐Q “The Fermi architecture supported 16‐way concurrency of kernel launches from separate streams, but ultimately the streams were all multiplexed into the same hardware work queue.” “Kepler GK110 … Hyper‐Q increases the total number of connections (work queues) … by allowing 32 simultaneous, hardware‐managed connections..” “… allows connections from multiple CUDA streams, from multiple Message Passing Interface (MPI) processes, or even from multiple threads within a process. Applications that previously encountered false serialization across tasks, thereby limiting GPU utilization, can see up to a 32x performance increase without Hyper‐Q “Each CUDA stream is managed within its own hardware work queue … “ “The redesigned Kepler HOST to GPU workflow shows the new Grid Management Unit, which allows it to manage the actively dispatching grids, pause dispatch and hold pending and suspended grids.” NVIDIA GPUDirect™ “Kepler GK110 supports the RDMA feature in NVIDIA GPUDirect, which is designed to improve performance by allowing direct access to GPU memory by third‐party devices such as IB adapters, NICs, and SSDs. When using CUDA 5.0, GPUDirect provides the following important features: · Direct memory access (DMA) between NIC and GPU without the need for CPU‐side data buffering. (Huge improvement for GPU-only Servers) · Significantly improved MPISend/MPIRecv efficiency between GPU and other nodes in a network. · Eliminates CPU bandwidth and latency bottlenecks · Works with variety of 3rd‐party network, capture, and storage devices.” “GPUDirect RDMA allows direct access to GPU memory from 3rd‐party devices such as network adapters, which translates into direct transfers between GPUs across nodes as well.” 理解度クイズ 当日配布
© Copyright 2024 ExpyDoc