計算機アーキテクチャ特論後半第2回アウトオブオーダー実行 Out

計算機アーキテクチャ特論
後半第5回
GPGPU
General Purpose Computing on GPUｓ
講師加藤真平
本資料は授業用です。無断で転載することを禁じます。
前回の理解度クイズ（未実施）
問１
チップマルチプロセッシングにおける共有メモリシステム
及びスヌープキャッシュの役割を簡潔に説明せよ。
答え
共有メモリシステムは、１つのプロセッサ上に存在する
複数のコア間でメモリ領域を共有するためのものである。
スヌープキャッシュは、各々のプロセッサコア専用の
キャッシュに存在するデータの整合性を保つために用い
られる技術である。
前回の理解度クイズ（未実施）
問２
MESIプロトコルにおける4つの状態
（Modified/Exclusive/Shared/Invalid）を簡潔に説明せよ。
答え
Modified：
データは正しい。他のコアに正しくない値がある。
Exclusive:
データは正しい。このデータは自分しか保持していない。
Shared:
データは正しい。他のコアもこのデータを保持している。
Invalid:
データは正しくない。読み込むときには更新が必要である。
前回の理解度クイズ（未実施）
問３
2つのプロセッサコア（P0とP1）がそれぞれプライベートキャッ
シュ（L1キャッシュ）を持っており、そのほかに共有メモリがあ
るとする。P0とP1がそれぞれ以下の順序でメモリ番地Xへの
データアクセスを行ったときのP0とP1のMESI状態を書け。
答え
P0の状態
①P0: Read from X E
②P1: Read from X S
③P1: Write to X
I
④P0: Write to X
M
⑤P1: Read from X S
P1の状態
I
S
M
I
S
今日の講義
• GPGPU
• 理解度クイズ
GPU? Graphics Processing Unit
Beautiful Graphics
GPU vs Many Cores
Intel Many-Core
NVIDIA GPU
X86 Vec
L1 L1 L1 L1 L1 L1
L2 Cache
L1
X86 Vec
X86 Vec
L1 Cache L1 Cache
L1 Cache
L2 Cache L2 Cache
L2 Cache
Interprocessor Network
Memory & I/O Interface
Device Memory
CPU
Host Memory
CPU
Main Memory
GPU
NVIDIA GPU
L1 L1 L1 L1 L1 L1
L2 Cache
L1
Device Memory
CPU
Host Memory
GPU
C
Java
C++
Maxwell
Kepler
Fermi
Tesla
2008
250 cores
2010
2012
500 cores
3000 cores
2014
5000 cores
GPU vs CPU
Single Precision Performance
Performance per Watt
6000
25
GTX Titan Black
GFLOPS
GTX Titan
4000
GTX 680
3000
NVIDIA GTX
2000
1000
Intel Xeon
8800 GTX
0
2006
X7350
2008
GTX 580
GTX 480
GTX 285
9800 GTX
E7-8870
X7560
X7460
2010
2012
RELEASE YEAR
GTX Titan Black
20
GFLOPS/WATT
5000
GTX Titan
GTX 680
15
NVIDIA GTX
10
GTX 285
5
8800 GTX
E7-8890
2014
X7350
0
2006
2008
9800 GTX
Intel Xeon
GTX 580
GTX 480
E7-8890
E7-8870
X7460 X7560
2010
2012
RELEASE YEAR
2014
Today’s Computer Systems
Embedded GPU
396GFlops@10W
http://www.nvidia.com
GPU
Kernel Execution Unit
Processing Cores
Others
Details
Command Buffer
GPU Command
Refer to
GPU Command
GPU Command
Indirect Buffer (IB)
microcontroller
Read commands
24 bits
Code
40 bits
buffer size address offset
Data
IB Packet Format
Unified Addressing Memory Space
Page Table
Page Table
& GART
Device
Memory
Host
Memory
Write commands
CPU
MMIO
Space
(PCI)
Control registers
GPGPU Computing Model
CMD_HtoD
CMD_HtoD
CMD_LAUNCH
CMD_DtoH
Host Memory
Host Memory
Host Memory
Host Memory
GPU
Code
GPU
Code
GPU
Code
GPU
Code
Input
Data
Input
Data
Input
Data
Input
Data
copy
copy
GPU
Code
GPU
Code
Device Memory
Device Memory
Input
Data
Output
Data
copy
GPU
Code
Input
Data
Output
Data
Device Memory
GPU
Code
Input
Data
Output
Data
Device Memory
GPGPU Execution
Program 1
GPU driver
Program 2
GPU command
CPU
time
GPU
Blocked
Blocked
time
GPGPU Computing Stack
Application
User Programs
API
CUDA
Runtime
HMPP
OpenCL
OpenGL
GPGPU Runtime Backend
System Call
OS
Linux Kernel
Device Driver
I/O
Hardware
CPU
CPU
CPU
CPU
GPU
GPU
CUDA
Compute Unified Device Architecture
Thread
Grid = (2, 2)
Block = (3, 3)
Abstract the device by the concept of Grid, Block, and Thread
Programming model never changes as device changes
CUDA Programming Example
void multiply(double *a, double *b, double *c, int n)
{
double product = 0.0;
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
int i, idx;
for (i = 0; i < n; i++)
product += a[row * n + i] * b[i * n + col];
c[row * n + col] = product;
}
Threads and Warp
How many threads does each context have?
= Block.x * Block.y * Block.z * Grid.x * Grid.y * Grid.z
= So many…
Too much overhead if each thread is switched…
Context switch is done by a unit of 32 threads
Warp
TB1, W1 stall
TB2, W1 stall
Instruction:
TB1
W1
1
2
Time
3
4
TB2
W1
5
6
1
2
TB3
W1
1
2
TB3, W2 stall
TB3
W2
1
2
TB2
W1
3
4
TB1
W1
7
8
TB1
W2
1
2
TB = Thread Block, W = Warp
TB1
W3
1
2
TB3
W2
3
4
NVIDIA Kepler GK110
Architecture (chip)
(As used in coit-grid08.uncc.edu K20 GPU server)
Highlights – To discuss in class
Extracted directly from:
“Whitepaper NVIDIA’s Next Generation CUDATM Compute
Architecture KeplerTM GK110”, NVIDIA, 2012
http://www.nvidia.com/content/PDF/kepler/NVIDIA-KeplerGK110-Architecture-Whitepaper.pdf
ITCS 4/5010 GPU Programming, B. Wilkinson, GK110ArchNotes.ppt Feb 11, 2013
Designed for performance and
power efficiency
7.1 billion transistors
Over 1 TFlop of double
precision throughput
3x performance per watt of
Fermi
New features in Kepler GK110:
• Dynamic Parallelism
• Hyper-Q with GK110 Grid
Management Unit (GMU)
• NVIDIA GPUDirect™ RDMA
Kepler GK110 Chip
Kelper GK110 Full chip block diagram
Kepler GK110 supports the new CUDA Compute Capability 3.5
GTX 470/480s have GT100s
C2050s on grid06 and grid07 are compute cap 2.0
Full Kepler GK110
has 15 SMXs
Some products may
have 13 or 14 SMXs
Quad Warp Scheduler
The SMX schedules threads in groups of 32 parallel
threads called warps.
Each SMX features four warp schedulers and eight
instruction dispatch units, allowing four warps to be
issued and executed concurrently. (128 threads)
Kepler GK110 allows double precision instructions to
be paired with other instructions.
One Warp Scheduler Unit
• Each thread can access up to 255 registers (x4 of Fermi)
• New Shuffle instruction which allows threads within a
warp to share data without passing data through shared
memory:
• Atomic operations: Improved by 9x to one operation per
clock – fast enough to use frequently with kernel inner
loops
Texture units improvements
• Not considered in class
• For image processing
• Speed improvements when programs need to
operate on image data
New: 48 KB Read-only memory cache
Compiler/programmer can use to advantage
Faster than L2
Shared memory/L1 cache split:
Each SMX has 64 KB on‐chip
memory, that can be configured
as:
•48 KB of Shared memory with
16 KB of L1 cache,
or
•16 KB of shared memory with
48 KB of L1 cache
or
•(new) a 32KB / 32KB split
between shared memory and L1
cache.
Dynamic Parallelism
• Fermi could only launch one kernel at a time on a single
device. Kernel had to complete before calling for another
GPU task.
• “In Kepler GK110 any kernel can launch another kernel, and
can create the necessary streams, events and manage the
dependencies needed to process additional work without
the need for host CPU interaction.”
• “ .. makes it easier for developers to create and optimize
recursive and data‐dependent execution patterns, and
allows more of a program to be run directly on GPU.”
Control must be
transferred back
to CPU before a
new kernel can
execute
Only return to CPU
when all GPU
operations are
completed. Why is
this faster?
“With Dynamic Parallelism, the grid resolution can be determined
dynamically at runtime in a data dependent manner. Starting with a
coarse grid, the simulation can “zoom in” on areas of interest while
avoiding unnecessary calculation in areas with little change …. ”
Hyper‐Q
“The Fermi architecture supported 16‐way concurrency of
kernel launches from separate streams, but ultimately the
streams were all multiplexed into the same hardware
work queue.”
“Kepler GK110 … Hyper‐Q increases the total number of
connections (work queues) … by allowing 32
simultaneous, hardware‐managed connections..”
“… allows connections from multiple CUDA streams, from
multiple Message Passing Interface (MPI) processes, or
even from multiple threads within a process.
Applications that previously encountered false
serialization across tasks, thereby limiting GPU utilization,
can see up to a 32x performance increase without
Hyper‐Q
“Each CUDA stream is managed within its own
hardware work queue … “
“The redesigned Kepler HOST to GPU workflow shows
the new Grid Management Unit, which allows it to
manage the actively dispatching grids, pause dispatch
and hold pending and suspended grids.”
NVIDIA GPUDirect™
“Kepler GK110 supports the RDMA feature in NVIDIA
GPUDirect, which is designed to improve performance by
allowing direct access to GPU memory by third‐party devices
such as IB adapters, NICs, and SSDs.
When using CUDA 5.0, GPUDirect provides the following
important features:
· Direct memory access (DMA) between NIC and GPU without
the need for CPU‐side data buffering. (Huge improvement for
GPU-only Servers)
· Significantly improved MPISend/MPIRecv efficiency between
GPU and other nodes in a network.
· Eliminates CPU bandwidth and latency bottlenecks
· Works with variety of 3rd‐party network, capture, and storage
devices.”
“GPUDirect RDMA allows direct access to GPU memory from 3rd‐party devices such
as network adapters, which translates into direct transfers between GPUs across
nodes as well.”
理解度クイズ
当日配布

Download Report