The Future of HPC and The Path to Exascale (Arnaldo Tavares

The Future of HPC and The Path to Exascale
How GPUs Will Help Us Get There
Arnaldo Tavares
Senior Business Development Manager
NVIDIA TEGRA K1
Mar 2, 2014
NVIDIA Confidential
The Era of Accelerated Computing is Here
Era of
Accelerated Computing
Era of
Distributed Computing
Era of
Vector Computing
1980
1990
2000
2010
2020
CUDA Accelerating
19% of FLOPS from GPU Systems
40
Total Performance (PFLOPS)
35
30
NVIDIA Kepler
25
NVIDIA Fermi
Intel Xeon Phi
20
IBM Cell
Other
15
10
5
0
2007
2008
2009
2010
2011
2012
GPUs Power World’s 10 Greenest Supercomputers
Green500
Rank
MFLOPS/W
1
4,503.17
GSIC Center, Tokyo Tech
2
3,631.86
Cambridge University
3
3,517.84
University of Tsukuba
4
3,185.91
Swiss National Supercomputing (CSCS)
5
3,130.95
ROMEO HPC Center
6
3,068.71
GSIC Center, Tokyo Tech
7
2,702.16
University of Arizona
8
2,629.10
Max-Planck
9
2,629.10
(Financial Institution)
10
2,358.69
CSIRO
37
1959.90
Intel Endeavor (top Xeon Phi cluster)
49
1247.57
Météo France (top CPU cluster)
Site
Fast Paced CUDA GPU Roadmap
Volta
32
Maxwell
16
Kepler
GFLOPS per Watt
8
4
Fermi
Pascal
Higher Perf/Watt
Unified Memory
Stacked DRAM
NVLINK
2014
2016
Dynamic Parallelism
2
1
FP64
Tesla
0.5
CUDA
2008
2010
2012
2018
GOOGLE BRAIN
Artificial Neural Network at a
Fraction of the Cost with GPUs
$1M Artificial Brain on the Cheap
“
“Now You Can Build Google’s
-Wired
1,000 CPU Servers
2,000 CPUs • 16,000 cores
600 kWatts
$5,000,000
STANFORD AI LAB
3 GPU-Accelerated Servers
12 GPUs • 18,432 cores
4 kWatts
$33,000
IBM Partners with NVIDIA to Build NextGeneration Supercomputers
+
Tesla
GPU
POWER 8
CPU
Stacked Memory
Stacked Memory
4x Higher Bandwidth (~1 TB/s)
3x Larger Capacity
4x More Energy Efficient per bit
HBM
HBM
HBM
HBM
GP100
passive silicon interposer
Package Substrate
HBM
HBM
HBM
HBM
Today…
TESLA
GPU
PCIe
16 GB/s
GDDR5
250GB/s
GDDR5 Memory
CPU
DDR4
50-75 GB/s
DDR Memory
NVLink
TESLA
GPU
NVLink
80 GB/s
1 TB/s
Stacked Memory
CPU
DDR4
50-75 GB/s
DDR Memory
Connecting with CPUs via NVLink
Pascal
NVLink- 80 GB/s
Kepler
ARM64 | Power8+
PCIe- 16 GB/s
x86 | ARM64 | Power8
2014
2015
2016
140mm
Tesla SXM 2.0* – 3x Performance Density
80 mm
*Marketing Code Name. Name is not final.
SXM 2.0* : Double Performance Per Node
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
*Marketing Code Name. Name is not final.
Mobile Roadmap Meets GeForce
Maxwell
Kepler
Advancements
Fermi
Tegra K1
Tesla
GEFORCE ARCHITECTURE
Tegra 4
Tegra 3
MOBILE ARCHITECTURE
Tegra K1
ARM7
USB
3.0
MIPI
DSI/CSI/
HSI
Kepler GPU (192 CUDA Cores)
Open GL 4.4, OpenGL ES3.0, DX11, CUDA 6
CPU
Quad Core Cortex A15 “r3”
With 5th Battery-Saver Core; 2MB L2 cache
Battery
Saver Core
Kepler
2x
ISP
GPU
2160p30
VIDEO
ENCODE
R
SECURI
TY
ENGINE
HDMI
E,MMC
4.5
DDR3L
LPDDR2
LPDDR3
2160p30
VIDEO
DECODE
R
Dual
DISPLAY
CAMERA
AUDIO
POWER
Lower Power
28HPM, Battery Saver Core
UART
DISPLAY
SPI
SDIO
Dual High Performance ISP
1.2 Gigapixel throughput, 100MP sensor
I2S
I2C
4K panel, 4K HDMI
DSI, eDP, LVDS, High Speed HDMI 1.4a
The Most Advanced GPU Comes to Mobile
Kepler Graphics
GEFORCE
TITAN
OpenGL ES 3.0


OpenGL 4.4


DX11


Tessellation


CUDA 6.0


<2 W
250W
TEGRA 5
Power
Facial Recognition
JETSON TK1
THE WORLD’S 1st EMBEDDED SUPERCOMPUTER
Development Platform for Embedded
Computer Vision, Robotics, Medical
192 Cores · 326 GFLOPS
CUDA Enabled
Available Now
Brasil x Other BRICS on Top 500
3
1
5
ERAD 2012
5
68
Brasil x Other BRICS on Top 500
Top: 156
3
0
Top: 37
Top: 44
Top: 1
5
12
63
HPC Advisory Council 2014
Questions?
Obrigado!
Machine Learning using Deep Neural Networks
Input
Hinton et al., 2006; Bengio et al., 2007; Bengio & LeCun, 2007; Lee et al., 2008; 2009
Visual Object Recognition Using Deep Convolutional Neural Networks
Rob Fergus (New York University / Facebook) http://on-demand-gtc.gputechconf.com/gtcnew/on-demand-gtc.php#2985
Result
Google “Brain Project”
Building High-level Features
Using
Large Scale Unsupervised
Learning
Q. Le, M. Ranzato, R. Monga, M. Devin, K.
Chen, G. Corrado, J. Dean, A. Ng
Stanford / Google
1 billion connections
10 million 200x200 pixel images
1,000 servers(16,000 cores)
3 days to train
Accelerating Machine Learning
Deep Learning with COTS HPC Systems
GOOGLE BRAIN
STANFORD AI LAB
A. Coates, B. Huval, T. Wang, D. Wu,
A. Ng, B. Catanzaro
Stanford / NVIDIA • ICML 2013
$1M Artificial Brain on the Cheap
-Wired
“
“Now You Can Build Google’s
1,000 CPU Servers
2,000 CPUs • 16,000
cores
600 kWatts
$5,000,000
“10 Billion Parameter Neural Networks
In Your Basement”, Adam Coates
http://on-demand.gputechconf.com/gtc/2014/video/S4694-10-billion-parameter-neural-networks.mp4
3 GPU-Accelerated Servers
12 GPUs • 18,432 cores
4 kWatts
$33,000
Machine Learning Comes of Age
Image Detection
Face Recognition
Gesture Recognition
Video Search & Analytics
Speech Recognition &
Translation
Recommendation Engines
Indexing & Search
Talks at GTC
Auto Tagging in Creative Cloud
Speech/Image Recognition
Hadoop-based Clustering
Recommendation Engine
Database Queries
Search Ranking
GPUs Accelerate Machine
Learning & Data Analytics
Unified Memory
Dramatically Lower Developer Effort
Developer View Today
System
Memory
GPU Memory
Developer View With
Unified Memory
Unified Memory
Super Simplified Memory Management Code
CPU Code
void sortfile(FILE *fp, int N) {
char *data;
data = (char *)malloc(N);
}
CUDA 6 Code with Unified Memory
void sortfile(FILE *fp, int N) {
char *data;
cudaMallocManaged(&data, N);
fread(data, 1, N, fp);
fread(data, 1, N, fp);
qsort(data, N, 1, compare);
qsort<<<...>>>(data,N,1,compare);
cudaDeviceSynchronize();
use_data(data);
use_data(data);
free(data);
cudaFree(data);
}
Unprecedented Value to Scientific Computing
AMBER Molecular Dynamics Simulation
DHFR NVE Benchmark
64 Sandy Bridge CPUs
58 ns/day
1 Tesla K40 GPU
102 ns/day
Accelerating Mainstream Datacenters
Oil & Gas
Higher Ed
Chinese
Academy of
Sciences
Government
Air Force
Research
Laboratory
Supercomputing
Swiss National
Supercomputing
Centre
Tokyo Institute of
Technology
Naval Research
Laboratory
Finance
Web 2.0