The Future of HPC and The Path to Exascale How GPUs Will Help Us Get There Arnaldo Tavares Senior Business Development Manager NVIDIA TEGRA K1 Mar 2, 2014 NVIDIA Confidential The Era of Accelerated Computing is Here Era of Accelerated Computing Era of Distributed Computing Era of Vector Computing 1980 1990 2000 2010 2020 CUDA Accelerating 19% of FLOPS from GPU Systems 40 Total Performance (PFLOPS) 35 30 NVIDIA Kepler 25 NVIDIA Fermi Intel Xeon Phi 20 IBM Cell Other 15 10 5 0 2007 2008 2009 2010 2011 2012 GPUs Power World’s 10 Greenest Supercomputers Green500 Rank MFLOPS/W 1 4,503.17 GSIC Center, Tokyo Tech 2 3,631.86 Cambridge University 3 3,517.84 University of Tsukuba 4 3,185.91 Swiss National Supercomputing (CSCS) 5 3,130.95 ROMEO HPC Center 6 3,068.71 GSIC Center, Tokyo Tech 7 2,702.16 University of Arizona 8 2,629.10 Max-Planck 9 2,629.10 (Financial Institution) 10 2,358.69 CSIRO 37 1959.90 Intel Endeavor (top Xeon Phi cluster) 49 1247.57 Météo France (top CPU cluster) Site Fast Paced CUDA GPU Roadmap Volta 32 Maxwell 16 Kepler GFLOPS per Watt 8 4 Fermi Pascal Higher Perf/Watt Unified Memory Stacked DRAM NVLINK 2014 2016 Dynamic Parallelism 2 1 FP64 Tesla 0.5 CUDA 2008 2010 2012 2018 GOOGLE BRAIN Artificial Neural Network at a Fraction of the Cost with GPUs $1M Artificial Brain on the Cheap “ “Now You Can Build Google’s -Wired 1,000 CPU Servers 2,000 CPUs • 16,000 cores 600 kWatts $5,000,000 STANFORD AI LAB 3 GPU-Accelerated Servers 12 GPUs • 18,432 cores 4 kWatts $33,000 IBM Partners with NVIDIA to Build NextGeneration Supercomputers + Tesla GPU POWER 8 CPU Stacked Memory Stacked Memory 4x Higher Bandwidth (~1 TB/s) 3x Larger Capacity 4x More Energy Efficient per bit HBM HBM HBM HBM GP100 passive silicon interposer Package Substrate HBM HBM HBM HBM Today… TESLA GPU PCIe 16 GB/s GDDR5 250GB/s GDDR5 Memory CPU DDR4 50-75 GB/s DDR Memory NVLink TESLA GPU NVLink 80 GB/s 1 TB/s Stacked Memory CPU DDR4 50-75 GB/s DDR Memory Connecting with CPUs via NVLink Pascal NVLink- 80 GB/s Kepler ARM64 | Power8+ PCIe- 16 GB/s x86 | ARM64 | Power8 2014 2015 2016 140mm Tesla SXM 2.0* – 3x Performance Density 80 mm *Marketing Code Name. Name is not final. SXM 2.0* : Double Performance Per Node CPU CPU CPU CPU CPU CPU CPU CPU *Marketing Code Name. Name is not final. Mobile Roadmap Meets GeForce Maxwell Kepler Advancements Fermi Tegra K1 Tesla GEFORCE ARCHITECTURE Tegra 4 Tegra 3 MOBILE ARCHITECTURE Tegra K1 ARM7 USB 3.0 MIPI DSI/CSI/ HSI Kepler GPU (192 CUDA Cores) Open GL 4.4, OpenGL ES3.0, DX11, CUDA 6 CPU Quad Core Cortex A15 “r3” With 5th Battery-Saver Core; 2MB L2 cache Battery Saver Core Kepler 2x ISP GPU 2160p30 VIDEO ENCODE R SECURI TY ENGINE HDMI E,MMC 4.5 DDR3L LPDDR2 LPDDR3 2160p30 VIDEO DECODE R Dual DISPLAY CAMERA AUDIO POWER Lower Power 28HPM, Battery Saver Core UART DISPLAY SPI SDIO Dual High Performance ISP 1.2 Gigapixel throughput, 100MP sensor I2S I2C 4K panel, 4K HDMI DSI, eDP, LVDS, High Speed HDMI 1.4a The Most Advanced GPU Comes to Mobile Kepler Graphics GEFORCE TITAN OpenGL ES 3.0 OpenGL 4.4 DX11 Tessellation CUDA 6.0 <2 W 250W TEGRA 5 Power Facial Recognition JETSON TK1 THE WORLD’S 1st EMBEDDED SUPERCOMPUTER Development Platform for Embedded Computer Vision, Robotics, Medical 192 Cores · 326 GFLOPS CUDA Enabled Available Now Brasil x Other BRICS on Top 500 3 1 5 ERAD 2012 5 68 Brasil x Other BRICS on Top 500 Top: 156 3 0 Top: 37 Top: 44 Top: 1 5 12 63 HPC Advisory Council 2014 Questions? Obrigado! Machine Learning using Deep Neural Networks Input Hinton et al., 2006; Bengio et al., 2007; Bengio & LeCun, 2007; Lee et al., 2008; 2009 Visual Object Recognition Using Deep Convolutional Neural Networks Rob Fergus (New York University / Facebook) Result Google “Brain Project” Building High-level Features Using Large Scale Unsupervised Learning Q. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. Corrado, J. Dean, A. Ng Stanford / Google 1 billion connections 10 million 200x200 pixel images 1,000 servers(16,000 cores) 3 days to train Accelerating Machine Learning Deep Learning with COTS HPC Systems GOOGLE BRAIN STANFORD AI LAB A. Coates, B. Huval, T. Wang, D. Wu, A. Ng, B. Catanzaro Stanford / NVIDIA • ICML 2013 $1M Artificial Brain on the Cheap -Wired “ “Now You Can Build Google’s 1,000 CPU Servers 2,000 CPUs • 16,000 cores 600 kWatts $5,000,000 “10 Billion Parameter Neural Networks In Your Basement”, Adam Coates 3 GPU-Accelerated Servers 12 GPUs • 18,432 cores 4 kWatts $33,000 Machine Learning Comes of Age Image Detection Face Recognition Gesture Recognition Video Search & Analytics Speech Recognition & Translation Recommendation Engines Indexing & Search Talks at GTC Auto Tagging in Creative Cloud Speech/Image Recognition Hadoop-based Clustering Recommendation Engine Database Queries Search Ranking GPUs Accelerate Machine Learning & Data Analytics Unified Memory Dramatically Lower Developer Effort Developer View Today System Memory GPU Memory Developer View With Unified Memory Unified Memory Super Simplified Memory Management Code CPU Code void sortfile(FILE *fp, int N) { char *data; data = (char *)malloc(N); } CUDA 6 Code with Unified Memory void sortfile(FILE *fp, int N) { char *data; cudaMallocManaged(&data, N); fread(data, 1, N, fp); fread(data, 1, N, fp); qsort(data, N, 1, compare); qsort<<<...>>>(data,N,1,compare); cudaDeviceSynchronize(); use_data(data); use_data(data); free(data); cudaFree(data); } Unprecedented Value to Scientific Computing AMBER Molecular Dynamics Simulation DHFR NVE Benchmark 64 Sandy Bridge CPUs 58 ns/day 1 Tesla K40 GPU 102 ns/day Accelerating Mainstream Datacenters Oil & Gas Higher Ed Chinese Academy of Sciences Government Air Force Research Laboratory Supercomputing Swiss National Supercomputing Centre Tokyo Institute of Technology Naval Research Laboratory Finance Web 2.0
© Copyright 2025 ExpyDoc