ご挨拶 日本アイ・ビー・エム株式会社 システムズ・ハードウェア事業本部 執行役員 ハイエンド・システム事業部長 朝海 孝 デジタル変革をリードする IBM エンタープライズ・コンピューティング メインフレーム、エンタープライズサーバー両部門で、顧客満足度№1 IBM z Systems IBM Power Systems “Minsky” (IBM Power Systems S822LC for High Performance Computing) NVIDIA Tesla P100 Connecting with NVLink 本日のセッション Sumit Gupta Marc Hamilton 森下 亨 IBM Corporation Vice President High Performance Computing & Analytics NVIDIA Corporation Vice President of Solutions Architecture and Engineering 電気通信大学 情報理工学研究科 准教授 High Performance Computing & Data Analytics with POWER Sumit Gupta VP, High-Performance Computing & Analytics IBM Systems October 2016 Today’s challenges demand innovation Full system and stack open innovation required Data holds competitive value Price/Performance Moore’s Law 44 zettabytes Data Growth Processor Technology Firmware / OS Accelerators Software Storage Network You are here unstructured data structured data 2000 2020 2010 2020 6 © 2016 OpenPOWER Foundation 200B CORE HOURS OF LOST SCIENCE Data Center Throughput is the Most Important Thing for HPC National Science Foundation (NSF XSEDE) Supercomputing Resources 400 Computing Resources Requested Normalized Unit (Billions) 350 Computing Resources Available 300 250 200 150 100 50 0 2009 2010 2011 2012 2013 2014 2015 Source: NSF XSEDE Data: https://portal.xsede.org/#/gallery 7 NU = Normalized Computing Units are used to compare compute resources across supercomputers and are based on the result of the High Performance LINPACK benchmark run on each system HIGH PERFORMANCE DATA ANALYTICS HPC to HPDA Applications Driving Need for Increased Performance • High Performance (Scientific) Computing • Deep Learning • In Memory Database Collision of HPC and HPDA is Disruptive Force Impacting the Modern Enterprise Datacenter 8 OpenPOWER: Open Architecture for HPC & Analytics Processors Open Interfaces Systems & Software Latest POWER ™ CPUs and Tesla ™ GPUs Maximizing Performance Per System Tight Accelerator Integration using NVIDIA NVLink and CAPI Ecosystem Partners building OpenPOWER servers with Open Source Software including Firmware & Hypervisor 9 250+ OpenPOWER Foundation Members Implementation, HPC & Research Software System Integration I/O, Storage & Acceleration Boards & Systems Chips & SoCs 10 Introducing New “POWER8 with NVLink” Chip First CPU Designed for Accelerated Computing P8 CAPI NVLink PCIe POWER8 High Performance Cores Fast & Large Memory System Faster Cores than x86 Larger Caches Per Core than x86 Fast PowerAccel Interconnects for Accelerators 5x Faster Data Communication between POWER8 & GPUs 11 PASCAL TESLA P100 MOST ADVANCED DATA CENTER GPU FOR STRONG SCALE HPC CPU CPU PCIe Switch PCIe Switch 21.2 TF HP ∙ 10.6 TF SP ∙ 5.3 TF DP New Deep Learning Instructions More Registers & Cache per SM NVLink GPU Interconnect for Maximum Scalability CoWoS with HBM2 Up to 720 GB/Sec Bandwidth Up to 16 GB Memory Capacity ECC with Full Performance & Capacity PAGE MIGRATION ENGINE Tesla P100 for NVLINK-based Servers CPU Tesla P100 Unified Memory Simpler Parallel Programming Virtually Unlimited Data Size Performance w/ data locality 12 Introducing 822LC Power System for HPC First Custom-Built GPU Accelerator Server with NVLink • Custom-built GPU Accelerator Server • High-Speed NVLink Connections between CPU and GPU and GPU to GPU • Optimized for NVIDIA Tesla P100 GPU • Bottleneck Free Data Movement from System Memory to GPU 13 Minsky: 4-Pascal GPUs NVLink attach to 2 Power8 CPUs System Memory • NVLink between CPUs and GPUs enables fast memory access to large data sets in system memory • Two NVLink connections between each GPU and CPU-GPU leads to faster data exchange NVLink 115 GB/s Power8 CPU P100 GPU HBM2 System Memory 720 GB/s 25+ GB/s 115 GB/s Power8 CPU 40(x2) GB/s P100 GPU P100 GPU P100 GPU HBM2 HBM2 HBM2 https://www.ibm.com/blogs/systems/ibmnvidia-present-nvlink-server-youve-waiting/ 14 Deep Learning on POWER Servers • All Major Deep Learning Frameworks Ported to POWER – CAFFE, TensorFlow, Torch, Chainer, Theano, … – Available as source & binary (soon) POWER8 Server with 4x P100 GPUs is 2.2x Faster than Server with 4x M40 GPUs Training time compared (minutes): AlexNet and Caffe to top-1, 50% Accuracy 4 x M40 / PCIe • Optimizing for CPU-GPU and GPU-GPU NVLink – Early results: Please see talk by IBM Tokyo Research Labs today 2.2x Faster 4 x P100 / NVLink 0 20 40 60 80 100 120 140 15 2300+ Linux Applications on POWER HPC CHARMM GROMACS NAMD AMBER RTM GAMESS WRF HYCOM HOMME LES MiniGhost AMG2013 OpenFOAM Cloud Big Data & Machine Learning Mobile Enterprise miniDFT CTH BLAST Bowtie BWA FASTA HMMER GATK SOAP3 STAC-A2 SHOC Graph500 Ilog Major Linux Distros 16 AI AND HPDA ACCELERATION 17 THE AI RACE IS ON OpenAI Microsoft ImageNet Amazon ML MS AzureML CNTK IBM Watson Jeopardy Caffe ML Beats Humans Google TensorFlow Torch Microsoft Facebook Big Sur Theano Google IBM Watson NVIDIA cuDNN ImageNet Google Car 1M Miles Google Brain 2010 2011 2012 Toyota $1B AI Lab 2013 2014 2015 18 WHAT IS DEEP LEARNING? In one slide Models are trained on vast amounts of historical data and the network weights learn to map inputs to outputs. Raw data Low-level features Mid-level features These trained models are then deployed to take new inputs and produce outputs (called inference). High-level features Machine Learning Neural Networks Matrix-vector products & Convolutions Deep Learning 19 NVIDIA GPU THE ENGINE OF DEEP LEARNING WITH POWER CPU WATSON CHAINER TENSORFLOW CNTK POWER+GPU enabled THEANO MATCONVNET TORCH CAFFE Pre-build binaries for POWER+GPU available NVIDIA CUDA ACCELERATED COMPUTING PLATFORM http://openpowerfoundation.org/blogs/openpower-deep-learning-distribution/ 20 TRIAL ON CHAINER FOR NEXT STEP DNN becoming Deeper and Larger Need to consider Out-of-Core methodology (*) • NETWORK WISE MEMORY ALLOCATION • LAYER WISE MEMORY ALLOCATION Keep whole data on GPU memory 16GB (memory size of P100) Swap-in/out data from/to host memory (Out-of-Core) (*) Minsoo et al., “vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design” https://arxiv.org/pdf/1602.08124.pdf 21 CHAINER DEEP LEARNING: POWER8-P100 CPU-GPU NVLINK SHOWS 1.3X SPEEDUP In-Memory vs. Swap-In/Out with Chainer OOC branch codes (*) • In Memory : Keep whole data on GPU memory VGG16 (32 batch), Chainer, cuDNN 5.1 2500 2112 In Memory 2000 Swap In/Out 1785 PWR8 1500 • Swap In/Out data from/to host memory GPU 858 654 Time (ms) 500 133% Communication Overhead 363 PWR8 GPU GPU 500 363 38% 0 K40 (PCIe3) M6000 (PCIe3) P100 (PCIe3) GPU NVLink CPU GPU 1062 1000 NVLink 80 GB/s P100 (Power8+, NVLink x2) (*) : Prepared and tested by A. Naruse, NVIDIA DevTech : https://github.com/anaruse/chainer/tree/test.out-of-core 22 GPU BASED IN-MEMORY DATABASE Approximate Data Throughput Traditional Database lives on (many) disks GB/s 0 200 400 600 800 Databases can be made faster by moving into memory. 1000 Per Disk GPUs accelerate trend by using even faster parallel GPU memory. Per Ethernet Per Dual Socket Commodity Server Less servers with more bandwidth == better scaling Per Dual Socket Power8 Server Per P100 GPU Minsky POWER8 System Bandwidth Delivers Peak DB Performance Available Server Level Memory Bandwidth Minsky Commodity Server 0 500 1000 1500 2000 2500 3000 3500 GB/s 23 KINETICA ARCHITECTURE • • • • Reliable, Available and Scalable • Memory-based, w/ disk-based persistence • Linear Scalability • HA Architecture Performance • GPU Accelerated (1000’s Cores per GPU) • Ingest Billions of records per minute Massive Data Sizes API’s OPEN SOURCE INTEGRATI ON Apache NiFi • 10’s of Terabytes Scale in Memory Apache Kafka • Billions of entries Apache Spark Connectors • ODBC/JDBC • Rich API’s • Standard Geospatial Capabilities VISUALIZATION via ODBC/JDBC GEOSPATIAL CAPABILITIES Geom etric Object s Tracks Geosp atial Endpo ints C++ API Node. js API Pytho n API Java API JavaSc ript API REST API WMS WKT HTTP Head Node HTTP Head Node HTTP Head Node HTTP Head Node Columnar In-memory Columnar In-memory Columnar In-memory Columnar In-memory A1B1C1 A2B2C2 A3B3C3 A4B4C4 A1B1C1 A2B2 C2 A3B3 C3 A4B4C4 A1B1C1 A2B2 C2 A3B3 C3 A4B4C4 A1B1 C1 A2B2C2 A3B3C3 A4B4 C4 Disk Disk Disk Disk Commodity Hardware W/ GPU’s Commodity Hardware W/ GPU’s Commodity Hardware W/ GPU’s Commodity Hardware W/ GPU’s Apache Storm KINETICA CLUSTER On Demand Scale GPU DATABASE PERFORMANCE BENCHMARKS BoundingBox on 1B records: 561x faster on average than MemSQL & NoSQL Max/Min on 1B records: 712x faster on average than MemSQL & NoSQL Histogram on 1B records: 327x faster on average than MemSQL & NoSQL SQL query on 30 nodes: 104x faster than SAP HANA 25 25 THANK YOU! 26 終わりに 日本アイ・ビー・エム株式会社 システムズ・ハードウェア事業本部 執行役員 ハイエンド・システム事業部長 朝海 孝 本日のサマリ IBM Sumit Gupta オープン・スタンダード 汎用サーバーの登場 NVIDIA Marc Hamilton ディープラーニング・インメモリーDB における卓越したパフォーマンス 電気通信大学 森下 亨 氏 大規模行列計算をPOWER+GPGPU 環境で実行した性能検証結果 ”人類の発展”と”社会の未来”に貢献 超高速 科学技術計算 高速インメモリーDB が実現するHPDA コグニティブ コンピューティング 化学・マテリアルサイエンスを 活用した新素材・新技術での開発 データを活用した ビジネスの最大化 ディープラーニング技術を 活用した社会・経済的価値創造 THANK YOU ! 30
© Copyright 2024 ExpyDoc