ご挨拶

ご挨拶
日本アイ・ビー・エム株式会社
システムズ・ハードウェア事業本部
執行役員ハイエンド・システム事業部長
朝海孝
デジタル変革をリードする
IBM エンタープライズ・コンピューティング
メインフレーム、エンタープライズサーバー両部門で、顧客満足度№１
IBM z Systems
IBM Power Systems
“Minsky”
(IBM Power Systems S822LC
for High Performance Computing)
NVIDIA Tesla P100
Connecting
with NVLink
本日のセッション
Sumit Gupta
Marc Hamilton
森下亨
IBM Corporation
Vice President
High Performance Computing
& Analytics
NVIDIA Corporation
Vice President of Solutions
Architecture and Engineering
電気通信大学
情報理工学研究科
准教授
High Performance Computing & Data
Analytics with POWER
Sumit Gupta
VP, High-Performance Computing & Analytics
IBM Systems
October 2016
Today’s challenges demand innovation
Full system and stack
open innovation required
Data holds competitive value
Price/Performance
Moore’s Law
44 zettabytes
Data Growth
Processor
Technology
Firmware / OS
Accelerators
Software
Storage
Network
You are here
unstructured data
structured data
2000
2020
2010
2020
6
© 2016 OpenPOWER Foundation
200B CORE HOURS OF LOST SCIENCE
Data Center Throughput is the Most Important Thing for HPC
National Science Foundation (NSF XSEDE) Supercomputing Resources
400
Computing Resources Requested
Normalized Unit (Billions)
350
Computing Resources Available
300
250
200
150
100
50
0
2009
2010
2011
2012
2013
2014
2015
Source: NSF XSEDE Data: https://portal.xsede.org/#/gallery
7
NU = Normalized Computing Units are used to compare compute resources across supercomputers
and are
based on the result of the High Performance LINPACK benchmark run on each system
HIGH PERFORMANCE DATA ANALYTICS
HPC to HPDA
Applications Driving Need for Increased Performance
•
High Performance (Scientific) Computing
•
Deep Learning
•
In Memory Database
Collision of HPC and HPDA is Disruptive Force
Impacting the Modern Enterprise Datacenter
8
OpenPOWER: Open Architecture for HPC & Analytics
Processors
Open
Interfaces
Systems
& Software
Latest POWER ™ CPUs and Tesla ™ GPUs
Maximizing Performance Per System
Tight Accelerator Integration using NVIDIA NVLink and CAPI
Ecosystem Partners building OpenPOWER servers with
Open Source Software including Firmware & Hypervisor
9
250+ OpenPOWER Foundation Members
Implementation, HPC & Research
Software
System Integration
I/O, Storage & Acceleration
Boards & Systems
Chips & SoCs
10
Introducing New “POWER8 with NVLink” Chip
First CPU Designed for Accelerated Computing
P8
CAPI
NVLink
PCIe
POWER8
High Performance
Cores
Fast & Large
Memory System
Faster Cores
than x86
Larger Caches
Per Core than x86
Fast PowerAccel
Interconnects for
Accelerators
5x Faster Data
Communication between
POWER8 & GPUs
11
PASCAL
TESLA P100
MOST ADVANCED DATA CENTER GPU
FOR STRONG SCALE HPC
CPU
CPU
PCIe
Switch
PCIe
Switch
21.2 TF HP ∙ 10.6 TF SP ∙ 5.3 TF DP
New Deep Learning Instructions
More Registers & Cache per SM
NVLink
GPU Interconnect for
Maximum Scalability
CoWoS with HBM2
Up to 720 GB/Sec Bandwidth
Up to 16 GB Memory Capacity
ECC with Full Performance & Capacity
PAGE MIGRATION ENGINE
Tesla P100 for NVLINK-based Servers
CPU
Tesla
P100
Unified Memory
Simpler Parallel Programming
Virtually Unlimited Data Size
Performance w/ data locality
12
Introducing 822LC Power System for HPC
First Custom-Built GPU Accelerator Server with NVLink
• Custom-built GPU Accelerator Server
• High-Speed NVLink Connections between
CPU and GPU and GPU to GPU
• Optimized for NVIDIA Tesla P100 GPU
• Bottleneck Free Data Movement from
System Memory to GPU
13
Minsky: 4-Pascal GPUs NVLink attach to
2 Power8 CPUs
System
Memory
• NVLink between CPUs and GPUs enables
fast memory access to large data sets in
system memory
• Two NVLink connections between each
GPU and CPU-GPU leads to faster data
exchange
NVLink
115 GB/s
Power8
CPU
P100
GPU
HBM2
System
Memory
720 GB/s
25+ GB/s
115 GB/s
Power8
CPU
40(x2)
GB/s
P100
GPU
P100
GPU
P100
GPU
HBM2
HBM2
HBM2
https://www.ibm.com/blogs/systems/ibmnvidia-present-nvlink-server-youve-waiting/
14
Deep Learning on POWER Servers
• All Major Deep Learning Frameworks
Ported to POWER
– CAFFE, TensorFlow, Torch, Chainer,
Theano, …
– Available as source & binary (soon)
POWER8 Server with 4x P100 GPUs is
2.2x Faster than Server with 4x M40 GPUs
Training time compared (minutes):
AlexNet and Caffe to top-1, 50%
Accuracy
4 x M40 / PCIe
• Optimizing for CPU-GPU and GPU-GPU
NVLink
– Early results: Please see talk by IBM
Tokyo Research Labs today
2.2x Faster
4 x P100 / NVLink
0
20
40
60
80
100
120
140
15
2300+ Linux Applications on POWER
HPC
CHARMM
GROMACS
NAMD
AMBER
RTM
GAMESS
WRF
HYCOM
HOMME
LES
MiniGhost
AMG2013
OpenFOAM
Cloud
Big Data &
Machine Learning
Mobile Enterprise
miniDFT
CTH
BLAST
Bowtie
BWA
FASTA
HMMER
GATK
SOAP3
STAC-A2
SHOC
Graph500
Ilog
Major Linux Distros
16
AI AND HPDA ACCELERATION
17
THE AI RACE IS ON
OpenAI
Microsoft ImageNet
Amazon ML
MS AzureML CNTK
IBM Watson
Jeopardy
Caffe
ML Beats Humans
Google TensorFlow
Torch
Microsoft
Facebook Big Sur
Theano
Google
IBM Watson
NVIDIA cuDNN
ImageNet
Google Car 1M Miles
Google Brain
2010
2011
2012
Toyota $1B AI Lab
2013
2014
2015
18
WHAT IS DEEP LEARNING?
In one slide
Models are trained on vast amounts of historical data
and the network weights learn to map inputs to
outputs.
Raw data
Low-level features
Mid-level features
These trained models are then deployed to
take new inputs and produce outputs (called
inference).
High-level features
Machine Learning
Neural Networks
Matrix-vector products
& Convolutions
Deep Learning
19
NVIDIA GPU THE ENGINE OF DEEP LEARNING
WITH POWER CPU
WATSON
CHAINER
TENSORFLOW
CNTK
POWER+GPU enabled
THEANO
MATCONVNET
TORCH
CAFFE
Pre-build binaries for POWER+GPU available
NVIDIA CUDA
ACCELERATED COMPUTING PLATFORM
http://openpowerfoundation.org/blogs/openpower-deep-learning-distribution/
20
TRIAL ON CHAINER FOR NEXT STEP
DNN becoming Deeper and Larger
Need to consider Out-of-Core methodology (*)
•
NETWORK WISE MEMORY ALLOCATION
•
LAYER WISE MEMORY ALLOCATION
Keep whole data on GPU memory
16GB (memory size of P100)
Swap-in/out data from/to host memory (Out-of-Core)
(*) Minsoo et al., “vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design” https://arxiv.org/pdf/1602.08124.pdf
21
CHAINER DEEP LEARNING: POWER8-P100 CPU-GPU
NVLINK SHOWS 1.3X SPEEDUP
In-Memory vs. Swap-In/Out with Chainer OOC branch codes (*)
• In Memory : Keep whole data on GPU memory
VGG16 (32 batch), Chainer, cuDNN 5.1
2500
2112
In Memory
2000
Swap In/Out
1785
PWR8
1500
• Swap In/Out data from/to host memory
GPU
858
654
Time (ms)
500
133%
Communication
Overhead
363
PWR8
GPU GPU
500
363
38%
0
K40 (PCIe3)
M6000 (PCIe3)
P100 (PCIe3)
GPU
NVLink
CPU  GPU
1062
1000
NVLink
80 GB/s
P100
(Power8+, NVLink x2)
(*) : Prepared and tested by A. Naruse, NVIDIA DevTech : https://github.com/anaruse/chainer/tree/test.out-of-core
22
GPU BASED IN-MEMORY DATABASE
Approximate Data Throughput
Traditional Database lives on (many) disks
GB/s
0
200
400
600
800
Databases can be made faster by moving into memory.
1000
Per Disk
GPUs accelerate trend by using even faster parallel GPU memory.
Per Ethernet
Per Dual Socket Commodity Server
Less servers with more bandwidth == better scaling
Per Dual Socket Power8 Server
Per P100 GPU
Minsky POWER8 System Bandwidth Delivers Peak DB Performance
Available Server Level Memory Bandwidth
Minsky
Commodity Server
0
500
1000
1500
2000
2500
3000
3500
GB/s
23
KINETICA ARCHITECTURE
•
•
•
•
Reliable, Available and Scalable
•
Memory-based, w/ disk-based persistence
•
Linear Scalability
•
HA Architecture
Performance
•
GPU Accelerated (1000’s Cores per GPU)
•
Ingest Billions of records per minute
Massive Data Sizes
API’s
OPEN
SOURCE
INTEGRATI
ON
Apache NiFi
•
10’s of Terabytes Scale in Memory
Apache Kafka
•
Billions of entries
Apache Spark
Connectors
•
ODBC/JDBC
•
Rich API’s
•
Standard Geospatial Capabilities
VISUALIZATION via ODBC/JDBC
GEOSPATIAL
CAPABILITIES
Geom
etric
Object
s
Tracks
Geosp
atial
Endpo
ints
C++
API
Node.
js API
Pytho
n API
Java
API
JavaSc
ript
API
REST
API
WMS
WKT
HTTP
Head
Node
HTTP
Head
Node
HTTP
Head
Node
HTTP
Head
Node
Columnar
In-memory
Columnar
In-memory
Columnar
In-memory
Columnar
In-memory
A1B1C1
A2B2C2
A3B3C3
A4B4C4
A1B1C1
A2B2 C2
A3B3 C3
A4B4C4
A1B1C1
A2B2 C2
A3B3 C3
A4B4C4
A1B1 C1
A2B2C2
A3B3C3
A4B4 C4
Disk
Disk
Disk
Disk
Commodity Hardware
W/ GPU’s
Commodity Hardware
W/ GPU’s
Commodity Hardware
W/ GPU’s
Commodity Hardware
W/ GPU’s
Apache Storm
KINETICA CLUSTER
On Demand Scale
GPU DATABASE PERFORMANCE BENCHMARKS
BoundingBox on 1B records:
561x faster on average
than MemSQL & NoSQL
Max/Min on 1B records:
712x faster on average
than MemSQL & NoSQL
Histogram on 1B records:
327x faster on average
than MemSQL & NoSQL
SQL query on 30 nodes:
104x faster than SAP HANA
25
25
THANK YOU!
26
終わりに
日本アイ・ビー・エム株式会社
システムズ・ハードウェア事業本部
執行役員ハイエンド・システム事業部長
朝海孝
本日のサマリ
IBM
Sumit Gupta
オープン・スタンダード
汎用サーバーの登場
NVIDIA
Marc Hamilton
ディープラーニング・インメモリーDB
における卓越したパフォーマンス
電気通信大学
森下亨氏
大規模行列計算をPOWER＋GPGPU
環境で実行した性能検証結果
”人類の発展”と”社会の未来”に貢献
超高速
科学技術計算
高速インメモリーDB
が実現するHPDA
コグニティブ
コンピューティング
化学・マテリアルサイエンスを
活用した新素材・新技術での開発
データを活用した
ビジネスの最大化
ディープラーニング技術を
活用した社会・経済的価値創造
THANK YOU !
30

Download Report