インシリコ創薬時代の 最新チップとアプリの開発状況

インシリコ創薬時代の
最新チップとアプリの開発状況
ソリューションアーキテクト
郡司 茂樹
[email protected]
バイオグリッド研究会2015
生命科学の発展を支える製品ポートフォリオ
Intel® True Scale
Intel® Omni-Path
Intel® 10/40GbE
Intel® Xeon®
Intel® Xeon Phi™
Intel® iGFX
Intel® Xeon®
Intel® Xeon Phi™
Iris Pro™ Graphics
Embree Ray-Tracing
STORAGE
COMPUTE
FABRIC
NETWORKING
VISUALIZATION
Intel® Software Developer Tools
Intel® Intel Cluster Ready
Boards/Systems
Intel® Data Center Manager
IA Programming Model & Code Base
The Broadest Technical Computing Ecosystem
Intel® Lustre*
Intel® SSD/NVMe
RAID Controller
Intel® Xeon®
Intel® 10/40GbE
Intel® Switch Si
Intel® Xeon® Processor
E5 Family
インテルの HPC パフォーマンスの基礎。
ほぼ全域のワークロードにとって理想的
業界をリードする性能と、ワットあたりの性能
標準的な範囲のコア数を備え、
高速なシリアル性能にもフォーカスした、
シリアルおよび並列ワークロードのための
汎用プロセッサー
3
www.intel.com/xeon
Intel® Xeon® Processor E5 Family
ディープラーニング
も朝飯前
4
Intel® Xeon Phi™
Coprocessor 7120P
www.intel.com/xeonphi
5
61 Cores, 244 Threads
1.238 GHz
1.21 TFLOPS(倍精度浮動小数点ピーク性能)
512 bit SIMD instructions
スレッドあたり32 個のベクトルレジスター
16GB GDDR5 メモリー, 352 GB/s
300W(冷却方式:パッシブ)
PCIe x16( IA のホスト・プロセッサーが必要)
22nm with the world’s first
3-D Tri-Gate transistors
Linux* operating system
IP addressable
Common x86/IA
Programming Models and SW-Tools
Is Xeon Phi† performance compelling Vs Xeon† E5v2?
“2-socket Xeon E5v2 system” Vs “2-socket Xeon E5v2 system + Xeon Phi 7120”
http://www.intel.com/performance
Xeon Phi delivers up to 165% higher performance (with 1 card) versus 2-socket Xeon E5v2
† Xeon = Intel® Xeon® processor
† Xeon Phi = Intel® Xeon Phi™ coprocessor
6
Intel® Xeon Phi™ Product Family
1 TFLOPS1
Knights
Corner
3+ TFLOPS2
-プロセッサ(ブート可能)
-広帯域メモリをオンパッケージ
-インターコネクト内蔵
2H’15
Knights Commercial
First
Landing Systems
Knights
Hill
第3世代
Knights
Landing
Intel® Xeon Phi™
Coprocessor – Applications
and Solutions Catalog
>50
systems
providers
expected3
Intel® Xeon Phi™
Product Family
+
第2世代
many more
card-based systems
Intel Omni-Path
Architecture
10nm
プロセス技術
>100 PFLOPS customer system compute commits to-date3
1 Claim based on calculated theoretical peak double precision performance capability for a single coprocessor. 16 DP FLOPS/clock/core * 61 cores * 1.23GHz = 1.208 TeraFLOPS
2Over 3 Teraflops of peak theoretical double-precision performance is preliminary and based on current expectations of cores, clock frequency and floating point operations per
FLOPS = cores x clock frequency x floating-point operations per second per cycle. 3 Intel internal estimate
cycle.
Intel® Omni-Path Architecture
高い
48
ポート
システム Switch Chip
Architecture
拡張性
vs. 36 in InfiniBand
高い
アプリ性能
拡張性
100
Gbps
Line speed
56% 低い
遅延4
56%
Lower is Better
小規模クラスタ
主流のクラスタ
スパコン
高いポート密度
スイッチ数の削減2
スケーラブル
48 ports supports up to 12 add’l
nodes by only adding CABLES1
1.3x
Maximize SINGLE SWITCH investment
1
www.intel.com/omnipath
InfiniBand
Coming
2H’15
up to
½
Over 27k NODES in a
2-tier 5-hop FABRIC3
2.3x
As compared to a shipping 36-port edge InfiniBand switch. 2 Reduction in up to ½ fewer switches claim based on a 1024-node full bisectional bandwidth (FBB) Fat-Tree configuration, using
a 48-port switch for Intel® Omni-Path cluster and 36-port switch ASIC for either Mellanox or Intel® True Scale clusters. 3 A2.3X based on 27,648 nodes based on a cluster configured with the
Intel® Omni-Path Architecture using 48-port switch ASICs, as compared with a 36-port switch chip that can support up to 11,664 nodes. 4 Latency reductions based on Mellanox CS7500
Director Switch and Mellanox SB7700/SB7790 Edge switches compared to preliminary Intel simulations for Intel® Omni-Path switches, based on a 1024-node full bisectional bandwidth (FBB)
Fat-Tree configuration (2-tier, 5 total switch hops), using a 48-port switch for Intel® Omni-Path cluster and 36-port switch ASIC for either Mellanox or Intel® True Scale clusters. Results have
been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in your system hardware,
software or configuration may affect your actual performance.”
Intel® EE for Lustre
Hadoopとの接続性
オープンソース
9
インテル独自の拡張
Hadoopディストリビューションへのコネクター内訳
Hadoopに接続可能なLustre
ANL Selects Intel for World’s Biggest Supercomputer
2-system CORAL award extends IA leadership in extreme scale HPC
Aurora
Argonne National Laboratory
Trinity
>180PF
Cori
NNSA†
>40PF
April ‘15
>30PF
July ’14
+
NERSC‡
April ’14
Theta
Argonne National Laboratory
>8.5PF
>$200M
‡ Cray* XC* Series at National Energy Research Scientific Computing Center (NERSC).
† Cray XC Series at National Nuclear Security Administration (NNSA).
2
The Most Advanced Supercomputer Ever Built
An Intel-led collaboration with ANL and Cray to accelerate discovery & innovation
>180 PFLOPS
(option to increase up to 450 PF)
>50,000 nodes
13MW
2018 delivery
18X higher
performance†
>6X more energy
Prime Contractor
efficient†
Subcontractor
Source: Argonne National Laboratory and Intel.
†Comparison of theoretical peak double precision FLOPS and power consumption to ANL’s largest current system, MIRA (10PFs and 4.8MW)
11
“Intel’s leading technology & product provide
great high performance computing power
which enable us achieve more genome
scientific research success for genome
application development for China and for the
whole human being.”
Wang Bingqiang
Head of High Performance Computing, BGI
アプリケーション対応状況
Life Sciences
1213
AMBER* 14
1 NODE
Particle Mesh Ewald (PME) Tobacco Virus
AMBER* 14 PME: Tobacco Virus, 1 Million Atoms
2.41X
Comparative Performance
2.26X
2X
2
1.93X
1.52X
1
1
0
Intel® Xeon® processor E5-2697 v2 (baseline)
Intel® Xeon® processor E5-2697 v2 (optimized)
Xeon E5-2697 v2 (optimized) + Intel® Xeon Phi™ coprocessor 7120A
Xeon E5-2697 v2 (optimized) + NVIDIA* K40 DPFP
Intel® Xeon® processor E5-2697 v3
Xeon E5-2697 v3 (optimized) + Intel® Xeon Phi™ coprocessor 7120A
APPROVED FOR PUBLIC PRESENTATION
Application: AMBER* 14
Description: Bimolecular Simulations (Protein, DNA, RNA, virus etc.).
Full double precision (DPDP). More at http://ambermd.org/
Availability:
 Code: Available as a patch.
 Recipe: Available here (Section 18.7 of the manual).
Usage Model:
 Baseline is the Intel® Xeon® processor E5-2697 v2 compared to
the Intel® Xeon® processor E5-2697 v2 and the Intel® Xeon Phi™
coprocessor 7120A.
 Offload processing on both, and using the released code, double
precision code, across the platforms, 50% workload on the host
and 50% on the coprocessor.
Highlights: The code was optimized, delivered to the AMBER
community (whoever has license) and available as an update patch
during code configuration. The benchmark information is at
http://www.ks.uiuc.edu/Research/STMV/
Results: Optimized Intel Xeon processor E5-2697 v3 and Intel Xeon
Phi coprocessor 7120A offload demonstrated up to 2.41X improved
performance over the Intel Xeon processor E5-2697 v2. Optimized
offload process demonstrated 1.07X increased performance
compared to NVIDIA K40* performance.
“Xeon E5-2697 v2/v3” = Intel® Xeon® processor E5-2697 v2/v3
For configuration details, go here.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are
measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. See
benchmark tests and configurations in the speaker notes. For more information go to http://www.intel.com/performance
* Other names and brands may be claimed as the property of others.
SOURCE: INTEL MEASURED RESULTS AS OF SEPTEMBER, 2014
13
13
AMBER* 14
CLUSTER BENCHMARK
Particle Mesh Ewald (PME) Cellulose NPT
AMBER* PME Cellulose NPT (408K Atoms)
1.57X
Comparative Performance
1.37X
1.32X
1.14X 1.11X
1
1
0
2 nodes
3 nodes
Intel® Xeon® processor E5-2697 v2 (baseline)
Xeon E5-2697 v2 + Intel® Xeon Phi™ coprocessor 7120A
Xeon E5-2697 v2 + NVIDIA* K40 DPFP
3 NODES
APPROVED FOR PUBLIC PRESENTATION
Application: AMBER* 14
Description: Bimolecular Simulations (Protein, DNA, RNA, virus
etc.). Full double precision (DPDP). More at http://ambermd.org/
Availability:
 Code: Available as a patch.
 Recipe: Available here (Section 18.7 of the manual).
Usage Model:
 Baseline is on the Intel® Xeon® processor E5-2697 v2 host only
(also measured in
http://ambermd.org/gpus/benchmarks.htm#Benchmarks) and
speed up is shown with offload processing on both the Intel
Xeon processor E5-2697 v2 and the Intel® Xeon Phi™
coprocessor 7120A.
 Performance shown is for the released code, double precision
across the platforms, 50% workload on the host, 50% on the
coprocessor.
Highlights: The code had been optimized, will be delivered to the
AMBER community (whoever has license) and available as update
patch during code configuration.
Results: Optimized offload process demonstrated compelling
cluster performance improvement, up to 2.6X, over the baseline
Intel® Xeon® processor E5-2697 v2.
“Xeon E5-2697 v2” = Intel® Xeon® processor E5-2697 v2
For configuration details, go here.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are
measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. See
benchmark tests and configurations in the speaker notes. For more information go to http://www.intel.com/performance
* Other names and brands may be claimed as the property of others.
SOURCE: INTEL MEASURED RESULTS AS OF SEPTEMBER, 2014
14
14
1 NODE
GROMACS*
APPROVED FOR PUBLIC PRESENTATION
512K H2O with RF
GROMACS* 512K H2O with RF Speed Up
1.79X
1.72X
Comparative Performance
1.56X
1
1
1.03X
0
Intel® Xeon® processor E5-2697 v2
1 Intel® Xeon Phi™ coprocessor 7120P/X
2 Intel® Xeon Phi™ coprocessor 7120P/X
Intel® Xeon® processor E5-2697 v2 + 1 Intel® Xeon Phi™ coprocessor 7120P/X
Intel® Xeon® processor E5-2697 v2 + 2 Intel® Xeon Phi™ coprocessor 7120P/X
For configuration details, go here.
Application: GROMACS* 5.0-RC1; Workload: 512K H2O with
RF method
Description: GROMACS is a versatile package to perform
molecular dynamics, i.e. simulate the Newtonian equations
of motion for systems with hundreds to millions of particles.
It is one of the fastest and the most popular Molecular
Dynamics packages.
Availability:
 Code: Version 5.0-rc1 available here and here.
 Recipe: Available here.
Highlights:
 Highly optimized for Intel® Xeon® Processors (AVXintrinsics).
 Able to run full simulation on Intel® Xeon Phi™
coprocessor natively + host processor using a symmetric
model.
 Optimized with intrinsics for 512-bit vectorization on Intel
Xeon Phi coprocessors.
 Results: Symmetric process demonstrated up to 1.79X
improved performance over the baseline Intel® Xeon®
processor E5-2697 v2.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are
measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. See
benchmark tests and configurations in the speaker notes. For more information go to http://www.intel.com/performance
* Other names and brands may be claimed as the property of others.
SOURCE: INTEL MEASURED RESULTS AS OF APRIL, 2014
15
CLUSTER BENCHMARK
NWChem*
32 NODES
APPROVED FOR PUBLIC PRESENTATION
CCSD(T) Method
NWChem* 6.3rev2 and 6.5 CCSD(T) Method 32 Node Speed Up
Comparative Performance
1.52X
1.24X
1
1
0
NWChem 6.3, 64S Intel® Xeon® processor E5-2697 v2
NWChem 6.5, 64S Intel® Xeon® processor E5-2697 v2
NWChem 6.5, 64S Intel® Xeon® processor E5-2697 v2 + 64 Intel® Xeon
Phi™ Coprocessor 7120A 2
Application: NWChem* is a computational chemistry software
package that includes quantum chemical and molecular
dynamics functionality. NWChem is developed the
Environmental Molecular Sciences Laboratory (EMSL) at
the Pacific Northwest National Laboratory (PNNL). More at
http://www.nwchem-sw.org
Availability:
 Code: Available here and from the SVN repository.
 Recipe: Available here.
Usage Model: Offload using LEO and OpenMP*
Highlights: NWChem with Intel® Xeon Phi™ coprocessor 7120A
offloading is a compelling and cluster compelling application for
the NWChem community .
Results: Compared to the NWChem* 6.3rev2 and Intel® Xeon®
processor E5-2697 v2 baseline:
1) NWChem 6.5 CCSD(T) performed up to 1.24X faster with the
Intel® Xeon® processor E5-2697 v2.
2) NWChem 6.5 CCSD(T) performed up to 1.52X faster with the
Intel® Xeon® processor E5-2697 v2 and the Intel Xeon Phi
coprocessor 7120A.
For configuration details, go here.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are
measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. See
benchmark tests and configurations in the speaker notes. For more information go to http://www.intel.com/performance
* Other names and brands may be claimed as the property of others.
SOURCE: INTEL MEASURED RESULTS AS OF JULY, 2014
16
CLUSTER BENCHMARK
NAMD* 2.10 Pre-Release
32 NODES
APPROVED FOR PUBLIC PRESENTATION
STMV
NAMD* 2.10 (pre-release) Cluster Performance Increase
STMV (~1M atoms)
Comparative Performance
30
27.2X
24.2X
25
20X
20
15
12.2X 13.1X
10
6.8X
5
0
32X
1
1.2X
2X
7.9X
2.1X
1 Node
8 Nodes
32 Nodes
Intel® Xeon® processor E5-2697 v2 (Baseline: 1 node, 23 or 47 PPN)
Intel® Xeon® processor E5-2697 v3 (27 or 55 PPN)
Xeon E5-2697 v2 (23 or 47 PPN) + 1 Intel® Xeon Phi™ coprocessor 7120A (240T)
Xeon E5-2697 v3 (27 or 55 PPN) + 1 Intel® Xeon Phi™ coprocessor 7110A (240T)
“Xeon E5-2697 v2/v3” = Intel® Xeon® processor E5-2697 v2/v3
Application: NAMD 2.10 pre-release; STMV
Description: A parallel, object-oriented molecular
dynamics code designed for high-performance
simulation of large biomolecular systems. More at
http://www.ks.uiuc.edu/Research/namd/
Availability:
 Code: Intel® Xeon Phi™ coprocessor support is
available as a pre-release. Use the nightly build.
 Recipe: Available here.
Usage Model: Single rank on host with 47 threads.
Various computations are offloaded to Intel® Xeon
Phi™ coprocessor from each thread.
Highlights: Intel® Xeon Phi™ coprocessor support is
now in the development branch of NAMD 2.10 prerelease.
Results: For the STMV workload, the Intel® Xeon®
processor E5-2697 v3 and the Intel® Xeon Phi™
coprocessor (32 nodes, 55 PPN) improved
performance by up to 32X compared to the
baseline processor (1 node, 47 PPN).
For configuration details, go here.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured
using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and
performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. See benchmark tests and
configurations in the speaker notes. For more information go to http://www.intel.com/performance
* Other names and brands may be claimed as the property of others.
SOURCE: INTEL MEASURED RESULTS AS OF SEPTEMBER, 2014
17
CLUSTER BENCHMARK
NAMD* 2.10 Pre-Release
2 NODES
APPROVED FOR PUBLIC PRESENTATION
ApoA1
NAMD* 2.10 (pre-release) Cluster Performance Increase
ApoA1 (~92K atoms); 55 PPN
Comparative Performance
2.61X
1.94X
2
1.52X
1
1
0
1 Node
2 Nodes
Intel® Xeon® processor E5-2697 v3 (Baseline:
(Baseline; 1 node,
node) 55PPN)
Application: NAMD* 2.10 pre-release; ApoA1
Description: A parallel, object-oriented molecular dynamics code
designed for high-performance simulation of large bio molecular
systems. More at http://www.ks.uiuc.edu/Research/namd/
Availability:
 Code: Intel® Xeon Phi™ coprocessor support is available as a prerelease. Use the nightly build.
 Recipe: Available here.
Usage Model: Single rank on host with 55 threads. Various
computations are offloaded to Intel® Xeon Phi™ coprocessor from
each thread.
Highlights: Intel® Xeon Phi™ coprocessor support is now in the
development branch of NAMD 2.10 pre-release.
Results: For the ApoA1 workload, 2-node performance can be
accelerated by up to 2.61X using a single Intel® Xeon Phi™
coprocessor.
Intel® Xeon® processor E5-2697 v3 + Intel® Xeon Phi™ coprocessor B17110A (240T)
For configuration details, go here.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured
using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and
performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. See benchmark tests and
configurations in the speaker notes. For more information go to http://www.intel.com/performance
* Other names and brands may be claimed as the property of others.
SOURCE: INTEL MEASURED RESULTS AS OF SEPTEMBER, 2014
18
CLUSTER BENCHMARK
LAMMPS*
32 NODES
APPROVED FOR PUBLIC PRESENTATION
NEW
Stillinger-Weber Water Benchmark
LAMMPS* Liquid Crystal Benchmark Performance (Mixed Precision)
3.6X
Comparative Performance
3.41X
3.05X
3X
3
2
1
1
0.9X
1
No
testing
on Tesla
0
1 Node
32 Nodes
2S Intel® Xeon® processor E5-2697 v3 (LAMMPS Baseline)
2S Intel® Xeon® processor E5-2697 v3 (LAMMPS IA Package)
2S Xeon E5-2697 v3 + Tesla K40c*, boost off, ECC on
2S Xeon E5-2697 v3 + Xeon Phi 7120A, turbo off (LAMMPS IA Package)
“Xeon E5-2697 v3” = Intel® Xeon® processor E5-2697 v3
“Xeon Phi 7120A” = Intel® Xeon Phi™ coprocessor 7120A
Application: LAMMPS*
Description: Simulation of molecular systems with classical
models. More at http://lammps.sandia.gov/
Availability:
 Code: In main LAMMPS repository.
 Recipe: Available here.
Usage Model: Load balancer offloads part of neighbor-list and
non-bond force calculations to Intel® Xeon Phi™ coprocessor
for concurrent calculations with CPU.
Highlights: Improved results with Intel® Xeon® processor E52697 v3 and Intel® Xeon Phi™ coprocessor 7120A. Dynamic
load balancing allows for concurrent:
 Data transfer between host and coprocessor.
 Calculations of neighbor-list, non-bond, bond, and longrange terms.
Same routines in LAMMPS Intel Package also run faster on
CPU.
Results: Simulation rate increase with Intel® Package is up to
3.6X. Concurrent Intel Xeon Phi coprocessor computations and
MPI communications yield improved speedup and higher node
counts.
For configuration details, go here.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and
MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary.
You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when
combined with other products. See benchmark tests and configurations in the speaker notes. For more information go to http://www.intel.com/performance
* Other names and brands may be claimed as the property of others.
SOURCE: INTEL MEASURED RESULTS AS OF MARCH, 2015
19
CLUSTER BENCHMARK
LAMMPS*
32 NODES
APPROVED FOR PUBLIC PRESENTATION
Rhodopsin Benchmark; 512K Atoms
LAMMPS* Rhodopsin Benchmark Performance
(Mixed Precision)
Comparative Performance
1.68X
1.47X
1.27X
1
1
1
1.07X
0
1 Node
32 Nodes
2S Intel® Xeon® processor E5-2697 v3 (LAMMPS Baseline)
2S Intel® Xeon® processor E5-2697 v3 (LAMMPS IA Package)
2S E5-2697 v3 + Intel® Xeon Phi™ coprocessor 7110P/7120A Turbo Off
(LAMMPS IA Package)
Application: LAMMPS*
Description: Simulation of molecular systems with classical
models. More at http://lammps.sandia.gov/
Availability:
 Code: In main LAMMPS repository.
 Recipe: Available here.
Usage Model: Load balancer offloads part of neighbor-list and
non-bond force calculations to Intel® Xeon Phi™ coprocessor for
concurrent calculations with CPU.
Highlights: Improved results with Intel® Xeon® processor E5-2697
v3 and Intel Xeon Phi coprocessor 7120A. Dynamic load
balancing allows for concurrent:
 Data transfer between host and coprocessor.
 Calculations of neighbor-list, non-bond, bond, and long-range
terms.
Same routines in LAMMPS Intel Package also run faster on CPU.
Results: Up to 1.68X performance improvement utilizing Intel®
Xeon® processors and Intel® Xeon Phi™ coprocessors with
application optimization on a single node compared to the
baseline configuration. Performance gains continue to hold at
1.47X when scaling up to 32 nodes.
For configuration details, go here.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are
measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. See
benchmark tests and configurations in the speaker notes. For more information go to http://www.intel.com/performance
* Other names and brands may be claimed as the property of others.
SOURCE: INTEL MEASURED RESULTS AS OF AUGUST, 2014
20
1 NODE
Johns Hopkins Bowtie 2*
APPROVED FOR PUBLIC PRESENTATION
NEW
Multiple workloads
Johns Hopkins Bowtie 2 TGen Workload Speed Up
1.87X
Comparative Increase
1.59X
1
1.08X
1
.88X
0
ERR161544
SRR034966_1
ERR000589
SRR002273_1
Intel® Xeon® processor E5-2697 v2 + 1 NVIDIA Tesla* K40
Application: Bowtie2 version 2.2.3; Intel® AVX2 port.
Description: NVBowtie version 0.9.9.3. Bowtie is a GPUaccelerated re-engineering of Bowtie2, a very widely used shortread aligner. While being completely rewritten from scratch,
nvBowtie reproduces many (though not all) of the features of
Bowtie2. http://nvlabs.github.io/nvbio/nvbowtie_page.html
Availability:
 Code: Available here.
 Recipe: Not available. Check for future availability here.
Usage Model: ERR161544, SRR002273_1, HEK001(TGen),
ERR000589_1, SRR033552_1, SRR034966_1, & ERR024139_1
Highlights: See more here.
Results: Bowtie2 running on the Intel® Xeon® processor E52697 v3 with Intel® AVX2 port faster than NVBowtie running on
the Intel® Xeon® processor E5-2697 v2 and the NVIDIA Tesla
K40* for 6 of 7 workloads. NVIDIA published data of K40
compared to Intel® Xeon® processor E5-2600 (6 cores) on one
workload.
Intel® Xeon® processor E5-2697 v3
For configuration details, go here.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,
components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated
purchases, including the performance of that product when combined with other products. See benchmark tests and configurations in the speaker notes. For more information go to http://www.intel.com/performance
* Other names and brands may be claimed as the property of others.
SOURCE: INTEL MEASURED RESULTS AS OF JANUARY, 2015
21
Burrows-Wheeler Aligner (BWA-ALN)*
1 NODE
APPROVED FOR PUBLIC PRESENTATION
Human Genome
BWA-ALN* Speed Up
Comparative Performance
1.86X
1.24X
1
1
0
2S Intel® Xeon® processor E5-2697 v2 (baseline BWA-ALN)
2S Intel® Xeon® processor E5-2697 v2 (optimized BWA-ALN)
2S Intel® Xeon® processor E5-2697 v2 + Intel® Xeon Phi™ coprocessor
7120A
Application: Burrows-Wheeler Aligner*, version 0.5.10.
BWA-ALN is represented in this benchmark. Workload is
korean_female (read file 3.5 GB, 3.0 GB reference data
base).
Description: BWA is a popular software package for
mapping low-divergent sequences against a large
reference genome, such as the human genome. More at
http://bio-bwa.sourceforge.net/.
Availability:
 Code: Available here.
 Recipe: Available here.
Usage Model: Hybrid MPI + OpenMP* using symmetric
mode.
Highlights: Results are identical to the unmodified run
of BWA-ALN
Results: The Intel® Xeon® processor E5-2697 v2 and the
Intel® Xeon Phi™ coprocessor symmetric process
demonstrated up to 1.86X improved performance over
the baseline Intel® Xeon® processor E5-2697 v2.
For configuration details, go here.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are
measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. See
benchmark tests and configurations in the speaker notes. For more information go to http://www.intel.com/performance
* Other names and brands may be claimed as the property of others.
SOURCE: INTEL MEASURED RESULTS AS OF JANUARY, 2014
22
1 NODE
BLAST*
BLASTn v.30
Comparative Performance
BLASTn* v.30 Speed Up
1.49X
1.41X
1.26X
1
1.52X
1.22X
1
0
2S Xeon E5-2697 v2 (BLASTn v.30 baseline)
2S Xeon E5-2697 v2 + Xeon Phi 7120A
2S Xeon E5-2697 v2 + Xeon Phi 7120A, OFS parallelized
2S Xeon E5-2697 v3 (BLASTn v.30 baseline)
2S Xeon E5-2697 v3 + Xeon Phi 7120A2
2S Xeon E5-2697 v2 + Xeon Phi 7120A, OFS parallelized
“Xeon E5-2697 v2/v3” = Intel® Xeon® processor E5-2697 v2/v3
“Xeon Phi 7120A” = Intel® Xeon Phi™ coprocessor 7120A
For configuration details, go here.
APPROVED FOR PUBLIC PRESENTATION
NEW
Application: Basic Local Alignment Search Tool (BLASTn) v.30.
Description: Searching for alignment in nucleotide query
sequences against a known nucleotide db volume set. National
Center for Biotechnology Information (NCBI*). More at
http://blast.ncbi.nlm.nih.gov/.
Availability:
 Code: Available here.
 Recipe: Available here.
Usage Model: #4 (multiple queries multiple db) 100 NCBI queries
(concatenated) against db refseq_rna.00-02 are distributed to the
Intel® Xeon® processor and Intel® Xeon Phi™ coprocessor for
maximum speedup sweet spot. Experiment was repeated 20 times
with the pick of queries randomized for a sweet spot split 80/20
and 59/23/18.
Highlights: Throughput for this load sharing model has a small
sweet spot for a sufficiently large query set.
Results: Compared to the baseline, simulation rate speed up with
Intel® Xeon® processor E5-2697 v3 and Intel® Xeon Phi™
coprocessor 7120A heterogeneous model is 1.52X. Performance is
also improved on the CPU due to Output Formatting Section (OFS)
parallelization.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and
MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to
vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product
when combined with other products. See benchmark tests and configurations in the speaker notes. For more information go to http://www.intel.com/performance
.
* Other names and brands may be claimed as the property of others.
SOURCE: INTEL MEASURED RESULTS AS OF MARCH, 2015
23
1 NODE
BLAST*
APPROVED FOR PUBLIC PRESENTATION
NEW
BLASTp v.30
Application: Basic Local Alignment Search Tool (BLASTp) v.30
Description: Searching for alignment in protein query sequence
1.41X
against a known protein db volume set. More at
1.39X
1.3X
http://blast.ncbi.nlm.nih.gov/.
1.21X
1.15X
Availability:
1
 Code: Available here.
1
 Recipe: Available here.
 Usage Model: #4 (multiple queries multiple db) 40 NCBI
queries (concatenated) against db nr_sorted.00-02 are
distributed to Intel® Xeon® processor and Intel® Xeon Phi™
coprocessor for maximum speedup sweet spot. Experiment
was repeated 20 times with the pick of queries randomized for
a sweet spot split 33/7 and 28/5/7.
0
2S Xeon E5-2697 v2 (BLASTn v.30 baseline)
Highlights: Throughput for this offload model has a small sweet
2S Xeon E5-2697 v2 + Xeon Phi 7120A
spot for a sufficiently large query set. Throughput is limited due
2S Xeon E5-2697 v2 + Xeon Phi 7120A, OFS parallelized
to GAT stage not parallelized.
2S Xeon E5-2697 v3 (BLASTn v.30 baseline)
Results: Compared to the baseline, simulation rate speed up
2S Xeon E5-2697 v3 + Xeon Phi 7120A2
with Intel® Xeon® processor E5-2697 v3 and Intel® Xeon Phi™
2S Xeon E5-2697 v2 + Xeon Phi 7120A, OFS parallelized
coprocessor 7120A heterogeneous model is 1.41X.
“Xeon E5-2697 v2/v3” = Intel® Xeon® processor E5-2697 v2/v3
Performance is also improved on the CPU due to Output
“Xeon Phi 7120A” = Intel® Xeon Phi™ coprocessor 7120A
Formatting Section (OFS) parallelization.
For configuration details, go here.
.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,
Comparative Performance
BLASTp* v.30 Speed Up
components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated
purchases, including the performance of that product when combined with other products. See benchmark tests and configurations in the speaker notes. For more information go to http://www.intel.com/performance
* Other names and brands may be claimed as the property of others.
SOURCE: INTEL MEASURED RESULTS AS OF MARCH, 2015
24
法務情報
本資料に記載されているすべての製品、コンピューター・システム、日付、および数値は、現在の予想に基づくものであり、予告なく変更され
ることがあります。
インテル・プロセッサー・ナンバーはパフォーマンスの指標ではありません。プロセッサー・ナンバーは 同一プロセッサー・ファミリー内の製
品の機能を区別します。異なるプロセッサー・ファミリー 間の機能の区別には用いません。 詳細については、
http://www.intel.co.jp/jp/products/processor_number/ を参照してください。
インテル® プロセッサー、チップセット、およびデスクトップ・ボードには、エラッタと呼ばれる設計上の不具合が含まれている可能性があり、
公表されている仕様とは異なる動作をする場合があります。現在確認済みのエラッタについては、インテルまでお問い合わせください。
インテル® バーチャライゼーション・テクノロジーを利用するには、同テクノロジーに対応したインテル ® プロセッサー、BIOS、および仮想マシン
モニター (VMM) を搭載したコンピューター・システムが必要です。機能性、性能もしくはその他のバーチャライゼーション・テクノロジーの特
長は、ご使用のハードウェアやソフトウェアの構成によって異なります。ご利用になる OS によっては、ソフトウェア・アプリケーションとの互
換性がない場合があります。各 PC メーカーにお問い合わせください。 詳細については、
http://www.intel.co.jp/content/www/jp/ja/virtualization/virtualization-technology/hardware-assist-virtualization-technology.html を参照してく
ださい。
すべての条件下で絶対的なセキュリティーを提供できるコンピューター・システムはありません。インテル® トラステッド・エグゼキューショ
ン・テクノロジー (インテル® TXT) を利用するには、インテル® バーチャライゼーション・テクノロジー、インテル® TXT に対応したプロセッ
サー、チップセット、BIOS、Authenticated Code モジュール、インテル® TXT に対応した Measured Launched Environment (MLE) を搭載するコン
ピューター・システムが必要です。さらに、インテル® TXTを利用するには、システムが TPM v1.s を搭載している必要があります。 詳細につい
ては、http://www.intel.co.jp/content/www/jp/ja/data-security/security-overview-general-technology.html を参照してください。
インテル® ターボ・ブースト・テクノロジーに対応したシステムが必要です。インテル® ターボ・ブースト・テクノロジーおよびインテル® ター
ボ・ブースト・テクノロジー 2.0 は、一部のインテル® プロセッサーでのみ利用可能です。各 PC メーカーにお問い合わせください。実際の性能は
ハードウェア、ソフトウェア、システム構成によって異なります。詳細については、http://www.intel.co.jp/jp/technology/turboboost/ を参照してく
ださい。
インテル® AES New Instructions (インテル® AES-NI) を利用するには、インテル® AES-NI に対応したプロセッサーを搭載したコンピューター・システ
ム、および命令を正しい手順で実行する他社製ソフトウェアが必要ですインテル® AES-NI は、一部のインテル® プロセッサーで利用できます。提
供状況については、各 PC メーカーなどにお問い合わせください。詳細については、http://software.intel.com/en-us/articles/intel-advancedencryption-standard-instructions-aes-ni/ (英語) を参照してください。
Intel、インテル、Intel ロゴ、Intel Inside ロゴ、Xeon、Xeon Inside、Intel Xeon Phi は、アメリカ合衆国および / またはその他の国における Intel
Corporation の商標です。
© 2012, Intel Corporation. 無断での引用、転載を禁じます。
26
法律的な免責条項: パフォーマンス
性能に関するテストや評価は、特定のコンピューター・システム、コンポーネント、またはそれらを組み合わせて行ったものであり、こ
のテストによるインテル製品の性能の概算の値を表しているものです。システム・ハードウェア、ソフトウェアの設計、構成などの違い
により、実際の性能は掲載された性能テストや評価とは異なる場合があります。システムやコンポーネントの購入を検討される場合は、
ほかの情報も参考にして、パフォーマンスを総合的に評価することをお勧めします。インテル製品の性能評価についてさらに詳しい情報
をお知りになりたい場合は、http://www.intel.com/performance を参照してください。
インテルは、本資料で参照している第三者のベンチマークまたは Web サイトの設計や実装について管理や監査を行っていません。本資料
で参照している Web サイトまたは類似の性能ベンチマークが報告されているほかの Web サイトも参照して、本資料で参照しているベン
チマークが購入可能なシステムの性能を正確に表しているかを確認されるようお勧めします。
各ベンチマークの相対パフォーマンスは、ベンチマーク結果に 1.0 のベースライン値を割り当て、各プラットフォームのベンチマークの結
果を、ベースラインとなるプラットフォームの実際のベンチマーク結果で割り、報告されたパフォーマンスの向上に比例する相対パ
フォーマンスの数値を割り当てることによって計算しています。
SPEC、SPECint、SPECfp、SPECrate、SPECpower、SPECjAppServer、SPECjEnterprise、SPECjbb、SPECompM、SPECompL、SPEC
MPI は、Standard Performance Evaluation Corporation の商標です。詳細については、 http://www.spec.org/spec/trademarks.html (英語)
を参照してください。
TPC* ベンチマークは Transaction Processing Council の商標です。詳細については、http://www.tpc.org/ (英語) を参照してください。
SAP および SAP NetWeaver は、ドイツおよびその他の国々における SAP AG の登録商標です。詳細については、
http://www.sap.com/benchmark/(英語) を参照してください。
本資料に掲載されている情報は、現状のまま提供され、明示されているか否かにかかわらず、また禁反言によるとよらずにかかわらず、
いかなる知的財産権のライセンスを許諾するものではありません。この情報に関する明示または黙示の保証 (特定目的への適合性、商品適
格性、あらゆる特許権、著作権、その他知的財産権の非侵害性への保証を含む)に関してもいかなる責任も負いません。
性能に関するテストに使用されるソフトウェアとワークロードは、性能がインテル® マイクロプロセッサー用に最適化されていることがあ
ります。SYSmark* や MobileMark* などの性能テストは、特定のコンピューター・システム、コンポーネント、ソフトウェア、操作、機
能に基づいて行ったものです。結果はこれらの要因によって異なります。製品の購入を検討される場合は、他の製品と組み合わせた場合
の本製品の性能など、ほかの情報や性能テストも参考にして、パフォーマンスを総合的に評価することをお勧めします。
27