インシリコ創薬時代の 最新チップとアプリの開発状況 ソリューションアーキテクト 郡司 茂樹 [email protected] バイオグリッド研究会2015 生命科学の発展を支える製品ポートフォリオ Intel® True Scale Intel® Omni-Path Intel® 10/40GbE Intel® Xeon® Intel® Xeon Phi™ Intel® iGFX Intel® Xeon® Intel® Xeon Phi™ Iris Pro™ Graphics Embree Ray-Tracing STORAGE COMPUTE FABRIC NETWORKING VISUALIZATION Intel® Software Developer Tools Intel® Intel Cluster Ready Boards/Systems Intel® Data Center Manager IA Programming Model & Code Base The Broadest Technical Computing Ecosystem Intel® Lustre* Intel® SSD/NVMe RAID Controller Intel® Xeon® Intel® 10/40GbE Intel® Switch Si Intel® Xeon® Processor E5 Family インテルの HPC パフォーマンスの基礎。 ほぼ全域のワークロードにとって理想的 業界をリードする性能と、ワットあたりの性能 標準的な範囲のコア数を備え、 高速なシリアル性能にもフォーカスした、 シリアルおよび並列ワークロードのための 汎用プロセッサー 3 www.intel.com/xeon Intel® Xeon® Processor E5 Family ディープラーニング も朝飯前 4 Intel® Xeon Phi™ Coprocessor 7120P www.intel.com/xeonphi 5 61 Cores, 244 Threads 1.238 GHz 1.21 TFLOPS(倍精度浮動小数点ピーク性能) 512 bit SIMD instructions スレッドあたり32 個のベクトルレジスター 16GB GDDR5 メモリー, 352 GB/s 300W(冷却方式:パッシブ) PCIe x16( IA のホスト・プロセッサーが必要) 22nm with the world’s first 3-D Tri-Gate transistors Linux* operating system IP addressable Common x86/IA Programming Models and SW-Tools Is Xeon Phi† performance compelling Vs Xeon† E5v2? “2-socket Xeon E5v2 system” Vs “2-socket Xeon E5v2 system + Xeon Phi 7120” http://www.intel.com/performance Xeon Phi delivers up to 165% higher performance (with 1 card) versus 2-socket Xeon E5v2 † Xeon = Intel® Xeon® processor † Xeon Phi = Intel® Xeon Phi™ coprocessor 6 Intel® Xeon Phi™ Product Family 1 TFLOPS1 Knights Corner 3+ TFLOPS2 -プロセッサ(ブート可能) -広帯域メモリをオンパッケージ -インターコネクト内蔵 2H’15 Knights Commercial First Landing Systems Knights Hill 第3世代 Knights Landing Intel® Xeon Phi™ Coprocessor – Applications and Solutions Catalog >50 systems providers expected3 Intel® Xeon Phi™ Product Family + 第2世代 many more card-based systems Intel Omni-Path Architecture 10nm プロセス技術 >100 PFLOPS customer system compute commits to-date3 1 Claim based on calculated theoretical peak double precision performance capability for a single coprocessor. 16 DP FLOPS/clock/core * 61 cores * 1.23GHz = 1.208 TeraFLOPS 2Over 3 Teraflops of peak theoretical double-precision performance is preliminary and based on current expectations of cores, clock frequency and floating point operations per FLOPS = cores x clock frequency x floating-point operations per second per cycle. 3 Intel internal estimate cycle. Intel® Omni-Path Architecture 高い 48 ポート システム Switch Chip Architecture 拡張性 vs. 36 in InfiniBand 高い アプリ性能 拡張性 100 Gbps Line speed 56% 低い 遅延4 56% Lower is Better 小規模クラスタ 主流のクラスタ スパコン 高いポート密度 スイッチ数の削減2 スケーラブル 48 ports supports up to 12 add’l nodes by only adding CABLES1 1.3x Maximize SINGLE SWITCH investment 1 www.intel.com/omnipath InfiniBand Coming 2H’15 up to ½ Over 27k NODES in a 2-tier 5-hop FABRIC3 2.3x As compared to a shipping 36-port edge InfiniBand switch. 2 Reduction in up to ½ fewer switches claim based on a 1024-node full bisectional bandwidth (FBB) Fat-Tree configuration, using a 48-port switch for Intel® Omni-Path cluster and 36-port switch ASIC for either Mellanox or Intel® True Scale clusters. 3 A2.3X based on 27,648 nodes based on a cluster configured with the Intel® Omni-Path Architecture using 48-port switch ASICs, as compared with a 36-port switch chip that can support up to 11,664 nodes. 4 Latency reductions based on Mellanox CS7500 Director Switch and Mellanox SB7700/SB7790 Edge switches compared to preliminary Intel simulations for Intel® Omni-Path switches, based on a 1024-node full bisectional bandwidth (FBB) Fat-Tree configuration (2-tier, 5 total switch hops), using a 48-port switch for Intel® Omni-Path cluster and 36-port switch ASIC for either Mellanox or Intel® True Scale clusters. Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance.” Intel® EE for Lustre Hadoopとの接続性 オープンソース 9 インテル独自の拡張 Hadoopディストリビューションへのコネクター内訳 Hadoopに接続可能なLustre ANL Selects Intel for World’s Biggest Supercomputer 2-system CORAL award extends IA leadership in extreme scale HPC Aurora Argonne National Laboratory Trinity >180PF Cori NNSA† >40PF April ‘15 >30PF July ’14 + NERSC‡ April ’14 Theta Argonne National Laboratory >8.5PF >$200M ‡ Cray* XC* Series at National Energy Research Scientific Computing Center (NERSC). † Cray XC Series at National Nuclear Security Administration (NNSA). 2 The Most Advanced Supercomputer Ever Built An Intel-led collaboration with ANL and Cray to accelerate discovery & innovation >180 PFLOPS (option to increase up to 450 PF) >50,000 nodes 13MW 2018 delivery 18X higher performance† >6X more energy Prime Contractor efficient† Subcontractor Source: Argonne National Laboratory and Intel. †Comparison of theoretical peak double precision FLOPS and power consumption to ANL’s largest current system, MIRA (10PFs and 4.8MW) 11 “Intel’s leading technology & product provide great high performance computing power which enable us achieve more genome scientific research success for genome application development for China and for the whole human being.” Wang Bingqiang Head of High Performance Computing, BGI アプリケーション対応状況 Life Sciences 1213 AMBER* 14 1 NODE Particle Mesh Ewald (PME) Tobacco Virus AMBER* 14 PME: Tobacco Virus, 1 Million Atoms 2.41X Comparative Performance 2.26X 2X 2 1.93X 1.52X 1 1 0 Intel® Xeon® processor E5-2697 v2 (baseline) Intel® Xeon® processor E5-2697 v2 (optimized) Xeon E5-2697 v2 (optimized) + Intel® Xeon Phi™ coprocessor 7120A Xeon E5-2697 v2 (optimized) + NVIDIA* K40 DPFP Intel® Xeon® processor E5-2697 v3 Xeon E5-2697 v3 (optimized) + Intel® Xeon Phi™ coprocessor 7120A APPROVED FOR PUBLIC PRESENTATION Application: AMBER* 14 Description: Bimolecular Simulations (Protein, DNA, RNA, virus etc.). Full double precision (DPDP). More at http://ambermd.org/ Availability: Code: Available as a patch. Recipe: Available here (Section 18.7 of the manual). Usage Model: Baseline is the Intel® Xeon® processor E5-2697 v2 compared to the Intel® Xeon® processor E5-2697 v2 and the Intel® Xeon Phi™ coprocessor 7120A. Offload processing on both, and using the released code, double precision code, across the platforms, 50% workload on the host and 50% on the coprocessor. Highlights: The code was optimized, delivered to the AMBER community (whoever has license) and available as an update patch during code configuration. The benchmark information is at http://www.ks.uiuc.edu/Research/STMV/ Results: Optimized Intel Xeon processor E5-2697 v3 and Intel Xeon Phi coprocessor 7120A offload demonstrated up to 2.41X improved performance over the Intel Xeon processor E5-2697 v2. Optimized offload process demonstrated 1.07X increased performance compared to NVIDIA K40* performance. “Xeon E5-2697 v2/v3” = Intel® Xeon® processor E5-2697 v2/v3 For configuration details, go here. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. See benchmark tests and configurations in the speaker notes. For more information go to http://www.intel.com/performance * Other names and brands may be claimed as the property of others. SOURCE: INTEL MEASURED RESULTS AS OF SEPTEMBER, 2014 13 13 AMBER* 14 CLUSTER BENCHMARK Particle Mesh Ewald (PME) Cellulose NPT AMBER* PME Cellulose NPT (408K Atoms) 1.57X Comparative Performance 1.37X 1.32X 1.14X 1.11X 1 1 0 2 nodes 3 nodes Intel® Xeon® processor E5-2697 v2 (baseline) Xeon E5-2697 v2 + Intel® Xeon Phi™ coprocessor 7120A Xeon E5-2697 v2 + NVIDIA* K40 DPFP 3 NODES APPROVED FOR PUBLIC PRESENTATION Application: AMBER* 14 Description: Bimolecular Simulations (Protein, DNA, RNA, virus etc.). Full double precision (DPDP). More at http://ambermd.org/ Availability: Code: Available as a patch. Recipe: Available here (Section 18.7 of the manual). Usage Model: Baseline is on the Intel® Xeon® processor E5-2697 v2 host only (also measured in http://ambermd.org/gpus/benchmarks.htm#Benchmarks) and speed up is shown with offload processing on both the Intel Xeon processor E5-2697 v2 and the Intel® Xeon Phi™ coprocessor 7120A. Performance shown is for the released code, double precision across the platforms, 50% workload on the host, 50% on the coprocessor. Highlights: The code had been optimized, will be delivered to the AMBER community (whoever has license) and available as update patch during code configuration. Results: Optimized offload process demonstrated compelling cluster performance improvement, up to 2.6X, over the baseline Intel® Xeon® processor E5-2697 v2. “Xeon E5-2697 v2” = Intel® Xeon® processor E5-2697 v2 For configuration details, go here. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. See benchmark tests and configurations in the speaker notes. For more information go to http://www.intel.com/performance * Other names and brands may be claimed as the property of others. SOURCE: INTEL MEASURED RESULTS AS OF SEPTEMBER, 2014 14 14 1 NODE GROMACS* APPROVED FOR PUBLIC PRESENTATION 512K H2O with RF GROMACS* 512K H2O with RF Speed Up 1.79X 1.72X Comparative Performance 1.56X 1 1 1.03X 0 Intel® Xeon® processor E5-2697 v2 1 Intel® Xeon Phi™ coprocessor 7120P/X 2 Intel® Xeon Phi™ coprocessor 7120P/X Intel® Xeon® processor E5-2697 v2 + 1 Intel® Xeon Phi™ coprocessor 7120P/X Intel® Xeon® processor E5-2697 v2 + 2 Intel® Xeon Phi™ coprocessor 7120P/X For configuration details, go here. Application: GROMACS* 5.0-RC1; Workload: 512K H2O with RF method Description: GROMACS is a versatile package to perform molecular dynamics, i.e. simulate the Newtonian equations of motion for systems with hundreds to millions of particles. It is one of the fastest and the most popular Molecular Dynamics packages. Availability: Code: Version 5.0-rc1 available here and here. Recipe: Available here. Highlights: Highly optimized for Intel® Xeon® Processors (AVXintrinsics). Able to run full simulation on Intel® Xeon Phi™ coprocessor natively + host processor using a symmetric model. Optimized with intrinsics for 512-bit vectorization on Intel Xeon Phi coprocessors. Results: Symmetric process demonstrated up to 1.79X improved performance over the baseline Intel® Xeon® processor E5-2697 v2. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. See benchmark tests and configurations in the speaker notes. For more information go to http://www.intel.com/performance * Other names and brands may be claimed as the property of others. SOURCE: INTEL MEASURED RESULTS AS OF APRIL, 2014 15 CLUSTER BENCHMARK NWChem* 32 NODES APPROVED FOR PUBLIC PRESENTATION CCSD(T) Method NWChem* 6.3rev2 and 6.5 CCSD(T) Method 32 Node Speed Up Comparative Performance 1.52X 1.24X 1 1 0 NWChem 6.3, 64S Intel® Xeon® processor E5-2697 v2 NWChem 6.5, 64S Intel® Xeon® processor E5-2697 v2 NWChem 6.5, 64S Intel® Xeon® processor E5-2697 v2 + 64 Intel® Xeon Phi™ Coprocessor 7120A 2 Application: NWChem* is a computational chemistry software package that includes quantum chemical and molecular dynamics functionality. NWChem is developed the Environmental Molecular Sciences Laboratory (EMSL) at the Pacific Northwest National Laboratory (PNNL). More at http://www.nwchem-sw.org Availability: Code: Available here and from the SVN repository. Recipe: Available here. Usage Model: Offload using LEO and OpenMP* Highlights: NWChem with Intel® Xeon Phi™ coprocessor 7120A offloading is a compelling and cluster compelling application for the NWChem community . Results: Compared to the NWChem* 6.3rev2 and Intel® Xeon® processor E5-2697 v2 baseline: 1) NWChem 6.5 CCSD(T) performed up to 1.24X faster with the Intel® Xeon® processor E5-2697 v2. 2) NWChem 6.5 CCSD(T) performed up to 1.52X faster with the Intel® Xeon® processor E5-2697 v2 and the Intel Xeon Phi coprocessor 7120A. For configuration details, go here. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. See benchmark tests and configurations in the speaker notes. For more information go to http://www.intel.com/performance * Other names and brands may be claimed as the property of others. SOURCE: INTEL MEASURED RESULTS AS OF JULY, 2014 16 CLUSTER BENCHMARK NAMD* 2.10 Pre-Release 32 NODES APPROVED FOR PUBLIC PRESENTATION STMV NAMD* 2.10 (pre-release) Cluster Performance Increase STMV (~1M atoms) Comparative Performance 30 27.2X 24.2X 25 20X 20 15 12.2X 13.1X 10 6.8X 5 0 32X 1 1.2X 2X 7.9X 2.1X 1 Node 8 Nodes 32 Nodes Intel® Xeon® processor E5-2697 v2 (Baseline: 1 node, 23 or 47 PPN) Intel® Xeon® processor E5-2697 v3 (27 or 55 PPN) Xeon E5-2697 v2 (23 or 47 PPN) + 1 Intel® Xeon Phi™ coprocessor 7120A (240T) Xeon E5-2697 v3 (27 or 55 PPN) + 1 Intel® Xeon Phi™ coprocessor 7110A (240T) “Xeon E5-2697 v2/v3” = Intel® Xeon® processor E5-2697 v2/v3 Application: NAMD 2.10 pre-release; STMV Description: A parallel, object-oriented molecular dynamics code designed for high-performance simulation of large biomolecular systems. More at http://www.ks.uiuc.edu/Research/namd/ Availability: Code: Intel® Xeon Phi™ coprocessor support is available as a pre-release. Use the nightly build. Recipe: Available here. Usage Model: Single rank on host with 47 threads. Various computations are offloaded to Intel® Xeon Phi™ coprocessor from each thread. Highlights: Intel® Xeon Phi™ coprocessor support is now in the development branch of NAMD 2.10 prerelease. Results: For the STMV workload, the Intel® Xeon® processor E5-2697 v3 and the Intel® Xeon Phi™ coprocessor (32 nodes, 55 PPN) improved performance by up to 32X compared to the baseline processor (1 node, 47 PPN). For configuration details, go here. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. See benchmark tests and configurations in the speaker notes. For more information go to http://www.intel.com/performance * Other names and brands may be claimed as the property of others. SOURCE: INTEL MEASURED RESULTS AS OF SEPTEMBER, 2014 17 CLUSTER BENCHMARK NAMD* 2.10 Pre-Release 2 NODES APPROVED FOR PUBLIC PRESENTATION ApoA1 NAMD* 2.10 (pre-release) Cluster Performance Increase ApoA1 (~92K atoms); 55 PPN Comparative Performance 2.61X 1.94X 2 1.52X 1 1 0 1 Node 2 Nodes Intel® Xeon® processor E5-2697 v3 (Baseline: (Baseline; 1 node, node) 55PPN) Application: NAMD* 2.10 pre-release; ApoA1 Description: A parallel, object-oriented molecular dynamics code designed for high-performance simulation of large bio molecular systems. More at http://www.ks.uiuc.edu/Research/namd/ Availability: Code: Intel® Xeon Phi™ coprocessor support is available as a prerelease. Use the nightly build. Recipe: Available here. Usage Model: Single rank on host with 55 threads. Various computations are offloaded to Intel® Xeon Phi™ coprocessor from each thread. Highlights: Intel® Xeon Phi™ coprocessor support is now in the development branch of NAMD 2.10 pre-release. Results: For the ApoA1 workload, 2-node performance can be accelerated by up to 2.61X using a single Intel® Xeon Phi™ coprocessor. Intel® Xeon® processor E5-2697 v3 + Intel® Xeon Phi™ coprocessor B17110A (240T) For configuration details, go here. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. See benchmark tests and configurations in the speaker notes. For more information go to http://www.intel.com/performance * Other names and brands may be claimed as the property of others. SOURCE: INTEL MEASURED RESULTS AS OF SEPTEMBER, 2014 18 CLUSTER BENCHMARK LAMMPS* 32 NODES APPROVED FOR PUBLIC PRESENTATION NEW Stillinger-Weber Water Benchmark LAMMPS* Liquid Crystal Benchmark Performance (Mixed Precision) 3.6X Comparative Performance 3.41X 3.05X 3X 3 2 1 1 0.9X 1 No testing on Tesla 0 1 Node 32 Nodes 2S Intel® Xeon® processor E5-2697 v3 (LAMMPS Baseline) 2S Intel® Xeon® processor E5-2697 v3 (LAMMPS IA Package) 2S Xeon E5-2697 v3 + Tesla K40c*, boost off, ECC on 2S Xeon E5-2697 v3 + Xeon Phi 7120A, turbo off (LAMMPS IA Package) “Xeon E5-2697 v3” = Intel® Xeon® processor E5-2697 v3 “Xeon Phi 7120A” = Intel® Xeon Phi™ coprocessor 7120A Application: LAMMPS* Description: Simulation of molecular systems with classical models. More at http://lammps.sandia.gov/ Availability: Code: In main LAMMPS repository. Recipe: Available here. Usage Model: Load balancer offloads part of neighbor-list and non-bond force calculations to Intel® Xeon Phi™ coprocessor for concurrent calculations with CPU. Highlights: Improved results with Intel® Xeon® processor E52697 v3 and Intel® Xeon Phi™ coprocessor 7120A. Dynamic load balancing allows for concurrent: Data transfer between host and coprocessor. Calculations of neighbor-list, non-bond, bond, and longrange terms. Same routines in LAMMPS Intel Package also run faster on CPU. Results: Simulation rate increase with Intel® Package is up to 3.6X. Concurrent Intel Xeon Phi coprocessor computations and MPI communications yield improved speedup and higher node counts. For configuration details, go here. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. See benchmark tests and configurations in the speaker notes. For more information go to http://www.intel.com/performance * Other names and brands may be claimed as the property of others. SOURCE: INTEL MEASURED RESULTS AS OF MARCH, 2015 19 CLUSTER BENCHMARK LAMMPS* 32 NODES APPROVED FOR PUBLIC PRESENTATION Rhodopsin Benchmark; 512K Atoms LAMMPS* Rhodopsin Benchmark Performance (Mixed Precision) Comparative Performance 1.68X 1.47X 1.27X 1 1 1 1.07X 0 1 Node 32 Nodes 2S Intel® Xeon® processor E5-2697 v3 (LAMMPS Baseline) 2S Intel® Xeon® processor E5-2697 v3 (LAMMPS IA Package) 2S E5-2697 v3 + Intel® Xeon Phi™ coprocessor 7110P/7120A Turbo Off (LAMMPS IA Package) Application: LAMMPS* Description: Simulation of molecular systems with classical models. More at http://lammps.sandia.gov/ Availability: Code: In main LAMMPS repository. Recipe: Available here. Usage Model: Load balancer offloads part of neighbor-list and non-bond force calculations to Intel® Xeon Phi™ coprocessor for concurrent calculations with CPU. Highlights: Improved results with Intel® Xeon® processor E5-2697 v3 and Intel Xeon Phi coprocessor 7120A. Dynamic load balancing allows for concurrent: Data transfer between host and coprocessor. Calculations of neighbor-list, non-bond, bond, and long-range terms. Same routines in LAMMPS Intel Package also run faster on CPU. Results: Up to 1.68X performance improvement utilizing Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors with application optimization on a single node compared to the baseline configuration. Performance gains continue to hold at 1.47X when scaling up to 32 nodes. For configuration details, go here. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. See benchmark tests and configurations in the speaker notes. For more information go to http://www.intel.com/performance * Other names and brands may be claimed as the property of others. SOURCE: INTEL MEASURED RESULTS AS OF AUGUST, 2014 20 1 NODE Johns Hopkins Bowtie 2* APPROVED FOR PUBLIC PRESENTATION NEW Multiple workloads Johns Hopkins Bowtie 2 TGen Workload Speed Up 1.87X Comparative Increase 1.59X 1 1.08X 1 .88X 0 ERR161544 SRR034966_1 ERR000589 SRR002273_1 Intel® Xeon® processor E5-2697 v2 + 1 NVIDIA Tesla* K40 Application: Bowtie2 version 2.2.3; Intel® AVX2 port. Description: NVBowtie version 0.9.9.3. Bowtie is a GPUaccelerated re-engineering of Bowtie2, a very widely used shortread aligner. While being completely rewritten from scratch, nvBowtie reproduces many (though not all) of the features of Bowtie2. http://nvlabs.github.io/nvbio/nvbowtie_page.html Availability: Code: Available here. Recipe: Not available. Check for future availability here. Usage Model: ERR161544, SRR002273_1, HEK001(TGen), ERR000589_1, SRR033552_1, SRR034966_1, & ERR024139_1 Highlights: See more here. Results: Bowtie2 running on the Intel® Xeon® processor E52697 v3 with Intel® AVX2 port faster than NVBowtie running on the Intel® Xeon® processor E5-2697 v2 and the NVIDIA Tesla K40* for 6 of 7 workloads. NVIDIA published data of K40 compared to Intel® Xeon® processor E5-2600 (6 cores) on one workload. Intel® Xeon® processor E5-2697 v3 For configuration details, go here. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. See benchmark tests and configurations in the speaker notes. For more information go to http://www.intel.com/performance * Other names and brands may be claimed as the property of others. SOURCE: INTEL MEASURED RESULTS AS OF JANUARY, 2015 21 Burrows-Wheeler Aligner (BWA-ALN)* 1 NODE APPROVED FOR PUBLIC PRESENTATION Human Genome BWA-ALN* Speed Up Comparative Performance 1.86X 1.24X 1 1 0 2S Intel® Xeon® processor E5-2697 v2 (baseline BWA-ALN) 2S Intel® Xeon® processor E5-2697 v2 (optimized BWA-ALN) 2S Intel® Xeon® processor E5-2697 v2 + Intel® Xeon Phi™ coprocessor 7120A Application: Burrows-Wheeler Aligner*, version 0.5.10. BWA-ALN is represented in this benchmark. Workload is korean_female (read file 3.5 GB, 3.0 GB reference data base). Description: BWA is a popular software package for mapping low-divergent sequences against a large reference genome, such as the human genome. More at http://bio-bwa.sourceforge.net/. Availability: Code: Available here. Recipe: Available here. Usage Model: Hybrid MPI + OpenMP* using symmetric mode. Highlights: Results are identical to the unmodified run of BWA-ALN Results: The Intel® Xeon® processor E5-2697 v2 and the Intel® Xeon Phi™ coprocessor symmetric process demonstrated up to 1.86X improved performance over the baseline Intel® Xeon® processor E5-2697 v2. For configuration details, go here. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. See benchmark tests and configurations in the speaker notes. For more information go to http://www.intel.com/performance * Other names and brands may be claimed as the property of others. SOURCE: INTEL MEASURED RESULTS AS OF JANUARY, 2014 22 1 NODE BLAST* BLASTn v.30 Comparative Performance BLASTn* v.30 Speed Up 1.49X 1.41X 1.26X 1 1.52X 1.22X 1 0 2S Xeon E5-2697 v2 (BLASTn v.30 baseline) 2S Xeon E5-2697 v2 + Xeon Phi 7120A 2S Xeon E5-2697 v2 + Xeon Phi 7120A, OFS parallelized 2S Xeon E5-2697 v3 (BLASTn v.30 baseline) 2S Xeon E5-2697 v3 + Xeon Phi 7120A2 2S Xeon E5-2697 v2 + Xeon Phi 7120A, OFS parallelized “Xeon E5-2697 v2/v3” = Intel® Xeon® processor E5-2697 v2/v3 “Xeon Phi 7120A” = Intel® Xeon Phi™ coprocessor 7120A For configuration details, go here. APPROVED FOR PUBLIC PRESENTATION NEW Application: Basic Local Alignment Search Tool (BLASTn) v.30. Description: Searching for alignment in nucleotide query sequences against a known nucleotide db volume set. National Center for Biotechnology Information (NCBI*). More at http://blast.ncbi.nlm.nih.gov/. Availability: Code: Available here. Recipe: Available here. Usage Model: #4 (multiple queries multiple db) 100 NCBI queries (concatenated) against db refseq_rna.00-02 are distributed to the Intel® Xeon® processor and Intel® Xeon Phi™ coprocessor for maximum speedup sweet spot. Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 80/20 and 59/23/18. Highlights: Throughput for this load sharing model has a small sweet spot for a sufficiently large query set. Results: Compared to the baseline, simulation rate speed up with Intel® Xeon® processor E5-2697 v3 and Intel® Xeon Phi™ coprocessor 7120A heterogeneous model is 1.52X. Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. See benchmark tests and configurations in the speaker notes. For more information go to http://www.intel.com/performance . * Other names and brands may be claimed as the property of others. SOURCE: INTEL MEASURED RESULTS AS OF MARCH, 2015 23 1 NODE BLAST* APPROVED FOR PUBLIC PRESENTATION NEW BLASTp v.30 Application: Basic Local Alignment Search Tool (BLASTp) v.30 Description: Searching for alignment in protein query sequence 1.41X against a known protein db volume set. More at 1.39X 1.3X http://blast.ncbi.nlm.nih.gov/. 1.21X 1.15X Availability: 1 Code: Available here. 1 Recipe: Available here. Usage Model: #4 (multiple queries multiple db) 40 NCBI queries (concatenated) against db nr_sorted.00-02 are distributed to Intel® Xeon® processor and Intel® Xeon Phi™ coprocessor for maximum speedup sweet spot. Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 33/7 and 28/5/7. 0 2S Xeon E5-2697 v2 (BLASTn v.30 baseline) Highlights: Throughput for this offload model has a small sweet 2S Xeon E5-2697 v2 + Xeon Phi 7120A spot for a sufficiently large query set. Throughput is limited due 2S Xeon E5-2697 v2 + Xeon Phi 7120A, OFS parallelized to GAT stage not parallelized. 2S Xeon E5-2697 v3 (BLASTn v.30 baseline) Results: Compared to the baseline, simulation rate speed up 2S Xeon E5-2697 v3 + Xeon Phi 7120A2 with Intel® Xeon® processor E5-2697 v3 and Intel® Xeon Phi™ 2S Xeon E5-2697 v2 + Xeon Phi 7120A, OFS parallelized coprocessor 7120A heterogeneous model is 1.41X. “Xeon E5-2697 v2/v3” = Intel® Xeon® processor E5-2697 v2/v3 Performance is also improved on the CPU due to Output “Xeon Phi 7120A” = Intel® Xeon Phi™ coprocessor 7120A Formatting Section (OFS) parallelization. For configuration details, go here. . Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, Comparative Performance BLASTp* v.30 Speed Up components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. See benchmark tests and configurations in the speaker notes. For more information go to http://www.intel.com/performance * Other names and brands may be claimed as the property of others. SOURCE: INTEL MEASURED RESULTS AS OF MARCH, 2015 24 法務情報 本資料に記載されているすべての製品、コンピューター・システム、日付、および数値は、現在の予想に基づくものであり、予告なく変更され ることがあります。 インテル・プロセッサー・ナンバーはパフォーマンスの指標ではありません。プロセッサー・ナンバーは 同一プロセッサー・ファミリー内の製 品の機能を区別します。異なるプロセッサー・ファミリー 間の機能の区別には用いません。 詳細については、 http://www.intel.co.jp/jp/products/processor_number/ を参照してください。 インテル® プロセッサー、チップセット、およびデスクトップ・ボードには、エラッタと呼ばれる設計上の不具合が含まれている可能性があり、 公表されている仕様とは異なる動作をする場合があります。現在確認済みのエラッタについては、インテルまでお問い合わせください。 インテル® バーチャライゼーション・テクノロジーを利用するには、同テクノロジーに対応したインテル ® プロセッサー、BIOS、および仮想マシン モニター (VMM) を搭載したコンピューター・システムが必要です。機能性、性能もしくはその他のバーチャライゼーション・テクノロジーの特 長は、ご使用のハードウェアやソフトウェアの構成によって異なります。ご利用になる OS によっては、ソフトウェア・アプリケーションとの互 換性がない場合があります。各 PC メーカーにお問い合わせください。 詳細については、 http://www.intel.co.jp/content/www/jp/ja/virtualization/virtualization-technology/hardware-assist-virtualization-technology.html を参照してく ださい。 すべての条件下で絶対的なセキュリティーを提供できるコンピューター・システムはありません。インテル® トラステッド・エグゼキューショ ン・テクノロジー (インテル® TXT) を利用するには、インテル® バーチャライゼーション・テクノロジー、インテル® TXT に対応したプロセッ サー、チップセット、BIOS、Authenticated Code モジュール、インテル® TXT に対応した Measured Launched Environment (MLE) を搭載するコン ピューター・システムが必要です。さらに、インテル® TXTを利用するには、システムが TPM v1.s を搭載している必要があります。 詳細につい ては、http://www.intel.co.jp/content/www/jp/ja/data-security/security-overview-general-technology.html を参照してください。 インテル® ターボ・ブースト・テクノロジーに対応したシステムが必要です。インテル® ターボ・ブースト・テクノロジーおよびインテル® ター ボ・ブースト・テクノロジー 2.0 は、一部のインテル® プロセッサーでのみ利用可能です。各 PC メーカーにお問い合わせください。実際の性能は ハードウェア、ソフトウェア、システム構成によって異なります。詳細については、http://www.intel.co.jp/jp/technology/turboboost/ を参照してく ださい。 インテル® AES New Instructions (インテル® AES-NI) を利用するには、インテル® AES-NI に対応したプロセッサーを搭載したコンピューター・システ ム、および命令を正しい手順で実行する他社製ソフトウェアが必要ですインテル® AES-NI は、一部のインテル® プロセッサーで利用できます。提 供状況については、各 PC メーカーなどにお問い合わせください。詳細については、http://software.intel.com/en-us/articles/intel-advancedencryption-standard-instructions-aes-ni/ (英語) を参照してください。 Intel、インテル、Intel ロゴ、Intel Inside ロゴ、Xeon、Xeon Inside、Intel Xeon Phi は、アメリカ合衆国および / またはその他の国における Intel Corporation の商標です。 © 2012, Intel Corporation. 無断での引用、転載を禁じます。 26 法律的な免責条項: パフォーマンス 性能に関するテストや評価は、特定のコンピューター・システム、コンポーネント、またはそれらを組み合わせて行ったものであり、こ のテストによるインテル製品の性能の概算の値を表しているものです。システム・ハードウェア、ソフトウェアの設計、構成などの違い により、実際の性能は掲載された性能テストや評価とは異なる場合があります。システムやコンポーネントの購入を検討される場合は、 ほかの情報も参考にして、パフォーマンスを総合的に評価することをお勧めします。インテル製品の性能評価についてさらに詳しい情報 をお知りになりたい場合は、http://www.intel.com/performance を参照してください。 インテルは、本資料で参照している第三者のベンチマークまたは Web サイトの設計や実装について管理や監査を行っていません。本資料 で参照している Web サイトまたは類似の性能ベンチマークが報告されているほかの Web サイトも参照して、本資料で参照しているベン チマークが購入可能なシステムの性能を正確に表しているかを確認されるようお勧めします。 各ベンチマークの相対パフォーマンスは、ベンチマーク結果に 1.0 のベースライン値を割り当て、各プラットフォームのベンチマークの結 果を、ベースラインとなるプラットフォームの実際のベンチマーク結果で割り、報告されたパフォーマンスの向上に比例する相対パ フォーマンスの数値を割り当てることによって計算しています。 SPEC、SPECint、SPECfp、SPECrate、SPECpower、SPECjAppServer、SPECjEnterprise、SPECjbb、SPECompM、SPECompL、SPEC MPI は、Standard Performance Evaluation Corporation の商標です。詳細については、 http://www.spec.org/spec/trademarks.html (英語) を参照してください。 TPC* ベンチマークは Transaction Processing Council の商標です。詳細については、http://www.tpc.org/ (英語) を参照してください。 SAP および SAP NetWeaver は、ドイツおよびその他の国々における SAP AG の登録商標です。詳細については、 http://www.sap.com/benchmark/(英語) を参照してください。 本資料に掲載されている情報は、現状のまま提供され、明示されているか否かにかかわらず、また禁反言によるとよらずにかかわらず、 いかなる知的財産権のライセンスを許諾するものではありません。この情報に関する明示または黙示の保証 (特定目的への適合性、商品適 格性、あらゆる特許権、著作権、その他知的財産権の非侵害性への保証を含む)に関してもいかなる責任も負いません。 性能に関するテストに使用されるソフトウェアとワークロードは、性能がインテル® マイクロプロセッサー用に最適化されていることがあ ります。SYSmark* や MobileMark* などの性能テストは、特定のコンピューター・システム、コンポーネント、ソフトウェア、操作、機 能に基づいて行ったものです。結果はこれらの要因によって異なります。製品の購入を検討される場合は、他の製品と組み合わせた場合 の本製品の性能など、ほかの情報や性能テストも参考にして、パフォーマンスを総合的に評価することをお勧めします。 27
© Copyright 2024 ExpyDoc