Hot Chips & SC14トピックス、 CAE試作ボードの現状と今後 広島市立大学 情報科学研究科 北村 俊明 2014/12/10 HotChips 26での発表から 26 Hot Chipsとは ADVANCE PROGRAM August 10-12, 2014 A Symposium on High-Performance Chips Flint Center for the Performing Arts-Cupertino,CA http://www.hotchips.org HOTCHIPS brings together designers and architects of high-performance chips, software, and systems. The tutorial and presentation sessions focus on up-to-the-minute developments in leading-edge industrial designs and research projects. ✤ Sunday ほとんど企業の発表で、最近は新 製品の発表がよくおこなわれる モバイルからPC、サーバ、スパコ ン用プロセッサまでセッションが ある ✤ August 11 ✤ Monday クロプロセッサなどの半導体を中 心とした学会 August 12 1989年以来夏に行われているマイ Tuesday ✤ August 10 Register now at: https://www.123signup.com/register?id=drvzv FPGAのセッションもある Tutorial 1: Emerging Trends in Hardware Support for Security • Security Basics Princeton • Mobile HW Security ARM • Secure Systems Design AMD • Mitigating Exploits, Rootkits and Advanced Persistent Threats Intel • University Research in Hardware Security Princeton Tutorial 2: Internet of Things • Powering the Internet of Things TI • Ultra Low Power Design Approaches for IoT National University of Singapore • Connecting the IoT Qualcomm • Standards for Constrained IoT Devices ARM High-Performance Computing • SX-ACE Processor: NEC's Brand-New Vector Processor NEC • SPARC64 XIfx: Fujitsu’s Next Generation Processor for HPC Fujitsu • Anton 2: A 2nd-Generation ASIC for Molecular Dynamics Simulation D.E. Shaw Research Organizing Committee Chair Krste Asanovic UC Berkeley Vice Chair Fred Weber Finance Lily Jow HP Advertising Don Draper Oracle Sponsorship Amr Zaky Invensense Publications Randall Neff Press Ralph Wittig Xilinx Registration Charlie Neuhauser Neuhauser Associates Keynote 1 Power Constraints: From Sensors to Servers Location Services ARM John Sell Michael Muller Microsoft Mobile Processors Allen Baum • NVIDIA’s Tegra K1 System-on-Chip NVIDIA Volunteer Coordinator • Applying AMD’s “Kaveri” APU for Heterogeneous Computing AMD Gary Brown Tensilica • NVIDIA’s Denver Processor NVIDIA Webmaster, IT Technology Kevin Broch • HBM: Memory Solution for Bandwidth-Hungry Processors SK Hynix Inc Production • Improved 3D Chip Stacking withThruChip Wireless Connections ThruChip Communications Lance Hammond • CMOS Biochips for Point-of-Care Molecular Diagnostics InSilixa Mike Albaugh Keith Diefendorff ARM Servers Steering Committee • The AMD Opteron “Seattle”: A 64b ARM Dense Server Processor AMD Chair • ARM Next-Generation IP Supporting LSI’s High-End Networking ARM, LSI Logic Alan Jay Smith • X-Gene2: 28nm Scale-Out Processor Applied Micro Committee Members Allen Baum FPGAs Oracle • Design of a High-Density SOC-FPGA at 20nm Altera Don Draper Pradeep Dubey Intel • Large-Scale Reconfigurable Computing in a Microsoft Datacenter Microsoft Lily Jow HP • Xilinx FPGAs Case Study: High Capacity and Performance 20nm FPGAs Xilinx John Mashey Techviser • SDA: Software-Defined Accelerator for Large-Scale DNN Systems Baidu John Sell Microsoft Keith Diefendorff High-Performance ASICs • Hardware-Accelerated Text Analytics IBM Program Committee • Myriad2 “Eye” of the Computational-Vision Storm Movidius Program Co-Chairs AMD • Goldstrike 1: A 1st Generation Cryptocurrency Processor for Bitcoin Mining Cointerra Sam Naffziger Guri Sohi U. Wisconsin • RayChip: Real-Time Ray Tracing Chip for Embedded Applications Siliconarts Committee Members Forest Baskett NEA Keynote 2 The Internet of Everything: What is it? What’s driving it? Pradeep Dubey Intel What comes next? Davis Microsoft Rob Chandhok Qualcomm John Alan Jay Smith UC Berkeley Dense Servers and Server Technology Steve Miller NetApp • SCORPIO: 36-Core Shared-Memory Processor with a Coherent Mesh MIT Subhasish Mitra Stanford Oracle Stefan Rusu • Oracle’s Next-Generation SPARC Processor Cache Hierarchy Intel BayStorage • Unchaining the Datacenter with OpenPOWER: Reengineering a Server Ecosystem IBM Tom McWilliams Intel Behnam Robatmili Qualcomm • Intel C2000 Atom Microserver: Power Efficient Processing for the Data Center Ralph Wittig Xilinx Big-Iron Servers Mike Taylor UCSD • Performance Characteristics of the POWER8 Processor IBM Bill Dally NVIDIA Oracle Founder Bob Stewart • Next-Generation Oracle SPARC Processor SRE • IvyBridge Server: Delivering Performance from Workstations to Mission Critical Intel Warthman Associates Technical Writers www.warthman.com A Symposium of the Technical Committee on Microprocessors and Microcomputers of the IEEE Computer Society and the Solid-State Circuits Society AMDのARMサーバ ✤ ARMではなく AMDが設計 ✤ x86ではなく ARMアーキテク チャでサーバ利 用を目指す THE AMD OPTERONTM A1100 PROCESSOR CODENAMED "SEATTLE" SEAN WHITE 11 AUGUST 2014 “SEATTLE” – WHAT IS IT AND WHY? What is it? ‒ “Seattle” is AMD’s first 64-bit ARM-based processor ‒ ‒ ‒ ‒ 8 ARM CortexTM-A57 cores 2 DDR3/4 DRAM channels 10G Ethernet, PCI-Express, SATA GlobalFoundries 28nm process Why did AMD build it? ‒ “Seattle” is a dense server processor for datacenter applications ‒ Performance/dollar/watt drives today’s datacenter designs ‒ A significant number of datacenter workloads have inherently low Instructions Per Clock (IPC) and high cache miss rates ‒ For such workloads, processors like “Seattle,” with smaller cores and caches, can deliver the equivalent performance as traditional server processors with large cores and caches, but using much less power and area ‒ The 32-bit to 64-bit transition for the ARM architecture is a major shift in the industry, like the 32-bit to 64-bit transition in x86 was ‒ AMD is taking a leadership role in the 64-bit ARM space, as it did in the 64-bit x86 space 2 | AMD “SEATTLE” | HOT CHIPS 26 | 11 AUGUST 2014 “SEATTLE” SOC OVERVIEW 28nm Process Technology Power Efficient Cores • Up to Eight ARM Cortex-A57 cores • Up to 4MB shared L2 cache total Cache Coherent Network • Full cache coherency • 8MB L3 cache • SMMU: I/O address mapping and protection High Performance, Flexible Memory • • • • Two 64-bit DDR3/4 channels with ECC Two DIMMs/channel up to 1866Mhz SODIMM, UDIMM, RDIMM support Up to 128GB per CPU Highly Integrated I/O • 8x SATA 3 (6Gb/s) ports • Two 10GBASE-KR Ethernet ports • 8 lanes PCI-Express® Gen 3, supports x8, x4, x2 System Control Processor • TrustZone® technology for enhanced security • Dedicated 1GbE system management port (RGMII) • SPI, UART, I2C interfaces Cryptographic Coprocessor • Separate Cryptographic algorithm engine for offloading encryption, decryption, compression, decompression computations 64-bit Cortex A57 Core 64-bit Cortex A57 Core 64-bit Cortex A57 Core L2 Cache 1MB 64-bit Cortex A57 Core 64-bit Cortex A57 Core I2C L2 Cache 1MB 64-bit Cortex A57 Core 64-bit Cortex A57 Core L2 Cache 1MB UART 64-bit Cortex A57 Core L2 Cache 1MB SPI 1Gbit Ethernet (RGMII) 10Gbit Ethernet (KR) L3 Cache 8MB SATA 3 Cortex A5 System Control Processor PCIe Gen 3 Cryptographic Coprocessor DDR3/4 Memory Controller DDR3/4 Memory Controller Package • 3 | AMD “SEATTLE” | HOT CHIPS 26 | 11 AUGUST 2014 27mm x 27mm, SP1 BGA “SEATTLE” REFERENCE SYSTEM Standalone uATX board • 1P standalone platform intended to meet the needs of partners (ISV, OSV, IHV) • Off-the-shelf 2U rack mount chassis • DDR3 DIMMS only • x8 PCIe Gen3 lanes supporting (1) x8 slot or alternatively (2) x4 slots • NIC supported through add-in card option • Supports up to 8 hard drives • Provisions for remote access to start, stop, and remote console will be provided 16 | AMD “SEATTLE” | HOT CHIPS 26 | 11 AUGUST 2014 “SEATTLE” REFERENCE SYSTEM BOARD • uATX form factor • 1 “Seattle” SP1 BGA processor • DDR3 2-DIMM per memory channel config (up to 4 DIMMs per CPU) • 1 x8 PCIe slot • 2 x4 PCIe slots an alternative via mux • 8 SATA3 ports • 2 10GBase-T connectors • 4 I2C ports • 2 UARTs • Supports required debug features 17 | AMD “SEATTLE” | HOT CHIPS 26 | 11 AUGUST 2014 ARMコア入りFPGA ✤ 20nmプロセスを使った製品 ✤ ARMコアを含むSoC全体を1チップに Design of a High-Density SoC FPGA at 20nm Brad Vest, Sean Atsatt, Mike Hutton Altera, San Jose High Capacity and High Performance 20nm FPGAs Steve Young, Dinesh Gaitonde August 2014 © Copyright 2014 Xilinx Device Goals Mid-Range FPGA: balance of performance/power/cost targeting Key Market Applications Key Targets and Metrics: − 491 MHz fixed-point DSP datapath for Wireless RRU − 1M+LEs at 350 MHz for 4xOTU4 (400G) OTN networks, with Partial Reconfig − Cloud Server Acceleration – Hardened Floating-Point − 28G transceivers to support 200G to 400G networking/routing − Dramatic die-size reduction 3 Overview and Floorplan TSMC 20SOC Process − 5.3B Tx, 11LM Resources − 1.15M LEs, 1.7M FFs − 64Mb embedded SRAM − 32 fPLL, 16 PLLs, 32 GCLK − 1.5 TFlops IEEE754 DSP − Dual-Core ARM A9 − Row-based redundancy I/O − 28G SERDES, >1.7Tb b/w − x72 2.667Gbps DDR4 w/ Hard memory Controller − Hardened PCIe/ILKN/10GE 4 Hardened Floating Point DSP Hardened IEEE 754 Floating Point adder & Multiplier 32 32 32 − 12% DSP Area increase (<<1% die area) 100% Fixed Point backwards compatible X − No performance or power penalty ‘Have your cake and eat it too’ How is this possible? − Overlaid FP algorithms on Fixed point + circuits 32 Major Innovation – Hard Floating Point on a Commercial FPGA 13 DSP Block – 1000s of blocks at very low latency 1.5 TFLOPS of aggregate computation; 50 GFLOPS/W − 1678 blocks @ 2 FLOPS/clock @ 450 MHz = 1.520 GFLOPs − Can run individually or as large integrated DSP system Hardware recursive structure support (Vector Mode) − 10s/100s of DSP blocks can be seamlessly integrated − Internal/External pipeling of individual DSP elements Very small latency − Floating Point used for iterative algorithms – require small latency − Arria 10 Floating Point - 256 length dot products ~ 25 clocks − Standard FPGA Technology - 256 length systolic FIR filter ~750 clocks AB+CD+ A D AB+CD E F EF+GH G H EF+GH I J X X X X X + + + + + AB+CD AB+CD+EF+GH B C EF+GH AB+CD+EF+GH+ IJ+KL+MN+OP 14 IJ+KL+ MN+OP IJ+KL UltraScale Results Vivado ® routes more complex designs on UltraScale UltraScale shows lower congestion on complex designs As a result, timing closure is accelerated Routing complexity Routing complexity Delivers 1 speedgrade higher Fmax No routing congestion High routing congestion Cannot route Page 15 © Copyright 2014 Xilinx . Power Optimizations Transceiver • Architectural optimizations • Low power mode up to 60% I/O Dynamic up to 50% • I/O multi-mode control (cont’d from 28nm) • DDR4 voltage reduction • CLB packing & reduced wire length up to 30% up to 50% • HW based clock gating on leaf cells Transceiver I/O Static up to 65% Dynamic Static • BRAM hardened data cascading up to 30% 25-45% • BRAM dynamic power gating • DSP hardened features up to 40% Transceiver I/O Dynamic up to 50% up to 40% Static • MMCM & PLL lower supply voltage • Process node • Power binning & lower voltage scaling • 3D IC static power binned slices Spartan-6/Virtex-6 (45nm/40nm) Page 17 7 Series (28nm) UltraScale (20nm/16nm) © Copyright 2014 Xilinx . 装置の1部品から装置全体へ ✤ SoCの流れに沿って、システムの1構成要素としてFPGAによる機能 を利用すると言う構成から、FPGAの上でSoCを構成してしまうと言 う方向に変化 ✤ ✤ これを可能にしているのは、半導体の集積度向上 より高速な回路を要求、しかも消費電力の削減も HOT CHIPS 26の資料 ✤ http://www.hotchips.orgに歴代の資料があります。 ✤ 数年前の分から、プレゼンテーションのビデオも見られます。 ✤ 26については、Keynoteのみ一般公開。 ✤ 12月には全て公開の予定です。 SuperComputing 2014からの話題 SuperComputing ✤ 毎年11月に開催 ✤ 今年度は11月16∼22日New Orleansのコンベンションセンターで ✤ 論文発表のペーパーセッション以外に、展示会とBoFセッションもある。
© Copyright 2024 ExpyDoc