Hot Chips & SC14トピックス、 CAE試作ボードの現状と今後 HotChips

Hot Chips & SC14トピックス、
CAE試作ボードの現状と今後
広島市立大学 情報科学研究科 北村 俊明
2014/12/10
HotChips 26での発表から
26
Hot Chipsとは
ADVANCE PROGRAM
August 10-12, 2014
A Symposium on High-Performance Chips
Flint Center for the Performing Arts-Cupertino,CA
http://www.hotchips.org
HOTCHIPS brings together designers and architects of high-performance chips, software, and systems. The tutorial and
presentation sessions focus on up-to-the-minute developments in leading-edge industrial designs and research projects.
✤
Sunday
ほとんど企業の発表で、最近は新
製品の発表がよくおこなわれる
モバイルからPC、サーバ、スパコ
ン用プロセッサまでセッションが
ある
✤
August 11
✤
Monday
クロプロセッサなどの半導体を中
心とした学会
August 12
1989年以来夏に行われているマイ
Tuesday
✤
August 10
Register now at: https://www.123signup.com/register?id=drvzv
FPGAのセッションもある
Tutorial 1: Emerging Trends in Hardware Support for Security
• Security Basics
Princeton
• Mobile HW Security
ARM
• Secure Systems Design
AMD
• Mitigating Exploits, Rootkits and Advanced Persistent Threats
Intel
• University Research in Hardware Security
Princeton
Tutorial 2: Internet of Things
• Powering the Internet of Things
TI
• Ultra Low Power Design Approaches for IoT
National University of Singapore
• Connecting the IoT
Qualcomm
• Standards for Constrained IoT Devices
ARM
High-Performance Computing
• SX-ACE Processor: NEC's Brand-New Vector Processor
NEC
• SPARC64 XIfx: Fujitsu’s Next Generation Processor for HPC
Fujitsu
• Anton 2: A 2nd-Generation ASIC for Molecular Dynamics Simulation D.E. Shaw Research
Organizing Committee
Chair
Krste Asanovic
UC Berkeley
Vice Chair
Fred Weber
Finance
Lily Jow
HP
Advertising
Don Draper
Oracle
Sponsorship
Amr Zaky
Invensense
Publications
Randall Neff
Press
Ralph Wittig
Xilinx
Registration
Charlie Neuhauser
Neuhauser
Associates
Keynote 1 Power Constraints: From Sensors to Servers
Location Services
ARM John Sell
Michael Muller
Microsoft
Mobile Processors
Allen Baum
• NVIDIA’s Tegra K1 System-on-Chip
NVIDIA Volunteer Coordinator
• Applying AMD’s “Kaveri” APU for Heterogeneous Computing
AMD Gary Brown
Tensilica
• NVIDIA’s Denver Processor
NVIDIA Webmaster, IT
Technology
Kevin Broch
• HBM: Memory Solution for Bandwidth-Hungry Processors
SK Hynix Inc Production
• Improved 3D Chip Stacking withThruChip Wireless Connections ThruChip Communications Lance Hammond
• CMOS Biochips for Point-of-Care Molecular Diagnostics
InSilixa Mike Albaugh
Keith Diefendorff
ARM Servers
Steering Committee
• The AMD Opteron “Seattle”: A 64b ARM Dense Server Processor
AMD Chair
• ARM Next-Generation IP Supporting LSI’s High-End Networking
ARM, LSI Logic Alan Jay Smith
• X-Gene2: 28nm Scale-Out Processor
Applied Micro Committee Members
Allen Baum
FPGAs
Oracle
• Design of a High-Density SOC-FPGA at 20nm
Altera Don Draper
Pradeep Dubey
Intel
• Large-Scale Reconfigurable Computing in a Microsoft Datacenter
Microsoft Lily Jow
HP
• Xilinx FPGAs Case Study: High Capacity and Performance 20nm FPGAs
Xilinx John Mashey
Techviser
• SDA: Software-Defined Accelerator for Large-Scale DNN Systems
Baidu John Sell
Microsoft
Keith Diefendorff
High-Performance ASICs
• Hardware-Accelerated Text Analytics
IBM Program Committee
• Myriad2 “Eye” of the Computational-Vision Storm
Movidius Program Co-Chairs
AMD
• Goldstrike 1: A 1st Generation Cryptocurrency Processor for Bitcoin Mining
Cointerra Sam Naffziger
Guri Sohi
U. Wisconsin
• RayChip: Real-Time Ray Tracing Chip for Embedded Applications
Siliconarts
Committee Members
Forest Baskett
NEA
Keynote 2 The Internet of Everything: What is it? What’s driving it?
Pradeep Dubey
Intel
What comes next?
Davis
Microsoft
Rob Chandhok
Qualcomm John
Alan Jay Smith
UC Berkeley
Dense Servers and Server Technology
Steve Miller
NetApp
• SCORPIO: 36-Core Shared-Memory Processor with a Coherent Mesh
MIT Subhasish Mitra
Stanford
Oracle Stefan Rusu
• Oracle’s Next-Generation SPARC Processor Cache Hierarchy
Intel
BayStorage
• Unchaining the Datacenter with OpenPOWER: Reengineering a Server Ecosystem
IBM Tom McWilliams
Intel Behnam Robatmili Qualcomm
• Intel C2000 Atom Microserver: Power Efficient Processing for the Data Center
Ralph Wittig
Xilinx
Big-Iron Servers
Mike Taylor
UCSD
• Performance Characteristics of the POWER8 Processor
IBM Bill Dally
NVIDIA
Oracle Founder Bob Stewart
• Next-Generation Oracle SPARC Processor
SRE
• IvyBridge Server: Delivering Performance from Workstations to Mission Critical
Intel
Warthman
Associates
Technical Writers
www.warthman.com
A Symposium of the Technical Committee on Microprocessors and Microcomputers
of the IEEE Computer Society and the Solid-State Circuits Society
AMDのARMサーバ
✤
ARMではなく
AMDが設計
✤
x86ではなく
ARMアーキテク
チャでサーバ利
用を目指す
THE AMD OPTERONTM
A1100 PROCESSOR
CODENAMED "SEATTLE"
SEAN WHITE
11 AUGUST 2014
“SEATTLE” – WHAT IS IT AND WHY?
What is it?
‒ “Seattle” is AMD’s first 64-bit ARM-based processor
‒
‒
‒
‒
8 ARM CortexTM-A57 cores
2 DDR3/4 DRAM channels
10G Ethernet, PCI-Express, SATA
GlobalFoundries 28nm process
Why did AMD build it?
‒ “Seattle” is a dense server processor for datacenter applications
‒ Performance/dollar/watt drives today’s datacenter designs
‒ A significant number of datacenter workloads have inherently low Instructions Per Clock
(IPC) and high cache miss rates
‒ For such workloads, processors like “Seattle,” with smaller cores and caches, can deliver
the equivalent performance as traditional server processors with large cores and caches,
but using much less power and area
‒ The 32-bit to 64-bit transition for the ARM architecture is a major shift in the
industry, like the 32-bit to 64-bit transition in x86 was
‒ AMD is taking a leadership role in the 64-bit ARM space, as it did in the 64-bit x86 space
2 | AMD “SEATTLE” | HOT CHIPS 26 | 11 AUGUST 2014
“SEATTLE” SOC OVERVIEW
28nm Process Technology
Power Efficient Cores
• Up to Eight ARM Cortex-A57 cores
• Up to 4MB shared L2 cache total
Cache Coherent Network
• Full cache coherency
• 8MB L3 cache
• SMMU: I/O address mapping and protection
High Performance, Flexible Memory
•
•
•
•
Two 64-bit DDR3/4 channels with ECC
Two DIMMs/channel up to 1866Mhz
SODIMM, UDIMM, RDIMM support
Up to 128GB per CPU
Highly Integrated I/O
• 8x SATA 3 (6Gb/s) ports
• Two 10GBASE-KR Ethernet ports
• 8 lanes PCI-Express® Gen 3, supports x8, x4, x2
System Control Processor
• TrustZone® technology for enhanced security
• Dedicated 1GbE system management port (RGMII)
• SPI, UART, I2C interfaces
Cryptographic Coprocessor
• Separate Cryptographic algorithm engine for
offloading encryption, decryption, compression,
decompression computations
64-bit
Cortex
A57
Core
64-bit
Cortex
A57
Core
64-bit
Cortex
A57
Core
L2 Cache
1MB
64-bit
Cortex
A57
Core
64-bit
Cortex
A57
Core
I2C
L2 Cache
1MB
64-bit
Cortex
A57
Core
64-bit
Cortex
A57
Core
L2 Cache
1MB
UART
64-bit
Cortex
A57
Core
L2 Cache
1MB
SPI
1Gbit Ethernet
(RGMII)
10Gbit Ethernet
(KR)
L3 Cache
8MB
SATA 3
Cortex A5
System Control Processor
PCIe Gen 3
Cryptographic
Coprocessor
DDR3/4
Memory Controller
DDR3/4
Memory Controller
Package
•
3 | AMD “SEATTLE” | HOT CHIPS 26 | 11 AUGUST 2014
27mm x 27mm, SP1 BGA
“SEATTLE” REFERENCE SYSTEM
Standalone uATX board
• 1P standalone platform intended to meet the
needs of partners (ISV, OSV, IHV)
• Off-the-shelf 2U rack mount chassis
• DDR3 DIMMS only
• x8 PCIe Gen3 lanes supporting (1) x8 slot or
alternatively (2) x4 slots
• NIC supported through add-in card option
• Supports up to 8 hard drives
• Provisions for remote access to start, stop, and
remote console will be provided
16 | AMD “SEATTLE” | HOT CHIPS 26 | 11 AUGUST 2014
“SEATTLE” REFERENCE SYSTEM BOARD
• uATX form factor
• 1 “Seattle” SP1 BGA processor
• DDR3 2-DIMM per memory
channel config (up to 4 DIMMs per
CPU)
• 1 x8 PCIe slot
• 2 x4 PCIe slots an alternative via mux
• 8 SATA3 ports
• 2 10GBase-T connectors
• 4 I2C ports
• 2 UARTs
• Supports required debug features
17 | AMD “SEATTLE” | HOT CHIPS 26 | 11 AUGUST 2014
ARMコア入りFPGA
✤
20nmプロセスを使った製品
✤
ARMコアを含むSoC全体を1チップに
Design of a High-Density SoC FPGA at 20nm
Brad Vest, Sean Atsatt, Mike Hutton
Altera, San Jose
High Capacity and High Performance 20nm FPGAs
Steve Young, Dinesh Gaitonde
August 2014
© Copyright 2014 Xilinx
Device Goals
Mid-Range FPGA: balance of performance/power/cost
targeting Key Market Applications
Key Targets and Metrics:
− 491 MHz fixed-point DSP datapath for Wireless RRU
− 1M+LEs at 350 MHz for 4xOTU4 (400G) OTN networks, with Partial Reconfig
− Cloud Server Acceleration – Hardened Floating-Point
− 28G transceivers to support 200G to 400G networking/routing
− Dramatic die-size reduction
3
Overview and Floorplan
TSMC 20SOC Process
− 5.3B Tx, 11LM
Resources
− 1.15M LEs, 1.7M FFs
− 64Mb embedded SRAM
− 32 fPLL, 16 PLLs, 32 GCLK
− 1.5 TFlops IEEE754 DSP
− Dual-Core ARM A9
− Row-based redundancy
I/O
− 28G SERDES, >1.7Tb b/w
− x72 2.667Gbps DDR4 w/
Hard memory Controller
− Hardened PCIe/ILKN/10GE
4
Hardened Floating Point DSP
Hardened IEEE 754 Floating
Point adder & Multiplier
32
32
32
− 12% DSP Area increase (<<1% die area)
100% Fixed Point backwards
compatible
X
− No performance or power penalty
‘Have your cake and eat it too’
How is this possible?
− Overlaid FP algorithms on Fixed point
+
circuits
32
Major Innovation – Hard Floating Point on a
Commercial FPGA
13
DSP Block – 1000s of blocks at very low latency
1.5 TFLOPS of aggregate computation; 50 GFLOPS/W
−
1678 blocks @ 2 FLOPS/clock @ 450 MHz = 1.520 GFLOPs
−
Can run individually or as large integrated DSP system
Hardware recursive structure support (Vector Mode)
−
10s/100s of DSP blocks can be seamlessly integrated
−
Internal/External pipeling of individual DSP elements
Very small latency
−
Floating Point used for iterative algorithms – require small latency
−
Arria 10 Floating Point - 256 length dot products ~ 25 clocks
−
Standard FPGA Technology - 256 length systolic FIR filter ~750 clocks
AB+CD+
A
D AB+CD E
F EF+GH G H EF+GH I
J
X
X
X
X
X
+
+
+
+
+
AB+CD
AB+CD+EF+GH
B
C
EF+GH
AB+CD+EF+GH+
IJ+KL+MN+OP
14
IJ+KL+
MN+OP
IJ+KL
UltraScale Results
Vivado ® routes more complex designs on UltraScale
UltraScale shows lower congestion on complex designs
As a result, timing closure is accelerated
Routing complexity
Routing complexity
Delivers 1 speedgrade higher Fmax
No routing congestion
High routing congestion
Cannot route
Page 15
© Copyright 2014 Xilinx
.
Power Optimizations
Transceiver
• Architectural optimizations
• Low power mode
up to
60%
I/O
Dynamic
up to
50%
• I/O multi-mode control (cont’d from 28nm)
• DDR4 voltage reduction
• CLB packing & reduced wire length
up to
30%
up to
50%
• HW based clock gating on leaf cells
Transceiver
I/O
Static
up to
65%
Dynamic
Static
• BRAM hardened data cascading
up to
30%
25-45%
• BRAM dynamic power gating
• DSP hardened features
up to
40%
Transceiver
I/O
Dynamic
up to
50%
up to
40%
Static
• MMCM & PLL lower supply voltage
• Process node
• Power binning & lower voltage scaling
• 3D IC static power binned slices
Spartan-6/Virtex-6
(45nm/40nm)
Page 17
7 Series
(28nm)
UltraScale
(20nm/16nm)
© Copyright 2014 Xilinx
.
装置の1部品から装置全体へ
✤
SoCの流れに沿って、システムの1構成要素としてFPGAによる機能
を利用すると言う構成から、FPGAの上でSoCを構成してしまうと言
う方向に変化
✤
✤
これを可能にしているのは、半導体の集積度向上
より高速な回路を要求、しかも消費電力の削減も
HOT CHIPS 26の資料
✤
http://www.hotchips.orgに歴代の資料があります。
✤
数年前の分から、プレゼンテーションのビデオも見られます。
✤
26については、Keynoteのみ一般公開。
✤
12月には全て公開の予定です。
SuperComputing 2014からの話題
SuperComputing
✤
毎年11月に開催
✤
今年度は11月16∼22日New Orleansのコンベンションセンターで
✤
論文発表のペーパーセッション以外に、展示会とBoFセッションもある。