CaseStudy: Marenostrum 3

Marenostrum 3 SLIDES FROM
Javier Bartolomé
BSC System Head
Cloud Computing – MIRI (CLC-MIRI)
UPC Master in Innovation & Research in Informatics
Spring- 2013
Jordi Torres, UPC - BSC
www.JordiTorres.eu
MareNostrum 3
36 x IBM iDataPlex Compute racks
–  84 x IBM dx360 M4 compute nodes
•  2x SandyBridge-EP E5-2670 2.6GHz/1600 20M 8-core 115W
•  8x 4G DDR3-1600 DIMMs (2GB/core)
•  500GB 7200 rpm SATA II local HDD
4x IBM dx460 M4 compute nodes on a Management Rack
3028 compute nodes
–  48,448 Intel cores
Memory 94.62 TB
–  32GB/node
Peak performance: 1.0 Pflop/s
–  Node performance: 332.8 Gflops
–  Rack Performance: 27.95 Tflops
–  Rack Consumption: 28.04 kW/rack (nominal under HPL)
Estimated power consumption: 1.08 MW
Infiniband FDR10 non-blocking Fat Tree network topology
2
MareNostrum 1 - 2 - 3
MN1 (2004)
Compute
Performance
Memory
Cores/chip
1
Chip/node
2
Cores/node
2
x2
MN2 (2006)
Ratio
2
x4
2
x2
4
MN3 (2012)
8
2
x4
16
Nodes
2406
+154
2560
+468
3028
Total cores
4812
x2
10240
x4,73
48448
Freq.
2,2
2,3
2,6
Gflops/core
8,8
9,2
20,8
Gflops/node
17,6
36,8
332,8
Total Tflops
42,3
x2
94,2
x10,61
GB/core (GB)
2
GB/node (GB)
4
x2
8
x4
9,6
x2
20
x4,84
Total (TB)
Topology
Network
Ratio
2
1000,0
2
32
96,89
Non- blocking
Fat Tree
Non- blocking
Fat Tree
Non- blocking
Fat Tree
Latency (µs)
4
4
x5,7
0,7
Bandwidth (Gb/s)
4
4
x10
40
Storage
(TB)
236
x2
460
x4,34
2000
Consumption
(KW)
650
x1,1
750
x1,44
10803
MN3 HW Layout
M3
M2
IB7
M1
IB6
D5
IB5
D4
IB4
D3
IB3
D2
D1
C4
C8
C12
C16
C20
C24
C28
C32
C36
s02r2
s04r2
s06r2
s08r2
s10r2
s12r2
s14r2
s16r2
s18r2
C3
s02r1
C7
s04r1
C11
s06r1
C15
s08r1
C19
s10r1
C23
s12r1
C27
s14r1
C31
s16r1
C35
s18r1
C39
C2
s01r2
C6
s03r2
C10
s05r2
C14
s07r2
C18
s09r2
C22
s11r2
C26
s13r2
C30
s15r2
C34
s17r2
C38
C1
s01r1
C5
s03r1
C9
s05r1
C13
s07r1
C17
s09r1
C21
s11r1
C25
s13r1
C29
s15r1
C33
s17r1
C37
C40
IB2
IB1
4
MN3 Compute Racks
M3
M2
IB7
M1
IB6
D5
IB5
D4
IB4
D3
IB3
D2
D1
C4
C8
C12
C16
C20
C24
C28
C32
C36
s02r2
s04r2
s06r2
s08r2
s10r2
s12r2
s14r2
s16r2
s18r2
C3
s02r1
C7
s04r1
C11
s06r1
C15
s08r1
C19
s10r1
C23
s12r1
C27
s14r1
C31
s16r1
C35
s18r1
C39
C2
s01r2
C6
s03r2
C10
s05r2
C14
s07r2
C18
s09r2
C22
s11r2
C26
s13r2
C30
s15r2
C34
s17r2
C38
C1
s01r1
C5
s03r1
C9
s05r1
C13
s07r1
C17
s09r1
C21
s11r1
C25
s13r1
C29
s15r1
C33
s17r1
C37
C40
IB2
IB1
5
MN3 iDataPlex Compute Rack
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
dx360 M4
BNT G8052 - GPFS
dx360 M4
dx360 M4
3P 32A PDU
dx360 M4
MLX FDR10 36 Port
dx360 M4
3P 32A PDU
dx360 M4
BNT G8052 - GPFS
–  2.60 GHz x 8 flops/cycle (AVX) = 20.8 Gflops/core
–  16 core x 20.8 Gflops/core = 332.8 Gflops/node
–  84 nodes x 298.64 Gflops/node = 27.95
Tflops/rack
dx360 M4
3P 32A PDU
iDPX rack with RDHX (water cooling)
Performance
dx360 M4
MLX FDR10 36 Port
– 2x BNT RackSwitch G8052F
dx360 M4
dx360 M4
MLX FDR10 36 Port
GPFS Network
dx360 M4
BNT G8052 - Mgt
– 2x BNT RackSwitch G8052F
dx360 M4
dx360 M4
3P 32A PDU
Management Network
dx360 M4
dx360 M4
MLX FDR10 36 Port
– 12x compute nodes connected to leaf switches at
IB core racks.
dx360 M4
BNT G8052 - Mgt
84x IBM System x iDataPlex dx360 M4 server
4x Mellanox 36-port Managed FDR10 IB
Switch
dx360 M4
6
MN3 dx360 M4 Compute Node
•  1 chassis 2U for 2 nodes with shared:
•  Power (2x 900W redundant N+N)
•  Cooling – 80mm Fans
•  Each nodes:
•  2x 1GbE interfaces
•  1x iMM interface
•  Mellanox ConnectX-3 Dual-port FDR10
QSFP IB Mezz Card
One 3.5'' SATA drive
Dual-port
QSFP FDR10
IB Mezz card
in each node
IMMv2 Firmware
Rear of Chassis
Front of Chassis
7
Compute Node Block diagram
8
MN3 Network configuration
iDataPlex rack switches
dx360 M4 compute node
2x Management & Boot Ethernet switch
BNT RackSwitch G8052F
41x dx360M4
41
Management Network
(IMM and xCAT/Boot)
1 Gb/s copper
2x 1 Gb/s
copper
42
2
IB Leaf
mgt port
1
GPFS switch
2x GPFS Ethernet switch
BNT RackSwitch G8052F
41x dx360M4
41
GPFS Network
1 Gb/s copper
42
4x Infiniband Leaf Switch
Mellanox 36-port Managed FDR10 IB Switch
17x dx360M4
Infiniband FDR10
40 Gb/s copper
4x 10 Gb/s optical
17
18
18 optical links
...
9
MN3 Network configuration
iDataPlex rack switches
dx360 M4 compute node
em
teBcoo
2x IM
aM
na,gR
em
eno
t&
on
t tr
Eto
hle,rC
neo
tn
ss
wo
itlce
hs,
BNT RackSwitch G
0i5tc
2F
S8w
hes
41x dx360M4
41
Management Network
(IMM and xCAT/Boot)
1 Gb/s copper
42
IB Leaf
mgt port
41
GPFS Network – I/O Traffic
GPFS Network
42
40 Gb/s copper
4x 10 Gb/s optical
4x Infiniband Leaf Switch
Mellanox 36-port Managed FDR10 IB Switch
17x dx360M4
Infiniband FDR10
GPFS switch
2x GPFS Ethernet switch
BNT RackSwitch G8052F
41x dx360M4
1 Gb/s copper
OS Services: xCAT, Boot, LSF,
2
G1anglia
2x 1 Gb/s
copper
17
MPI Applications Traffic
18
18 optical links
...
10
MN3 Infiniband Racks
M3
M2
IB7
M1
IB6
D5
IB5
D4
IB4
D3
IB3
D2
D1
C4
C8
C12
C16
C20
C24
C28
C32
C36
s02r2
s04r2
s06r2
s08r2
s10r2
s12r2
s14r2
s16r2
s18r2
C3
s02r1
C7
s04r1
C11
s06r1
C15
s08r1
C19
s10r1
C23
s12r1
C27
s14r1
C31
s16r1
C35
s18r1
C39
C2
s01r2
C6
s03r2
C10
s05r2
C14
s07r2
C18
s09r2
C22
s11r2
C26
s13r2
C30
s15r2
C34
s17r2
C38
C1
s01r1
C5
s03r1
C9
s05r1
C13
s07r1
C17
s09r1
C21
s11r1
C25
s13r1
C29
s15r1
C33
s17r1
C37
C40
IB2
IB1
11
MN3 Infiniband Network
6x Infiniband Racks
–Melanox 648-port FDR10 Infiniband Core Switch
(29U)
1x Infiniband Rack: Leaf IB switches +
UFM servers
–  18x Mellanox 36-port Managed FDR10 IB Switch
–  2x Infiniband UFM servers.
•  Unified Fabric Manager: Provision, monitor and
operate data center fabric
144x Mellanox 36-port Managed FDR10
IB Switch
Front
Back
(cabling)
12
MN3 Management racks
M3
M2
IB7
M1
IB6
D5
IB5
D4
IB4
D3
IB3
D2
D1
C4
C8
C12
C16
C20
C24
C28
C32
C36
s02r2
s04r2
s06r2
s08r2
s10r2
s12r2
s14r2
s16r2
s18r2
C3
s02r1
C7
s04r1
C11
s06r1
C15
s08r1
C19
s10r1
C23
s12r1
C27
s14r1
C31
s16r1
C35
s18r1
C39
C2
s01r2
C6
s03r2
C10
s05r2
C14
s07r2
C18
s09r2
C22
s11r2
C26
s13r2
C30
s15r2
C34
s17r2
C38
C1
s01r1
C5
s03r1
C9
s05r1
C13
s07r1
C17
s09r1
C21
s11r1
C25
s13r1
C29
s15r1
C33
s17r1
C37
C40
IB2
IB1
14
MN3 Storage racks
M3
M2
IB7
M1
IB6
D5
IB5
D4
IB4
D3
IB3
D2
D1
C8
C12
C16
C20
C24
C28
C32
C36
s02r2
s04r2
s06r2
s08r2
s10r2
s12r2
s14r2
s16r2
s18r2
C3
s02r1
C7
s04r1
C11
s06r1
C15
s08r1
C19
s10r1
C23
s12r1
C27
s14r1
C31
s16r1
C35
s18r1
C39
C2
s01r2
C6
s03r2
C10
s05r2
C14
s07r2
C18
s09r2
C22
s11r2
C26
s13r2
C30
s15r2
C34
s17r2
C38
C1
s01r1
C5
s03r1
C9
s05r1
C13
s07r1
C17
s09r1
C21
s11r1
C25
s13r1
C29
s15r1
C33
s17r1
C37
C40
IB2
IB1
21
C4
15
Storage HW
3x Data Building Blocks with:
–  8x Data Servers (x3550 M3) with 48 GB main memory
–  1x DS5300 controller couplet
–  8x EXP5060
•  400x SATA 2TB 7,2K rpm
• 
50 disk/enclosure (Total: 400 disks)
• 
10 empty disk slots
• 
Total capacity: 800 TB
• 
Net capacity: 640TB (RAID6 8+2P)
TOTAL Data Capacity:
–  1200 SATA 2TB 7,2 rpm disks
–  Net capacity: 1920 TB (RAID6 8+2P)
1x MetaData Building Block
– 
6x Metadata servers (x3650 M3) with 128 GB main memory
–  1x DS5300 controller couplet (4U)
–  8x Storage Enclosure Expansion Units.
– 
112x FC 600GB 15Krpm (16 disk/enclosure)
Total capacity: 67.2 TB
Net capacity: 33.6 TB (RAID 1)
21
16