The Mont-Blanc prototype...Per-node power monitoring • The EMB features a power sensor for every...
Transcript of The Mont-Blanc prototype...Per-node power monitoring • The EMB features a power sensor for every...
This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/2007-2013] under grant agreement n° 288777.
http://www.montblanc-project.eu
The Mont-Blanc prototype
Alex RamirezBarcelona Supercomputing Center
Commodity components drive HPC
• Microprocessors replaced Vector/SIMD supercomputers• They were not faster• They were cheaper
Top500 1993, 1st edition:
Cray vector, 41%MasPar SIMD, 11%Convex/HP vector, 5%
May 19, 2014Prototypes workshop2
Mobile SoC vs Server processor
Performance
5.2 GFLOPS
153 GFLOPS
Cost
21$2
1500$3
x30
1. 6.8 GFLOPS from CPU + 25.5 GFLOPS from embedded GPU2. Leaked Tegra3 price from the Nexus 7 Bill of Materials3. Non-discounted List Price for the 8-core Intel E5 SandyBrdige
x70
32.3 GFLOPS1 21$ (?)
x5 x70
May 19, 2014Prototypes workshop3
Samsung Exynos 5 Dual Superphone SoC
• 32nm HKMG• Dual-core ARM Cortex-A15 @ 1.7 GHz• Quad-core ARM Mali T604
• OpenCL 1.1• Dual-channel DDR3• USB 3.0 to 1 GbE bridge
• All in a low-power mobile socket
May 19, 2014Prototypes workshop4
4 GB of DDR3-1600 uSD slot, up to 64 GB
Exynos 5 Dual:2x ARM Cortex-A15
ARM Mali-T604
USB 3.0to 1 GbE
bridge
Samsung Daughter Board (SDB)
• CPU + GPU + DRAM + storage + network … all in a compute card just 8.5x5.6 cm
May 19, 2014Prototypes workshop5
15 compute cards30 x ARM Cortex-A1515 x ARM Mali-T604 GPU120 GB DDR3
1 GbE crossbar switch
Cluster management
2 x 10 GbE links
Embedded Mother Board (EMB)
• 15 node-cluster in a standard Bull B505 enclosure
May 19, 2014Prototypes workshop6
9 x Compute blades:135 x Compute cards + 36 x 10 GbE links
270 x ARM Cortex-A15135 x ARM Mali-T604 GPU540 GB DDR3-1600
Chassis management panel
Mont-Blanc server chassis
• 9 blades in a standard 7U BullX chassis• Shared cooling, PSU, chassis management
May 19, 2014Prototypes workshop7
The Mont-Blanc prototype
• 6 BullX chassis• 54 Compute blades• 810 Compute cards
• 1620 CPU• 810 GPU• 3.2 TB of DRAM• 52 TB of Flash
• 26 TFLOPS• 18 KWatt
May 19, 2014Prototypes workshop8
Interconnection Network
SDB1 Gb/s
SDBSDBSDBSDBSDBSDBSDBSDBSDBSDB
SDBSDBSDBSDBSDB15 x 1 Gb/s
10 Gb/s
10 Gb/s
EMB
EM
BE
MB
18 x 10 Gb/s
18 x 10 Gb/s
160 Gb/s
LustreServer(s)
N x10 Gb/s
May 19, 2014Prototypes workshop9
18 x 10 Gb/s
18 x 10 Gb/s
Per-node power monitoring
• The EMB features a power sensor for every SDB• TI INA209 power meter• 1 sample every 16ms, 5% accuracy
• Data is aggregated on an FPGA on the EMB• 15 aggregated samples every 1.12s
• Samples offloaded to the BMC via I2C every 500ms• User can access readings in BMC through management network
using IPMI calls
May 19, 2014Prototypes workshop10
Partners + roles
• BSC• Concept + (shifting) architecture requirements• SoC benchmarking
• ARM• System software stack
• Linux kernel + OpenCL drivers• Network benchmarking• System software stack
• LRZ• Power monitoring requirements
• Bull• System architecture• PCB design
May 19, 2014Prototypes workshop11
Prototype development• Oct’12
• SoC selected for prototype integration• Nov’12
• Prototype architecture to use SoM / microserver concept• (instead of flat multi-node blade)
• Mar’13• Finalization of interconnect network
• 10 GbE switch integrated in Bull chassis• 10 GbE pass-through module integrated in Bull chassis• 10 GbE cables from the top of the blade• 10 GbE cables from the front of the blade
• Jul’13• First SDB samples received (EVT boards)
• Never-ending OS bring-up and work on drivers work starts …• Sep’13
• First EMB samples received (EVT boards)• Mar’14
• Second round of SDB and EMB boards (DVT boards)• Green light for prototype procurement
May 19, 2014Prototypes workshop12
Prototype procurement
• Complex procurement due to BSC + FP7 rules• Prototype provider can’t be Bull
• Because Bull is a partner …• … and this is going to be a BSC property
• BSC can’t pay Bull with Project funds• Bypass Bull, and order the hardware directly from their provider
• Bull will still integrate the hardware and deploy prototype
• Still fighting with bureocracy to publish prototype procurement• Budget is above 150.000€ => Must be public• Exclusive contract assigned to one provider
• Must argue technical + IP reasons• Timing is NOT a valid reason!
• But still, anyone could apply …
May 19, 2014Prototypes workshop13
Prototype deployment
• Large-scale prototype not deployed yet …• … trouble still needs to happen there
• SDB bring-up has been a never-ending process• Mismatch between kernel version + opencl driver + GbE driver• Then add Lustre client version
• Lustre client did not work for kernels 3.7 to 3.10 …• … we drop Lustre for GlusterFS …
• We perform all the benchmarking on a GlutserFS server• … and Lustre Works again in kernel 3.11
• Hardware platform for application developers has been in short supply• Arndale kits with Exynos 5 Dual out of stock in June’13
May 19, 2014Prototypes workshop14
Prototype evaluation
• What Works• Everything Works
• Multi-core CPU• Embedded GPU accelerator + OpenCL• GbE interconnect• Lustre client• HPC software stack
• Cluster management console• Performance monitoring + analysis• Debugger• …
• What doesn’t work• Wait until we deploy the large scale version …• … Ethernet use in HPC isn’t simple
May 19, 2014Prototypes workshop15
Exynos 5 Dual vs. Quad-core Intel i7
May 19, 201416
0
2
4
6
8
10
12
14
Quad Cortex‐A9 Dual Cortex‐A15 Quad Mali‐T604 Quad i7
Performance Energy
Rel
ativ
e to
Qua
d-A
9
Prototypes workshop
Exynos 5 Octa (5420)
• Quad-core ARM Cortex-A15, for performance• Quad-core ARM Cortex-A7, for energy efficiency• Six-core ARM Mali-T628, for OpenCL accelerator
• 50% more GPU cores than Exynos 5 Dual• 50% higher compute performance per core
• Higher CPU and DDR frequencies
May 19, 2014Prototypes workshop17
Exynos 5 Octa projected performance …
May 19, 201418
0
2
4
6
8
10
12
14
16
18
20
Quad Cortex‐A9 Dual Cortex‐A15 Quad Mali‐T604 Quad i7 Exynos 5 Octa
PerformanceR
elat
ive
to Q
uad-
A9
Speculative data, no commitment from ARM or Samsung implied
Prototypes workshop
Interconnect evaluation: latency
• TCP/IP adds significant CPU overhead• OpenMX driver interfaces “directly” to the Ethernet NIC• USB in Exynos5 adds extra latency on top of network
stack
May 19, 2014Prototypes workshop19
Interconnect evaluation: bandwidth
• TCP/IP overhead prevents Tegra2 from achieving full bandwidth• OpenMX does achieve peak bandwidth
• USB overheads prevent Exynos 5 from achieving full bandwidth, even with OpenMX
May 19, 2014Prototypes workshop20
Advances over State of the Art
• Developed an entire HPC software ecosystem on ARM• Including OpenCL driver for GPU acclerator• Performance monitoring + tracing + analysis• Advanced parallel programming models
• MPI + OmpSs @ OpenCL + FORTRAN
• Fine-grain per-node power monitoring• Open the door to future per-node power management • Combined with SoC power management features
• Microserver architecture for HPC• Built on commodity SoC + commodity network (Ethernet)
May 19, 2014Prototypes workshop21
Limitations of current mobile processors for HPC
• 32-bit memory controller• Even if ARM Cortex-A15 offers 40-bit address space
• No ECC protection in memory• Limited scalability, errors will appear beyond a certain number of
nodes• No standard server I/O interfaces
• Do NOT provide native Ethernet or PCI Express• Provide USB 3.0 and SATA (required for tablets)
• No network protocol off-load engine• TCP/IP, OpenMX, USB protocol stacks run on the CPU
• Thermal package not designed for sustained full-power operation
• All these are implementation decisions, not unsolvable problems• Only need a business case to justify the cost of including the new
features … such as the HPC and server markets
May 19, 2014Prototypes workshop22
Future directions
• Continue exploiting commodity SoC (from tables / smartphones)• No ECC memory protection
• Software error checking + checkpointing• No server I/O interfaces + no protocol offload
• Custom interconnect protocol + interface + switch
• Develop server versions of those commodity SoC• Small + low-power SoC using the same IP• Integrating ECC checksums• Integrating 10/40 Gb/s Ethernet + TCP/IP protocol offload
• Develop HPC-class versions of those SoC• Large + high-end many-core versions using the same IP
• Develop custom packaging + liquid cooling
• All of them rely on the same Mont-Blanc software stack
May 19, 2014Prototypes workshop23
2016 SuperPhone SoC: 80 + 160 GFLOPS
• Quad-core ARM Cortex-A57, for performance• 2.5 GHz x 8 ops/cycle = 20 GFLOPS / core
• Quad-core ARM Cortex-A53, for energy efficiency• 16-core ARM Mali-T760, for OpenCL accelerator
• 2x more cores than Mali-T678 + higher performance/watt• 833 MHz x 12 ops/cycle = 10 GFLOPS / core
• 10 Gb/s IO interface (USB 3.1)• DDR4 higher frequency + bus width / extra channels
• 25.6 GB/sSpeculative data extrapolated from public sources, no commitment from ARM is implied
May 19, 2014Prototypes workshop24
Conclusions
• The convergence of Embedded and HPC technologies has happened already• We have enabled the software already
• Leverage on all of the embedded systems technology to build a new class of HPC system• Automated SoC design• Automatic core customization• SoC power management• Decouple IP provider from semiconductor provider
May 19, 2014Prototypes workshop25