SYSTEMS LANDSCAPE FOR OPTICAL INTERCONNECT
Transcript of SYSTEMS LANDSCAPE FOR OPTICAL INTERCONNECT
SYSTEMS LANDSCAPE FOR OPTICAL INTERCONNECT
Mike WoodacreHPE FellowCTO for High-Performance Computing and Mission Critical [email protected]
AGENDA
2
• What do systems look like today• Processing technology directions• Optical interconnect opportunity
Enterprise System Challenges• Power density of processors going up• DRAM reliability sensitive to temperature• Need for air-cooled in typical enterprise data centers• Transactional applications look like random memory
access, driving large memory footprints• Reliability at small scale is key• Minimize Stranded resources/Increase resource utilization
FACTORS AT WORK FOR OPTICAL INTERCONNECT• Electrical reach is falling
• HPE Exascale systems uses 2.3m electrical links, • IEEE 802.3ck (100 Gbps/lane) is targeting 2m• The power required to drive these links is high• Optical cables are required for inter-rack links
• Optical link costs improving• COGs cost is falling, but significant investment required to enable SiPh in broad use, low cost, reliable endpoints
• New generations of systems will grow optical footprint• Power density of systems reaching the limit• Changing resource utilization requires increasing flexibility
3
HPC System Challenges• Power density of processors going up• Scale is key, driving to cost reduction of supporting
resources to the processor (memory capacity, IO) • Integrated HBM capacity opens up opportunity for
additional memory/persistent memory further away from processor
• High end dominated by direct-liquid cooling• Reliability at scale is key
Copyright HPE 2021
MODELING & SIMULATION
BIG DATAANALYTICS
ARTIFICIAL INTELLIGENCEX X EXASCALE
ERA
4Copyright HPE 2021
Traditional Ethernet NetworksUbiquitous & interoperable
Broad connectivity ecosystem
Broadly converged network
Native IP protocol
Efficient for large payloads only
High latency
Limited scalability for HPC
Limited HPC features
Standards based / interoperable
Broad connectivity
Converged network
Native IP Support
Low latency
Efficient for small to large payloads
Full set of HPC features
Very scalable for HPC & Big Data
HPE Slingshot Traditional HPC InterconnectsProprietary (single vendor)
Limited connectivity
HPC interconnect only
Expensive/ slow gateways
Low latency
Efficient for small to large payloads
Full set of HPC features
Very scalable for HPC & Big Data
5
INTRODUCING HPE SLINGSHOT
ü Consistent, predictable reliable high performance High bandwidth + low latency, from one rack or exascale
ü Excellent for emerging infrastructures
Mix tightly-coupled HPC, AI, analytics, and cloud workloads
ü Native connectivity to data center resources Copyright HPE 2021
EXASCALE TECHNOLOGY WINS FOR HPE CRAY EX SYSTEMS
30-OCT-2018A N N O U N C E D
18-MAR-2019A N N O U N C E D
7-MAY-2019A N N O U N C E D
5-MAR-2020A N N O U N C E D
1-OCT-2020A N N O U N C E D
Copyright HPE 2021
7
SLINGSHOT OVERVIEW§ 100GbE/200GbE interfaces§ 64 ports at 100/200 Gbps§ 12.8 Tbps total bandwidth
”Rosetta” Switch ASIC
Slingshot Integrated Blade Switch
Direct Liquid Cooled
Slingshot Top of Rack Switch
Air Cooled
High Performance Switch
Microarchitecture
Over 250K endpoints with a diameter of just
three hops
Ethernet Compliant
Easy connectivity to datacenters and
third-party storage
World Class Adaptive Routing and QoS
High utilization at scale; flawless support for
hybrid workloads
Effective and Efficient Congestion Control
Performance isolation between workloads
Low, Uniform Latency
Focus on tail latency, because real apps
synchronize
Rosetta 64×200 switch
Copyright HPE 2021
SLINGSHOT PACKAGING – SHASTA MOUNTAIN/OLYMPUS (SCALE OPTIMIZED)
Switch Chassis
Compute Chassis
“Colorado” Switch(horizontal)
Compute blade(vertical)
Dual NIC Mezz Card
QSFP-DD Cables
Group cablesGlobal cables
Internal – to BladeRear - To Fabric
Group/Global links
Switch and NICs mateat 90 degree orientation
8Copyright HPE 2021
CABLING THE SLINGSHOT NETWORK
Copyright HPE 2021
WHAT IS A DRAGONFLY NETWORK? WAY TO MINIMIZE LONG CABLES
• Nodes are organized into groups• Usually racks or cabinets• Electrical links from NIC to switch
• All-to-all amongst switches in a group• Electrical links between switches
• All-to-all between groups• Optical links
• Group can have different characteristics• Global network can be tapered to reduce cost
10
Group 0 Group 1 Group 2 Group 3 Group 4 Group 5 Group 6 Group 7
Flexible computeand I/O High density compute
Electrical group
Global optical links between groups
• Enabled by High radix router• Host:local:global = 1:2:1• Max ports = ~4×(radix/4)4
• E.g. 64 ports• Host links (16)• Local links (32)• Global links (16)• Max ports = 262,656 Copyright HPE 2021
WHY DRAGONFLY?
• Dragonfly lets us use more copper cables which has several advantages• Cost – each link is about 10X less expensive when
you use copper• Reliability – Links built with copper cables more
than an order of magnitude more reliable than links built with AOCs
• Power – Copper cables are passive and use no additional power, active electrical cables about half the power of an AOC currently
• Dragonfly allows you to cable a network of a given global bandwidth with about half of the optical cables needed in a fat tree
• With a high-radix switch, Dragonfly gives very low diameter topologies meaning we don’t care about placement of workloads
Copyright HPE 2021
BENEFITS OF LOW DIAMETER NETWORKS
12
Fat-tree (classical topology), 512 servers• 16 “in-row” 64p switches• 16 Top of Rack 64p switches• 512 Optical cables• 512 Copper cables• 512 NICs
All-to-All topology, 512 servers• 0 “in-row” switches• 16 Top of Rack 64p switches• 240 Optical cables • 512 Copper cables• 512 NICs
Copyright HPE 2021
4 sockets (up to 6TB) 8 sockets (up to 12TB) 16 sockets (up to 24TB) 32 sockets (up to 48TB)
Superdome Flex Scales up Seamlessly as a Single System
20 sockets (up to 30TB)
4-socket building block
12-socket, 24-socket, and 28-socket configurations not pictured
Copyright HPE 2021
ENTERPRISE SHARED MEMORY TECHNOLOGY: MODULAR BUILDING BLOCKS WITH MEMORY FABRIC
SUPERDOME FLEX CHASSIS
14
Minimizing PCB cost of UPI interconnect of 4-skts (with 12 DIMMs/skt), pushes longer PCIe connection to rear of chassis.DRAM memory wants to remain cool for reliability
Custom PCIe ribbon cables to rear of chassis to enable use of entire PCIe pins from each skt,
QSFP interconnect ports
Copyright HPE 2021
HPE SUPERDOME FLEX SERVER 8/16/32 SOCKET ARCHITECTURE
Copyright HPE 2021
SUPERDOME FLEX PROTOTYPE USE OF OPTICAL CABLING 64-skt SD-Flex, 2-rack
Production Copper CableUp to 30 QSFP 100Gb cables per chassis
Optical Cable Example – 4:1 reduction
Optical cabling would significantly reduce complexity/support of fabric, with similar MBO solution as
shown in the ‘Machine’ prototype below
Copyright HPE 2021
Improving Serviceability
TRADITIONAL VS. MEMORY-DRIVEN COMPUTING ARCHITECTURE
17
Today’s architectureis constrained by the CPU
DDR
Ethernet
PCI
If you exceed the what can be connected to one CPU, you need another CPU
Memory-Driven Computing:Mix and match at the speed of memory
SATA
Copyright HPE 2021
DRIVING OPEN PROCESSOR INTERFACES: PCIE/CXL/GEN-Z
Memory Semantic Protocols• Replaces processor-local interconnects—(LP)DDR, PCIe, SAS/SATA, etc.• CXL/Gen-Z enable memory-semantics outside processor proprietary domain• CXL speeds based on PCIE Gen5/Gen6• Disaggregation of devices, especially memory/persistent memory, accelerators
Past
Future
CPU / SoC
SoC
PCIE/CXL PCIE/CXL PCIE/CXL Gen-Z Gen-Z
(LP) DDR PCIe SAS / SATA Ethernet / IB Proprietary
PCIE Gen4/16Gb Gen5/32Gb Gen6/64Gb
PCIe x16 200Gb 400Gb 800Gb
x16 PCIeCard
X16 Connector
x16 CXLCard
Processor
PCIe channelSERDES
• CXL runs across the standard PCIe physical layer with new protocols optimized for cache and memory
• CXL uses a flexible processor Port that can auto-negotiate to either the standard PCIe transaction protocol or the alternate CXL transaction protocols
• First generation CXL aligns to 32 GT/s PCIe 5.0 specification
• CXL usages expected to be key driver for an aggressive timeline to PCIe 6.0 architecture
PCIe enabling processor use of higher ethernet speeds
Copyright HPE 2021
GEN-Z CONSORTIUM: DEMONSTRATED ACHIEVEMENTS
• Flash Memory Summit & Supercomputing (2019)
• Fully operational Smart Modular 256GB ZMM modules
• Prototype Gen-Z Media Box Enclosure• FPGA module implemented 12-port switch (x48 lanes total)• ZMM Midplane• 8 ZMM module bays (only 6 electrically connected in demo)• Gen-Z x4 25G Gen-Z links (copper cabling internal to box)• Four QSFP28 Gen-Z uplink ports/links• Standard 5m QSFP DAC cables between boxes, switches, and servers
• Looking to the future, fabric attached memory can supplement processor memory• Fastest ‘storage’ with SCM (Storage Class Memory), eg 3DXpoint• Memory pool for use to grow processor memory footprint dynamically
Demonstration of Media Modules and “Box of Slots” Enclosure
19
256GB DRAM ZFF Smart Modular using
FPGA and Samsung DRAM
Gen-Z Media Box “Box of slots”
6-8x ZMM modules
Copyright HPE 2021
Memory FabricQSFP connectors
AGENDA
20
• What do systems look like today• Processing technology directions• Optical interconnect opportunity
• The “Cambrian explosion” is driven by:• Demand side: diversification of workloads• Supply side: CMOS process limitations
Achieving performance through specialization50+ start-ups working on custom ASICs
• Key requirements looking forward:• System architecture that can adapt to a wide range of compute
and storage devices• System software that can rapidly adopt new silicon• Programming environment that can abstract specialized ASICs
to accelerate workload productivity
21
PROCESSING TECHNOLOGY - HETEROGENEITY
Cerebras Wafer Scale Engine
HBM enabled processors delivering memory bandwidth needsOpens up opportunity to package without DIMMsDrives need for greater interconnect bandwidth per node to balance system
Copyright HPE 2021
PSC’s Neocortex system• 2x Cerebras CS-1, with each:
– 400,000 sparce linear algebra “cores”– 18 GB SRAM on-chip Memory– 9.6 PB/s memory bandwidth– 100 Pb/sec on-chip interconnect bandwidth– 1.2 Tb/s I/O bandwidth– 15 RU
• HPE Superdome Flex– 32 Xeon “Cascade Lake” CPUs– 24.5 TB System Memory– 200 TB NVMe local storage– 2 x 12 x 100G Ethernet (to 2x CS-1)– 16 x 100G HDR100 to Bridges-II and Lustre FS
120 Memory fabric connections104 PCIe fabric connections (64 internal, 40 external)
22
COMPLEX WORKFLOWS – MANY INTERCONNECTS
NeocortexOne Data Scientist, One DL Engine, One Workstation
100G
be s
witc
h10
0Gbe
sw
itch
HD
R20
0 IB
leaf
sw
itch
HD
R20
0 IB
leaf
sw
itch
3 x 100Gb Ethernet
2 x 100Gb HDR
12 x 100Gb Ethernet
12 x 100Gb Ethernet
1200Gb each (raw)
1200Gb each (raw)
1600Gb (raw)
Aggregate throughput per stage
200TB raw flash filesystem
NVMeNVMeNVMeNVMeNVMeNVMeNVMeNVMe
NVMeNVMeNVMeNVMeNVMeNVMeNVMeNVMe
NVMeNVMeNVMeNVMeNVMeNVMeNVMeNVMe
NVMeNVMeNVMeNVMeNVMeNVMeNVMeNVMe
NVMeNVMeNVMeNVMeNVMeNVMeNVMeNVMe
NVMeNVMeNVMeNVMeNVMeNVMeNVMeNVMe
NVMeNVMeNVMeNVMeNVMeNVMeNVMeNVMe
NVMeNVMeNVMeNVMeNVMeNVMeNVMeNVMe
SuperDome Flex
Brid
ges-
II &
Lus
tre
File
syst
em
Courtesy of Nick Nystrom, PSC – http://psc.edu Copyright HPE 2021
AGENDA
23
• What do systems look like today• Processing technology directions• Optical interconnect opportunity
Global & I/O Links: 18×400 Gbps links, 5m < path length < 50mEthernet interoperability on external links
OPTICAL INTERCONNECT OPPORTUNITIES
24
Host Links: 16×400 Gbps links, 64×112Gbps lanes, path length < 1mMight replace orthogonal connector with passive optical equivalent
Local Links: 30×400 Gbps links, 1m < path length < 2.5m
Optimal number of optical links varies between use cases:
• Relative cost of Cu and SiPh links• Interoperability requirements
Copyright HPE 2021
WHAT DOES THE FUTURE HOLD FOR SYSTEM DESIGN?
• HPC systems reaching limit of power density• Optical interconnect reach opens up innovation opportunities to re-think system packaging
• Enterprise systems want to increase resource utilization• Optical interconnect opens up disaggregation of resources, RAS (Reliability, Availability, Serviceability) improvements
• Cost is always a factor and people will cling to technology offering lower cost until it breaks• Reliability is key – existing AOC have shown to be at least order-of-magnitude less reliable in large deployments• Power of interconnect is important but need to consider in context of overall system power consumption
• Key need to continue driving optical technology innovation for production/reliable deployment across connectors/MBO/co-packaging at 400Gb, to enable critical use cases at 800Gb, as well as PCIe/CXL use cases
25Copyright HPE 2021