SYSTEMS LANDSCAPE FOR OPTICAL INTERCONNECT

SYSTEMS LANDSCAPE FOR OPTICAL INTERCONNECT

Mike WoodacreHPE FellowCTO for High-Performance Computing and Mission Critical [email protected]

AGENDA

2

• What do systems look like today• Processing technology directions• Optical interconnect opportunity

Enterprise System Challenges• Power density of processors going up• DRAM reliability sensitive to temperature• Need for air-cooled in typical enterprise data centers• Transactional applications look like random memory

access, driving large memory footprints• Reliability at small scale is key• Minimize Stranded resources/Increase resource utilization

FACTORS AT WORK FOR OPTICAL INTERCONNECT• Electrical reach is falling

• HPE Exascale systems uses 2.3m electrical links, • IEEE 802.3ck (100 Gbps/lane) is targeting 2m• The power required to drive these links is high• Optical cables are required for inter-rack links

• Optical link costs improving• COGs cost is falling, but significant investment required to enable SiPh in broad use, low cost, reliable endpoints

• New generations of systems will grow optical footprint• Power density of systems reaching the limit• Changing resource utilization requires increasing flexibility

3

HPC System Challenges• Power density of processors going up• Scale is key, driving to cost reduction of supporting

resources to the processor (memory capacity, IO) • Integrated HBM capacity opens up opportunity for

additional memory/persistent memory further away from processor

• High end dominated by direct-liquid cooling• Reliability at scale is key

Copyright HPE 2021

MODELING & SIMULATION

BIG DATAANALYTICS

ARTIFICIAL INTELLIGENCEX X EXASCALE

ERA

4Copyright HPE 2021

Traditional Ethernet NetworksUbiquitous & interoperable

Broad connectivity ecosystem

Broadly converged network

Native IP protocol

Efficient for large payloads only

High latency

Limited scalability for HPC

Limited HPC features

Standards based / interoperable

Broad connectivity

Converged network

Native IP Support

Low latency

Efficient for small to large payloads

Full set of HPC features

Very scalable for HPC & Big Data

HPE Slingshot Traditional HPC InterconnectsProprietary (single vendor)

Limited connectivity

HPC interconnect only

Expensive/ slow gateways

Low latency

Efficient for small to large payloads

Full set of HPC features

Very scalable for HPC & Big Data

5

INTRODUCING HPE SLINGSHOT

ü Consistent, predictable reliable high performance High bandwidth + low latency, from one rack or exascale

ü Excellent for emerging infrastructures

Mix tightly-coupled HPC, AI, analytics, and cloud workloads

ü Native connectivity to data center resources Copyright HPE 2021

EXASCALE TECHNOLOGY WINS FOR HPE CRAY EX SYSTEMS

30-OCT-2018A N N O U N C E D

18-MAR-2019A N N O U N C E D

7-MAY-2019A N N O U N C E D

5-MAR-2020A N N O U N C E D

1-OCT-2020A N N O U N C E D

Copyright HPE 2021

7

SLINGSHOT OVERVIEW§ 100GbE/200GbE interfaces§ 64 ports at 100/200 Gbps§ 12.8 Tbps total bandwidth

”Rosetta” Switch ASIC

Slingshot Integrated Blade Switch

Direct Liquid Cooled

Slingshot Top of Rack Switch

Air Cooled

High Performance Switch

Microarchitecture

Over 250K endpoints with a diameter of just

three hops

Ethernet Compliant

Easy connectivity to datacenters and

third-party storage

World Class Adaptive Routing and QoS

High utilization at scale; flawless support for

hybrid workloads

Effective and Efficient Congestion Control

Performance isolation between workloads

Low, Uniform Latency

Focus on tail latency, because real apps

synchronize

Rosetta 64×200 switch

Copyright HPE 2021

SLINGSHOT PACKAGING – SHASTA MOUNTAIN/OLYMPUS (SCALE OPTIMIZED)

Switch Chassis

Compute Chassis

“Colorado” Switch(horizontal)

Compute blade(vertical)

Dual NIC Mezz Card

QSFP-DD Cables

Group cablesGlobal cables

Internal – to BladeRear - To Fabric

Group/Global links

Switch and NICs mateat 90 degree orientation

8Copyright HPE 2021

CABLING THE SLINGSHOT NETWORK

Copyright HPE 2021

WHAT IS A DRAGONFLY NETWORK? WAY TO MINIMIZE LONG CABLES

• Nodes are organized into groups• Usually racks or cabinets• Electrical links from NIC to switch

• All-to-all amongst switches in a group• Electrical links between switches

• All-to-all between groups• Optical links

• Group can have different characteristics• Global network can be tapered to reduce cost

10

Group 0 Group 1 Group 2 Group 3 Group 4 Group 5 Group 6 Group 7

Flexible computeand I/O High density compute

Electrical group

Global optical links between groups

• Enabled by High radix router• Host:local:global = 1:2:1• Max ports = ~4×(radix/4)4

• E.g. 64 ports• Host links (16)• Local links (32)• Global links (16)• Max ports = 262,656 Copyright HPE 2021

WHY DRAGONFLY?

• Dragonfly lets us use more copper cables which has several advantages• Cost – each link is about 10X less expensive when

you use copper• Reliability – Links built with copper cables more

than an order of magnitude more reliable than links built with AOCs

• Power – Copper cables are passive and use no additional power, active electrical cables about half the power of an AOC currently

• Dragonfly allows you to cable a network of a given global bandwidth with about half of the optical cables needed in a fat tree

• With a high-radix switch, Dragonfly gives very low diameter topologies meaning we don’t care about placement of workloads

Copyright HPE 2021

BENEFITS OF LOW DIAMETER NETWORKS

12

Fat-tree (classical topology), 512 servers• 16 “in-row” 64p switches• 16 Top of Rack 64p switches• 512 Optical cables• 512 Copper cables• 512 NICs

All-to-All topology, 512 servers• 0 “in-row” switches• 16 Top of Rack 64p switches• 240 Optical cables • 512 Copper cables• 512 NICs

Copyright HPE 2021

4 sockets (up to 6TB) 8 sockets (up to 12TB) 16 sockets (up to 24TB) 32 sockets (up to 48TB)

Superdome Flex Scales up Seamlessly as a Single System

20 sockets (up to 30TB)

4-socket building block

12-socket, 24-socket, and 28-socket configurations not pictured

Copyright HPE 2021

ENTERPRISE SHARED MEMORY TECHNOLOGY: MODULAR BUILDING BLOCKS WITH MEMORY FABRIC

SUPERDOME FLEX CHASSIS

14

Minimizing PCB cost of UPI interconnect of 4-skts (with 12 DIMMs/skt), pushes longer PCIe connection to rear of chassis.DRAM memory wants to remain cool for reliability

Custom PCIe ribbon cables to rear of chassis to enable use of entire PCIe pins from each skt,

QSFP interconnect ports

Copyright HPE 2021

HPE SUPERDOME FLEX SERVER 8/16/32 SOCKET ARCHITECTURE

Copyright HPE 2021

SUPERDOME FLEX PROTOTYPE USE OF OPTICAL CABLING 64-skt SD-Flex, 2-rack

Production Copper CableUp to 30 QSFP 100Gb cables per chassis

Optical Cable Example – 4:1 reduction

Optical cabling would significantly reduce complexity/support of fabric, with similar MBO solution as

shown in the ‘Machine’ prototype below

Copyright HPE 2021

Improving Serviceability

TRADITIONAL VS. MEMORY-DRIVEN COMPUTING ARCHITECTURE

17

Today’s architectureis constrained by the CPU

DDR

Ethernet

PCI

If you exceed the what can be connected to one CPU, you need another CPU

Memory-Driven Computing:Mix and match at the speed of memory

SATA

Copyright HPE 2021

DRIVING OPEN PROCESSOR INTERFACES: PCIE/CXL/GEN-Z

Memory Semantic Protocols• Replaces processor-local interconnects—(LP)DDR, PCIe, SAS/SATA, etc.• CXL/Gen-Z enable memory-semantics outside processor proprietary domain• CXL speeds based on PCIE Gen5/Gen6• Disaggregation of devices, especially memory/persistent memory, accelerators

Past

Future

CPU / SoC

SoC

PCIE/CXL PCIE/CXL PCIE/CXL Gen-Z Gen-Z

(LP) DDR PCIe SAS / SATA Ethernet / IB Proprietary

PCIE Gen4/16Gb Gen5/32Gb Gen6/64Gb

PCIe x16 200Gb 400Gb 800Gb

x16 PCIeCard

X16 Connector

x16 CXLCard

Processor

PCIe channelSERDES

• CXL runs across the standard PCIe physical layer with new protocols optimized for cache and memory

• CXL uses a flexible processor Port that can auto-negotiate to either the standard PCIe transaction protocol or the alternate CXL transaction protocols

• First generation CXL aligns to 32 GT/s PCIe 5.0 specification

• CXL usages expected to be key driver for an aggressive timeline to PCIe 6.0 architecture

PCIe enabling processor use of higher ethernet speeds

Copyright HPE 2021

GEN-Z CONSORTIUM: DEMONSTRATED ACHIEVEMENTS

• Flash Memory Summit & Supercomputing (2019)

• Fully operational Smart Modular 256GB ZMM modules

• Prototype Gen-Z Media Box Enclosure• FPGA module implemented 12-port switch (x48 lanes total)• ZMM Midplane• 8 ZMM module bays (only 6 electrically connected in demo)• Gen-Z x4 25G Gen-Z links (copper cabling internal to box)• Four QSFP28 Gen-Z uplink ports/links• Standard 5m QSFP DAC cables between boxes, switches, and servers

• Looking to the future, fabric attached memory can supplement processor memory• Fastest ‘storage’ with SCM (Storage Class Memory), eg 3DXpoint• Memory pool for use to grow processor memory footprint dynamically

Demonstration of Media Modules and “Box of Slots” Enclosure

19

256GB DRAM ZFF Smart Modular using

FPGA and Samsung DRAM

Gen-Z Media Box “Box of slots”

6-8x ZMM modules

Copyright HPE 2021

Memory FabricQSFP connectors

AGENDA

20


• The “Cambrian explosion” is driven by:• Demand side: diversification of workloads• Supply side: CMOS process limitations

Achieving performance through specialization50+ start-ups working on custom ASICs

• Key requirements looking forward:• System architecture that can adapt to a wide range of compute

and storage devices• System software that can rapidly adopt new silicon• Programming environment that can abstract specialized ASICs

to accelerate workload productivity

21

PROCESSING TECHNOLOGY - HETEROGENEITY

Cerebras Wafer Scale Engine

HBM enabled processors delivering memory bandwidth needsOpens up opportunity to package without DIMMsDrives need for greater interconnect bandwidth per node to balance system

Copyright HPE 2021

PSC’s Neocortex system• 2x Cerebras CS-1, with each:

– 400,000 sparce linear algebra “cores”– 18 GB SRAM on-chip Memory– 9.6 PB/s memory bandwidth– 100 Pb/sec on-chip interconnect bandwidth– 1.2 Tb/s I/O bandwidth– 15 RU

• HPE Superdome Flex– 32 Xeon “Cascade Lake” CPUs– 24.5 TB System Memory– 200 TB NVMe local storage– 2 x 12 x 100G Ethernet (to 2x CS-1)– 16 x 100G HDR100 to Bridges-II and Lustre FS

120 Memory fabric connections104 PCIe fabric connections (64 internal, 40 external)

22

COMPLEX WORKFLOWS – MANY INTERCONNECTS

NeocortexOne Data Scientist, One DL Engine, One Workstation

100G

be s

witc

h10

0Gbe

sw

itch

HD

R20

0 IB

leaf

sw

itch

HD

R20

0 IB

leaf

sw

itch

3 x 100Gb Ethernet

2 x 100Gb HDR

12 x 100Gb Ethernet

12 x 100Gb Ethernet

1200Gb each (raw)

1200Gb each (raw)

1600Gb (raw)

Aggregate throughput per stage

200TB raw flash filesystem

NVMeNVMeNVMeNVMeNVMeNVMeNVMeNVMe








SuperDome Flex

Brid

ges-

II &

Lus

tre

File

syst

em

Courtesy of Nick Nystrom, PSC – http://psc.edu Copyright HPE 2021

AGENDA

23


Global & I/O Links: 18×400 Gbps links, 5m < path length < 50mEthernet interoperability on external links

OPTICAL INTERCONNECT OPPORTUNITIES

24

Host Links: 16×400 Gbps links, 64×112Gbps lanes, path length < 1mMight replace orthogonal connector with passive optical equivalent

Local Links: 30×400 Gbps links, 1m < path length < 2.5m

Optimal number of optical links varies between use cases:

• Relative cost of Cu and SiPh links• Interoperability requirements

Copyright HPE 2021

WHAT DOES THE FUTURE HOLD FOR SYSTEM DESIGN?

• HPC systems reaching limit of power density• Optical interconnect reach opens up innovation opportunities to re-think system packaging

• Enterprise systems want to increase resource utilization• Optical interconnect opens up disaggregation of resources, RAS (Reliability, Availability, Serviceability) improvements

• Cost is always a factor and people will cling to technology offering lower cost until it breaks• Reliability is key – existing AOC have shown to be at least order-of-magnitude less reliable in large deployments• Power of interconnect is important but need to consider in context of overall system power consumption

• Key need to continue driving optical technology innovation for production/reliable deployment across connectors/MBO/co-packaging at 400Gb, to enable critical use cases at 800Gb, as well as PCIe/CXL use cases

25Copyright HPE 2021

THANKS!

26

[email protected]

SYSTEMS LANDSCAPE FOR OPTICAL INTERCONNECT

Documents

Transcript of SYSTEMS LANDSCAPE FOR OPTICAL INTERCONNECT