ARPA-E MSFT CoPkg

Post on 10-Apr-2022

12 views 0 download

Transcript of ARPA-E MSFT CoPkg

Mark FilerPrincipal Engineer, Azure Hardware Architecture (AHA)

Framing the discussionHow we build data center networks today, and where we’re headed…

RNGRNG

DC

100G DWDM(PAM4)

<100km

100G PSM4<1km

100G AOC<30m

100G(Now)

row

rack

50G DAC<2m

100G DWDM(PAM4)<100km

long-haul DWDM>100km

long-haul DWDM>100km

RNGRNG

400G(2021-)

row

rack

DC

400G DWDM(400ZR)<100km

400G DR4<1km

400G AOC<30m

100G DAC<2m

long-haul DWDM>100km

long-haul DWDM>100km

400G DWDM(400ZR)<100km

Elephant in the room… Power

Possible Solutions

Takeaway: we can’t just keep scaling link bandwidths…

“next gen” systems will require all of the above

EthernetSwitch ASIC

CP

OC

PO

CP

OC

PO

CP

OC

PO

CP

OC

PO

CPO CPO CPO CPO

CPOCPOCPOCPO

The path to CPO…What are the key drivers and applications?What are application-specific requirements?What must happen to make it viable?

Interconnect Figure of MeritSource: DARPA Photonics in the Package for Extreme Scalability (PIPES)

NVLink 2

NVIDIA GRS 2013

On-board(PCI-E PMA + PCS)

LVDS

32nm SerDes

65nm SerDes

PCIe

HBM

CHIPS LR target

CHIPS SR target

NVIDIA GRS, ISSCC 2018

Avago MicropodFinisar 100 GBASE SR4

Mellanox 100GbE SR4 QSFP28

Mellanox 100GbE CPRI QSFP28

Finisar BOA/MBOM

Finisar Optical Backplane

0.1

1

10

100

1000

10000

100000

1000000

0.0001 0.001 0.01 0.1 1 10 100

BW

De

nsi

ty *

En

erg

y Ef

fici

en

cy

Max Interconnect Distance (meters)

in-package on-board off-board

(Gb

ps/

mm

)/(p

J/b

it)

CPO PotentialFoM comparable to chip-to-chip

electrical interconnects while retaining

optical reach

CPO key drivers

Next Gen

Optics

Memory/Accelerator/IO

Disaggregation

New DC

Applications(functions,

emerging storage

technologies)

Machine

Learning

Cost, Energy, Bandwidth

Optical innovations are key to future datacenters

Increasing bandwidth, 100G/200G signaling

redundancy, increasing use of optics in pluggable FFsFactors:

CPO benefits are multifacetedInterconnect

Metric Desired Characteristics

Optical Reach 10-1000s of meters

HW

Cost Component Cost << $1 / Gbps

Bandwidth Density 100s of Gbps / mm

Oper. C

ost Reliability < 10 FIT &

Energy Efficiency < 10 pJ / bit

Latency Few ns + (prop. delay)

(Energy / bit data includes module/chiplet-side

electrical interface, DSP, PIC components, laser

source. Excludes switch-side. Assume XSR for CPO.)CPO is a distinct step forward in datacenter communication efficiency

CPO applications

Server Chip

NIC

DRAM

PCIe

CPO

High Perf.Switch ASIC

CPO CPO CPO CPO

CP

OC

PO

CP

OC

PO

CPOCPOCPOCPO

CP

OC

PO

CP

OC

PO

Traditional Datacenter Networking

AI Training / HPCFuture Systems w/

Disaggregation

EthernetSwitch ASIC

CP

OC

PO

CP

OC

PO

CP

OC

PO

CP

OC

PO

CPO CPO CPO CPO

CPOCPOCPOCPO

Loca

l Mem

ory

CPO

Server Chip

Loca

l Mem

ory

CPO

Server Chip

Server CPU Package

Compute Fabric

Compute Fabric

CPOCPOCPOCPO

Switch

MCNIC

Remote Memory

CPO

Ethernet

MC

CPE

CPE

CPO

CPO

HBMHBM NPU/GPU

CPO: application domains

Server Chip

NIC

DRAM

PCIe

CPO

High Perf.Switch ASIC

CPO CPO CPO CPO

CP

OC

PO

CP

OC

PO

CPOCPOCPOCPO

CP

OC

PO

CP

OC

PO

Traditional Datacenter Networking

AI Training / HPCFuture Systems w/

Disaggregation

EthernetSwitch ASIC

CP

OC

PO

CP

OC

PO

CP

OC

PO

CP

OC

PO

CPO CPO CPO CPO

CPOCPOCPOCPO

Loca

l Mem

ory

CPO

Server Chip

Loca

l Mem

ory

CPO

Server Chip

Server CPU Package

Compute Fabric

Compute Fabric

CPOCPOCPOCPO

Switch

MCNIC

Remote Memory

CPO

Ethernet

MC

CPE

CPE

CPO

CPO

HBMHBM NPU/GPU

memorynon-memory

CPO: application domainsend-point bandwidth high (4-8T)low (400G) med-high (2-4T)

latency tolerance medium (µs)high (<ms) ultra-low (100s ns)

“link quality” highmed-high ultra-high

power consumption low (5-10 pJ/bit)medium (10-15 pJ/bit) ultra-low (<5 pJ/bit)

loss budget medium-high (~8 dB)low (IEEE-compliant) medium-high (~8 dB)

Server Chip

NIC

DRAM

PCIe

CPO

High Perf.Switch ASIC

CPO CPO CPO CPO

CP

OC

PO

CP

OC

PO

CPOCPOCPOCPO

CP

OC

PO

CP

OC

PO

Traditional Datacenter Networking

AI Training / HPCFuture Systems w/

Disaggregation

EthernetSwitch ASIC

CP

OC

PO

CP

OC

PO

CP

OC

PO

CP

OC

PO

CPO CPO CPO CPO

CPOCPOCPOCPO

Loca

l Mem

ory

CPO

Server Chip

Loca

l Mem

ory

CPO

Server Chip

Server CPU Package

Compute Fabric

Compute Fabric

CPOCPOCPOCPO

Switch

MCNIC

Remote Memory

CPO

Ethernet

MC

CPE

CPE

CPO

CPO

HBMHBM NPU/GPU

memorynon-memory

CPO: system functions

Server Chip

NIC

DRAM

PCIe

CPO

High Perf.Switch ASIC

CPO CPO CPO CPO

CP

OC

PO

CP

OC

PO

CPOCPOCPOCPO

CP

OC

PO

CP

OC

PO

Traditional Datacenter Networking

AI Training / HPCFuture Systems w/

Disaggregation

EthernetSwitch ASIC

CP

OC

PO

CP

OC

PO

CP

OC

PO

CP

OC

PO

CPO CPO CPO CPO

CPOCPOCPOCPO

Loca

l Mem

ory

CPO

Server Chip

Loca

l Mem

ory

CPO

Server Chip

Server CPU Package

Compute Fabric

Compute Fabric

CPOCPOCPOCPO

Switch

MCNIC

Remote Memory

CPO

Ethernet

MC

CPE

CPE

CPO

CPO

HBMHBM NPU/GPU

CPO: system functions

cooler

warmer

low

high

low

high

soldered ?

integrated laser ?

socketed ?

remote laser ?

Server Chip

NIC

DRAM

PCIe

CPO

High Perf.Switch ASIC

CPO CPO CPO CPO

CP

OC

PO

CP

OC

PO

CPOCPOCPOCPO

CP

OC

PO

CP

OC

PO

Traditional Datacenter Networking

AI Training / HPCFuture Systems w/

Disaggregation

EthernetSwitch ASIC

CP

OC

PO

CP

OC

PO

CP

OC

PO

CP

OC

PO

CPO CPO CPO CPO

CPOCPOCPOCPO

Loca

l Mem

ory

CPO

Server Chip

Loca

l Mem

ory

CPO

Server Chip

Server CPU Package

Compute Fabric

Compute Fabric

CPOCPOCPOCPO

Switch

MCNIC

Remote Memory

CPO

Ethernet

MC

CPE

CPE

CPO

CPO

HBMHBM NPU/GPU

… …

Server-ToR-Tier1 topology

row

rack

DAC<2m

AOC<30m

4x100GDR4

ToR bypass – multi-homed NIC

row

rack

CPO<50m

… …… …

Tier1-ToR-server ToR-bypass

failure domain ToR is SPOF for rack no SPOF – multi-homed NIC

switch ASIC count 4X-8X 1X

switch space + power baseline reduced space and ~1/3 power

switch radix can’t leverage higher radix chips

(stranded capacity at ToR)

can leverage full switch radix

(multi-chip T1 box)

oversubscription 3:1 typical fully non-blocking in row

reach limits DAC < 3m; AOC < 30m 1m-2km+

ToR bypass efficiencies (100G lane speeds)

Application example: resource disaggregation

MEM MEM MEM MEM NIC NIC STOR STOR

NPU NPU GPU GPU CPU CPU CPU CPU

NPU: neural processing unit

GPU: graphical processing unit

CPU: computer processing unit

MEM: memory

NIC: network interface card

STOR: storage

~8x8 max with electrical interconnect

Application example: resource disaggregation

Memory Storage

NPU cluster GPU cluster CPU cluster

switch

NIC NIC

scaling >8x8 with CPO-based interconnect

Approximate timelines

CY2022

Q1 Q2 Q3 Q4

CY2023

Q1 Q2 Q3 Q4

CY2024

Q1 Q2 Q3 Q4

CY2025

Q1 Q2 Q3 Q4

dev pilotPOC deploy

dev pilotPOC deploy

dev pilotPOC deploy

dev pilotPOC deploy

Traditional

Networking

AI/ML

Resource

Disagg (Mem)

Resource

Disagg

(Non-Mem)To be refined further…

Facebook – ECOC 2020 | 1

Changes to Ecosystem & Deployment Model

Today – Front Panel Pluggable CPO

FPP Optical ModulesPackaged Switch

Silicon

rBOM(CPU, PCB, Fans,

Mechanicals etc.)

OEM / ODM PCB & System Integration

Assembly & Test

SoftwareFinal Switch System

Deployment

End User Integration

rBOM(CPU, PCB, Fans,

Mechanicals etc.)

OEM / ODM Optics & PCB System

Integration / Test

Final Switch System Deployment

External Laser Source

Intra-box Fiber Solution

Packaged Switch Silicon CPO ready

CPO Optical Modules

CPO Switch Integration & Test

Software

Today

CPO

Yield Loss / RMA

RMA

Component Manufacturers

OSATs & OEM / ODM Manufacturers

End Users

Slide credit: Rob Stone (Facebook) – ECOC 2020 WS3

Standardization of CPO

OIF Launches-Co-packaging Framework Implementation Agreement Project - [oiforum.com]

COBO announces formation of Co-packaged Optics Working Group - [onboardoptics.org]

Summary

mark.filer@microsoft.com