ARPA-E MSFT CoPkg

25
Mark Filer Principal Engineer, Azure Hardware Architecture (AHA)

Transcript of ARPA-E MSFT CoPkg

Page 1: ARPA-E MSFT CoPkg

Mark FilerPrincipal Engineer, Azure Hardware Architecture (AHA)

Page 2: ARPA-E MSFT CoPkg

Framing the discussionHow we build data center networks today, and where we’re headed…

Page 3: ARPA-E MSFT CoPkg

RNGRNG

DC

100G DWDM(PAM4)

<100km

100G PSM4<1km

100G AOC<30m

100G(Now)

row

rack

50G DAC<2m

100G DWDM(PAM4)<100km

long-haul DWDM>100km

long-haul DWDM>100km

Page 4: ARPA-E MSFT CoPkg

RNGRNG

400G(2021-)

row

rack

DC

400G DWDM(400ZR)<100km

400G DR4<1km

400G AOC<30m

100G DAC<2m

long-haul DWDM>100km

long-haul DWDM>100km

400G DWDM(400ZR)<100km

Page 5: ARPA-E MSFT CoPkg

Elephant in the room… Power

Page 6: ARPA-E MSFT CoPkg

Possible Solutions

Takeaway: we can’t just keep scaling link bandwidths…

“next gen” systems will require all of the above

EthernetSwitch ASIC

CP

OC

PO

CP

OC

PO

CP

OC

PO

CP

OC

PO

CPO CPO CPO CPO

CPOCPOCPOCPO

Page 7: ARPA-E MSFT CoPkg

The path to CPO…What are the key drivers and applications?What are application-specific requirements?What must happen to make it viable?

Page 8: ARPA-E MSFT CoPkg

Interconnect Figure of MeritSource: DARPA Photonics in the Package for Extreme Scalability (PIPES)

NVLink 2

NVIDIA GRS 2013

On-board(PCI-E PMA + PCS)

LVDS

32nm SerDes

65nm SerDes

PCIe

HBM

CHIPS LR target

CHIPS SR target

NVIDIA GRS, ISSCC 2018

Avago MicropodFinisar 100 GBASE SR4

Mellanox 100GbE SR4 QSFP28

Mellanox 100GbE CPRI QSFP28

Finisar BOA/MBOM

Finisar Optical Backplane

0.1

1

10

100

1000

10000

100000

1000000

0.0001 0.001 0.01 0.1 1 10 100

BW

De

nsi

ty *

En

erg

y Ef

fici

en

cy

Max Interconnect Distance (meters)

in-package on-board off-board

(Gb

ps/

mm

)/(p

J/b

it)

CPO PotentialFoM comparable to chip-to-chip

electrical interconnects while retaining

optical reach

Page 9: ARPA-E MSFT CoPkg

CPO key drivers

Next Gen

Optics

Memory/Accelerator/IO

Disaggregation

New DC

Applications(functions,

emerging storage

technologies)

Machine

Learning

Cost, Energy, Bandwidth

Optical innovations are key to future datacenters

Increasing bandwidth, 100G/200G signaling

redundancy, increasing use of optics in pluggable FFsFactors:

Page 10: ARPA-E MSFT CoPkg

CPO benefits are multifacetedInterconnect

Metric Desired Characteristics

Optical Reach 10-1000s of meters

HW

Cost Component Cost << $1 / Gbps

Bandwidth Density 100s of Gbps / mm

Oper. C

ost Reliability < 10 FIT &

Energy Efficiency < 10 pJ / bit

Latency Few ns + (prop. delay)

(Energy / bit data includes module/chiplet-side

electrical interface, DSP, PIC components, laser

source. Excludes switch-side. Assume XSR for CPO.)CPO is a distinct step forward in datacenter communication efficiency

Page 11: ARPA-E MSFT CoPkg

CPO applications

Server Chip

NIC

DRAM

PCIe

CPO

High Perf.Switch ASIC

CPO CPO CPO CPO

CP

OC

PO

CP

OC

PO

CPOCPOCPOCPO

CP

OC

PO

CP

OC

PO

Traditional Datacenter Networking

AI Training / HPCFuture Systems w/

Disaggregation

EthernetSwitch ASIC

CP

OC

PO

CP

OC

PO

CP

OC

PO

CP

OC

PO

CPO CPO CPO CPO

CPOCPOCPOCPO

Loca

l Mem

ory

CPO

Server Chip

Loca

l Mem

ory

CPO

Server Chip

Server CPU Package

Compute Fabric

Compute Fabric

CPOCPOCPOCPO

Switch

MCNIC

Remote Memory

CPO

Ethernet

MC

CPE

CPE

CPO

CPO

HBMHBM NPU/GPU

Page 12: ARPA-E MSFT CoPkg

CPO: application domains

Server Chip

NIC

DRAM

PCIe

CPO

High Perf.Switch ASIC

CPO CPO CPO CPO

CP

OC

PO

CP

OC

PO

CPOCPOCPOCPO

CP

OC

PO

CP

OC

PO

Traditional Datacenter Networking

AI Training / HPCFuture Systems w/

Disaggregation

EthernetSwitch ASIC

CP

OC

PO

CP

OC

PO

CP

OC

PO

CP

OC

PO

CPO CPO CPO CPO

CPOCPOCPOCPO

Loca

l Mem

ory

CPO

Server Chip

Loca

l Mem

ory

CPO

Server Chip

Server CPU Package

Compute Fabric

Compute Fabric

CPOCPOCPOCPO

Switch

MCNIC

Remote Memory

CPO

Ethernet

MC

CPE

CPE

CPO

CPO

HBMHBM NPU/GPU

memorynon-memory

Page 13: ARPA-E MSFT CoPkg

CPO: application domainsend-point bandwidth high (4-8T)low (400G) med-high (2-4T)

latency tolerance medium (µs)high (<ms) ultra-low (100s ns)

“link quality” highmed-high ultra-high

power consumption low (5-10 pJ/bit)medium (10-15 pJ/bit) ultra-low (<5 pJ/bit)

loss budget medium-high (~8 dB)low (IEEE-compliant) medium-high (~8 dB)

Server Chip

NIC

DRAM

PCIe

CPO

High Perf.Switch ASIC

CPO CPO CPO CPO

CP

OC

PO

CP

OC

PO

CPOCPOCPOCPO

CP

OC

PO

CP

OC

PO

Traditional Datacenter Networking

AI Training / HPCFuture Systems w/

Disaggregation

EthernetSwitch ASIC

CP

OC

PO

CP

OC

PO

CP

OC

PO

CP

OC

PO

CPO CPO CPO CPO

CPOCPOCPOCPO

Loca

l Mem

ory

CPO

Server Chip

Loca

l Mem

ory

CPO

Server Chip

Server CPU Package

Compute Fabric

Compute Fabric

CPOCPOCPOCPO

Switch

MCNIC

Remote Memory

CPO

Ethernet

MC

CPE

CPE

CPO

CPO

HBMHBM NPU/GPU

memorynon-memory

Page 14: ARPA-E MSFT CoPkg

CPO: system functions

Server Chip

NIC

DRAM

PCIe

CPO

High Perf.Switch ASIC

CPO CPO CPO CPO

CP

OC

PO

CP

OC

PO

CPOCPOCPOCPO

CP

OC

PO

CP

OC

PO

Traditional Datacenter Networking

AI Training / HPCFuture Systems w/

Disaggregation

EthernetSwitch ASIC

CP

OC

PO

CP

OC

PO

CP

OC

PO

CP

OC

PO

CPO CPO CPO CPO

CPOCPOCPOCPO

Loca

l Mem

ory

CPO

Server Chip

Loca

l Mem

ory

CPO

Server Chip

Server CPU Package

Compute Fabric

Compute Fabric

CPOCPOCPOCPO

Switch

MCNIC

Remote Memory

CPO

Ethernet

MC

CPE

CPE

CPO

CPO

HBMHBM NPU/GPU

Page 15: ARPA-E MSFT CoPkg

CPO: system functions

cooler

warmer

low

high

low

high

soldered ?

integrated laser ?

socketed ?

remote laser ?

Server Chip

NIC

DRAM

PCIe

CPO

High Perf.Switch ASIC

CPO CPO CPO CPO

CP

OC

PO

CP

OC

PO

CPOCPOCPOCPO

CP

OC

PO

CP

OC

PO

Traditional Datacenter Networking

AI Training / HPCFuture Systems w/

Disaggregation

EthernetSwitch ASIC

CP

OC

PO

CP

OC

PO

CP

OC

PO

CP

OC

PO

CPO CPO CPO CPO

CPOCPOCPOCPO

Loca

l Mem

ory

CPO

Server Chip

Loca

l Mem

ory

CPO

Server Chip

Server CPU Package

Compute Fabric

Compute Fabric

CPOCPOCPOCPO

Switch

MCNIC

Remote Memory

CPO

Ethernet

MC

CPE

CPE

CPO

CPO

HBMHBM NPU/GPU

Page 16: ARPA-E MSFT CoPkg

… …

Server-ToR-Tier1 topology

row

rack

DAC<2m

AOC<30m

Page 17: ARPA-E MSFT CoPkg

4x100GDR4

ToR bypass – multi-homed NIC

row

rack

CPO<50m

Page 18: ARPA-E MSFT CoPkg

… …… …

Tier1-ToR-server ToR-bypass

failure domain ToR is SPOF for rack no SPOF – multi-homed NIC

switch ASIC count 4X-8X 1X

switch space + power baseline reduced space and ~1/3 power

switch radix can’t leverage higher radix chips

(stranded capacity at ToR)

can leverage full switch radix

(multi-chip T1 box)

oversubscription 3:1 typical fully non-blocking in row

reach limits DAC < 3m; AOC < 30m 1m-2km+

ToR bypass efficiencies (100G lane speeds)

Page 19: ARPA-E MSFT CoPkg

Application example: resource disaggregation

MEM MEM MEM MEM NIC NIC STOR STOR

NPU NPU GPU GPU CPU CPU CPU CPU

NPU: neural processing unit

GPU: graphical processing unit

CPU: computer processing unit

MEM: memory

NIC: network interface card

STOR: storage

~8x8 max with electrical interconnect

Page 20: ARPA-E MSFT CoPkg

Application example: resource disaggregation

Memory Storage

NPU cluster GPU cluster CPU cluster

switch

NIC NIC

scaling >8x8 with CPO-based interconnect

Page 21: ARPA-E MSFT CoPkg

Approximate timelines

CY2022

Q1 Q2 Q3 Q4

CY2023

Q1 Q2 Q3 Q4

CY2024

Q1 Q2 Q3 Q4

CY2025

Q1 Q2 Q3 Q4

dev pilotPOC deploy

dev pilotPOC deploy

dev pilotPOC deploy

dev pilotPOC deploy

Traditional

Networking

AI/ML

Resource

Disagg (Mem)

Resource

Disagg

(Non-Mem)To be refined further…

Page 22: ARPA-E MSFT CoPkg

Facebook – ECOC 2020 | 1

Changes to Ecosystem & Deployment Model

Today – Front Panel Pluggable CPO

FPP Optical ModulesPackaged Switch

Silicon

rBOM(CPU, PCB, Fans,

Mechanicals etc.)

OEM / ODM PCB & System Integration

Assembly & Test

SoftwareFinal Switch System

Deployment

End User Integration

rBOM(CPU, PCB, Fans,

Mechanicals etc.)

OEM / ODM Optics & PCB System

Integration / Test

Final Switch System Deployment

External Laser Source

Intra-box Fiber Solution

Packaged Switch Silicon CPO ready

CPO Optical Modules

CPO Switch Integration & Test

Software

Today

CPO

Yield Loss / RMA

RMA

Component Manufacturers

OSATs & OEM / ODM Manufacturers

End Users

Slide credit: Rob Stone (Facebook) – ECOC 2020 WS3

Page 23: ARPA-E MSFT CoPkg

Standardization of CPO

OIF Launches-Co-packaging Framework Implementation Agreement Project - [oiforum.com]

COBO announces formation of Co-packaged Optics Working Group - [onboardoptics.org]

Page 24: ARPA-E MSFT CoPkg

Summary