ARPA-E MSFT CoPkg
Embed Size (px)
Transcript of ARPA-E MSFT CoPkg

Mark FilerPrincipal Engineer, Azure Hardware Architecture (AHA)

Framing the discussionHow we build data center networks today, and where we’re headed…

RNGRNG
DC
100G DWDM(PAM4)
<100km
100G PSM4<1km
100G AOC<30m
100G(Now)
row
rack
50G DAC<2m
100G DWDM(PAM4)<100km
long-haul DWDM>100km
long-haul DWDM>100km

RNGRNG
400G(2021-)
row
rack
DC
400G DWDM(400ZR)<100km
400G DR4<1km
400G AOC<30m
100G DAC<2m
long-haul DWDM>100km
long-haul DWDM>100km
400G DWDM(400ZR)<100km

Elephant in the room… Power

Possible Solutions
Takeaway: we can’t just keep scaling link bandwidths…
“next gen” systems will require all of the above
EthernetSwitch ASIC
CP
OC
PO
CP
OC
PO
CP
OC
PO
CP
OC
PO
CPO CPO CPO CPO
CPOCPOCPOCPO

The path to CPO…What are the key drivers and applications?What are application-specific requirements?What must happen to make it viable?

Interconnect Figure of MeritSource: DARPA Photonics in the Package for Extreme Scalability (PIPES)
NVLink 2
NVIDIA GRS 2013
On-board(PCI-E PMA + PCS)
LVDS
32nm SerDes
65nm SerDes
PCIe
HBM
CHIPS LR target
CHIPS SR target
NVIDIA GRS, ISSCC 2018
Avago MicropodFinisar 100 GBASE SR4
Mellanox 100GbE SR4 QSFP28
Mellanox 100GbE CPRI QSFP28
Finisar BOA/MBOM
Finisar Optical Backplane
0.1
1
10
100
1000
10000
100000
1000000
0.0001 0.001 0.01 0.1 1 10 100
BW
De
nsi
ty *
En
erg
y Ef
fici
en
cy
Max Interconnect Distance (meters)
in-package on-board off-board
(Gb
ps/
mm
)/(p
J/b
it)
CPO PotentialFoM comparable to chip-to-chip
electrical interconnects while retaining
optical reach

CPO key drivers
Next Gen
Optics
Memory/Accelerator/IO
Disaggregation
New DC
Applications(functions,
emerging storage
technologies)
Machine
Learning
Cost, Energy, Bandwidth
Optical innovations are key to future datacenters
Increasing bandwidth, 100G/200G signaling
redundancy, increasing use of optics in pluggable FFsFactors:

CPO benefits are multifacetedInterconnect
Metric Desired Characteristics
Optical Reach 10-1000s of meters
HW
Cost Component Cost << $1 / Gbps
Bandwidth Density 100s of Gbps / mm
Oper. C
ost Reliability < 10 FIT &
Energy Efficiency < 10 pJ / bit
Latency Few ns + (prop. delay)
(Energy / bit data includes module/chiplet-side
electrical interface, DSP, PIC components, laser
source. Excludes switch-side. Assume XSR for CPO.)CPO is a distinct step forward in datacenter communication efficiency

CPO applications
Server Chip
NIC
DRAM
PCIe
CPO
High Perf.Switch ASIC
CPO CPO CPO CPO
CP
OC
PO
CP
OC
PO
CPOCPOCPOCPO
CP
OC
PO
CP
OC
PO
Traditional Datacenter Networking
AI Training / HPCFuture Systems w/
Disaggregation
EthernetSwitch ASIC
CP
OC
PO
CP
OC
PO
CP
OC
PO
CP
OC
PO
CPO CPO CPO CPO
CPOCPOCPOCPO
Loca
l Mem
ory
CPO
Server Chip
Loca
l Mem
ory
CPO
Server Chip
Server CPU Package
Compute Fabric
Compute Fabric
CPOCPOCPOCPO
Switch
MCNIC
Remote Memory
CPO
Ethernet
MC
CPE
CPE
CPO
CPO
HBMHBM NPU/GPU

CPO: application domains
Server Chip
NIC
DRAM
PCIe
CPO
High Perf.Switch ASIC
CPO CPO CPO CPO
CP
OC
PO
CP
OC
PO
CPOCPOCPOCPO
CP
OC
PO
CP
OC
PO
Traditional Datacenter Networking
AI Training / HPCFuture Systems w/
Disaggregation
EthernetSwitch ASIC
CP
OC
PO
CP
OC
PO
CP
OC
PO
CP
OC
PO
CPO CPO CPO CPO
CPOCPOCPOCPO
Loca
l Mem
ory
CPO
Server Chip
Loca
l Mem
ory
CPO
Server Chip
Server CPU Package
Compute Fabric
Compute Fabric
CPOCPOCPOCPO
Switch
MCNIC
Remote Memory
CPO
Ethernet
MC
CPE
CPE
CPO
CPO
HBMHBM NPU/GPU
memorynon-memory

CPO: application domainsend-point bandwidth high (4-8T)low (400G) med-high (2-4T)
latency tolerance medium (µs)high (<ms) ultra-low (100s ns)
“link quality” highmed-high ultra-high
power consumption low (5-10 pJ/bit)medium (10-15 pJ/bit) ultra-low (<5 pJ/bit)
loss budget medium-high (~8 dB)low (IEEE-compliant) medium-high (~8 dB)
Server Chip
NIC
DRAM
PCIe
CPO
High Perf.Switch ASIC
CPO CPO CPO CPO
CP
OC
PO
CP
OC
PO
CPOCPOCPOCPO
CP
OC
PO
CP
OC
PO
Traditional Datacenter Networking
AI Training / HPCFuture Systems w/
Disaggregation
EthernetSwitch ASIC
CP
OC
PO
CP
OC
PO
CP
OC
PO
CP
OC
PO
CPO CPO CPO CPO
CPOCPOCPOCPO
Loca
l Mem
ory
CPO
Server Chip
Loca
l Mem
ory
CPO
Server Chip
Server CPU Package
Compute Fabric
Compute Fabric
CPOCPOCPOCPO
Switch
MCNIC
Remote Memory
CPO
Ethernet
MC
CPE
CPE
CPO
CPO
HBMHBM NPU/GPU
memorynon-memory

CPO: system functions
Server Chip
NIC
DRAM
PCIe
CPO
High Perf.Switch ASIC
CPO CPO CPO CPO
CP
OC
PO
CP
OC
PO
CPOCPOCPOCPO
CP
OC
PO
CP
OC
PO
Traditional Datacenter Networking
AI Training / HPCFuture Systems w/
Disaggregation
EthernetSwitch ASIC
CP
OC
PO
CP
OC
PO
CP
OC
PO
CP
OC
PO
CPO CPO CPO CPO
CPOCPOCPOCPO
Loca
l Mem
ory
CPO
Server Chip
Loca
l Mem
ory
CPO
Server Chip
Server CPU Package
Compute Fabric
Compute Fabric
CPOCPOCPOCPO
Switch
MCNIC
Remote Memory
CPO
Ethernet
MC
CPE
CPE
CPO
CPO
HBMHBM NPU/GPU

CPO: system functions
cooler
warmer
low
high
low
high
soldered ?
integrated laser ?
socketed ?
remote laser ?
Server Chip
NIC
DRAM
PCIe
CPO
High Perf.Switch ASIC
CPO CPO CPO CPO
CP
OC
PO
CP
OC
PO
CPOCPOCPOCPO
CP
OC
PO
CP
OC
PO
Traditional Datacenter Networking
AI Training / HPCFuture Systems w/
Disaggregation
EthernetSwitch ASIC
CP
OC
PO
CP
OC
PO
CP
OC
PO
CP
OC
PO
CPO CPO CPO CPO
CPOCPOCPOCPO
Loca
l Mem
ory
CPO
Server Chip
Loca
l Mem
ory
CPO
Server Chip
Server CPU Package
Compute Fabric
Compute Fabric
CPOCPOCPOCPO
Switch
MCNIC
Remote Memory
CPO
Ethernet
MC
CPE
CPE
CPO
CPO
HBMHBM NPU/GPU

… …
Server-ToR-Tier1 topology
row
rack
DAC<2m
AOC<30m

4x100GDR4
…
ToR bypass – multi-homed NIC
row
rack
CPO<50m
…

… …… …
Tier1-ToR-server ToR-bypass
failure domain ToR is SPOF for rack no SPOF – multi-homed NIC
switch ASIC count 4X-8X 1X
switch space + power baseline reduced space and ~1/3 power
switch radix can’t leverage higher radix chips
(stranded capacity at ToR)
can leverage full switch radix
(multi-chip T1 box)
oversubscription 3:1 typical fully non-blocking in row
reach limits DAC < 3m; AOC < 30m 1m-2km+
ToR bypass efficiencies (100G lane speeds)

Application example: resource disaggregation
MEM MEM MEM MEM NIC NIC STOR STOR
NPU NPU GPU GPU CPU CPU CPU CPU
NPU: neural processing unit
GPU: graphical processing unit
CPU: computer processing unit
MEM: memory
NIC: network interface card
STOR: storage
~8x8 max with electrical interconnect

Application example: resource disaggregation
Memory Storage
NPU cluster GPU cluster CPU cluster
switch
NIC NIC
scaling >8x8 with CPO-based interconnect

Approximate timelines
CY2022
Q1 Q2 Q3 Q4
CY2023
Q1 Q2 Q3 Q4
CY2024
Q1 Q2 Q3 Q4
CY2025
Q1 Q2 Q3 Q4
dev pilotPOC deploy
dev pilotPOC deploy
dev pilotPOC deploy
dev pilotPOC deploy
Traditional
Networking
AI/ML
Resource
Disagg (Mem)
Resource
Disagg
(Non-Mem)To be refined further…

Facebook – ECOC 2020 | 1
Changes to Ecosystem & Deployment Model
Today – Front Panel Pluggable CPO
FPP Optical ModulesPackaged Switch
Silicon
rBOM(CPU, PCB, Fans,
Mechanicals etc.)
OEM / ODM PCB & System Integration
Assembly & Test
SoftwareFinal Switch System
Deployment
End User Integration
rBOM(CPU, PCB, Fans,
Mechanicals etc.)
OEM / ODM Optics & PCB System
Integration / Test
Final Switch System Deployment
External Laser Source
Intra-box Fiber Solution
Packaged Switch Silicon CPO ready
CPO Optical Modules
CPO Switch Integration & Test
Software
Today
CPO
Yield Loss / RMA
RMA
Component Manufacturers
OSATs & OEM / ODM Manufacturers
End Users
Slide credit: Rob Stone (Facebook) – ECOC 2020 WS3

Standardization of CPO
OIF Launches-Co-packaging Framework Implementation Agreement Project - [oiforum.com]
COBO announces formation of Co-packaged Optics Working Group - [onboardoptics.org]

Summary