Post on 11-May-2022
Waferscale Architectures
Saptadeep Pal, Subramanian S. Iyer and Puneet Gupta University of California, Los Angeles
In collaboration with Rakesh Kumar
University of Illinois at Urbana-Champaign
Table of Contents
• Why and How Waferscale?
• Design Challenges - A Waferscale GPU Case Study
• Prototype - Waferscale Graph Processor
10/2/2019 UCLA 2
Massively Parallel Applications
Social Networks
Molecular Dynamics Simulation
Artificial Intelligence
Weather Modelling
Big Data
Multi-Physics Simulation10/2/2019 UCLA 3
Communication is Expensive!!
10/2/2019 UCLA 4
One Large Chip
10/2/2019 UCLA 5
A Brief History of Waferscale Computing
Gene Amdahl’s Trilogy Systems Tandem Computers, Fujitsu
Other efforts: ITT Corporation, Texas Instruments. Recent efforts: Spinnaker
10/2/2019 UCLA 6
What Happened to Waferscale Integration?
Their Approach to Waferscale: Monolithic
Didn’t work out (e.g., Trilogy Systems was one of the biggest financial disasters in Silicon Valley before 2001)
Area of chip
Probability of defects
Variability
Voting Logic
Some mitigation possible through TMR, etc. - but prohibitively expensive
Deemed commercially unviable10/2/2019 UCLA 7
Time to Give Waferscale Another Go?
However, to achieve waferscale integration, we need to solve the yield problem
10/2/2019 UCLA 8
Re-imagining WSI Technology
UCLA Silicon Interconnect Fabric (Si-IF)*
2 µm10 µm
100 µm
*UCLA CHIPS Programme: https://www.chips.ucla.edu/research/project/4
Allows heterogeneous waferscale integration with high yieldMeasured Bond Yield >99%
Q: What do we need from waferscale integration?A: High density interconnection
10/2/2019 UCLA 9
A Case for Waferscale GPU
GPU applications scale well with compute and memory resources
10/2/2019 UCLA 10
Waferscale GPU Overview
• GPU die = 500 mm2
• 3D DRAM die = 100 mm2
• Total Area = 700 mm2
• GPU Die power = 200 W• DRAM Die power = 35 W• Total Power = 270 W
300 mm wafer has enough area for about 72 GPU modules (GPM).
A GPU Module: GPM
10/2/2019 UCLA 11
Architecting a Waferscale GPUQ: Can we build a 72-GPM waferscale GPU ?
Three major physical constraints:
1. Thermal
➢ Waferscale GPU would dissipate kWs of power
2. Power Delivery
➢ How to supply kWs of power to the GPU modules? ➢ Voltage Regulator Module (VRM) overhead?
3. Network of GPMs ➢ Si-IF has up to 4 metal layers, what network topology to build?
10/2/2019 UCLA 12
Thermal Design Power
➢ Forced air-cooling with two heat sinks
➢ ~12 kW TDP i.e., 34 GPMs can be supported
10/2/2019 UCLA 13
Input Voltage (V)
Number of Metal Layers
VRM+DeCap Area per GPM
(mm2)
Number of GPMs
1 68 300 50
3.3 8 1020 29
12 2 1380 24
48 2 2460 15
Input Voltage Number of Layers
Input Voltage Area Overhead
Power Delivery
Input Voltage (V)
Number of Metal Layers
VRM+DeCap Area per GPM
(mm2)
Number of GPMs
1 68 300 50
3.3 8 1020 29
12 2 1380 24
48 2 2460 15
10/2/2019 UCLA 14
Stacked Power Delivery
Input Voltage (V)
No Stack 2-GPM stack 4-GPM stack
12 24 33 40
48 15 24 34
• One VRM shared by multiple GPMs in stack
• Decreases step down ratio of VRM
• GPMs need to be activity balanced for equal voltage drop
• Intermediate regulators for voltage stability
10/2/2019 UCLA 15
Mesh 1D-Torus 2D-Torus
Num. Layers Topology Inter-GPM BW (TBps) Si-IF Yield
1 Mesh 0.75 95.9%
2 Mesh 1.5 91.9%
1-D Torus 1.5 84.3%
3 2D Torus 1.9 74%
Waferscale Inter-GPM Network
10/2/2019 UCLA 16
Final WS-GPU Architectures
24 GPM Floorplan without voltage stacking 40 GPM Floorplan with voltage stacking
Inter-GPM Network: Mesh
10/2/2019 UCLA 17
Results – WS-GPU Performance Improvement
• WSI with 24 GPMs performs 2.97x better than multi-MCM configuration (EDP: 9.3x)
• WSI with 40 GPMs performs 5.2x better than multi-MCM configuration (EDP: 22.5x)
Waferscale integration can provide large performance and energy benefits
MCM Package
PCB
10/2/2019 UCLA 18
Waferscale Graph Accelerator Prototype
10/2/2019 UCLA 19
Graph Acceleration using Waferscale Architectures
Bandwidth beyond 50 TBps @ 10 ns/iteration
00.10.20.30.40.50.60.70.80.9
1
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29
Uti
lizat
ion
Parallel Workers (log2)
Utilization w/ Perfect Scheduling
1.00E+00
1.00E+01
1.00E+02
1.00E+03
1.00E+04
1.00E+05
1.00E+06
1.00E+07
1.00E+08
1.00E+09
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29
Mem
ory
Th
rou
ghp
ut
Req
uir
ed (
Byt
es/I
tera
tio
n)
Parallel Workers (log2)
Memory Requirements
Massive number of processing cores needed
10/2/2019 UCLA 20
Graph workloads exhibit irregular memory access pattern
ARM Cortex-M3 based Prototype
(1) Scalable up to 64x64 tile array (4096 tiles). Total 57,344 cores
(2) Unified memory architecture and >100 TB/s total memory bandwidth
(3) Custom implementation of waferscale read, write and synchronization primitives.
(4) Fault-tolerant dual network scheme.
(5) GALS clocking using the waferscale network.
(6) Simple software libraries implemented for easy and intuitive programming-- Programmers can treat the entire wafer as a single multi-processor
Planned Tape-out: Late-2019 Bring-up: Q2 2020.
Compute die
Memory die
10/2/2019 UCLA 21
• Communication between packaged processors is a major bottleneck
• Si-IF technology enables waferscale integration at high yield
• Waferscale GPU versus multi-MCM system:• 5.2x performance improvement
• 22.5x EDP improvement
• Graph applications are good fit for waferscale processing• ARM Cortex-M3 based waferscale prototype under active development
Summary and Conclusion
10/2/2019 UCLA 22
• This work was supported in part by the Defense Advanced Research Projects Agency (DARPA) through ONR grant N00014-16-1-263, UCOP through grant MRP-17-454999, and the UCLA CHIPS Consortium.
• Collaborators: UCLA: SivaChandra Jangam, Adeel A. Bajwa, Shi Bu, Prof. Sudhakar Pamarti
UIUC: Matthew Tomei, Daniel Petrisko, Nick Cebry, Jinyang Liu
Acknowledgement
10/2/2019 UCLA 23
Thank You
10/2/2019 UCLA 24
Backup
10/2/2019 UCLA 25
Waferscale Graph Processor Architecture
• 256 3D stacked nodes with simple cores in the logic layer (2 TB memory)
• Large intra node bandwidth in 3D memories (up to 1 TBps)
• Mesh based inter-node network. 1.5TBps link bandwidth using Si-IF
• Unified memory architecture. Any core can directly access any memory
10/2/2019 UCLA 26
Re-imagining Waferscale Integration
Bond the dies on to the interconnect wafer
Small known good diesA wafer with
interconnect wiring only
10/2/2019 UCLA 27
10/2/2019 28
Why Interposers or Packages Do not Scale
Package
BGA Balls (Solder) - 0.5-1 mmPackage Substrate
PCB (Organic)
Silicon Die
Package
BGA Balls (Solder) - 0.5-1 mmPackage Substrate
PCB (Organic)
Silicon DieInterposer TSV
Experimental Methodology
Simulator: In-house Trace-based GPU Simulator (Validated against Gem5-GPU)
Baselines: MCM-GPU, iso-GPM multi-MCM GPU integrated on PCB
MCM Package PCB
10/2/2019 UCLA 29
Graph Processing Requires a New Architecture
Assumes Locality
Mo
vin
g A
way
fro
m C
ore
s
CPUs GPUs
Graph Applications Exhibit Poor Memory Locality 10/2/2019 UCLA 30
Graph Processing Requires a New Architecture
Requires Locality Requires Predictable Layout
Requires Nothing
Mo
vin
g A
way
fro
m C
ore
s
CPUs GPUs Accelerators
Near MemoryComputing/HPC Clusters
Ideal Architecture
Requires Expensive Pre-Processing
Provides High Random Access Bandwidth
10/2/2019 UCLA 31