3D Technologies and Architectures for High ... - IRT Nanoelec
Transcript of 3D Technologies and Architectures for High ... - IRT Nanoelec
I N S T I T U T D E R E C H E R C H E T E C H N O L O G I Q U E
3D Technologies and
Architectures for High
Performance Computing
13/03/2020
3D Technologies and Architectures for High Performance
Computing, P. Vivet, DATE'2020 IRT Workshop
High Performance Computing & Big Data
• More cores + more accelerators + more memory– Similar constraints are appearing for embedded HPC
(Automotive, etc)
– Need both highly optimized generic and specialized functions
(i.e. ML/AI accelerator)
– Need a « go-to-market » solution for sustainable system
differentiation
• System designers must offer :– Modular and cost effective solutions
– Energy efficiency of the system infrastructure
– More on-chip memory bandwidth per core
With advanced CMOS issues, « Single Die »
solution is not viable anymore
13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 2
Chiplet Partitioning
• Chiplet motivations– Cost driven
– Modularity driven
– Heterogeneous integration
• Chiplet challenges ?– Eco-system maturity,
– Technology & Architecture partitioning,
– Chiplet Interfaces, testability, 3D CAD flow, etc
using 3D technologies
[D. Dutoit, Keynote, 3DIC’2014]
13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 3
• Existing technologies
• But, some limitations
– Chiplet communication limited to side-by-side communication, not scalable
– How to integrate heterogeneous chiplets & differentiating functions ?
– How to integrate less-scalable functions (IO’s, analogs, power management) ?
Organic Substrates Passive interposer (2.5D) Silicon bridges
Chiplet Partitioning : Solutions and Limitations
AMD, 4-chiplet circuit, ISSCC’2018 INTEL, EMIB bridge, ISSCC’2017TSMC, CoWos, VLSI’2019
13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 4
Active Interposer : Principle
Mature CMOS technology (with low logic density to preserve system cost)
Active
Interposer
Additional features
SoC infrastructure
Analog, IOs, PHY, DFT
Power Management
Close to cores
Scalable & Distributed NoCs
Any chiplet-to-chiplet traffic
Chiplets :
Clusters of Cores
13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 5
Outline
• Introduction on Chiplet partitioning
• INTACT : An Active Interposer– Circuit Overview
– Design Building blocks
– 3D Design Flow
• INTACT : Circuit Results
• Conclusions & Perspectives
13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 6
6 Chiplets 3D-stacked on an Active Interposer
Chiplet Overview
• 4 cluster of 4 cores
• Distributed L1$ + L2$ + L3$
• Scalable Cache Coherency
Active Interposer
• Distributed flexible interconnects
• Integrated SCVRs (1/chiplet)
• Memory Controller & System IO’s
• SOC Infrastructure, DFT
L3
L3
L3
L3
Cluster
0
So
Cin
fras
tru
ctu
re
Chiplet (16 cores)
Cluster
2
Cluster
1
Cluster
3
3D Plug(s)
Act
ive
Inte
rpo
ser
Power Management
C4 bumps Ø90µm
Cfg Power Management Memory-IO
Distributed NoCs
(routers & pipelined links)
µ-bumps, Ø10µm
Clk, Rst, Config, Test Package Substrate
Balls Ø500µm
Off chip links1.5 - 2.5 VDD-chiplet 1.2 VDD-interpo
L3
L3
L3
L3
Cluster
0
So
Cin
fras
tru
ctu
re
Chiplet (16 cores)
Cluster
2
Cluster
1
Cluster
3
3D Plug(s)
13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 7
6 Chiplets 3D-stacked on an Active Interposer
96 cores :
In 6 chiplets
3D-stacked on
active CMOS interposer
6 Chiplets
(FDSOI28)
Active
Interposer
(CMOS65)
2 technology nodes difference between chiplets & bottom die
Chiplet Overview
• 4 cluster of 4 cores
• Distributed L1$ + L2$ + L3$
• Scalable Cache Coherency
Active Interposer
• Distributed flexible interconnects
• Integrated SCVRs (1/chiplet)
• Memory Controller & System IO’s
• SOC Infrastructure, DFT
13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 8
Chiplet Main Features
• 16 x MIPS ® 32-bit scalar cores
• Memory is physically distributed throughchiplet L2-caches + Virtual Memory support
– L1 I-caches + D-caches (16 kB / core)
– Distributed Shared L2-caches (256 kB / cluster)
– Adaptive & fault tolerant L3-caches (4 tiles of 1 MB)
• Directory-based cache coherence with linked-list directory [5]
• 2D-mesh NoCs, extended through the interposer
• FDSOI 28nm, LPLV, [0.5-1.3V], with Body Biasing– FLLs, Timing Fault Sensors, Thermal Sensors
[5] E. Guthmuller et al, “A 29 Gops/Watt 3D-Ready 16-Core Computing Fabric with Scalable Cache Coherent Architecture Using Distributed L2 and Adaptive L3 Caches”, ESSCIRC’2018.
from/to
active interposer
L1-L2L2-L3L3-ExtMem
13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 9
System Level Interconnects
• Distributed & flexible interconnectswithin the active interposer
– Multiple Network-on-Chips (routers+links)
– 3D-Plug communication IPs
Synchronous & Asynchronous versions
• Chiplet-to-Chiplet Communication Schemes
– Passive links, short reach (L1-L2)
– Active links, long reach (L2-L3, L3-ExtMem)
allow chiplet to any chiplet scalable traffic
L3 $
L3 $
L3 $
L3 $
L1I
L1DPE0
L1I
L1DPE2
L1I
L1DPE1
L1I
L1DPE3
L2 $
Active Interposer
R R
to next
chiplet
R
R
TGTGTG
TAP
FLL
Th
erm
al
Sen
sor
3D Plug(s)
SP
IU
AR
T
3D Plug(s)
L2-L3 - long reach - async. - active
L3-Ext-Mem - sync. - active
L1-L2 - short reach - passive
L1I
L1DPE0
L1I
L1DPE2
L1I
L1DPE1
L1I
L1DPE3
L2 $
L1I
L1DPE0
L1I
L1DPE2
L1I
L1DPE1
L1I
L1DPE3
L2 $
L1I
L1DPE0
L1I
L1DPE2
L1I
L1DPE1
L1I
L1DPE3
L2 $
R
From
prev.
chiplet
Chiplet (16 cores)
L3 $
L3 $
L3 $
L3 $
L1I
L1DPE0
L1I
L1DPE2
L1I
L1DPE1
L1I
L1DPE3
L2 $
TGTGTG
TAP
FLL
Th
erm
al
Sen
sor
SP
IU
AR
T
L1I
L1DPE0
L1I
L1DPE2
L1I
L1DPE1
L1I
L1DPE3
L2 $
L1I
L1DPE0
L1I
L1DPE2
L1I
L1DPE1
L1I
L1DPE3
L2 $
L1I
L1DPE0
L1I
L1DPE2
L1I
L1DPE1
L1I
L1DPE3
L2 $
Chiplet (16 cores)
3D Plug(s)
3D Plug(s)
Memory-IO
L1-L2L2-L3L3-ExtMem
13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 10
3D-Plug Communication IP : layout overview
µ-bumps
20µm pitch
3D-Plug :
• Logic interface
• µ-bumps
• µ-buffer std-cells
• DFT
µ-buffer std-cell
BiDir Driver + ESD +
Pull-Up + Level-Shifter
Chiplet layout :
3D-Plug interfaces
µ-buffer std-cells
13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 11
System Level Interconnects : Comparison
• 3D-Plug - Best throughput for synchronous version (1.25GHz)
• Interposer - Similar throughput between SNOC & ANOC (~1GHz)
- Best latency for ANOC, 0.6ns/mm (3-5x wrt. SNOC) Latency reduction, for cache coherency traffic, at the cost of energy
L1-L2 L2-L3 L3-EXT-MEM Units
Link type Passive, sync. Active, async. Active, sync.
3D Plug frequency 1.25 0.52 1.21 GHz
2D NoC frequency 1.00 0.97 0.75 GHz
End to end latency44 4 + async. 37 cycles
44.0 15.2 49.5 ns
Propagation speed 2.9 0.6 2.0 ns/mm
Energy / bit / mm 0.15 0.52 0.24 pJ/bit/mm
Combination of interconnect types to achieve performance trade-offs
A
B
* A => B end-to-end latency
13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 12
Switched Cap Voltage Regulators : Principle
• Distributed power supply units– DVFS local scheme, below each chiplet
– Fast transitions & reduced IR-drop effects
– “High” input voltage (up to 2.5V),
reduces #PG IOs in the package
• Fully Integrated– No external passive components, Thick oxide transistors
– On-chip CAPs only (MOS+MOM+MIM 8.9 nF/mm2)
– 50% of chiplet area, fault tolerant, in the interposer
– PowerGrid delivery as a µ-bump flip-chip matrixµ-bumps
DC-DC
converterTSVs
VIN
VOUTP/G to chiplet
Power Unit
Digital IC
VIN
13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 13
Design-for-Test (DFT) : Challenges and Solutions
Test challenges
• Chiplet Know-Good-Die (KGD) sorting
• Testability for active interposer & final package
• Reduced test access due to 3D fine pitch interconnect
– Chiplet EWS Test performed on a regular IO ring
Flexible 3D-DFT architecture, including :
• Parallel scan chains, with test compression
– test time & IO driven
• IJTAG test interface, using IEEE1687 standard
– Boundary Scan Chains, for testing 3D interconnections
– SIB, for Memory BISTs & Repair
• 3D Design for Test tool– Tessent tool, for DFT insertion & ATPG
[J. Durupt et al., ETS’2016]
13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 15
Outline
• Introduction on Chiplet partitioning
• INTACT : An Active Interposer– Circuit Overview
– Design Building blocks
– 3D Design Flow
• INTACT : Circuit Results
• Conclusions & Perspectives
13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 16
3D Architecture- 3D Netlist partitioning
- 3D Circuit/package co-design- 3D Thermal & Power Profile
3D Physical Implementation
- 2D Synthesis - 3D Floorplanning- 2D/3D Place & Route- 3D DFT & ATPG
3D Sign-Off- 3D DRC & LVS verif- 3D Thermal validation- 3D Parasitic extraction/TA- 3D Electromigration- 3D IR-Drop analysis
Stan
dar
diz
atio
n E
ffo
rts
CONFIDENTIEL
For future large 3D systems (HPC application)XPEDITION tools
For thermal profiling at system levelSAHARA tool
DFT archi & ATPG toolsTESSENT tools
Sign-Off with CALIBRE tools- DRC/LVS => 3DStack- Thermal => SAHARA- Parasitics => PEX, xACT- Electromigration => DENALI
3D Design Flow : Main Achievements
More challenges to come for next 3D Technologies
Fine pitchHybrid Bonding
(1µm)
CoolCube(100nm)
Close co-operation & optimizationbetween 3D CAD tools & 3D circuit design
13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 17
• Thermal exploration & flow using SAHARA tool prototype
• Full circuit thermal model, including circuit layer, detailed power maps,
package description, and heat sink selection.
3D Design Flow : Thermal Exploration BEOL (10 Cu layers + 1 Al layer) ~8µm
BEOL (7 Cu layers + 1 Al layer) ~7µm
SiON passivation 2µm + RDL Cu 2µm + organic passivation 3µm
Place & CTSSignal routing
PartitioningFloorplaning
Timing closureSign-off
Physical Implementation
Package optimization
Package selection
Package Design
Thermalexploration
ThermalSign-off
Interposer power sources Chiplet power sources
Clusters = 4x 0.4W Logic = 1.4WTotal chiplet = 3W
Total chiplet x6:18W
IO ring = 4.5W
DCDC = 0.6W
Logic = 2.2WTotal interposer: 10.3W
Total power: 28.3W Thermal modelling methodology[C. Santos, P. Vivet, DAC’17 Design Track]
3D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop13/03/2020 18
Outline
• Introduction on Chiplet partitioning
• INTACT : An Active Interposer– Circuit Overview
– Design Building blocks
– 3D Design Flow
• INTACT : Circuit Results
• Conclusions & Perspectives
13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 19
Circuit Overview
• Die technologies– Chiplet: FDSOI 28nm, ULV + BodyBias, 22mm2
– Active Interposer: CMOS 65nm, MIM option, 200mm2
• 3D technology integration– µ-bumps, 20µm pitch (150 k)
– TSV middle, 40 µm pitch
– Face2Face assembly
on package substrate
– 6 chiplets
Active Interposer
front-face
Chiplet front-face
3D integration
and final package
3D cross-section
13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 20
Active Interposer : 3D Cross Sections
Chiplet
BGA
Active Interposer
TSVs
Copper Pillars
CopperPillar20 µmpitch
BGA via
[P. Coudrain, ECTC’2019] [Best Paper Award]
- Correct assembly of interposer onto package- Correct chiplet alignment stacking achieved
TSV40 µmpitch
Active Interposer
Chiplet
13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 21
• Main performances– Freq in [130 MHz @ 0.5V – 1.15GHz @ 1.1V] with FDSOI Back-Bias
– Peak performance : 220 GOPS for all 96 cores @ 1.15 GHz.
– Best Energy efficiency : 9.6 GOPS/W (Coremark) @ 246MHz @ 0.6V
• Power consumption break-down– Cores+L1: ~50% power per chiplet
– Interposer logic & interconnect (w.o. IOs)
3% only of overall budget
– SCVR: 17% of overall power budget
Circuit Performance
Chiplet 1
Chiplet 2
Chi
plet
3
Chiplet 4
Chiplet 5
Misc 2%
Clks 21%
L1-L2 5%L3 1%L2 12%
Cores + L1
55%
13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 22
First Active Interposer, with fully integrated SCVR, up to 82% efficiency
Comparison with State-of-the-Art
[4] ISSCC'2018 [1] ISSCC'2018 [2] VLSI'2019 [3] ISSCC'2017
INTEL AMD TSMC INTEL
Chiplet TechnologyFDSOI
28nm
FinFET
14nm
FinFET
14nm
FinFET
7nm
FinFET
14nm
Interposer TechnologyActive
CMOS 65nmno
MCM
substrate
Passive
CoWoS ®
EMIB
bridge
Interposer extra features yes N/A no no no
Total system yield
High, using
active interposer
mature technology and
low transistor count
N/A high high high
Die-to-Die µbump pitch 20 N/A > 100 40 55 µm
Voltage Regulator (VR) type
Integrated in interposer,
1 SCVR per chiplet
with MOS+MOM+MIM
on-chip
distributed
SCVR with MIM
LDO per core,
with MIMno no
VR area 34% of active interposerMIM above 40%
of core area- N/A N/A
VR peak efficiency 82% 72% LDO limited N/A N/A
This work Units
Te
ch
no
log
yP
ow
er
Mg
t
[P. Vivet et al., ISSCC’2020]
13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 24
Comparison with State-of-the-Art
First Active Interposer, with distributed NoC meshes and 3.0 Tb/s/mm2
interfaces, offering a total of 96 cores
[4] ISSCC'2018 [1] ISSCC'2018 [2] VLSI'2019 [3] ISSCC'2017
INTEL AMD TSMC INTEL
Interconnect types
Distributed NoC meshes
for scalable chip-to-chip
cache-coherency traffic
N/AScalable Data
Fabric (SDF)LIPINCON
TM links AIB interconnect
3D Plug power efficiency 0.59 N/A 2.0 0.56 1.2 pJ/bit
BW density 3.0 N/A - 1.6 1.5 Tb/s/mm2
Aggregate 3D bandwidth 527 N/A - 640 504 GByte/s
Number of chiplets 6 1 1 - 4 21 FPGA fabric
6 transceivers
Number of cores 96 18 8 - 32 8 FPGA fabric
Max Frequency 1.15 0.4 4.1 4 1 GHz
Gops (32b-Integer) 220 (peak mult./acc.) 14.4 131.2 - 524.8 128 N/A Gop/s
CP
U
This work Units
Inte
rco
nn
ec
t
[P. Vivet et al., ISSCC’2020]
13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 25
Outline
• Introduction on Chiplet partitioning
• INTACT : An Active Interposer– Circuit Overview
– Design Building blocks
– 3D Design Flow
• INTACT : Circuit Results
• Conclusions & Perspectives
13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 26
Conclusions and Perspectives
• Active Interposer & chiplet partitioning– Integration of : Interconnects, Power management, IOs,
– Scalable cache coherency protocol
– 3 TBit/s/mm2 3D interface achieved
– Low latency 0.6ns/mm long-reach asynchronous interconnect
– Power management @ 82% efficiency, close to the cores, w.o. passives
Increase the system energy efficiency and the on-chip memory bandwidth per core
• Mature 3D Design Flow & Tools
• Chiplet Eco-system– Progressive setup of a chiplet eco-system, for HPC but also e-HPC (Automotive, AI, etc)
– Active interposer, an enabler for differentiation : integrating heterogeneous functions &
chiplets
• Perspectives– Hybrid Bonding fine pitch (3-5µm target) for optimized chipletlization
– Photonic Interposers, for more chip-2-chip bandwith & lower latency
13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 27
Acknowledgments
• Acknowledgments– Many thanks to all the technology and circuit design team for their huge
contributions
– Many thanks to our partners
• Funding
• Thank you for your attention
This work was partly funded by the French National Program
Programme d’Investissements d’Avenir IRT Nanoelec under
Grant ANR-10-AIRT-05
13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 28
Merci de votre attention13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 29