3D Technologies and Architectures for High ... - IRT Nanoelec

27
INSTITUT DE RECHERCHE TECHNOLOGIQUE 3D Technologies and Architectures for High Performance Computing 13/03/2020 3D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop

Transcript of 3D Technologies and Architectures for High ... - IRT Nanoelec

Page 1: 3D Technologies and Architectures for High ... - IRT Nanoelec

I N S T I T U T D E R E C H E R C H E T E C H N O L O G I Q U E

3D Technologies and

Architectures for High

Performance Computing

13/03/2020

3D Technologies and Architectures for High Performance

Computing, P. Vivet, DATE'2020 IRT Workshop

Page 2: 3D Technologies and Architectures for High ... - IRT Nanoelec

High Performance Computing & Big Data

• More cores + more accelerators + more memory– Similar constraints are appearing for embedded HPC

(Automotive, etc)

– Need both highly optimized generic and specialized functions

(i.e. ML/AI accelerator)

– Need a « go-to-market » solution for sustainable system

differentiation

• System designers must offer :– Modular and cost effective solutions

– Energy efficiency of the system infrastructure

– More on-chip memory bandwidth per core

With advanced CMOS issues, « Single Die »

solution is not viable anymore

13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 2

Page 3: 3D Technologies and Architectures for High ... - IRT Nanoelec

Chiplet Partitioning

• Chiplet motivations– Cost driven

– Modularity driven

– Heterogeneous integration

• Chiplet challenges ?– Eco-system maturity,

– Technology & Architecture partitioning,

– Chiplet Interfaces, testability, 3D CAD flow, etc

using 3D technologies

[D. Dutoit, Keynote, 3DIC’2014]

13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 3

Page 4: 3D Technologies and Architectures for High ... - IRT Nanoelec

• Existing technologies

• But, some limitations

– Chiplet communication limited to side-by-side communication, not scalable

– How to integrate heterogeneous chiplets & differentiating functions ?

– How to integrate less-scalable functions (IO’s, analogs, power management) ?

Organic Substrates Passive interposer (2.5D) Silicon bridges

Chiplet Partitioning : Solutions and Limitations

AMD, 4-chiplet circuit, ISSCC’2018 INTEL, EMIB bridge, ISSCC’2017TSMC, CoWos, VLSI’2019

13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 4

Page 5: 3D Technologies and Architectures for High ... - IRT Nanoelec

Active Interposer : Principle

Mature CMOS technology (with low logic density to preserve system cost)

Active

Interposer

Additional features

SoC infrastructure

Analog, IOs, PHY, DFT

Power Management

Close to cores

Scalable & Distributed NoCs

Any chiplet-to-chiplet traffic

Chiplets :

Clusters of Cores

13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 5

Page 6: 3D Technologies and Architectures for High ... - IRT Nanoelec

Outline

• Introduction on Chiplet partitioning

• INTACT : An Active Interposer– Circuit Overview

– Design Building blocks

– 3D Design Flow

• INTACT : Circuit Results

• Conclusions & Perspectives

13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 6

Page 7: 3D Technologies and Architectures for High ... - IRT Nanoelec

6 Chiplets 3D-stacked on an Active Interposer

Chiplet Overview

• 4 cluster of 4 cores

• Distributed L1$ + L2$ + L3$

• Scalable Cache Coherency

Active Interposer

• Distributed flexible interconnects

• Integrated SCVRs (1/chiplet)

• Memory Controller & System IO’s

• SOC Infrastructure, DFT

L3

L3

L3

L3

Cluster

0

So

Cin

fras

tru

ctu

re

Chiplet (16 cores)

Cluster

2

Cluster

1

Cluster

3

3D Plug(s)

Act

ive

Inte

rpo

ser

Power Management

C4 bumps Ø90µm

Cfg Power Management Memory-IO

Distributed NoCs

(routers & pipelined links)

µ-bumps, Ø10µm

Clk, Rst, Config, Test Package Substrate

Balls Ø500µm

Off chip links1.5 - 2.5 VDD-chiplet 1.2 VDD-interpo

L3

L3

L3

L3

Cluster

0

So

Cin

fras

tru

ctu

re

Chiplet (16 cores)

Cluster

2

Cluster

1

Cluster

3

3D Plug(s)

13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 7

Page 8: 3D Technologies and Architectures for High ... - IRT Nanoelec

6 Chiplets 3D-stacked on an Active Interposer

96 cores :

In 6 chiplets

3D-stacked on

active CMOS interposer

6 Chiplets

(FDSOI28)

Active

Interposer

(CMOS65)

2 technology nodes difference between chiplets & bottom die

Chiplet Overview

• 4 cluster of 4 cores

• Distributed L1$ + L2$ + L3$

• Scalable Cache Coherency

Active Interposer

• Distributed flexible interconnects

• Integrated SCVRs (1/chiplet)

• Memory Controller & System IO’s

• SOC Infrastructure, DFT

13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 8

Page 9: 3D Technologies and Architectures for High ... - IRT Nanoelec

Chiplet Main Features

• 16 x MIPS ® 32-bit scalar cores

• Memory is physically distributed throughchiplet L2-caches + Virtual Memory support

– L1 I-caches + D-caches (16 kB / core)

– Distributed Shared L2-caches (256 kB / cluster)

– Adaptive & fault tolerant L3-caches (4 tiles of 1 MB)

• Directory-based cache coherence with linked-list directory [5]

• 2D-mesh NoCs, extended through the interposer

• FDSOI 28nm, LPLV, [0.5-1.3V], with Body Biasing– FLLs, Timing Fault Sensors, Thermal Sensors

[5] E. Guthmuller et al, “A 29 Gops/Watt 3D-Ready 16-Core Computing Fabric with Scalable Cache Coherent Architecture Using Distributed L2 and Adaptive L3 Caches”, ESSCIRC’2018.

from/to

active interposer

L1-L2L2-L3L3-ExtMem

13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 9

Page 10: 3D Technologies and Architectures for High ... - IRT Nanoelec

System Level Interconnects

• Distributed & flexible interconnectswithin the active interposer

– Multiple Network-on-Chips (routers+links)

– 3D-Plug communication IPs

Synchronous & Asynchronous versions

• Chiplet-to-Chiplet Communication Schemes

– Passive links, short reach (L1-L2)

– Active links, long reach (L2-L3, L3-ExtMem)

allow chiplet to any chiplet scalable traffic

L3 $

L3 $

L3 $

L3 $

L1I

L1DPE0

L1I

L1DPE2

L1I

L1DPE1

L1I

L1DPE3

L2 $

Active Interposer

R R

to next

chiplet

R

R

TGTGTG

TAP

FLL

Th

erm

al

Sen

sor

3D Plug(s)

SP

IU

AR

T

3D Plug(s)

L2-L3 - long reach - async. - active

L3-Ext-Mem - sync. - active

L1-L2 - short reach - passive

L1I

L1DPE0

L1I

L1DPE2

L1I

L1DPE1

L1I

L1DPE3

L2 $

L1I

L1DPE0

L1I

L1DPE2

L1I

L1DPE1

L1I

L1DPE3

L2 $

L1I

L1DPE0

L1I

L1DPE2

L1I

L1DPE1

L1I

L1DPE3

L2 $

R

From

prev.

chiplet

Chiplet (16 cores)

L3 $

L3 $

L3 $

L3 $

L1I

L1DPE0

L1I

L1DPE2

L1I

L1DPE1

L1I

L1DPE3

L2 $

TGTGTG

TAP

FLL

Th

erm

al

Sen

sor

SP

IU

AR

T

L1I

L1DPE0

L1I

L1DPE2

L1I

L1DPE1

L1I

L1DPE3

L2 $

L1I

L1DPE0

L1I

L1DPE2

L1I

L1DPE1

L1I

L1DPE3

L2 $

L1I

L1DPE0

L1I

L1DPE2

L1I

L1DPE1

L1I

L1DPE3

L2 $

Chiplet (16 cores)

3D Plug(s)

3D Plug(s)

Memory-IO

L1-L2L2-L3L3-ExtMem

13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 10

Page 11: 3D Technologies and Architectures for High ... - IRT Nanoelec

3D-Plug Communication IP : layout overview

µ-bumps

20µm pitch

3D-Plug :

• Logic interface

• µ-bumps

• µ-buffer std-cells

• DFT

µ-buffer std-cell

BiDir Driver + ESD +

Pull-Up + Level-Shifter

Chiplet layout :

3D-Plug interfaces

µ-buffer std-cells

13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 11

Page 12: 3D Technologies and Architectures for High ... - IRT Nanoelec

System Level Interconnects : Comparison

• 3D-Plug - Best throughput for synchronous version (1.25GHz)

• Interposer - Similar throughput between SNOC & ANOC (~1GHz)

- Best latency for ANOC, 0.6ns/mm (3-5x wrt. SNOC) Latency reduction, for cache coherency traffic, at the cost of energy

L1-L2 L2-L3 L3-EXT-MEM Units

Link type Passive, sync. Active, async. Active, sync.

3D Plug frequency 1.25 0.52 1.21 GHz

2D NoC frequency 1.00 0.97 0.75 GHz

End to end latency44 4 + async. 37 cycles

44.0 15.2 49.5 ns

Propagation speed 2.9 0.6 2.0 ns/mm

Energy / bit / mm 0.15 0.52 0.24 pJ/bit/mm

Combination of interconnect types to achieve performance trade-offs

A

B

* A => B end-to-end latency

13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 12

Page 13: 3D Technologies and Architectures for High ... - IRT Nanoelec

Switched Cap Voltage Regulators : Principle

• Distributed power supply units– DVFS local scheme, below each chiplet

– Fast transitions & reduced IR-drop effects

– “High” input voltage (up to 2.5V),

reduces #PG IOs in the package

• Fully Integrated– No external passive components, Thick oxide transistors

– On-chip CAPs only (MOS+MOM+MIM 8.9 nF/mm2)

– 50% of chiplet area, fault tolerant, in the interposer

– PowerGrid delivery as a µ-bump flip-chip matrixµ-bumps

DC-DC

converterTSVs

VIN

VOUTP/G to chiplet

Power Unit

Digital IC

VIN

13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 13

Page 14: 3D Technologies and Architectures for High ... - IRT Nanoelec

Design-for-Test (DFT) : Challenges and Solutions

Test challenges

• Chiplet Know-Good-Die (KGD) sorting

• Testability for active interposer & final package

• Reduced test access due to 3D fine pitch interconnect

– Chiplet EWS Test performed on a regular IO ring

Flexible 3D-DFT architecture, including :

• Parallel scan chains, with test compression

– test time & IO driven

• IJTAG test interface, using IEEE1687 standard

– Boundary Scan Chains, for testing 3D interconnections

– SIB, for Memory BISTs & Repair

• 3D Design for Test tool– Tessent tool, for DFT insertion & ATPG

[J. Durupt et al., ETS’2016]

13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 15

Page 15: 3D Technologies and Architectures for High ... - IRT Nanoelec

Outline

• Introduction on Chiplet partitioning

• INTACT : An Active Interposer– Circuit Overview

– Design Building blocks

– 3D Design Flow

• INTACT : Circuit Results

• Conclusions & Perspectives

13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 16

Page 16: 3D Technologies and Architectures for High ... - IRT Nanoelec

3D Architecture- 3D Netlist partitioning

- 3D Circuit/package co-design- 3D Thermal & Power Profile

3D Physical Implementation

- 2D Synthesis - 3D Floorplanning- 2D/3D Place & Route- 3D DFT & ATPG

3D Sign-Off- 3D DRC & LVS verif- 3D Thermal validation- 3D Parasitic extraction/TA- 3D Electromigration- 3D IR-Drop analysis

Stan

dar

diz

atio

n E

ffo

rts

CONFIDENTIEL

For future large 3D systems (HPC application)XPEDITION tools

For thermal profiling at system levelSAHARA tool

DFT archi & ATPG toolsTESSENT tools

Sign-Off with CALIBRE tools- DRC/LVS => 3DStack- Thermal => SAHARA- Parasitics => PEX, xACT- Electromigration => DENALI

3D Design Flow : Main Achievements

More challenges to come for next 3D Technologies

Fine pitchHybrid Bonding

(1µm)

CoolCube(100nm)

Close co-operation & optimizationbetween 3D CAD tools & 3D circuit design

13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 17

Page 17: 3D Technologies and Architectures for High ... - IRT Nanoelec

• Thermal exploration & flow using SAHARA tool prototype

• Full circuit thermal model, including circuit layer, detailed power maps,

package description, and heat sink selection.

3D Design Flow : Thermal Exploration BEOL (10 Cu layers + 1 Al layer) ~8µm

BEOL (7 Cu layers + 1 Al layer) ~7µm

SiON passivation 2µm + RDL Cu 2µm + organic passivation 3µm

Place & CTSSignal routing

PartitioningFloorplaning

Timing closureSign-off

Physical Implementation

Package optimization

Package selection

Package Design

Thermalexploration

ThermalSign-off

Interposer power sources Chiplet power sources

Clusters = 4x 0.4W Logic = 1.4WTotal chiplet = 3W

Total chiplet x6:18W

IO ring = 4.5W

DCDC = 0.6W

Logic = 2.2WTotal interposer: 10.3W

Total power: 28.3W Thermal modelling methodology[C. Santos, P. Vivet, DAC’17 Design Track]

3D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop13/03/2020 18

Page 18: 3D Technologies and Architectures for High ... - IRT Nanoelec

Outline

• Introduction on Chiplet partitioning

• INTACT : An Active Interposer– Circuit Overview

– Design Building blocks

– 3D Design Flow

• INTACT : Circuit Results

• Conclusions & Perspectives

13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 19

Page 19: 3D Technologies and Architectures for High ... - IRT Nanoelec

Circuit Overview

• Die technologies– Chiplet: FDSOI 28nm, ULV + BodyBias, 22mm2

– Active Interposer: CMOS 65nm, MIM option, 200mm2

• 3D technology integration– µ-bumps, 20µm pitch (150 k)

– TSV middle, 40 µm pitch

– Face2Face assembly

on package substrate

– 6 chiplets

Active Interposer

front-face

Chiplet front-face

3D integration

and final package

3D cross-section

13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 20

Page 20: 3D Technologies and Architectures for High ... - IRT Nanoelec

Active Interposer : 3D Cross Sections

Chiplet

BGA

Active Interposer

TSVs

Copper Pillars

CopperPillar20 µmpitch

BGA via

[P. Coudrain, ECTC’2019] [Best Paper Award]

- Correct assembly of interposer onto package- Correct chiplet alignment stacking achieved

TSV40 µmpitch

Active Interposer

Chiplet

13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 21

Page 21: 3D Technologies and Architectures for High ... - IRT Nanoelec

• Main performances– Freq in [130 MHz @ 0.5V – 1.15GHz @ 1.1V] with FDSOI Back-Bias

– Peak performance : 220 GOPS for all 96 cores @ 1.15 GHz.

– Best Energy efficiency : 9.6 GOPS/W (Coremark) @ 246MHz @ 0.6V

• Power consumption break-down– Cores+L1: ~50% power per chiplet

– Interposer logic & interconnect (w.o. IOs)

3% only of overall budget

– SCVR: 17% of overall power budget

Circuit Performance

Chiplet 1

Chiplet 2

Chi

plet

3

Chiplet 4

Chiplet 5

Misc 2%

Clks 21%

L1-L2 5%L3 1%L2 12%

Cores + L1

55%

13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 22

Page 22: 3D Technologies and Architectures for High ... - IRT Nanoelec

First Active Interposer, with fully integrated SCVR, up to 82% efficiency

Comparison with State-of-the-Art

[4] ISSCC'2018 [1] ISSCC'2018 [2] VLSI'2019 [3] ISSCC'2017

INTEL AMD TSMC INTEL

Chiplet TechnologyFDSOI

28nm

FinFET

14nm

FinFET

14nm

FinFET

7nm

FinFET

14nm

Interposer TechnologyActive

CMOS 65nmno

MCM

substrate

Passive

CoWoS ®

EMIB

bridge

Interposer extra features yes N/A no no no

Total system yield

High, using

active interposer

mature technology and

low transistor count

N/A high high high

Die-to-Die µbump pitch 20 N/A > 100 40 55 µm

Voltage Regulator (VR) type

Integrated in interposer,

1 SCVR per chiplet

with MOS+MOM+MIM

on-chip

distributed

SCVR with MIM

LDO per core,

with MIMno no

VR area 34% of active interposerMIM above 40%

of core area- N/A N/A

VR peak efficiency 82% 72% LDO limited N/A N/A

This work Units

Te

ch

no

log

yP

ow

er

Mg

t

[P. Vivet et al., ISSCC’2020]

13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 24

Page 23: 3D Technologies and Architectures for High ... - IRT Nanoelec

Comparison with State-of-the-Art

First Active Interposer, with distributed NoC meshes and 3.0 Tb/s/mm2

interfaces, offering a total of 96 cores

[4] ISSCC'2018 [1] ISSCC'2018 [2] VLSI'2019 [3] ISSCC'2017

INTEL AMD TSMC INTEL

Interconnect types

Distributed NoC meshes

for scalable chip-to-chip

cache-coherency traffic

N/AScalable Data

Fabric (SDF)LIPINCON

TM links AIB interconnect

3D Plug power efficiency 0.59 N/A 2.0 0.56 1.2 pJ/bit

BW density 3.0 N/A - 1.6 1.5 Tb/s/mm2

Aggregate 3D bandwidth 527 N/A - 640 504 GByte/s

Number of chiplets 6 1 1 - 4 21 FPGA fabric

6 transceivers

Number of cores 96 18 8 - 32 8 FPGA fabric

Max Frequency 1.15 0.4 4.1 4 1 GHz

Gops (32b-Integer) 220 (peak mult./acc.) 14.4 131.2 - 524.8 128 N/A Gop/s

CP

U

This work Units

Inte

rco

nn

ec

t

[P. Vivet et al., ISSCC’2020]

13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 25

Page 24: 3D Technologies and Architectures for High ... - IRT Nanoelec

Outline

• Introduction on Chiplet partitioning

• INTACT : An Active Interposer– Circuit Overview

– Design Building blocks

– 3D Design Flow

• INTACT : Circuit Results

• Conclusions & Perspectives

13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 26

Page 25: 3D Technologies and Architectures for High ... - IRT Nanoelec

Conclusions and Perspectives

• Active Interposer & chiplet partitioning– Integration of : Interconnects, Power management, IOs,

– Scalable cache coherency protocol

– 3 TBit/s/mm2 3D interface achieved

– Low latency 0.6ns/mm long-reach asynchronous interconnect

– Power management @ 82% efficiency, close to the cores, w.o. passives

Increase the system energy efficiency and the on-chip memory bandwidth per core

• Mature 3D Design Flow & Tools

• Chiplet Eco-system– Progressive setup of a chiplet eco-system, for HPC but also e-HPC (Automotive, AI, etc)

– Active interposer, an enabler for differentiation : integrating heterogeneous functions &

chiplets

• Perspectives– Hybrid Bonding fine pitch (3-5µm target) for optimized chipletlization

– Photonic Interposers, for more chip-2-chip bandwith & lower latency

13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 27

Page 26: 3D Technologies and Architectures for High ... - IRT Nanoelec

Acknowledgments

• Acknowledgments– Many thanks to all the technology and circuit design team for their huge

contributions

– Many thanks to our partners

• Funding

• Thank you for your attention

This work was partly funded by the French National Program

Programme d’Investissements d’Avenir IRT Nanoelec under

Grant ANR-10-AIRT-05

13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 28

Page 27: 3D Technologies and Architectures for High ... - IRT Nanoelec

Merci de votre attention13/03/20203D Technologies and Architectures for High Performance Computing, P. Vivet, DATE'2020 IRT Workshop 29