RF-Interconnect for Communications On-Chip

RF-Interconnect for Communications On-Chip

Frank Chang1, Jason Cong2, Adam Kaplan2, Mishali Naik2, Glenn Reinman2

Eran Socher1, Rocco Tam1 Department of Electrical Engineering1 Department

of Computer Science2

Current Trend in CMP - NoC

• 65nm CMOS 80 tile NoC• 10X8 2D mesh network-

on-chip running @ 4GHz• Bisection bandwidth

256GB/s• 1 TFLOPS @ 1V about

98W

ISSCC 2007: An 80-Tile 1.28TFLOPS Network-on-Chipin 65nm CMOS (Sriram Vangal et al., Intel)

What is The Challenge?

• Cores would keep shrinking in size but maintain the same operation frequency (2~4GHz) due to thermal constraints

• More cores would be integrated on the same chip to achieve performance boost through parallelism

• Performance would be limited by the communication efficiency between cores and memories on- and off-chip

The Scaling Trend• Scaling reduces delay of logic gates but not wires

Transistor and Wire Delay Trend in CMOS

0102030405060708090

100

180n

m

130n

m90

nm65

nm45

nm32

nm

Technology Node

Del

ay [

ps] FO4

1mm RC global wire

Repeated 1mm RC globalwire

Traditional Interconnect• Units communicate through a parallel bus using

voltage signaling (charging and discharging the wire capacitance)

• Latency is RC limited (~L2)• Using CMOS repeaters reduces latency (~L) but does

not benefit from scaling• Supply no longer scales due to leakage • Baseband-only signaling requires extensive

equalization• Waste of broad bandwidth available from modern

CMOS devices (ft>150GHz, fmax>250GHz)

Sig

nal P

ower

BasebandSignal

fT

Available Bandwidth

10Tf

Major Interconnect Issues• Latency is large across chip• Bandwidth is RC limited (~1Gbps/wire)• Communication pattern is fixed• Energy consumption is high and not scalable

(~10pJ/bit)• Future microprocessors may encounter

communication congestion and most of the energy will be spent on “talking” instead of computing

How Can RF Help?

• EM waves travel at the (effective) speed of light (~10ps/mm)

• Carrier frequencies can be modulated by modern CMOS with high data rates

• Transmission lines on- or off-chip can guide the waves (RF modulated data) from the transmitter to receiver with recoverable attenuation

RF-Interconnect Concept

f0 f0

Transmission Line

Output BufferMixer Mixer LPF

datain dataout

frequency

Sig

na

l Po

we

r

Datain

0f frequency

Sig

na

l Po

we

r

frequency

Sig

na

l Po

we

r

TransmittedSignal

dataout

• Data transmit through transmission lines at the speed of light, with less dispersion across the band and less baseband interference

• data rate is only limited by CMOS mixer modulation speed

RF-I using Multi-band FDMA

f1

f2

f3

f4

f1

f2

f3

f4

Transmission Line


frequency

Sig

nal P

ow

er Data1

Sig

na

l P

ow

er

frequency

frequency

Sig

na

l P

ow

er

frequency

Sig

nal P

ow

er

frequency

Sig

na

l P

ow

er

Data2

Data3

Data4

frequency

Sig

nal P

ow

er Data1

frequency

Sig

nal P

ow

er

frequency

Sig

nal P

ow

er

frequency

Sig

nal P

ow

er

Data2

Data3

Data4

f1 f2 f3 f4

• More bands are used with same modulation speed at each band

• Higher aggregate data rates can be achieved on the same transmission line

3.6Gbps Multi-drop Multiband Bi-directional RF-I *

0.15ns/div

100m

V/d

iv

4ns/div

100m

V/d

iv

4ns/div

Recovered data eye diagramsRecovered data waveforms

Input data patterns

Data-B

FDMAchip4

Data-B

FDMAchip2

Data-R

FDMAchip3

Data-R

FDMAchip1

Data-B : 1.8Gb/s PRBS through basebandData-R : 1.8Gb/s PRBS through RF-band

Data-R

Data-R

Data-B

Data-B

Data-R

Data-B

Data-R Data-B

10 cm FR4 Interconnect

* World’s 1st Multiband RF-I, Ko & Chang, 2005 ISSCC

Can We Implement RF-I in CMOS?

• Today’s RF-CMOS circuits are in the wireless communication “sweet spots” of 500MHz-5GHz– Insufficient bandwidth for RF-I to be

effective!

• Millimeter-wave CMOS circuits have been developed for 60GHz and recently for 324 GHz bands

CMOS 324GHz Generator

-76dBm before calibration

-46dBm after calibration

*Huang, Larocca and Chang, “324GHz CMOS Frequency Generator using Linear Superposition Technique,” pp. 476- 477, 2008 ISSCC

Frequency Generation in Multiband RF-Interconnect

10GHz 20GHz 30GHz 40GHz 50GHz 60GHzf

Sig

nal S

pect

rum

f6 = 60GHz

f5 = 50GHz

f4 = 40GHz

f3 = 30GHz

f2 = 20GHz

f1 = 10GHz60GHz

10GHz

60GHz

Transmission Line


frequency

Sig

nal

Pow

er

Data1

frequency

Sig

nal

Pow

er

Data6

frequency

Sig

nal

Pow

er

Data1

frequency

Sig

nal

Pow

er

Data6

10GHz

X 6 TX

X6 RX

Multi-Band Synthesizer

Simultaneous Sub-harmonic Injection Locked mm-Wave Frequency Generation

• Using sub-harmonic injection-locked VCOs simultaneous lock to one single reference frequency

• Advantages:– Eliminate PLLs– Low Power

Consumption– Small Area Master VCO

Non-linear HarmonicGenerator

Slave VCOs

Sub-harmonic Injection Locked VCO*

• LC-based VCO core

• Differential pair for odd harmonic generation

• Single-ended even harmonic generation

• Injection locking to high harmonic within locking range of the VCO

M1 M2

M3 M4

Ibias,VCO

Ibias,inj

Vinj

+

-

f

Out+ Out-

VCC,1V

Out buffer

Out buffer

f

3f 3f

ProcessFree Running

Frequency (GHz)

Max locking Range (GHz)

Locking Harmonics Power (mW)

This Work* 90nm CMOS 29.3 5.62nd,4th, 6th, 8th

3rd, 5th, 7th 4

*Sai-Wang Tam, M.-C. Frank Chang, etc…, "Simultaneous Sub-harmonic Injection-Locked mm-Wave Frequency Generators for Multi-band Communications in CMOS", IEEE RFIC Sym., 2008

RF-I using Amplitude shift-Key (ASK) Modulation

• TX: Use transformer couples output of VCO to ASK modulator and use simple modulator to generate RF signal in ASK.

• RX: Use self-mixer for envelope detection. Afterwards a simple buffer and Schmitt Trigger recover the signal to rail-to-rail swing.

Differential Transmission Line

• Loss of 0.6-1.6 dB/mm

Differential TML

RF-I using Amplitude Shift-Key (ASK) Modulation

VCO Output: 60GHZ

ASK modulated Signal

Mixer output5Gbit/s Data input

3DIC ASK RF-I Tested at 11Gbps*Output Eye diagram Output versus input

Input

Output

10ps/div

50mV/div

500ps/div

Coupling Capacito

rTX in Layer

2RX in Layer

1

Die Photo

*Gu and Chang, pp.448-449, 2007

ISSCC (0.33pJ/bit)

Single Channel ASK RF-I Performance Summary

• Simple Architecture: One TX VCO, One Mixer, One RX Buffer

• No synchronization circuits such as PLL or clock data recovery needed in ASK RF-I

• Can expand the same architecture to multi-band RF-I

Process IBM 90nm CMOS Digital Process

RF-Carrier Freq. 60GHz

Data Rate 5Gbit/s

Power TX:2mW RX: 3mW

Energy per bit 1pJ/Bit

Active Area 1300 µm2

21

Future Trends in Multi-band ASK RF-I

Technology # of Carriers data rate per carrier (Gb/s) Total Data rate per wire (Gb/s) Power (mW) Energy per bit(pJ) Area (TX+RX) mm2

Area/Gbit

(µm2/Gbit)

90nm 3RF + 1 BB 5 20 20 1.00 0.022 1100

65nm 4RF + 1 BB 6 30 25 0.83 0.0238 800

45nm 5RF + 1 BB 7 42 30 0.71 0.0228 540

32nm 6RF + 1 BB 8 56 35 0.63 0.0211 380

22nm 7RF + 1 BB 9 72 40 0.56 0.0193 260

22

Interconnect Topology Comparison

2cm Interconnect Data Rate Density

0

2

4

6

8

10

12

14

90nm 65nm 45nm 32nm 22nm

Technology Node

Da

ta R

ate

De

ns

ity

[G

bp

s/u

m]

Bus

RF-I

Optical-I

2cm Interconnect Energy

0

5

10

15

20

25


Technology Node

En

erg

y [p

J/b

it]

Bus

RF-I

Optical-I

2cm Interconnect Latency

0

200

400

600

800

1000

1200

1400

1600


Technology Node

Lat

ency

[p

s]

Bus

RF-I

Optical-I

• Comparison across process technology of…

– Traditional RC parallel bus– RF-Interconnect– Optical Interconnect

• As process technology scales toward 22nm…

– RF-I has lowest latency– RF-I consumes least energy– RF-I has highest data rate density

• RF-I is fully compatible with modern CMOS technology

Advantages of RF-Interconnects

• Latency

• Bandwidth

• Energy

• Reconfigurability

Example: RF-I for CMP NoC Design

• 10x10 mesh of 5-cycle pipelined routers– NoC runs at 2GHz– XY/YX routing

• 64 4GHz 3-wide processor cores containing– 8KB L1 Data Cache– 8KB L1 Instruction Cache

• 32 L2 Cache Banks– 256KB each– Organized as shared

NUCA cache• 4 Main Memory Interfaces

– Labeled with + in the figure

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

R

$R

$R

$R

$R

$R

$

R

$R

$R

$R

$

R

$R

$R

$R

$R

$R

$

R

$R

$R

$R

$R

$R

$

R

$R

$R

$R

$

R

$R

$R

$R

$R

$R

$

R R

R R

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

R

$R

$R

$R

$R

$R

$R

$R

$R

$R

$R

$R

$

R

$R

$R

$R

$R

$R

$R

$R

$

R

$R

$R

$R

$R

$R

$R

$R

$R

$R

$R

$R

$

R

$R

$R

$R

$R

$R

$R

$R

$R

$R

$R

$R

$

R

$R

$R

$R

$R

$R

$R

$R

$

R

$R

$R

$R

$R

$R

$R

$R

$R

$R

$R

$R

$

RR RR

RR RR

R (square) = router

C (circle) = processor core

$ (diamond) = L2 cache bank

+ (plus) = main memory interface

MORFIC: Mesh Overlaid with RF-InterConnect

• Shared Z-shaped RF waveguide• Organized as 8 bidirectional

shortcut links• Each direction of each shortcut

can transmit simultaneously over shared medium

• Router A can send a flit to other router A, B to B, … H to H in a single cycle

• Router labeled X cannot directly send to any router not labeled X – E.g. Router B in upper left cannot

send to router E in upper right directly

– However, B in upper left can send to B in upper right, and then north to E using normal mesh link

A

F

A

C

C

B B

D

D E

E

G

G

H

H

F

PHYSICAL ORGANIZATIONLOGICAL ORGANIZATION

A

F

A

C

C

B B

D

D E

E

G

G

H

H

F

MORFIC Results For 256B Total RF-I [HPCA’2008]

• 256B RF-I consumes 0.18% silicon overhead on 400mm2 die– RF-I components: 0.13%, Router overhead: 0.05%

• Normalized Splash-2 Execution Time and Average Packet Latency Results

– Normalized to baseline mesh run-cycles/latency at 1– Average 13% (max 18%) performance improvement– Average 22% (max 24%) packet latency improvement

0.740.760.780.800.82

fft

radi

x

wat

er-s

p

wat

ern^

2 lu

ocea

n

barn

es

Nor

mal

ized

Avg

Pac

ket

Lat 256B RF-I

0.750.800.850.900.95

fft

radi

x

wat

er-s

p

wat

ern^

2 lu

ocea

n

barn

es

Nor

mal

ized

Run

Cyc

les 256B RF-I

The Bad News …Most Interconnect Optimization Techniques May

Not be Relevant …• Performance-driven interconnect design based on distributed RC delay model - all 10 versions »

Jason Cong, Kwok-Shing Leung, and Dian Zhou, Design Automation Conference 1993, Cited by 141 - Related Articles - Web Search - Library Search

• Interconnect design for deep submicron ICs - all 25 versions »J Cong, L He, KY Khoo, CK Koh, Z Pan - Proc. Int. Conf. on Computer Aided Design, 1997 - doi.ieeecomputersociety.orgCited by 139 - Related Articles - Web Search

• Efficient algorithms for the minimum shortest path Steiner arborescence problem with applications to … - all 11 versions »Jason Cong, Andrew B. Kahng, and Kwok-Shing Leung, IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 17, NO. 1, JANUARY 1998

Cited by 127 - Related Articles - Web Search

• Buffer block planning for interconnect-driven floorplanning - all 21 versions »J Cong, T Kong, DZ Pan - Proc. Int. Conf. Computer-Aided Design, 1999 - doi.ieeecomputersociety.orgCited by 130 - Related Articles - Web Search

… (from Google Scholar)

http://scholar.google.com/scholar?hl=en&lr=&cluster=7757023979139065399

Good News -- Plenty of New Problems for Future PhD Students

• How many/which routers should be RF-enabled?

– How many RF-I ports should each router have?• Dedicated or multiplexed with other ports?

• How much RF-I bandwidth to allocate?

– Total? Per communicating pair?

– Impacts active layer area consumed by RF-I components

• Which routing strategy to employ in presence of RF-I express channels?

• Dynamic or static allocation of frequency bands to sources/destinations

– Dynamic: requires arbitration overhead for channel assignment

– Static: may miss opportunity to match changing communication demand

• Support of multi-cast

Example: Deadlock: To Avoid or Confront?

• South-Last Strategy [Ogras and Marculescu, 2006]

– Routes which can lead to circular buffer dependence are forbidden avoids deadlock

• Deadlock Detection & Recovery (DDR)– Based on Duato and Pinkston’s theory [Duato and

Pinkston 2001]

• If deadlock occurs, route all packets in the network on a spare virtual channel

• Use deadlock-free XY-routing• Packets entering network after this point may be

routed normally

Deadlock Results

– South-Last strategy too restrictive• Halves the average realizable performance

– Deadlock is best detected and recovered from when it occurs• Detection happens reasonably quickly• Performance during recovery no worse than baseline

Example: RF-I Topology and Bandwidth Optimization

• For each channel– Source and destination may

be reconfigured via frequency-band reassignment

• Can assign variable # of channels to each source, destination pair (s,d) – critical channels given more

bandwidth

• A flexible means to reconfigure topology

PHYSICAL

LOGICAL ALOGICAL B

Variance In Communication PatternsVariance In Communication Patterns

Mpeg2Enc time varying behavior

1

10

100

1,000

10,000

100,000

1,000,000

1 15 29 43 57 71 85 99 113 127 141 155 169 183 197 211 225 239

inte rv al (250k cycle s)

ev

en

t c

ou

nt

L2 ACCESS NW INJECT BW STALL FLITS SENT

WaterSpatial time varying behavior

1

10

100

1,000

10,000

100,000

1,000,000

10,000,000

1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85

inte rv al (250k cycle s)

ev

en

t c

ou

nt

L2 ACCESS NW INJECT BW STALL FLITS SENT

Conclusions

• RF-I on CMOS is real• RF-I is a very promising solution to global

interconnect bottleneck– Latency– Bandwidth– Energy– Reconfigurability

• RF-I introduces many interesting physical and architecture design problems in NoC designs

RF-Interconnect for Communications On-Chip

Documents

Transcript of RF-Interconnect for Communications On-Chip