Download - Rev PA1 1 Performance energy trade-offs with Silicon Photonics Sébastien Rumley, Robert Hendry, Dessislava Nikolova, Keren Bergman.

Rev PA1Rev PA1 1

Performance energy trade-offswith Silicon Photonics

Sébastien Rumley, Robert Hendry, Dessislava Nikolova, Keren Bergman

Rev PA1Rev PA1 2

Goal of the study

• Suppose (silicon photonics based) optical data movement between end-points– Small connectivity (4 – 16)

– Between chips (not on the same chip), potentially distant of several meters

• What is the design space?

– Selection of the “topology”

– Choice of optical devices

– Amount of WDM parallelism

– Type of modulation and rate

Rev PA1Rev PA1 3

Topology selection

Send Receive Send Receive

• Basically, all-to-all, switched, or bus

• … and all the possible combinations thereof, or hybrids• but let start by analyzing two “extremities” of this design space:

– All-to-all (a.k.a Full-mesh)

– Switched (a.k.a star network)

Send Receive

Rev PA1Rev PA1 4

Other aspects

• Type of modulation and rate– Simply 10Gb/s per channel, OOK – considered as a good trade-off

between SERDES complexity and optical channel utilization• To be extended in the future

• Choice of optical devicesand amount of WDM parallelism:– Interrelated!

– Optical devices parameters haveto be optimized for a givennumber of wavelengths ANDfor a given topology

• The worst case path determines the parameters, and the maximal number of channels supported

Design space: between 1 and max

Selecting the max is NOT the obvious choice!

Topology

Opticaldevices

Number ofchannels

[1] S. Rumley, et al. "Modeling Silicon Photonics in Distributed Computing Systems: From the Device to the Rack".[2] R. Hendry, D. Nikolova, S. Rumley, N. Ophir, K. Bergman, "Physical layer analysis and modeling of silicon photonic WDM bus architectures ”[3] R. Hendry, et al "Modeling and Evaluation of Chip-to-Chip Scale Silicon Photonic Networks," IEEE Symposium on High Performance Interconnects Hoti 2014

Rev PA1Rev PA1 5

Why shouldn’t the channels number (hence the bandwidth) always be maximized?• Each channel (color) needs its own modulator and detector devices• Each channel needs its own amount of initial optical power

– Provided by a (so far, rather poorly efficient) laser

This laser power dominates other power requirements

More channels generally DOES NOT make the system MORE energy efficient

• More channels induce inter-channels effects. To (partly) compensate for those, more initial optical power is required– More channels also means bigger, more “lossy” optical devices

More channels generally DO make the system LESS energy efficient

• Ideal (power-wise) number of channels: 1 (but adding a few will not drastically change the per channel consumption)– Except in cases involving devices whose consumption is independent of

number of channels (common to all channels)• In these cases, the ideal (power-wise) channel number is larger than one

Rev PA1Rev PA1 6

Relation energy efficiency - channels

• When going from POWER to ENERGY-PER-BIT efficiency, the utilization plays a major role

• For a FIXED load (traffic, average network activity over time), the energy-per-bit looks like this

Flat if resulting bandwidth is lower than the load (resultingin 100% utilization – and buffer overflow)

Proportional to the number of channels (each channel consumes, almost independently of the utilization)

For high number of channels, optical signal effects super-linearly affect the power consumption (for low number, it is negligible)

Number of channels

Energy-per-bit (J)

Max channels1 channel

Average loadChannel rate

Rev PA1Rev PA1 7

So, how many channels?

• From a computer architecture point of view, more channels, hence more bandwidth, is generally good to take– Less queuing time when links are highly solicited

– SHORTER serialization times

Serialization time, inversely proportional to the bandwidth

Latency(log)

Number of channels (log)

Queuing time

Max channelsAverage loadChannel rate

Sum = Head-to-tail latency

Rev PA1Rev PA1 8

Performance-energy trade-off for a link

• Plotting once against the other

Energy-per-bit (J) (log)

Head to taillatency (log)

Optical signal effects with high number of channels

High latency (saturation, overflow)

Trade energy-efficiencyfor latency

Trade latencyfor energy-efficiency

Rev PA1Rev PA1 9

Going back to the topology choice

• In case of all-to-all, the total number of channels (i.e. bisectional bandwidth) is a multiple of N(N-1)– So at least N(N-1)

• with N=16, 240 2.4 Tb/s

– At most the maximum number supported by a link (typically 100*) times N(N-1)• With N=16, 24,000 240 Tb/s

• In the switched case, it is a multiple of N– So at least N

• With N=16, 16 160Gb/s

– At most the max number supported by the switched, e.g. 40*• With N=16, 640 6.4 Tb/s

Two topologies differ from the range of bandwidths they can offer– For low loads, the all-to-all might be an over-kill, even with a single channel

– For high loads, the switch might be short of a few Tb/s, even with 40 channels

• But they are several other very important differences…

* depending on the assumptions made on the device behavior, numbers mentioned here as indicative

Rev PA1Rev PA1 10

More differences – on the energy side

• Consider a case where we want to provide 4.8 Tb/s of bisect. BW between 16 endpoints– Falls in the range of both all-to-all and switched (480 channels in total)

• All-to-all means 2 channels per link (we have 16x15 = 240 links)• Switched means 30 channels per link (with 16 links)

– For a total traffic of (e.g.) 2.4 Tb/s, utilization is 50% in the two cases• Same total number of wavelengths, same traffic• BUT, more wavelengths per link in the switched case

(and switch signal attenuation to be compensated)

Switched architecture less energy efficient (more energy-per-bit)

• What about the latency?

Rev PA1Rev PA1 11

Topology impact – on the latency side

• In the previous example, all-to-all and switched provide the same bisect BW.– Same asymptotic throughput, same saturation load.

– But…

• Switched topology implies resources sharing among flows – Impacts queuing latency

• Packets in a flow are not only delayed by previous packets in the same flow, but also by other flows’ packets.

Less predictable, potentially higher latency distribution

– On the other hand, serialization latency improved by the fact that all outgoing channels can be used in parallel for a single packet.

• From 2 to 30 channels up to 15x improvement in serialization latency

Rev PA1Rev PA1 12

Differences among topologies in terms of performance-energy trade-off

• At constant bisect. BW:– All-to-all intrinsically energy optimal

– Switched intrinsically latency optimal (at least below the saturation load)

• But bisect. BW is not a requirement, we can “test” different WDM parallelism– See how these populate the Pareto front for each topology

Rev PA1Rev PA1 13

Main result

• 16 endpoints• 10 Gb/s average

load between pairs(150Gb/s per client)(2.4 Tb/s total)

• 1KB packets• Poisson arrivals• Includes physical

layer analysis• Power consumption

from componentstaken in the literature

• Switch realizes round-robin arbitration

101

102

103

104

100

101

Head to tail latency (ns)

Ene

rgy

per d

eliv

ered

bit

(pJ)

All-to-allSwitched

All-to-all allows the shorter latencies

Gap Switched offerssolution “in between”

All-to-all achieves the best energyefficiencies

1 channel per link 100% utilization = saturation load

Rev PA1Rev PA1 14

Other loads

• No solution below 1pJ/bit

101

102

103

104

100

101


Ene

rgy

per d

eliv

ered

bit

(pJ)

All-to-allSwitched

101

102

103

104

100

101


Ene

rgy

per d

eliv

ered

bit

(pJ)

All-to-allSwitched

100 Gb/s per client 225 Gb/s per client

• Switched topology is completely dominated

Rev PA1Rev PA1 15

Measuring latency with poisson traffic is only an indicator

Let’s test the designs with traffic generated by a simple skeleton.

All-to-all, 2 channels per link, 480 total Switched, 30 channels per link, 480 total

Initial broadcast takes the same time to complete

In the switched case, some messages arrive earlier.

The “shift down” communication phase fully benefits from the switch (no congestion)

Rev PA1Rev PA1 16

Pareto trade-off for application skeleton

103.3

103.6

101.22

101.25

101.28

101.31

Time to solution (ns)

Ene

rgy

per d

eliv

ered

bit

(pJ)

All-to-allSwitched

103

104

101

102

103


Ene

rgy

per d

eliv

ered

bit

(pJ)

All-to-allSwitched

Switched architecture almost totally dominated! Does this contradicts previous slide? NO

Time-to-solution (ns)

Rev PA1Rev PA1 17

Performance and energy relations

All-to-all, 2 channels per link

2

43

5

30

Switched architecture, with the same total number of channels, DO leads to shorter time-to-solution than all-to-all

But the presence of the switch AND the multiple channels induce a penalty in terms of energy

Switched, 30 channels per link

In this particular case, a larger latency gain can be achieved by doubling the channels in the all-to-all, at a far least energy penalty.

Rev PA1Rev PA1 18

Application results discussion

• Sensitive to physical layer parameters.– Depending on the assumptions made on future fabrication possibilities,

results for the switched topology might improve slightly

• Sensitive to network size– With 8 or even 4 clients, the switch penalty is far less important

• Sensitive to application itself• So far, arbitration latency neglected

– That will push green curves to the right

• Arbitration power consumption neglected, too– That will push green curves up

• But silicon area of all-to-all neglected, too…

Rev PA1Rev PA1 19

Conclusions• Main conclusion: too close to call!

– Although it seems that the all-to-all architecture does pretty good for a “brute force” approach, the switched architecture seems to not be far away

– For a given context, one may be slightly better than an other• Important to expose the “solution diversity” to the higher layers

Integration of the resulting models (all-to-all and switched) in SSTMicro!

• Probably a good potential lies in hybrid architectures– Example: a switch for pair end-point, another for odd end-points

• Doubles the number of links, shrinks the switch radix by factor of two

Explore the possible hybrids and integrate in SSTMicro, too.

• Sensitivity analysis– Physical layer: around 20 parameters; arbitration: 3-4 parameters

Application traffic type: 3-4 parameters…