Rev PA1Rev PA1 1
Performance energy trade-offswith Silicon Photonics
Sébastien Rumley, Robert Hendry, Dessislava Nikolova, Keren Bergman
Rev PA1Rev PA1 2
Goal of the study
• Suppose (silicon photonics based) optical data movement between end-points– Small connectivity (4 – 16)
– Between chips (not on the same chip), potentially distant of several meters
• What is the design space?
– Selection of the “topology”
– Choice of optical devices
– Amount of WDM parallelism
– Type of modulation and rate
Rev PA1Rev PA1 3
Topology selection
Send Receive Send Receive
• Basically, all-to-all, switched, or bus
• … and all the possible combinations thereof, or hybrids• but let start by analyzing two “extremities” of this design space:
– All-to-all (a.k.a Full-mesh)
– Switched (a.k.a star network)
Send Receive
Rev PA1Rev PA1 4
Other aspects
• Type of modulation and rate– Simply 10Gb/s per channel, OOK – considered as a good trade-off
between SERDES complexity and optical channel utilization• To be extended in the future
• Choice of optical devicesand amount of WDM parallelism:– Interrelated!
– Optical devices parameters haveto be optimized for a givennumber of wavelengths ANDfor a given topology
• The worst case path determines the parameters, and the maximal number of channels supported
Design space: between 1 and max
Selecting the max is NOT the obvious choice!
Topology
Opticaldevices
Number ofchannels
[1] S. Rumley, et al. "Modeling Silicon Photonics in Distributed Computing Systems: From the Device to the Rack".[2] R. Hendry, D. Nikolova, S. Rumley, N. Ophir, K. Bergman, "Physical layer analysis and modeling of silicon photonic WDM bus architectures ”[3] R. Hendry, et al "Modeling and Evaluation of Chip-to-Chip Scale Silicon Photonic Networks," IEEE Symposium on High Performance Interconnects Hoti 2014
Rev PA1Rev PA1 5
Why shouldn’t the channels number (hence the bandwidth) always be maximized?• Each channel (color) needs its own modulator and detector devices• Each channel needs its own amount of initial optical power
– Provided by a (so far, rather poorly efficient) laser
This laser power dominates other power requirements
More channels generally DOES NOT make the system MORE energy efficient
• More channels induce inter-channels effects. To (partly) compensate for those, more initial optical power is required– More channels also means bigger, more “lossy” optical devices
More channels generally DO make the system LESS energy efficient
• Ideal (power-wise) number of channels: 1 (but adding a few will not drastically change the per channel consumption)– Except in cases involving devices whose consumption is independent of
number of channels (common to all channels)• In these cases, the ideal (power-wise) channel number is larger than one
Rev PA1Rev PA1 6
Relation energy efficiency - channels
• When going from POWER to ENERGY-PER-BIT efficiency, the utilization plays a major role
• For a FIXED load (traffic, average network activity over time), the energy-per-bit looks like this
Flat if resulting bandwidth is lower than the load (resultingin 100% utilization – and buffer overflow)
Proportional to the number of channels (each channel consumes, almost independently of the utilization)
For high number of channels, optical signal effects super-linearly affect the power consumption (for low number, it is negligible)
Number of channels
Energy-per-bit (J)
Max channels1 channel
Average loadChannel rate
Rev PA1Rev PA1 7
So, how many channels?
• From a computer architecture point of view, more channels, hence more bandwidth, is generally good to take– Less queuing time when links are highly solicited
– SHORTER serialization times
Serialization time, inversely proportional to the bandwidth
Latency(log)
Number of channels (log)
Queuing time
Max channelsAverage loadChannel rate
Sum = Head-to-tail latency
Rev PA1Rev PA1 8
Performance-energy trade-off for a link
• Plotting once against the other
Energy-per-bit (J) (log)
Head to taillatency (log)
Optical signal effects with high number of channels
High latency (saturation, overflow)
Trade energy-efficiencyfor latency
Trade latencyfor energy-efficiency
Rev PA1Rev PA1 9
Going back to the topology choice
• In case of all-to-all, the total number of channels (i.e. bisectional bandwidth) is a multiple of N(N-1)– So at least N(N-1)
• with N=16, 240 2.4 Tb/s
– At most the maximum number supported by a link (typically 100*) times N(N-1)• With N=16, 24,000 240 Tb/s
• In the switched case, it is a multiple of N– So at least N
• With N=16, 16 160Gb/s
– At most the max number supported by the switched, e.g. 40*• With N=16, 640 6.4 Tb/s
Two topologies differ from the range of bandwidths they can offer– For low loads, the all-to-all might be an over-kill, even with a single channel
– For high loads, the switch might be short of a few Tb/s, even with 40 channels
• But they are several other very important differences…
* depending on the assumptions made on the device behavior, numbers mentioned here as indicative
Rev PA1Rev PA1 10
More differences – on the energy side
• Consider a case where we want to provide 4.8 Tb/s of bisect. BW between 16 endpoints– Falls in the range of both all-to-all and switched (480 channels in total)
• All-to-all means 2 channels per link (we have 16x15 = 240 links)• Switched means 30 channels per link (with 16 links)
– For a total traffic of (e.g.) 2.4 Tb/s, utilization is 50% in the two cases• Same total number of wavelengths, same traffic• BUT, more wavelengths per link in the switched case
(and switch signal attenuation to be compensated)
Switched architecture less energy efficient (more energy-per-bit)
• What about the latency?
Rev PA1Rev PA1 11
Topology impact – on the latency side
• In the previous example, all-to-all and switched provide the same bisect BW.– Same asymptotic throughput, same saturation load.
– But…
• Switched topology implies resources sharing among flows – Impacts queuing latency
• Packets in a flow are not only delayed by previous packets in the same flow, but also by other flows’ packets.
Less predictable, potentially higher latency distribution
– On the other hand, serialization latency improved by the fact that all outgoing channels can be used in parallel for a single packet.
• From 2 to 30 channels up to 15x improvement in serialization latency
Rev PA1Rev PA1 12
Differences among topologies in terms of performance-energy trade-off
• At constant bisect. BW:– All-to-all intrinsically energy optimal
– Switched intrinsically latency optimal (at least below the saturation load)
• But bisect. BW is not a requirement, we can “test” different WDM parallelism– See how these populate the Pareto front for each topology
Rev PA1Rev PA1 13
Main result
• 16 endpoints• 10 Gb/s average
load between pairs(150Gb/s per client)(2.4 Tb/s total)
• 1KB packets• Poisson arrivals• Includes physical
layer analysis• Power consumption
from componentstaken in the literature
• Switch realizes round-robin arbitration
101
102
103
104
100
101
Head to tail latency (ns)
Ene
rgy
per d
eliv
ered
bit
(pJ)
All-to-allSwitched
All-to-all allows the shorter latencies
Gap Switched offerssolution “in between”
All-to-all achieves the best energyefficiencies
1 channel per link 100% utilization = saturation load
Rev PA1Rev PA1 14
Other loads
• No solution below 1pJ/bit
101
102
103
104
100
101
Head to tail latency (ns)
Ene
rgy
per d
eliv
ered
bit
(pJ)
All-to-allSwitched
101
102
103
104
100
101
Head to tail latency (ns)
Ene
rgy
per d
eliv
ered
bit
(pJ)
All-to-allSwitched
100 Gb/s per client 225 Gb/s per client
• Switched topology is completely dominated
Rev PA1Rev PA1 15
Measuring latency with poisson traffic is only an indicator
Let’s test the designs with traffic generated by a simple skeleton.
All-to-all, 2 channels per link, 480 total Switched, 30 channels per link, 480 total
Initial broadcast takes the same time to complete
In the switched case, some messages arrive earlier.
The “shift down” communication phase fully benefits from the switch (no congestion)
Rev PA1Rev PA1 16
Pareto trade-off for application skeleton
103.3
103.6
101.22
101.25
101.28
101.31
Time to solution (ns)
Ene
rgy
per d
eliv
ered
bit
(pJ)
All-to-allSwitched
103
104
101
102
103
Head to tail latency (ns)
Ene
rgy
per d
eliv
ered
bit
(pJ)
All-to-allSwitched
Switched architecture almost totally dominated! Does this contradicts previous slide? NO
Time-to-solution (ns)
Rev PA1Rev PA1 17
Performance and energy relations
All-to-all, 2 channels per link
2
43
5
30
Switched architecture, with the same total number of channels, DO leads to shorter time-to-solution than all-to-all
But the presence of the switch AND the multiple channels induce a penalty in terms of energy
Switched, 30 channels per link
In this particular case, a larger latency gain can be achieved by doubling the channels in the all-to-all, at a far least energy penalty.
Rev PA1Rev PA1 18
Application results discussion
• Sensitive to physical layer parameters.– Depending on the assumptions made on future fabrication possibilities,
results for the switched topology might improve slightly
• Sensitive to network size– With 8 or even 4 clients, the switch penalty is far less important
• Sensitive to application itself• So far, arbitration latency neglected
– That will push green curves to the right
• Arbitration power consumption neglected, too– That will push green curves up
• But silicon area of all-to-all neglected, too…
Rev PA1Rev PA1 19
Conclusions• Main conclusion: too close to call!
– Although it seems that the all-to-all architecture does pretty good for a “brute force” approach, the switched architecture seems to not be far away
– For a given context, one may be slightly better than an other• Important to expose the “solution diversity” to the higher layers
Integration of the resulting models (all-to-all and switched) in SSTMicro!
• Probably a good potential lies in hybrid architectures– Example: a switch for pair end-point, another for odd end-points
• Doubles the number of links, shrinks the switch radix by factor of two
Explore the possible hybrids and integrate in SSTMicro, too.
• Sensitivity analysis– Physical layer: around 20 parameters; arbitration: 3-4 parameters
Application traffic type: 3-4 parameters…
Top Related