Post on 04-Jan-2016
A Study of Cyclops64 Crossbar
Architecture and Performance
Yingping ZhangApril, 2005
Overview
1. Background2. Architecture Of C64 Crossbar3. Performance Simulation4. Test Result5. Performance Analysis6. Conclusion7. Future Work
Background1. What is Cyclops64? Cyclops64(C64), also called Blue Gene/C, is part of IBM Blue Gene
project.
It is a cellular architecture-based supercomputer. Each chip consists of 75~80 custom designed 64-bit processors. Each processor will have two thread units, two integer units, and a floating point unit.
C64 is expected 1000 teraflops and will be one of the fastest supercomputers in the world.
The architecture was conceived by Cray award winner Monty Denneau , Verification testing and system software development is being done at our CAPSL group.
2. What is the project goal?Study of the architecture and performance of the C64 interconnection network, crossbar (part of Verification testing)
Host IF
FIFO64-bit x 64
Mickey treeGbit ethernetDiskMickey tree (DMA)Gbit ethernet (DMA)
Mickey treeGbit ethernetDiskMickey tree (DMA)Gbit ethernet (DMA)
C64 ProcessorC64 ProcessorC64 ProcessorC64 ProcessorC64 Processor
TU TU FP
ICache5
Crossbar
C64 ProcessorC64 ProcessorC64 ProcessorC64 ProcessorC64 Processor
TU TU FP
ICache5
C64 ProcessorC64 ProcessorC64 ProcessorC64 ProcessorC64 Processor
TU TU FP
ICache5
DDR2 SDRAMController
4
ASw(a part of 3D cube network)
The other C64 chips DDR2 SDRAM DIMMs
FPGA
• Port 0-79 for C64 processors• Port 80-83 for mpg ICache• Port 84,85 for Host IF
• Port 86-89 for DRAM controller• Port 90-95 for ASw
Processor# 80 ICache# 16
mpg mpg mpg mpg
Configuration Pin * The configuration pins are Connected to all modules except DDR and Crossbar
Cyclops64 CHIP
Architecture Of C64 Crossbar
1. On chip crossbar: Provide communication inside a single chip
2. 96-way crossbar: 96 input ports, 96 output ports. Each port can
connect with any other port and itself. Any communication among processors, ICaches,
SRAM, DRAM, and ASwitches has to go through the crossbar
3. Pipelined crossbar: 7 pipeline stages When full pipelined, each port flow out one packet
each cycle Bandwidth of the crossbar =
port number * length of the packet
SrcSplit
TarCombine
TUnitA
102+2
TarCtl
Arbiter
LC SrcCtlWsWrRs
Req Ack
102
92 10
9
FIFO
96
96
C64|MP|CORE
MUX
Sel
92
92 3
95
95
SrcSplit
TarCombine
102+2
TarCtl
Arbiter
LC SrcCtlWsWrRs
Req Ack
102
92 10
9
FIFO
96
96
C64|MP|CORE
MUX
Sel
92
92 3
95
95
Port# 96
Crossbar Architecture
SrcSplit
TarCombine
TUnitA
102+2
TarCtl
Arbiter
LC SrcCtlWsWrRs
Req Ack
102
92 10
9
FIFO
96
96
C64|MP|CORE
MUX
Sel
92
92 3
95
95
TUnitA
TUnitB TUnitB TUnitB
Crossbar Architecture
SrcSplit
TarCombine
102+2
TarCtl
Arbiter
LC SrcCtlWsWrRs
Req Ack
102
92 10
9
FIFO
96
96
C64|MP|CORE
MUX
Sel
92
92 3
SrcSplit
102+2
TarCtl
Arbiter
LC SrcCtlWsWrRs
Req Ack
102
92 10
9
FIFO
96
96
C64|MP|CORE
MUX
Sel
92
92 3
SrcSplit
102+2
TarCtl
Arbiter
LC SrcCtlWsWrRs
Req Ack
102
92 10
9
FIFO
96
96
C64|MP|CORE
MUX
Sel
92
92 3
Port# 96
1
2
3
4
5
6
7
TUnitA TUnitA TUnitA
95
95TUnitB
TarCombine
95
95TUnitB
TarCombine
95
95TUnitB
Performance Simulation1. Performance Measurement
Latency: The time required for a packet to traverse the network form source to destination
Throughput: The rate at which packets are delivered by the network for a particular traffic pattern
2. Workloads Synthetic: Random Distributed vs Poisson Distributed Application Driven: Hello_World, Matrix_Cthread,
Laplace_Cthread, Heat_Cthread, Cnet_get_nb, Cnet_put_nb, Dev_Align, Dev_Reset
3. Simulators Csim_crossbar LAST
(Both designed by Fei Chen at CAPSL)
Parameters configurationPARAMETERS
Workloads Arbitration Schemes
SyntheticApplicationDriven Benchmarks
Temporal1
CharacteristicsSpatial2
Distributions
Uniform
Random
Permutation
(Neighbor & Tornado)
Uniform
RandomPoisson
UniformlyRandom
Matrix Circular
SegmentedMatrix
FixedPriority
1. Describe the generation probability of message over time2. Determine the communication paths between the sources and destinations
Test Results: Latency - Synthetic Workloads
•Latency of Uniform Random Pattern goes infinite when injection rate > 0.6•Latency of Permutation Traffic is always 7 cycles without any change.
Test Results: Throughput - Synthetic Workloads
(Cont)
•Uniform workload with permutation traffic pattern has linear throughput•This network is a stable network
Test Results: Contention- Synthetic
Workloads(Cont)
•Permutation Traffic has zero contention•Uniform distribution has more contention than POISSON distribution
Performance Analysis One- Synthetic Workloads
The least latency in the crossbar is 7 cycles.
The crossbar is a stable network because its throughput does not degrade beyond the saturation point.
Contention at the output causes the delay of transferring message, and permutation traffic has zero contention
Uniformly random workload with permutation traffic has the best performance. When injection rate reaches 1.0, its throughput can achieve 1.
Test Results: Latency - Arbitration Schemes
• Fixed Priority Scheme is the worst case, its latency goes infinite at rate 0.5• Others have very similar latency behavior
Test Results: Throughput - Arbitration Schemes
(Cont)
• Fixed Priority Scheme is the worst case, the network saturates at rate 0.5 • Others have very similar throughput behavior
Performance Analysis Two- Arbitration Schemes
SLRU, PLRU, CIRC and RAND arbitration schemes show very similar performance behavior under uniformly random traffic pattern.
Fixed Priority arbitration scheme shows the worst performance behavior under the same situation.
Test Results – Application-Driven
Benchmarks
ApplicationNumber
OfPackets
ForwordLatency(Avg)
ReverseLatency(Avg)
ForwordThroughput
(Avg)
ReverseThroughput
(Avg)
Hello_World 5110 7.35 19.74 0.002 0.002
Heat_Cthread 7975863 46.00 4034.00 0.002 0.001
Matrix_Cthread 110218 21.59 939.00 0.002 0.002
Cnet_get_nb 10162 7.538 53.552 0.001 0.002
Cnet_put_nb 10052 7.619 50.027 0.001 0.002
Dev_Align 8924 7.286 37.381 0.002 0.002
Dev_Reset 10148 7.617 50.413 0.001 0.002
• Average reverse latency increases very fast when packet number increased• Forward and reverse traffics have different latency behavior
Performance Analysis-Application-Driven
Benchmarks C64 architecture classified traffic into:
Class 0 (Forward traffic): messages send out from processor, like load request and stores from processors
Class 1 (Reverse traffic): Messages send back to processors, like load return to processors
Reverse transfer delay is much bigger than forward transfer delay
Forward and reverse transfer have similar throughput
ConclusionFor Synthetic Workloads Verified:
C64 crossbar is a stable network The least latency of C64 crossbar is 7 cycles.
Discovered: Traffic pattern, including temporal characteristics and spatial
distribution, has sensitive affect on the crossbar performance behavior
permutation spatial traffic has the best latency behavior. It keeps to have the least latency 7 cycles because it has zero contention.
Uniform random distributed workload has better throughput behavior.
Fixed priority arbitration scheme has worst performance behavior and others are very similar
For Application-Driven Workload Discovered:
Forward and reverse traffics have different latency behavior but similar throughput behavior
Reverse traffic has worse latency behavior than forward
Future WorkSynthetic Workloads Investigate arbitration schemes under different traffic
patterns
Application-Driven Workloads Investigate performance behavior of C64 Crossbar under
different configuration constrains Number of used thread units Number of involved memory banks
Investigate performance behavior of C64 Crossbar under different arbitration schemes
Summary of Performance Analyses
Documentation
Acknowledge
Fei ChenYuheiDimitriJoseph
TedProf. Gao
All people in CAPSL group
Question?
Thanks!!!