Blue Gene / C

28
Blue Gene / C • Cellular architecture • 64-bit Cyclops64 chip: 500 Mhz 80 processors ( each has 2 thread units and a FP unit) Software Cyclops64 exposes much of the underyling hardware to the programmer, allowing the programer to write very high performance, finely tuned software.

description

Blue Gene / C. Cellular architecture 64-bit Cyclops64 chip: 500 Mhz 80 processors ( each has 2 thread units and a FP unit) Software Cyclops64 exposes much of the underyling hardware to the programmer, allowing the programer to write very high performance, finely tuned software. - PowerPoint PPT Presentation

Transcript of Blue Gene / C

Page 1: Blue Gene / C

Blue Gene / C • Cellular architecture• 64-bit Cyclops64 chip:

– 500 Mhz– 80 processors ( each has 2 thread units and a FP

unit)

• Software– Cyclops64 exposes much of the underyling

hardware to the programmer, allowing the programer to write very high performance, finely tuned software.

Page 2: Blue Gene / C

The C64 system is a petaflop supercomputer built on multi-core system-on-a-chip (SoC) technology, based on a cellular architecture andexpected to achieve over one petaflop peak performance. A maximum configuration of a C64 system consists of 13,824 C64 processing nodes (1million processors) connected by a 3D-mesh network. Each node is composed of a C64 chip, external DRAMs and a small number of external modules. A C64 chip consists of up to 80 custom-designed 64-bit processors (each consists of two thread processing cores), 16 shared instruction caches (I-caches), 160 on-chip embedded SRAM memory banks and 80 floating point units (FP). It is interesting to note that there is no data cache on the chip. Instead, each SRAM bank on the chip can be configured into two levels: global interleaved memory banks (GM) which are uniformly addressable, and scratch pad memories (SP) that are local to individual processors .The C64 chip configuration used in this study integrates 75 processors on a single chip. Each processor contains two thread units, one floating point unit and two 32KB SRAM memory banks. Groups of five processors share one I-Cache.

Page 3: Blue Gene / C

IBM Cyclops Project

Page 4: Blue Gene / C
Page 5: Blue Gene / C

Interconnection NetworkSystem = Processor Tiles + Channels + Routers

Page 6: Blue Gene / C

Router Architecture

• Input-queued• Virtual Channel• Speculative Pipeline

Page 7: Blue Gene / C

Cross-bar

InputBuffer

Control

OutputPorts

Input Receiver Transmiter

Ports

Routing, Scheduling

OutputBuffer

Page 8: Blue Gene / C

Switches

Low-swing bit lines• Operate at channel rate• Reduces area and hence power• Equalized drive• Buffered crosspoints• Integral allocation

Page 9: Blue Gene / C

Torus

Page 10: Blue Gene / C

Concentrated Mesh Source: Balfour and Dally, ICS 06

Page 11: Blue Gene / C

Express LinksSource: Balfour and Dally, ICS 06

Page 12: Blue Gene / C

The most important quality measures of an interconnection network are its :1. Degree - the maximum degree of all PUs; 2.Diameter - the maximum distance between any pair of PUs in the network. 3.Bisection width, the minimum number of connections that must be removed in order to decompose a processor network with n PUs into two networks with at most round_up(n / 2) PUs.

Page 13: Blue Gene / C

Comparison of the diameter (D) and average diameter (Dm)of toruses, fat-trees and circulant graphs (Project SWISS)

1.Toruses always have the worst diameter 2.Fat-trees appear to have the best diameter but the difference

with circulant graphs is decreasing with increasing degree; 3.The average diameter of fat-trees is very close to its diameter,

as a consequence, for degrees greater than 4 and a size smaller than 1000, the average diameter of circulant graphs is smaller than the one of fat-trees;

4.For a number of PUs up to 1000 the diameter of circulant graphs is smaller or equivalent to the one of fat-tree as soon as the degree is greater than 6;

5.Fat-trees always have the best bisectional width, toruses the worst ones, and the bisectional width of circulant graphs is very erratic.

Page 14: Blue Gene / C
Page 15: Blue Gene / C
Page 16: Blue Gene / C
Page 17: Blue Gene / C
Page 18: Blue Gene / C

Comparison of the bisectional width of toruses, fat-trees and circulant graphs

Based on these results we can discard the toruses that

always have the worst diameter and bisectional width.

Small degree fat-trees seem to be the best choice even if

the difference with circulant graphs is not spectacular.

Nevertheless, the drawback of fat-tree is that they are

extremely rigid. We have the following properties • The number of fat-trees of a given degree d and of size N is

equal to <N. For d=8 and N=1000 this number is equal to 3; • Performant circular graphs can be found for any number of

PUs.

Page 19: Blue Gene / C

Comparison of the bisectional width of toruses, fat-trees and circulant graphs

Page 20: Blue Gene / C
Page 21: Blue Gene / C
Page 22: Blue Gene / C

Building up systems with several hundred blocks requires building a matrix of high-speed, high-fanout fat-tree switches to interconnect the processors. Courtesy Compaq Computer Corporation, Manchester, U.K.

Page 23: Blue Gene / C
Page 24: Blue Gene / C

To understand how technology changes affect the optimalnetwork radix, consider the latency (T ) of a packet travelingthrough a network. The header latency (Th) is the time for thebeginning of a packet to traverse the network and is equal to thenumber of hops a packet takes times a per hop router delay(tr). Since packets are generally wider than the network channels,the body of the packet must be squeezed across the channel,incurring an additional serialization delay (Ts). Thus, total delaycan be written as

T = Th + Ts = Htr + L/b (1)where H is the number of hops a packet travels, L is the length ofa packet, and b is the bandwidth of the channels. For an N nodenetwork with radix k routers (k input channels and k outputchannels per router), the number of hops must be at least 2logkN.Also, if the total bandwidth of a router is B, that bandwidth isdivided among the 2k input and output channels and b = B/2k.

Page 25: Blue Gene / C

Substituting this into the expression for latency from equation(1)

T = 2tr logk N + 2kL/B (2)Then, setting dT/dk equal to zero and isolating k gives theoptimal radix in terms of the network parameters,

k log2k =Btr logN/ L (3)

Router delay tr can be expressed as the number of pipeline

stages (P) and times the cycle time (tcy). As radix increases, tcy

remains constant and P increases logarithmically. Thenumber of pipeline stages P can be further broken down into acomponent that is independent of the radix (X)and a

componentwhich is dependent on the radix (Y log2 k).

Thus router delay (tr) can be rewritten as

tr = tcyP = tcy(X + Y log2 k)` (4)

Page 26: Blue Gene / C

Radix Clos Rank 2 Network Latency

Latency = H tr + L / b = 2trlogkN + 2kL / B

where

k = radix

B = total router Bandwidth

N = number of nodes

L = message size

Page 27: Blue Gene / C

Chip radix switch latency

Page 28: Blue Gene / C

Radix Clos Rank 2 Network