Cray Inc. Hot Interconnects 1 Bob Alverson, Duncan Roweth, Larry Kaplan Cray Inc.

19
Gemini System Interconnect Cray Inc. Hot Interconnects 1 Bob Alverson, Duncan Roweth, Larry Kaplan Cray Inc.

Transcript of Cray Inc. Hot Interconnects 1 Bob Alverson, Duncan Roweth, Larry Kaplan Cray Inc.

Page 1: Cray Inc. Hot Interconnects 1 Bob Alverson, Duncan Roweth, Larry Kaplan Cray Inc.

Gemini System Interconnect

Cray Inc. Hot Interconnects 1

Bob Alverson, Duncan Roweth, Larry Kaplan

Cray Inc.

Page 2: Cray Inc. Hot Interconnects 1 Bob Alverson, Duncan Roweth, Larry Kaplan Cray Inc.

OverviewNetwork InterfaceRouterReliability, Availability, and Serviceability FeaturesSoftware StackPerformance

Agenda

Cray Inc. Hot Interconnects 2

Page 3: Cray Inc. Hot Interconnects 1 Bob Alverson, Duncan Roweth, Larry Kaplan Cray Inc.

Integrated NIC and RouterExternal HSS Monitoring Supports 2 Nodes per ASIC Advanced Resiliency Features Hardware Global Address Support Advanced NIC designed to efficiently

support MPI One-sided MPI Shmem UPC, Coarray FORTRAN

Cray Inc. Hot Interconnects 3

Cray XE6 With Gemini Network

Page 4: Cray Inc. Hot Interconnects 1 Bob Alverson, Duncan Roweth, Larry Kaplan Cray Inc.

Cray XE6 Chassis Topology

Cray Inc. Hot Interconnects 4

Y

X

Z

Z

X

Y

Page 5: Cray Inc. Hot Interconnects 1 Bob Alverson, Duncan Roweth, Larry Kaplan Cray Inc.

Fast Memory Access (FMA) – fine grain remote PUT/GETBlock Transfer Engine (BTE) – offload for long transfersCompletion Queue (CQ) – client notificationAtomic Memory Op (AMO) – fetch&add, etc.

Gemini Network Interface

Cray Inc. Hot Interconnects 5

HT

3 C

ave

vc0

vc1

vc1

vc0

LB Ring

LBLM

NL

FMA

CQ

NPT

RMTnet req

HARB

net rsp

ht pireq

ht treq p

ht irsp

ht npireq

ht np req

ht np reqnet req

ht p req ORB

RAT

NAT

BTE

net req

net rsp

ht treq np

ht trsp net req

net req

net req

net req

net reqnet req

ht p req

ht p req

ht p req net rsp

CLM

AMO net rsp headers

TARB

net req

net rsp

SSID

Ro

ute

r T

iles

Page 6: Cray Inc. Hot Interconnects 1 Bob Alverson, Duncan Roweth, Larry Kaplan Cray Inc.

Single-sidedProcessor stores become remote

PUT or GETFMA descriptors hold state to help

determine destination node and memory location

FMA PUT for short messagesUncached processor store to

Gemini window translated directly to network packet

FMA GET allows reverse direction data transfer of 1 to 64 bytes

Fast Memory Access

Cray Inc. Hot Interconnects 6

Page 7: Cray Inc. Hot Interconnects 1 Bob Alverson, Duncan Roweth, Larry Kaplan Cray Inc.

Driver managedBTE PUT for long messages

DMA transfer to offload data movement from processorBTE SEND for IP traffic, etc.

Send message to remote nodeSingle receive queue for all sourcesUpper level protocol covers lost messages

BTE GET support for simplified data transfersIn lieu of involving remote side for PUT

Block Transfer Engine

Cray Inc. Hot Interconnects 7

Page 8: Cray Inc. Hot Interconnects 1 Bob Alverson, Duncan Roweth, Larry Kaplan Cray Inc.

Hardware remote atomic memory operations in the NICAdd, Compare & Swap, Logical OperationsExecuted at the node with the memoryAMO cache for hot locations

Up to 64 locations with AMOs in process

Global operations supportBarriersCountersCollectives (reductions, global sum)

Atomic Memory Operations (AMOs)

Cray Inc. Hot Interconnects 8

Page 9: Cray Inc. Hot Interconnects 1 Bob Alverson, Duncan Roweth, Larry Kaplan Cray Inc.

6x8 tile matrixInput queue to one of 6

subswitchesRoute to one of 8 output

buffersHashed routing preserves

order to cachelinesAdaptive routing

Router

Cray Inc. Hot Interconnects 9

Page 10: Cray Inc. Hot Interconnects 1 Bob Alverson, Duncan Roweth, Larry Kaplan Cray Inc.

Cray Inc. Hot Interconnects 10

Adaptive routing

Route around stalled or down links If a link goes down, adaptive routing mask updated in hardware to

exclude it OS traffic uses adaptive routing only, recovers from finite loss of packets Quiesce and re-route to repair deterministic routes

Congestion feedback to allow routing around bottlenecks Potential for improved performance

on difficult traffic patterns such as transpose

Packets reordered in receive buffer (DRAM) Separate notification (completion

event) when all stored

Page 11: Cray Inc. Hot Interconnects 1 Bob Alverson, Duncan Roweth, Larry Kaplan Cray Inc.

24 bit flitMaximum size packet is

7+24+1=32 flit Put request of 64 bytes

Minimum is 2 flit Put response

Network Packet Format

Cray Inc. Hot Interconnects 11

23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

phit 0 h a r=0 v p c

phit 1 p c

phit 2 p c

last phit R R R 1 p c

23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

phit 0 h a r=0v=0 p c

phit 1 F carmt b p c

phit 2 p c

phit 3 vm ra p c

phit 4 dt pt p c

phit 5 p c

phit 6 p c

23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

phit n p c

(phit n+1) p c

(phit n+2) p c

CRC-16 ok

payload

address[37:24]ptag[7:0]

vc

cmd[5:0]

vc

payloadoptional hash bitspayload

reserved addr[45:40]

General Network Packet Format

destination[15:0]

packetID[11:0]SSID[7:0]

MDH[11:0]

Network Request Packet Format

address[23:6]

destination[15:0]

sizedata[19:0]

mask[15:0]

BTEvc

data[63:42]

SrcIDDstIDsource[15:0]

data[41:20]

addr[39:38]

Data Payload (up to 24 phits)

Page 12: Cray Inc. Hot Interconnects 1 Bob Alverson, Duncan Roweth, Larry Kaplan Cray Inc.

Automatic link-level retriesHT3 support including automatic retries and improved CRCMost internal data structures are at least parity protected

The longer the occupancy of data at a location, the stronger the protection

Errors reported as precisely as possiblePayload errors reported directly to userControl errors often cannot be associated with a particular

transactionIn all cases OS or HSS can be notified of the error

Router errors includedReported at the point of errorEndpoint(s) (user) see a timeout

RAS Features

Cray Inc. Hot Interconnects 12

Page 13: Cray Inc. Hot Interconnects 1 Bob Alverson, Duncan Roweth, Larry Kaplan Cray Inc.

Gemini Software Stack

Cray Inc. Hot Interconnects 13

User level Gemini Network Interface(uGNI)

DMAPP

MPICH MPICH2 SHMEM

Gemini Hardware Abstraction Layer (GHAL)

GNI Core

IOC

TL o

r Syste

m C

all

Kernel level GNI(kGNI)

Lustre Network Driver(LND)

IP over Gemini Fabric

(IPoGIF)

Dire

ct A

ccess

Linux Core

GART Resource Management

(GRM)

Cray COW solution

MRT-size page support

Registration Cache support

PGAS

Dire

ct A

ccess

Page 14: Cray Inc. Hot Interconnects 1 Bob Alverson, Duncan Roweth, Larry Kaplan Cray Inc.

LatencyBandwidthAtomic operations

Performance

Cray Inc. Hot Interconnects 14

Page 15: Cray Inc. Hot Interconnects 1 Bob Alverson, Duncan Roweth, Larry Kaplan Cray Inc.

Gemini expanded to HT3 at up to 5.2 GT/sExpect to sustain greater than 6 GB/s user data injection

Network bandwidth is limited by XT packagingLink speed from 3.125 to 6.25 Gbit/secIn some cases, double wide X & Z links also offer

increased bandwidth

Gemini relies on user level threadsMPI processing limits to 2M messages/sec per threadScales beyond 10M msg/sec per NIC

Injection BW and Message Rate

Cray Inc. Hot Interconnects 15

Page 16: Cray Inc. Hot Interconnects 1 Bob Alverson, Duncan Roweth, Larry Kaplan Cray Inc.

One way PUT in 750ns

Waiting for Ack in only 1.1 us

Remote GET increases to 1.4 us

Latency vs. Transfer Size

Cray Inc. Hot Interconnects 16

0.0

0.5

1.0

1.5

2.0

2.5

8 16 32 64 128 256 512 1024

Tim

e (m

icros

ecs)

Size (bytes)

PUT, ping-pongPUT, at sourceGET

Page 17: Cray Inc. Hot Interconnects 1 Bob Alverson, Duncan Roweth, Larry Kaplan Cray Inc.

Peak bandwidth reached with small transfers

Multiple threads reach peak with smaller, still, transfers

Bandwidth vs. Transfer Size

Cray Inc. Hot Interconnects 17

0

1000

2000

3000

4000

5000

6000

7000

8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K

Ban

dwid

th (M

byte

s/se

c)

Size (bytes)

PPN=1

PPN=2

PPN=4

Page 18: Cray Inc. Hot Interconnects 1 Bob Alverson, Duncan Roweth, Larry Kaplan Cray Inc.

Hot location reaches 100 Mupdates/sec

Random locations (GUPS) still over 45 Mupdates/sec

Atomic Memory Operation Rate

Cray Inc. Hot Interconnects 18

0

20

40

60

80

100

120

0 256 512 768 1024

AMO

rate

(mill

ions

)

Number of processes

1 AMO8192 AMOs

Page 19: Cray Inc. Hot Interconnects 1 Bob Alverson, Duncan Roweth, Larry Kaplan Cray Inc.

Gemini provides low latency, and performance for fine grain operations

Gemini has features to scale in performance and reliability to large system size

Questions?

Conclusion

Cray Inc. Hot Interconnects 19