Cray XE6 Performance WorkshopCray XE6 Performance Workshop Modern HPC Architectures David Henty...

13/11/2012

1

EPCC, University of Edinburgh

Cray XE6

Performance

Workshop

Modern HPC Architectures

David Henty

[email protected]

Modern HPC Architectures 2

Overview

• Components

• History

• Flynn’s Taxonomy

– SIMD

– MIMD

• Classification via Memory

– Distributed Memory

– Shared Memory

– Clusters

• Summary

20/11/2012

13/11/2012

2


Building Blocks of Parallel Machines

• Processors

– to calculate

• Memory

– for temporary storage of data

• Interconnect

– so processors can talk to each other and the outside world

• Storage

– disks and tapes for long term archiving of data

• These are the basic components

– but how do we to put them together ...

20/11/2012


Processors

• Most are RISC architecture

– Reduced Instruction Set Computer

– simplify instructions to maximise speed

• Calculations performed on values in registers

– separate integer and floating point

– loading and storing from memory must be done explicitly

• a = b + c is not an atomic operation

– involves 2 loads, an addition and a store

20/11/2012

13/11/2012

3


Clock Speed

• Rate at which instructions are issued

– modern chips are around 2-3 GHz

– integer and floating point calculations done in parallel

– can also have multiple issue, e.g. simultaneous add and multiply

• Whole series of hardware innovations

– pipelining

– out-of-order execution, speculative computation

– ...

• Details become important for top performance

– most features are fairly generic

20/11/2012


Moore’s Law

• “CPU power doubles every 24 months”

– strictly speaking, applies to transistor density

• Held true for ~35 years

– now maybe self-fulfilling?

• People have predicted its demise many times

– but it hasn’t happened yet

• Increases in power are due to increases in parallelism as

well as in clock rate

– fine grain parallelism (pipelining)

– medium grain parallelism (hardware multithreading)

– coarse grain parallelism (multiple processors on a chip)

• First two seem to be (almost) exhausted: main trend is now

towards multicore

20/11/2012

13/11/2012

4


Memory

• Memory speed is often the limiting factor for HPC

applications

– keeping the CPU fed with data is the key to performance

• Memory is a substantial contributor to the cost of systems

– typical HPC systems have a few Gbytes of memory per processor

– technically possible to have much more than this, but it is too expensive and power-hungry

• Basic characteristics

– latency: how long you have to wait for data to arrive

– bandwidth: how fast it actually comes in

– ballpark figures: 100’s of nanoseconds and a few Gbytes/s

20/11/2012


Cache memory

• Memory latencies are very long

– 100s of processor cycles

– fetching data from main memory is 2 orders of magnitude slower than doing arithmetic

• Solution: introduce cache memory

– much faster than main memory

– ...but much smaller than main memory

– keeps copies of recently used data

• Modern systems use a hierarchy of two or three levels of

cache

20/11/2012

13/11/2012

5


Memory hierarchy

Registers

CPU

L1 Cache

L2 Cache

L3 Cache

Main Memory

Speed (and cost) Capacity ~1 Kb

~100 Kb

~1-10 Mb

~10-50 Mb

~1 Gb

1 cycle

~20 cycles

~300 cycles

~50 cycles

2-3 cycles

20/11/2012


Serial v Parallel Computers

• Serial computers are easier to program than parallel

computers

• …but there are limits on single processor performance

– physical: speed of light, uncertainty principle

– practical: design, manufacture

• Parallel computers dominate HPC because

– they allow highest performance

– they are more cost effective

• Achieving good performance requires

– high quality algorithms, decomposition and programming

20/11/2012

13/11/2012

6


Flynn's Taxonomy

• Classification of architectures by instruction stream and data

stream

• SISD: Single Instruction Single Data

– serial machines

• MISD: Multiple Instructions Single Data

– (probably) no real examples

• SIMD: Single Instruction Multiple Data

• MIMD: Multiple Instructions Multiple Data

20/11/2012


SIMD Architecture

• Single Instruction Multiple Data

• Every processor synchronously executes same instructions

on different data

• Instructions issued by front-end

• Each processor has its own memory where it keeps its data

• Processors can communicate with each other

• Usually thousands of simple processors

• Examples:

– DAP, MasPar, CM200

20/11/2012

13/11/2012

7


SIMD Architecture

P PM

P PM

P

P PM

P PM

P PM

P

P PM

P PM

P PM

P PM

P PM

P PM

P P P PM

PM

PM

PM

PM

Front-end

Network

Peripherals

20/11/2012


MIMD Architecture

• Multiple Instructions Multiple Data

• Several independent processors capable of executing

separate programs

• Subdivision by relationship between processors and memory

20/11/2012

13/11/2012

8


Distributed Memory

• MIMD-DM – each processor has its own local memory

• Processors connected by some interconnect mechanism

• Processors communicate via explicit message passing – effectively sending emails to each other

• Highly scalable architecture – allows Massively Parallel Processing (MPP)

• Examples – Cray XE, IBM BlueGene, workstation/PC clusters (Beowulf)

20/11/2012


Distributed Memory

P M

P M P M

P M

P M P M

P M

P M

Interconnect

20/11/2012

13/11/2012

9


Distributed Memory

• Processors behave like distinct workstations

– each runs its own copy of the operating system

– no interaction except via the interconnect

• Pros

– adding processors increases memory bandwidth

– can grow to almost any size

• Cons

– scalability relies on good interconnect

– jobs are placed by user and remain on the same processors

– potential for high system management overhead

20/11/2012


Shared Memory

• MIMD-SM

– each processor has access to a global memory store

• Communications via write/reads to memory

– caches are automatically kept up-to-date or coherent

• Simple to program (no explicit communications)

• Scaling is difficult because of memory access bottleneck

• Usually modest numbers of processors

20/11/2012

13/11/2012

10


• Each processor in an SMP has equal access to all parts of memory

– same latency and bandwidth

Symmetric MultiProcessing

• Examples

– IBM servers, Sun HPC Servers, multicore PCs

P P P P P P

Bus

Memory

20/11/2012


Shared Memory

• Looks like a single machine to the user

– a single operating system covers all the processors

– the OS automatically moves jobs around the CPU cores

• Pros

– simple to use and maintain

– CC-NUMA architectures allow scaling to 100’s of CPUs

• Cons

– potential problems with simultaneous access to memory

– sophisticated hardware required to maintain cache coherency

– scalability ultimately limited by this

20/11/2012

13/11/2012

11


Shared Memory Cluster

Interconnect

P P

P P

M

P P

P P

M

P P

P P

M

P P

P P

M P P

P P

M

P P

P P

M P P

P P

M

P P

P P

M

20/11/2012


Shared Memory Clusters

• Technology Pyramid…

• …encouraged clustering of SMP nodes.

– i.e.. top-end nodes are the mid-range systems

• Recent trend towards Multicore processors

– Low end clusters and Custom HPC systems have SMP nodes.

SMP server

workstation cluster

HPC

20/11/2012

13/11/2012

12


Shared Memory Clusters

• Combine features of two architectures – shared-memory within a node

– distributed memory between nodes

• Pros – constructed as a standard distributed memory machine

– but with more powerful nodes

• Cons – may be hard to take advantage of mixed architecture

– more complicated to understand performance

– combination of interconnect and memory system behaviour

• Examples – clusters of Intel servers, Bull machines, …

– all modern PC clusters

20/11/2012

HECToR: Cray XE6

• Built from 16-core AMD Interlagos CPUs

– each a mini 16-way SMP with internal bus

20/11/2012 Modern HPC Architectures 24

13/11/2012

13

A bespoke Cray interconnect

• essentially a high-end SMP cluster

– network is a 3D torus, not a switch

2 8 GB

main

memory

12.8 GB/sec direct connect memory

(DDR 800)

6.4 GB/sec direct connect HyperTransport

Cray SeaStar2+

Interconnect

20/11/2012 Modern HPC Architectures 25


HECToR System Specifications cont.

• Cray XE6 parallel processors

• 2816 compute nodes which contain

two AMD 2.3 GHz 16-core Opteron

processors => 90,112 cores

• Theoretical peak of 827 Tflops

• 32 GB main memory per processor,

shared between 32 cores => total

memory of 90 TB

• 10 login nodes

• Gemini interconnect

• 12 IO nodes

26 20/11/2012

13/11/2012

14


Summary

• Flynn’s taxonomy looks somewhat dated

– SIMD likely to remain a niche market

• Wide variety of memory architectures for MIMD

– need to sub-classify by memory

• Many parallel systems based on commodity microprocessors

or clusters of SMPs

– … providing leverage with commercial products

• Parallel architectures appear to be the present and future of

HPC

20/11/2012


Message Passing Model

• The message passing model is based on the notion of

processes

– can think of a process as an instance of a running program, together with the program’s data

• In the message passing model, parallelism is achieved by

having many processes co-operate on the same task

• Each process has access only to its own data

• Processes communicate with each other by sending and

receiving messages

20/11/2012

13/11/2012

15


Process Communication

a=23 Recv(1,b)

Process 1 Process 2

23

23

24

23

Program

Data

Send(2,a) a=b+1

20/11/2012


Quantifying Performance

• Serial computing concerned with complexity

– how execution time varies with problem size N

– adding two arrays (or vectors) is O(N)

– matrix times vector is O(N2), matrix-matrix is O(N3)

• Look for clever algorithms

– naïve sort is O(N2)

– divide-and-conquer approaches are O(N log (N))

• Parallel computing also concerned with scaling

– how time varies with number of processors P

– different algorithms can have different scaling behaviour

– but always remember that we are interested in minimum time!

20/11/2012

13/11/2012

16


Performance Measures

• T(N,P) is time for size N on P processors

• Speedup

– typically S(N,P) < P

• Parallel Efficiency

– typically E(N,P) < 1

• Serial Efficiency

– typically E(N) <= 1

20/11/2012


The Serial Component

• Amdahl’s law

“the performance improvement to be gained by parallelisation is limited by the proportion of the code which is serial”

Gene Amdahl, 1967

20/11/2012

13/11/2012

17


Amdahl’s law

• Assume a fraction a is completely serial

– time is sum of serial and potentially parallel

• Parallel time

– parallel part 100% efficient

• Parallel speedup

– for a = 0, S = P as expected (ie E = 100%)

– otherwise, speedup limited by 1/ a for any P

– impossible to effectively utilise large parallel machines?

20/11/2012


Gustafson’s Law

• Need larger problems for larger numbers of CPUs

20/11/2012

13/11/2012

18


Utilising Large Parallel Machines

• Assume parallel part is O(N), serial part is O(1)

– time

– speedup

• Scale problem size with CPUs, ie set N = P

– speedup

– efficiency

• Maintain constant efficiency (1-a) for large P

20/11/2012


• Real Speed-up graphs

• Improving load balance / algorithm increases the turn-over to

a higher numbers of processors

– better scaling = ability to utilise larger computers

Scaling

Speed-up vs No of PEs

0

50

100

150

200

250

300

0 50 100 150 200 250 300

No of PEs

Sp

ee

d-u

p

linear

actual

20/11/2012

13/11/2012

19


Summary

• Useful definitions

– Speed-up

– Efficiency

• Amdahl’s Law – “the performance improvement to be gained

by parallelisation is limited by the proportion of the code

which is serial”

• Gustafson’s Law – to maintain constant efficiency we need to

scale the problem size with the number of CPUs.

20/11/2012

Cray XE6 Performance WorkshopCray XE6 Performance Workshop Modern HPC Architectures David Henty...

Documents

Transcript of Cray XE6 Performance WorkshopCray XE6 Performance Workshop Modern HPC Architectures David Henty...