Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term...

59
Shared-Memory Hardware Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube

Transcript of Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term...

Page 1: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

Shared-Memory Hardware Parallel Programming Concepts Winter Term 2013 / 2014

Dr. Peter Tröger, M.Sc. Frank Feinbube

Page 2: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

Shared-Memory Hardware

■  Hardware architecture: Processor(s), memory system(s), data path(s) □  Each component may become the performance bottleneck □  Each component can be replicated

□  Each parallelization target must be handled separately ■  Modern processor

□  Multiple instructions in the same cycle □  Multiple concurrent instruction streams per functional unit

□  Multiple functional units (cores) ■  Combination of multiple processors with one shared memory ■  Logical hardware setup as seen by the software,

physical hardware organization typically differs

Page 3: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

Shared-Memory Hardware

3 Fetch instruction,

update program counter (PC)

Decode instruction

Execute instruction

Write Back result

Page 4: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

Shared-Memory Hardware

■  Central Processing Units (CPUs) + volatile memory + I/O devices

■  Fetch instruction and execute it - typically memory access, computation, and / or I/O

4 [Stallings]

■  I/O devices and memory controller may interrupt the instruction processing

■  Improve processor utilization by asynchronous operations

Page 5: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

RISC vs. CISC

■  RISC - Reduced Instruction Set Computer □  MIPS, ARM, DEC Alpha, Sparc, IBM 801, Power, etc. □  Small number of instructions □  Few data types in hardware □  Instruction size constant, few addressing modes □  Relies on optimization in software

■  CISC - Complex Instruction Set Computer

□  VAX, Intel X86, IBM 360/370, etc. □  Large number of complex instructions, may take multiple cycles □  Variable length instructions □  Smaller code size □  Focus on optimization in hardware

■  RISC designs lend themself to exploitation of instruction level parallelism

Page 6: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

Shared-Memory Hardware

■  Major constraints of memory are amount, speed, and costs

□  Faster access time results in greater costs per bit □  Greater capacity results in smaller costs per bit

□  Greater capacity results in slower access ■  Going down the memory hierarchy

□  Decreasing costs per bit

□  Increasing capacity for fixed costs

□  Increasing access time ■  I/O devices provide

non-volatile memory on lower levels, which is an additional advantage

6

http

://t

jliu.

myw

eb.h

inet

.net

/

Page 7: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

Shared-Memory Hardware

■  Principle of Locality □  Memory referenced by a processor (program and data) tends

to cluster (e.g. loops, subroutines) □  Operations on tables and arrays involve access to

clustered data sets □  Temporal locality: If a memory location is referenced, it will

tend to be referenced again soon □  Spatial locality: If a memory location is referenced, locations

whose addresses are close by will tend to be referenced soon ■  Data should be organized so that the

percentage of accesses to lower levels is substantially less than to the level above

■  Typically implemented by caching concept

7

[Stallings]

Page 8: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

Shared-Memory Hardware

■  Caching □  Offer a portion of lower

level memory as copy in the faster smaller memory

□  Leverages the principle of locality

□  Processor caches work in hardware, but must be considered by an operating system

8

[Stallings]

Page 9: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

Shared-Memory Hardware

■  Conflicting caching design goals □  Cache size per level

□  Number of cache levels □  Block size exchanged with

lower level memory □  Replacement algorithm □  Mapping function □  Write policy for modified

cache lines ■  All decisions made by hardware

vendor, considerable by software ■  Cache-optimized software needed when parallelization

improvements start to depend on memory bottlenecks

9

[Stallings]

Page 10: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

Parallel Processing

■  Inside the processor □  Instruction-level

parallelism (ILP) □  Multicore

□  Shared memory ■  With multiple processing

elements in one machine □  Multiprocessing □  Shared memory

■  With multiple processing elements in many machines □  Multicomputer

□  Shared nothing (in terms of a globally accessible memory)

10

Page 11: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

Instruction-Level Parallelism

■  Hardware optimizes sequential instruction stream execution □  Pipelining architecture

◊  Sub-steps of sequential instructions are overlapped in their execution to increase throughput

◊  Traditional concept in processor hardware design ◊  Relies on mechanisms such as branch prediction or the

out-of-order execution of instructions □  Superscalar architecture ◊  Execution of multiple instructions in parallel, based on

redundant functional units of the processor ◊ Very Long Instruction Word (VLIW)

◊  Explicitly parallel instruction computing (EPIC) ◊  SIMD vectorization support with special instructions

11

Page 12: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

Pipelining

12

Page 13: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

Pipelining

■  Pipelining overlaps various stages of instruction execution ■  Fetch, decode and execute happen in parallel

□  Increases instruction throughput with the same clock speed ■  Analogue to assembly line concept ■  Pipelining hazard

□  Temporal dependencies between sub-steps

□  Influence on speedup □  Structural hazard: Multiple instructions access the same

resource □  Data hazard: Instruction needs the result of the previous

instruction □  Control hazard: Instruction result changes the control flow

(interrupt, branch)

13

Page 14: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

Multi-Cycle Pipelining

14

Page 15: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

Pipelining Conflicts

■  Conflict solution strategies □  Multiply commonly needed hardware units

□  Inclusion of NOPs for changing the timing □  Reorder the instructions for changing the timing ◊ Write-after-read effect: Instruction uses a register value

that is overwritten by a subsequent instruction ◊ Write-after-write effect: Two subsequent instructions

write into the same register ◊ Read-After-write effect: Reliance on previous result

□  Stall the pipeline until the conflict is solved

■  Control conflicts are targeted by branch prediction □  Static branch prediction (forward never - backward ever) vs.

dynamic branch prediction (based on previous jumps) ■  Issues typically targeted by compiler and processor hardware

15

Page 16: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

Superscalar Architectures

16

Page 17: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

Superscalar Architectures

■  Good for problems with high degree of regularity, such as graphics/image processing

■  Typically exploit data parallelism ■  Today: GPGPU Computing, Cell

processor, SSE, AltiVec

17

ILLI

AC I

V (

1974

) Cra

y Y-

MP

Thin

king

Mac

hine

s CM

-2 (

1985

)

Ferm

i GPU

Page 18: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

Superscalar Architectures

■  Vector instructions for high-level operations on data sets ■  Became famous with Cray architecture in the 70‘s

■  Today, vector instructions are part of the standard instruction set □  AltiVec □  Streaming SIMD Extensions (SSE)

■  Example: Vector addition

18

vec_res.x = v1.x + v2.x;!vec_res.y = v1.y + v2.y;!vec_res.z = v1.z + v2.z;!vec_res.w = v1.w + v2.w;!

movaps xmm0,address-of-v1 !(xmm0=v1.w | v1.z | v1.y | v1.x) !!addps xmm0,address-of-v2 !(xmm0=v1.w+v2.w | v1.z+v2.z | v1.y+v2.y | v1.x+v2.x) !!movaps address-of-vec_res,xmm0!

Page 19: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

Streaming SIMD Extensions (SSE)

■  Introduced by Intel with the Pentium III (1999) ■  Specifically designed for floating point and vector operations

■  New 128 Bit registers can be packed with four 32 bit scalars □  Operation is performed simultaneously on all of them

■  Typical operations □  Move data between SSE registers and 32b registers / memory

□  Add, subtract, multiply, divide, square root, maximum, minimum, reciprocal, compare, bitwise AND / OR / XOR

■  Available as compiler intrinsic □  Function known by the compiler that maps to assembler □  Better performance than with linked library

19

Page 20: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

Other Instruction Set Extensions

■  Fused Multiply-Add instructions (FMA) □  Supported in different variations by all processors

□  Floating point multiply-add operation performed in one step □  Improves speed and accuracy of product accumulation ◊  Scalar product ◊ Matrix multiplication

◊  Efficient software implementation of square root and division ■  Intel Advanced Vector Extensions (AVX)

□  Extension of SSE instruction set □  Introduced with Sandy Bridge architecture (2011)

□  Registers are now 256 Bit wide □  512 bit support announced for 2015 version of Xeon Phi

20

Page 21: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

Very Long Instruction Word (VLIW)

■  Very Long Instruction Word (VLIW), Fisher et al., 1980‘s □  Compiler identifies instructions to be executed in parallel

□  One VLIW instruction encodes several operations (at least one for each redundant execution unit)

□  Less hardware complexity, higher compiler complexity □  VLIW processors typically designed with multiple RISC

execution units □  Very popular in the embedded market and in GPU hardware

■  Explicitely Parallel Instruction Computing (EPIC) □  Coined by HP-Intel alliance since 1997

□  Foundational concept for the Intel Itanium architecture □  Extended version of VLIW concept □  Turned out to be extremely difficult for compilers

21

Page 22: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

EPIC

■  64-bit register-rich explicitly-parallel architecture ■  Implements predication, speculation, and branch prediction

□  Hardware register renaming for parameter passing

□  Parallel execution of loops

■  Speculation, prediction, and renaming controlled by compiler

□  Each 128-bit instruction word contains three instructions

□  Stop-bits control parallel execution

□  Processor can execute six instructions per clock cycle

□  Thirty execution units for subsets of instruction set in eleven groups

□  Each unit executes at a rate of one instruction per cycle (unless stall)

□  Common instructions can be executed in multiple units

22

Page 23: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

Itanium – 30 Functional Units

23

Page 24: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

Simultaneous Multi-Threading (SMT)

■  Reasons for bad performance in superscalar architectures depend on application

■  Dynamically schedule the functional unit usage

■  Support multiple instruction streams in one pipeline

24

[Tullsen et al., 1995]

Page 25: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

Hyperthreading

■  Intel‘s implementation of simultaneous multi-threading (SMT) ■  Allows an execution core to function as two logical processors

■  Main goal is to reduce the number of related instructions being in the pipeline at the same time

■  Works nicely on cache miss, branch misprediction, or data dependencies in one of the threads

■  Most core hardware resources are shared □  Caches, execution units, buses

■  Each logical processor has an own architectural set □  Register bank is mirrored

■  Mainly enables very fast thread context switch in pure hardware ■  More than two logical threads per core would saturate the memory

connection and pollute the caches

25

Page 26: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

Hyperthreading

[Intel]

Page 27: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

Parallel Processing

■  Inside the processor □  Instruction-level

parallelism (ILP) □  Multicore

□  Shared memory ■  With multiple processing

elements in one machine □  Multiprocessing □  Shared memory

■  With multiple processing elements in many machines □  Multicomputer

□  Shared nothing (in terms of a globally accessible memory)

27

Page 28: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

Chip Multi-Processing

■  One integrated circuit die (socket) contains multiple computational engines (core) □  Called many-core or multi-core architecture □  Cores share some / all cache levels and memory connection

□  All other parts are dedicated per core (pipeline, registers, ...) ■  Increase in core count leads to resource contention problem with

caches and memory ■  Beside Intel / AMD, also available with ARM, MIPS, PPC ■  Multi-Core vs. SMP

□  SMP demands more replicated hardware (fans, bus, ...)

□  SMP is a choice, multi-core is given by default □  Cores typically have lower clock frequency

■  Multi-Core and SMP programming problems are very similar ■  Recent trends towards heterogeneous cores

28

Page 29: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

Many-Core / Multi-Core

Intel Core i7 SPARC64 VIIIfx

Page 30: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

Parallel Processing

■  Inside the processor □  Instruction-level

parallelism (ILP) □  Multicore

□  Shared memory ■  With multiple processing

elements in one machine □  Multiprocessing □  Shared memory

■  With multiple processing elements in many machines □  Multicomputer

□  Shared nothing (in terms of a globally accessible memory)

30

Page 31: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

Multiprocessor: Flynn‘s Taxonomy (1966)

■  Classify multiprocessor architectures among instruction and data processing dimension

31

Sing

le In

stru

ctio

n,

Sing

le D

ata

(SIS

D)

(C)

Bla

ise

Bar

ney

Sing

le In

stru

ctio

n,

Mul

tiple

Dat

a (S

IMD

)

Mul

tiple

Inst

ruct

ion,

Sing

le D

ata

(MIS

D)

Sing

le In

stru

ctio

n,

Sing

le D

ata

(SIS

D)

Mul

tiple

Inst

ruct

ion,

Mul

tiple

Dat

a (M

IMD

)

Page 32: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

Multiprocessor Systems

■  Symmetric Multiprocessing (SMP) □  Set of equal processors in one system

(more SM-MIMD than SIMD) □  Traditionally memory bus, today on-chip network

□  Demands synchronization and operating system support ■  Asymmetric multiprocessing (ASMP)

□  Specialized processors for I/O, interrupt handling or operating system (DEC VAX 11, OS-360, IBM Cell processor)

□  Typically master processor with main memory access and slaves

32

Page 33: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

Symmetric Multi-Processing

■  Two or more processors in one system, can perform the same operations (symmetric)

■  Processors share the same main memory and all devices

■  Increased performance and scalability for multi-tasking

■  No master, any processor can cause another to reschedule

■  Challenges for an SMP operating system:

□  Reentrant kernel, scheduling policies, synchronization, memory re-use, ...

[Sta

lling

s]

Page 34: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

Shared Memory

■  All processors act independently and use the same global address space, changes in one memory location are visible for all others

■  Uniform memory access (UMA) system □  Equal load and store access for all processors to all memory

□  Default approach for SMP systems of the past ■  Non-uniform memory access (NUMA) system

□  Groups of physical processors (called “nodes”) that have local memory, connected by some interconnect

□  Still an SMP system (e.g. any processor can access all of memory), but node-local memory is faster

□  OS tries to schedule close activities on the same node

□  Became the default model in shared memory architectures □  Cache-coherent NUMA (CC-NUMA) in hardware

34

Page 35: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

UMA Example

35

Shared-Memory with UMA

� Two dual core chips (2 core/socket)� P = Processor core� L1D = Level 1 Cache – Data (fastest)� L2 = Level 2 Cache (fast)� Memory = main memory (slow)� Chipset = enforces cache coherence and

mediates connections to memory

Lecture 1 – HPC and Big Data

� UMA systems use ‘flat memory model’: Latencies and bandwidth are the same for all processors and all memory locations.

� Also called Symmetric Multiprocessing (SMP)

[3] Introduction to High Performance Computing for Scientists and Engineers

13 / 37

Page 36: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

NUMA Example

36

Shared-Memory with ccNUMA

� Eight cores (4 cores/socket); L3 = Level 3 Cache� Memory interface = establishes a coherent link to enable one

‘logical’ single address space of ‘physically distributed memory’

Lecture 1 – HPC and Big Data

� ccNUMA systems share logically memory that is physically distributed (similar like distributed-memory systems)

� Network logic makes the aggregated memory appear as one single address space

[3] Introduction to High Performance Computing for Scientists and Engineers

14 / 37

Page 37: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

NUMA Example: Intel Nehalem

37

Core Core

Core Core

Q P I

Core Core

Core Core

Q P I

Core Core

Core Core

Q P I

Core Core

Core Core

Q P I L3

Cac

he

L3 C

ache

L3 C

ache

Mem

ory

Con

trolle

r

Mem

ory

Con

trolle

r M

emor

y C

ontro

ller

L3 C

ache

Mem

ory

Con

trolle

r I/O I/O

I/O I/O

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Page 38: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

CC-NUMA

38

[Sch

öne

et a

l.] ■  Central crossbar

for interaction of cores, memory controller and other processors via QPI

■  Similar approach by other vendors

■  Extended versions of MESI cache coherence protocol being used for L3 management

UN

CO

RE

Page 39: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

CC-NUMA

39 ■  Cache coherency in a multi-core multi-socket system

□  Extended problem of traditional cache coherency problem in multi-socket SMP systems

■  Application of extended MESI cache coherence protocol in QPI

□  Each cache line has one state ◊ Modified – Written by the local core ◊  Exclusive – First read by the local core ◊ Shared – Read by two cores (cache hit) ●  Write attempt in this state lead to cache invalidation ●  New state is modified

◊  Invalid – Cache line contains no valid data (read miss)

◊  Forwarding (new) - Direct L3 exchange of data □  Can be optimized by snooping into other caches

Page 40: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

Hypertransport

■  Specification of I/O interconnect, originally developed by AMD, Alpha and API Networks in 2001

■  Point-to-point unidirectional links between components □  At least one host device

(typically processor) □  Bridge functionality to PCI,

PCI-X, PCI Express, ... □  Tunnel devices connect a link

to other HAT devices ■  Extremely low overhead, suitable

for inter-processor communication in SMP hardware

40

[hyp

ertr

ansp

ort.

org]

Page 41: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

Hypertransport

41

[hyp

ertr

ansp

ort.

org]

Page 42: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

Quick Path Interconnect (QPI)

■  Competing technology from Intel, since 2008 ■  Result of a continuous improvement in Intel processor

interconnect technology

42

An Introduction to the Intel® QuickPath Interconnect 7

Figure 3. Shared Front-side Bus, up until 2004

To further increase the bandwidth of the front-side bus based platforms, the single-shared bus approach evolved into dual independent buses (DIB), as depicted in Figure 4. DIB designs essentially doubled the available bandwidth. However, all snoop traffic had to be broadcast on both buses, and if left unchecked, would reduce effective bandwidth. To minimize this problem, snoop filters were employed in the chipset to cache snoop information, thereby significantly reducing bandwidth loading.

Figure 4. Dual Independent Buses, circa 2005

The DIB approach was extended to its logical conclusion with the introduction of dedicated high-speed interconnects (DHSI), as shown in Figure 5. DHSI-based platforms use four FSBs, one for each processor in the platform. Again, snoop filters were employed to achieve bandwidth scaling.

Figure 5. Dedicated High-speed Interconnects, 2007

processor processor processor

chipsetMemoryInterface

I/O

processor

Up to 4.2GB/s Platform Bandwidth

processor processor processor

chipsetMemoryInterface

I/O

processor

Up to 12.8GB/s Platform Bandwidth

snoop filter

processor processor processor

chipsetMemoryInterface

I/O

processor

Up to 34GB/s Platform Bandwidth

snoop filter

An Introduction to the Intel® QuickPath Interconnect 7

Figure 3. Shared Front-side Bus, up until 2004

To further increase the bandwidth of the front-side bus based platforms, the single-shared bus approach evolved into dual independent buses (DIB), as depicted in Figure 4. DIB designs essentially doubled the available bandwidth. However, all snoop traffic had to be broadcast on both buses, and if left unchecked, would reduce effective bandwidth. To minimize this problem, snoop filters were employed in the chipset to cache snoop information, thereby significantly reducing bandwidth loading.

Figure 4. Dual Independent Buses, circa 2005

The DIB approach was extended to its logical conclusion with the introduction of dedicated high-speed interconnects (DHSI), as shown in Figure 5. DHSI-based platforms use four FSBs, one for each processor in the platform. Again, snoop filters were employed to achieve bandwidth scaling.

Figure 5. Dedicated High-speed Interconnects, 2007

processor processor processor

chipsetMemoryInterface

I/O

processor

Up to 4.2GB/s Platform Bandwidth

processor processor processor

chipsetMemoryInterface

I/O

processor

Up to 12.8GB/s Platform Bandwidth

snoop filter

processor processor processor

chipsetMemoryInterface

I/O

processor

Up to 34GB/s Platform Bandwidth

snoop filter

Traditional Shared Frontside Bus (until 2004)

Dual Independent Buses (until 2005)

[int

el.c

om]

Page 43: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

Quick Path Interconnect (QPI)

43

Dedicated Interconnects (until 2007)

Quick Path Interconnect

8 An Introduction to the Intel® QuickPath Interconnect

With the production of processors based on next-generation, 45-nm Hi-k Intel® Core™ microarchitecture, the Intel® Xeon® processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using Intel® QuickPath Interconnects. This configuration is shown in Figure 6. With its narrow uni-directional links based on differential signaling, the Intel® QuickPath Interconnect is able to achieve substantially higher signaling rates, thereby delivering the processor interconnect bandwidth necessary to meet the demands of future processor generations.

Figure 6. Intel® QuickPath Interconnect

Interconnect Overview

The Intel® QuickPath Interconnect is a high-speed point-to-point interconnect. Though sometimes classified as a serial bus, it is more accurately considered a point-to-point link as data is sent in parallel across multiple lanes and packets are broken into multiple parallel transfers. It is a contemporary design that uses

some techniques similar to other point-to-point interconnects, such as PCI Express* and Fully-Buffered DIMMs. There are, of course, some notable differences between these approaches, which reflect the fact that these interconnects were designed for different applications. Some of these similarities and differences will be explored later in this paper.

Figure 7 shows a schematic of a processor with external Intel® QuickPath Interconnects. The processor may have one or more cores. When multiple cores are present, they may share caches or have separate caches. The processor also typically has one or more integrated memory controllers. Based on the level of scalability supported in the processor, it may include an integrated crossbar router and more than one Intel® QuickPath Interconnect port (a port contains a pair of uni-directional links).

Figure 7. Block Diagram of Processor with Intel® QuickPath Interconnects

MemoryInterface

I/O

MemoryInterface

MemoryInterface

MemoryInterface

chipset

I/O

chipset

processor

processor processor

processor

Legend:Bi-directional busUni-directional link

core core core

Inte

grat

edM

emor

yC

ontro

ller(s

)

Crossbar Router /Non-routing

global links interface

MemoryInterface

Processor Cores

Intel®QuickPath

interconnects

An Introduction to the Intel® QuickPath Interconnect 7

Figure 3. Shared Front-side Bus, up until 2004

To further increase the bandwidth of the front-side bus based platforms, the single-shared bus approach evolved into dual independent buses (DIB), as depicted in Figure 4. DIB designs essentially doubled the available bandwidth. However, all snoop traffic had to be broadcast on both buses, and if left unchecked, would reduce effective bandwidth. To minimize this problem, snoop filters were employed in the chipset to cache snoop information, thereby significantly reducing bandwidth loading.

Figure 4. Dual Independent Buses, circa 2005

The DIB approach was extended to its logical conclusion with the introduction of dedicated high-speed interconnects (DHSI), as shown in Figure 5. DHSI-based platforms use four FSBs, one for each processor in the platform. Again, snoop filters were employed to achieve bandwidth scaling.

Figure 5. Dedicated High-speed Interconnects, 2007

processor processor processor

chipsetMemoryInterface

I/O

processor

Up to 4.2GB/s Platform Bandwidth

processor processor processor

chipsetMemoryInterface

I/O

processor

Up to 12.8GB/s Platform Bandwidth

snoop filter

processor processor processor

chipsetMemoryInterface

I/O

processor

Up to 34GB/s Platform Bandwidth

snoop filter

[int

el.c

om]

Page 44: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

Scalable Coherent Interface

■  ANSI / IEEE standard for NUMA interconnect, used in HPC world □  64bit global address space,

translation by SCI bus adapter (I/O-window) ■  Used as 2D / 3D torus

44

Processor A Processor B

Cache Cache

Memory

Processor C Processor D

Cache Cache

Memory SCI Cache

SCI Bridge

SCI Cache

SCI Bridge

...

Page 45: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

Theoretical Models for Parallel Hardware

■  Better use simplified parallel machine model than real hardware specification for parallelization optimization □  Allows theoretical investigation of algorithms □  Allows generic optimization, regardless of products

□  Should improve algorithm robustness by avoiding optimizations to hardware layout specialties (e.g. network topology)

□  Became popular in the 70‘s and 80‘s, due to large diversity in parallel hardware design

■  Resulting computational model is independent from the programming model for the implementation

■  Vast body of theoretical research results ■  Typically, formal models adopt to hardware developments

45

Page 46: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

(Parallel) Random Access Machine

■  RAM assumptions: Constant memory access time, unlimited memory

■  PRAM assumptions: Non-conflicting shared bus, no assumption on synchronization support, unlimited number of processors

■  Alternative models: BSP, LogP

46

CPU

Input Memory Output

CPU CPU

Shared Bus

CPU

Input Memory Output

Page 47: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

PRAM Extensions

■  Rules for memory interaction to classify hardware support of a PRAM algorithm

■  Memory access assumed to be in lockstep (synchronous PRAM) ■  Concurrent Read, Concurrent Write (CRCW)

□  Multiple tasks may read from / write to the same location at the same time, can be simulated with EREW

■  Concurrent Read, Exclusive Write (CREW) □  One task may write to a given memory location at any time

■  Exclusive Read, Concurrent Write (ERCW) □  One task may read from a given memory location

at any time ■  Exclusive Read, Exclusive Write (EREW)

□  One task may read from / write to a memory location at any time, memory management must know concurrency

47

Page 48: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

PRAM Extensions

■  Concurrent write scenario needs further specification by algorithm □  Ensures that the same value is written

□  Selection of arbitrary value from parallel write attempts □  Priority of written value derived from processor ID □  Store result of combining operation (e.g. sum) into memory

■  PRAM algorithm can act as starting point for a real implementation

□  Unlimited resource assumption □  Allows to map ,logical‘ PRAM processors to a restricted number

of physical processors □  Enables the design scalable algorithm based on unlimited

memory assumption □  Focus only on concurrency opportunities,

synchronization and communication later

48

Page 49: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

Example: Parallel Sum

■  General parallel sum operation works with any associative and commutative combining operation □  Multiplication, maximum, minimum, logical operations, …

■  PRAM solution

□  Build binary tree, with input data items as leaf nodes □  Internal nodes hold the sum, root node as global sum □  Additions on one level are independent from each other

■  PRAM algorithm

◊ One processor per leaf node, in-place summation ◊ Computation in O(log2n)

49

int sum=0; for (int i=0; i<N; i++) { sum += A[i]; }

Page 50: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

Example: Parallel Sum

■  Example: n=8: □  l=1: Partial sums in X[1], X[3], X[5], [7]

□  l=2: Partial sums in X[3] and X[7] □  l=3: Parallel sum result in X[7]

■  Correctness relies on PRAM lockstep assumption (no synchronization)

50

for all l levels (1..log2n){ for all i items (0..n-1) { if (((i+1) mod 2^l) = 0) then X[i] := X[i-2^(l-1)]+X[i] } }

Page 51: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

Bulk-Synchronous Parallel (BSP) Model

■  Leslie G. Valiant. A Bridging Model for Parallel Computation, 1990 ■  Success of von Neumann model

□  Bridge between hardware and software □  High-level languages can be efficiently compiled on this model □  Hardware designers can optimize the realization of this model

■  Similar model for parallel machines

□  Should be neutral about the number of processors □  Program should be written for v virtual processors that are

mapped to p physical ones □  When v >> p, the compiler has options

■  BSP computation consists of a series of supersteps: □  1.) Concurrent computation on all processors

□  2.) Exchange of data between all processes □  3.) Barrier synchronization

51

Page 52: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

Bulk-Synchronous Parallel (BSP) Model

■  Costs of a superstep depend on □  The costs for the slowest computation

□  The costs for communication between all processes □  The costs for barrier synchronization

■  Algorithm costs relate to the sum of all superstep costs ■  Synchronization may only happen for some processes

□  Long-running serial tasks are not slowed down from model perspective

■  Recent industrial uptake with Pregel and ML language ■  Apache Hama project implements BSP on top of Hadoop

52

Page 53: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

Bulk-Synchronous Parallel (BSP) Model

■  Bulk-synchronous parallel computer (BSPC) is defined by: □  Components, each performing processing and / or memory

functions □  Router that delivers messages between pairs of components

□  Facilities to synchronize components at regular intervals L (periodicity)

■  Computation consists of a number of supersteps □  Each L, global check is made if the superstep is completed

■  Router concept splits computation vs. communication aspects, and models memory / storage access explicitly

■  L is controlled by the application, even at run-time

53

Page 54: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

LogP [Culler et al., 1993]

■  Criticism on overly simplification in PRAM-based approaches, encourage exploitation of ,formal loopholes‘ (e.g. communication)

■  Trend towards multicomputer systems with large local memories ■  Characterization of a parallel machine by:

□  P: Number of processors □  g (gap): Minimum time between two consecutive transmissions ◊  Reciprocal corresponds to per-processor communication

bandwidth □  L (latency): Upper bound on messaging time □  o (overhead): Exclusive processor time needed for send /

receive operation ■  L, o, G in multiples of processor cycles

54

Page 55: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

LogP Architecture Model

55

Page 56: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

LogP

■  Algorithm must produce correct results under all message interleaving, prove space and time demands of processors

■  Simplifications □  With infrequent communication, bandwidth limits (g) are not

relevant □  With streaming communication, latency (L) may be

disregarded ■  Convenient approximation:

Increase overhead (o) to be as large as gap (g) ■  Encourages careful scheduling of computation, and overlapping of

computation and communication ■  Can be mapped to shared-memory and shared nothing

□  Reading a remote location requires 2L+4o processor cycles

56

Page 57: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

LogP

■  Matching the model to real machines □  Saturation effects: Latency increases as function of the

network load, sharp increase at saturation point - captured by capacity constraint

□  Internal network structure is abstracted, so ,good‘ vs. ,bad‘ communication patterns are not distinguished - can be modeled by multiple g‘s

□  LogP does not model specialized hardware communication primitives, all mapped to send / receive operations

□  Separate network processors can be explicitly modeled

■  Model defines 4-dimensional parameter space of machines □  Vendor product line can be identified by a curve in this space

57

Page 58: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

LogP – Optimal Broadcast Tree

58

Page 59: Parallel Programming Concepts Winter Term 2013 / 2014 · Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube . Shared-Memory Hardware Hardware

LogP – Optimal Summation

59