PARALLEL PROCESSOR ORGANIZATIONS

40
PARALLEL PROCESSOR ORGANIZATIONS Jehan-François Pâris [email protected]

description

PARALLEL PROCESSOR ORGANIZATIONS. Jehan-François Pâris [email protected]. Chapter Organization. Overview Writing parallel programs Multiprocessor Organizations Hardware multithreading Alphabet soup (SISD, SIMD, MIMD, …) Roofline performance model. OVERVIEW. The hardware side. - PowerPoint PPT Presentation

Transcript of PARALLEL PROCESSOR ORGANIZATIONS

Page 1: PARALLEL PROCESSOR ORGANIZATIONS

PARALLEL PROCESSOR ORGANIZATIONS

Jehan-François Pâ[email protected]

Page 2: PARALLEL PROCESSOR ORGANIZATIONS

Chapter Organization

• Overview• Writing parallel programs• Multiprocessor Organizations• Hardware multithreading• Alphabet soup (SISD, SIMD, MIMD, …)• Roofline performance model

Page 3: PARALLEL PROCESSOR ORGANIZATIONS

OVERVIEW

Page 4: PARALLEL PROCESSOR ORGANIZATIONS

The hardware side

• Many parallel processing solutions– Multiprocessor architectures

• Two or more microprocessor chips• Multiple architectures

– Multicore architectures• Several processors on a single chip

Page 5: PARALLEL PROCESSOR ORGANIZATIONS

The software side

• Two ways for software to exploit parallel processing capabilities of hardware– Job-level parallelism

• Several sequential processes run in parallel• Easy to implement (OS does the job!)

– Process-level parallelism• A single program runs on several processors

at the same time

Page 6: PARALLEL PROCESSOR ORGANIZATIONS

WRITING PARALLEL PROGRAMS

Page 7: PARALLEL PROCESSOR ORGANIZATIONS

Overview

• Some problems are embarrassingly parallel– Many computer graphics tasks– Brute force searches in cryptography or

password guessing• Much more difficult for other applications

– Communication overhead among sub-tasks– Amdahl's law– Balancing the load

Page 8: PARALLEL PROCESSOR ORGANIZATIONS

Amdahl's Law

• Assume a sequential process takes

– tp seconds to perform operations that could be performed in parallel

– ts seconds to perform purely sequential operations

• The maximum speedup will be

(tp + ts )/ts

Page 9: PARALLEL PROCESSOR ORGANIZATIONS

Balancing the load

• Must ensure that workload is equally divided among all the processors

• Worst case is when one of the processors does much more work than all others

Page 10: PARALLEL PROCESSOR ORGANIZATIONS

Example (I)

• Computation partitioned among n processors• One of them does 1/m of the work with m < n

– That processor becomes a bottleneck

• Maximum expected speedup: n

• Actual maximum speedup: m

Page 11: PARALLEL PROCESSOR ORGANIZATIONS

Example (II)

• Computation partitioned among 64 processors• One of them does 1/8 of the work

• Maximum expected speedup: 64

• Actual maximum speedup: 8

Page 12: PARALLEL PROCESSOR ORGANIZATIONS

A last issue

• Humans likes to address issues one after the order– We have meeting agendas– We do not like to be interrupted– We write sequential programs

Page 13: PARALLEL PROCESSOR ORGANIZATIONS

Rene Descartes

• Seventeenth-century French philosopher• Invented

– Cartesian coordinates – Methodical doubt

• [To] never to accept anything for true which I did not clearly know to be such

• Proposed a scientific method based on four precepts

Page 14: PARALLEL PROCESSOR ORGANIZATIONS

Method's third rule

• The third, to conduct my thoughts in such order that, by commencing with objects the simplest and easiest to know, I might ascend by little and little, and, as it were, step by step, to the knowledge of the more complex; assigning in thought a certain order even to those objects which in their own nature do not stand in a relation of antecedence and sequence.

Page 15: PARALLEL PROCESSOR ORGANIZATIONS

MULTI PROCESSOR ORGANIZATIONS

Page 16: PARALLEL PROCESSOR ORGANIZATIONS

Shared memory multiprocessors

PU

Cache

PU

Cache

PU

Cache

Interconnection network

RAM I/O

Page 17: PARALLEL PROCESSOR ORGANIZATIONS

Shared memory multiprocessor

• Can offer– Uniform memory access to all processors

(UMA)• Easiest to program

– Non-uniform memory access to all processors(NUMA)• Can scale up to larger sizes• Offer faster access to nearby memory

Page 18: PARALLEL PROCESSOR ORGANIZATIONS

Computer clusters

PU

Cache

RAM

PU

Cache

RAM

PU

Cache

RAM

Interconnection network

Page 19: PARALLEL PROCESSOR ORGANIZATIONS

Computer clusters

• Very easy to assemble• Can take advantage of high-speed LANs

– Gigabit Ethernet, Myrinet, …• Data exchanges must be done through

message passing

Page 20: PARALLEL PROCESSOR ORGANIZATIONS

Message passing (I)

• If processor P wants to access data in the main memory of processor Q it must– Send a request to Q– Wait for a reply

• For this to work, processor Q must have a thread– Waiting for message from other processors– Sending them replies

Page 21: PARALLEL PROCESSOR ORGANIZATIONS

Message passing (II)

• In a shared memory architecture, each processor can directly access all data

• A proposed solution– Distributed shared memory offers to the

users of a cluster the illusion of a single address space for their shared data

– Still has performance issues

Page 22: PARALLEL PROCESSOR ORGANIZATIONS

When things do not add up

• Memory capacity is very important for big computing applications– If the data can fit into main memory, the

computation will run much faster• A company replaced

– Single shared memory computer with 32GB of RAM

Page 23: PARALLEL PROCESSOR ORGANIZATIONS

A problem

• A company replaced – Single shared memory computer with 32GB of

RAM– Four “clustered” computers with 8GB each

• More I/O than ever• What did happen?

Page 24: PARALLEL PROCESSOR ORGANIZATIONS

The explanation

• Assume OS occupies one GB of RAM– The old shared-memory computer still had 31

GB of free RAM– Each of the clustered computer has 7 GB of

free RAM• The total RAM available to the program went

down from 31 GB to 47 = 28 GB!

Page 25: PARALLEL PROCESSOR ORGANIZATIONS

Grid computing

• The computers are distributed over a very large network– Sometimes computer time is donated

• Volunteer computing• Seti@Home

– Works well with embarrassingly parallel workloads• Searches in a n-dimensional space

Page 26: PARALLEL PROCESSOR ORGANIZATIONS

HARDWARE MULTITHREADING

Page 27: PARALLEL PROCESSOR ORGANIZATIONS

General idea

• Let the processor switch to another thread of computation while them current one is stalled

• Motivation:– Increased cost of cache misses

Page 28: PARALLEL PROCESSOR ORGANIZATIONS

Implementation

• Entirely controlled by the hardware– Unlike multiprogramming

• Requires a processor capable of– Keeping track of the state of each thread

• One set of registers—including PC– for each concurrent thread

– Quickly switching among concurrent threads

Page 29: PARALLEL PROCESSOR ORGANIZATIONS

Approaches

• Fine-grained multithreading:– Switches between threads for each instruction– Provides highest throughputs– Slows down execution of individual threads

Page 30: PARALLEL PROCESSOR ORGANIZATIONS

Approaches

• Coarse-grained multithreading– Switches between threads whenever a long

stall is detected– Easier to implement – Cannot eliminate all stalls

Page 31: PARALLEL PROCESSOR ORGANIZATIONS

Approaches

• Simultaneous multi-threading:– Takes advantage of the possibility of modern

hardware to perform different tasks in parallel for instructions of different threads

– Best solution

Page 32: PARALLEL PROCESSOR ORGANIZATIONS

ALPHABET SOUP

Page 33: PARALLEL PROCESSOR ORGANIZATIONS

Overview

• Used to describe processor organizations where– Same instructions can be applied to– Multiple data instances

• Encountered in– Vector processors in the past– Graphic processing units (GPU)– x86 multimedia extension

Page 34: PARALLEL PROCESSOR ORGANIZATIONS

Classification

• SISD:– Single instruction, single data– Conventional uniprocessor architecture

• MIMD:– Multiple instructions, multiple data– Conventional multiprocessor architecture

Page 35: PARALLEL PROCESSOR ORGANIZATIONS

Classification

• SIMD:– Single instruction, multiple data– Perform same operations on a set of similar data

• Think of adding two vectors

for (i = 0; i++; i < VECSIZE)sum[i] = a[i] + b[i];

Page 36: PARALLEL PROCESSOR ORGANIZATIONS

Vector computing

• Kind of SIMD architecture– Used by Cray computers

• Pipelines multiple executions of single instruction with different data (“vectors”) trough the ALU

• Requires– Vector registers able to store

multiple values– Special vector instructions: say lv, addv, …

Page 37: PARALLEL PROCESSOR ORGANIZATIONS

Benchmarking

• Two factors to consider– Memory bandwidth

• Depends on interconnection network– Floating-point performance

• Best known benchmark is LINPACK

Page 38: PARALLEL PROCESSOR ORGANIZATIONS

Roofline model

• Takes into account– Memory bandwidth– Floating-point performance

• Introduces arithmetic intensity– Total number of floating point operations in a

program divided by total number of bytes transferred to main memory

– Measured in FLOPS/byte

Page 39: PARALLEL PROCESSOR ORGANIZATIONS

Roofline model

• Attainable GFLOPS/s =Min(Peak Memory BWArithmetic

Intensity, Peak Floating-Point Performance

Page 40: PARALLEL PROCESSOR ORGANIZATIONS

Roofline model

Peak floating-point performance

Floating-point performance islimited by memory bandwidth