STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of...

9
STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of computing: Instruction (single, multiple) per clock cycle Data used (single, multiple) per clock cycle Single Instruction Single Data: Serial computing Single Instruction Multiple Data: Multiple processors, GPU Multiple Instruction Single Data: Shared memory MIMD: Cluster computing, Multi-core CPU, Multi-threaded, Message-passing (IBM SP-x on hypercube, Intel single chip Xenon Phi: http://spectrum.ieee.org/semiconductors/processors/what-intels-xeon-phi- coprocessor-means-for-the-future-of-supercomputing )

Transcript of STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of...

Page 1: STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of computing: –Instruction (single, multiple) per clock cycle.

STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.)

• Parallelization

• Four types of computing:

– Instruction (single, multiple) per clock cycle

– Data used (single, multiple) per clock cycle

• Single Instruction Single Data: Serial computing

• Single Instruction Multiple Data: Multiple processors, GPU

• Multiple Instruction Single Data: Shared memory

• MIMD: Cluster computing, Multi-core CPU, Multi-threaded, Message-passing (IBM SP-x on hypercube, Intel single chip Xenon Phi: http://spectrum.ieee.org/semiconductors/processors/what-intels-xeon-phi-coprocessor-means-for-the-

future-of-supercomputing)

Page 2: STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of computing: –Instruction (single, multiple) per clock cycle.

Grid Computing & Cloud

• Not necessarily parallel• Primary focus is the utilization of CPU-cycles across • Just networked CPU’s, but middle-layer software makes node

utilizations transparent• A major focus: avoid data transfer – run codes where data are• Another focus: load balancing• Message passing parallelization is possible: MPI, PVM, etc.• Community specific Grids: CERN, Bio-grid, Cardio-vascular grid,

etc.• Cloud: Data archiving focus, but really commercial versions of Grid,

CPU utilization is under-sold but coming up: expect service-oriented software business model to pick up

Page 3: STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of computing: –Instruction (single, multiple) per clock cycle.

RAM Memory Utilization

• Two types feasible:

• Shared memory:

• Fast, possibly on-chip, no message passing time, no dependency on a ‘pipe’ and its possible failure

• But, consistency needs to be explicitly controlled, that may cause-deadlock, that needs deadlock checking-breaking mechanism adding overhead

• Distributed local memory:

• communication overhead

• ‘pipe’ failure possibility is a practical problem

• good model where threads are independent of each other

• most general model for parallelization

• easy to code, & well-established library (MPI)

• scaling up is easy – on-chip to over-the-globe

Page 4: STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of computing: –Instruction (single, multiple) per clock cycle.

Threading Types

• Two types feasible:

• Static threading: OS controls, typically for single-core CPU’s (why would one do it? - OS),

but multi-core CPU’s use it if compiler guarantees safe execution

• Dynamic threading: Program controls explicitly, threads are created/destroyed as needed, parallel computing model

Page 5: STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of computing: –Instruction (single, multiple) per clock cycle.

Multi-threaded Fibonacci Recursive

Fib (n)

1If n<=1 then return n;

else

2. x = Fib(n-1);

3. y = Fib(n-2);

4. return (x+y).

Complexity: O(Gn), where G is Golden ration ~1.6

Page 6: STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of computing: –Instruction (single, multiple) per clock cycle.

Fibonacci Recursive

Fib (n)

1If n<=1 then return n;

else

2. x = Spawn Fib(n-1);

3. y = Fib(n-2);

4.Sync;

5. return (x+y).

Parallelization of threads is optional: scheduler decides (programmer, script translator, compiler, os)

Page 7: STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of computing: –Instruction (single, multiple) per clock cycle.

• GPU-type parallelization’s ideal time ~critical path length• The more balanced the tree is the shorter the critical path

Spawn, or Data collection node is counted as time unit 1This is message passing

Note, GPU/SIMD uses different model:Each thread does same work (kernel), & Data goes to shared memory

Page 8: STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of computing: –Instruction (single, multiple) per clock cycle.

Terminologies/Concepts

• For P available processor: Tinf , TP , T1 : no-limit to serial-processor

• Ideal parallelization: TP = T1 / P

• Real situation: TP >= T1 / P

• Tinf is theoretical minimum feasible, so, TP >= Tinf

• Speedup factor = T1 / P

• T1 / TP <= P

• Linear speedup: T1 / TP = O(P) [e.g. 3P +c]

• Perfect linear speedup: T1 / TP = P

• My preferred factor would be TP / T1 (inverse speedup: slowdown factor?)

– linear O(P); quadratic O(P2), …, exponential O(kP, k>1)

Page 9: STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of computing: –Instruction (single, multiple) per clock cycle.

Terminologies/Concepts

• For P available processor: Tinf , TP , T1 : no limit to serial processor

• Parallelism factor: T1 / Tinf

– serial-time by ideal-parallelized-time

– note, this is about your algorithm,

• unoptimized over the actual configuration available to you

• T1 / Tinf < P implies NOT linear speedup

• T1 / Tinf << P implies processors are underutilized

• We want to be close to P: T1 / Tinf P, as in limit

• Slackness factor: (T1 / Tinf ) / P , or (T1 / Tinf P)

• We want slackness 1, minimum feasible

– i.e, we want no slack