Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) Alejandro Salinger...

12
Optimal Speedup on a Low- Degree Multi-Core Parallel Architecture (LoPRAM) Alejandro Salinger Cheriton School of Computer Science University of Waterloo Joint work with Alejandro López- Ortiz and Reza Dorrigiv

Transcript of Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) Alejandro Salinger...

Page 1: Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) Alejandro Salinger Cheriton School of Computer Science University of Waterloo.

Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture

(LoPRAM)

Alejandro SalingerCheriton School of Computer Science

University of Waterloo

Joint work with Alejandro López-Ortiz and Reza Dorrigiv

Page 2: Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) Alejandro Salinger Cheriton School of Computer Science University of Waterloo.

Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger 2

Multicore Challenge

• RAM model will no longer accurately reflect the architecture on which algorithms are executed.

• PRAM facilitates design and analysis, however:– Unrealistic.– Difficult to derive work-optimal algorithms for

Θ(n) processors.• 2, 4, or 8 cores per chip: low degree parallelism.• Thread-based parallelism.

Page 3: Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) Alejandro Salinger Cheriton School of Computer Science University of Waterloo.

Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger 3

Multicore Challenge

• Design a model such that:– Reflects available degree of parallelism.– Multi-threaded.– Easy theoretical analysis.– Easy to program.

“Programmability has now replaced power as the number oneimpediment to the continuation of Moore’s law” [Gartner]

Page 4: Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) Alejandro Salinger Cheriton School of Computer Science University of Waterloo.

Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger 4

The LoPRAM Model• Number of cores is not a constant: modeled as O(log n).• Similar to bit-level parallelism, w = O(log n)-bit word.

LoPRAM:• PRAM with p = O(log n) processors running in MIMD

mode.• Concurrent Read Exclusive Write (CREW).• Simplest form: high-level thread-based parallelism.• Semaphores and automatic serialization available and

transparent to programmer.• p = O(log n) but not p = Θ(log n).

Page 5: Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) Alejandro Salinger Cheriton School of Computer Science University of Waterloo.

Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger 5

PAL-threadsvoid mergeSort(int numbers[], int temp[], int array_size){ m_sort(numbers, temp, 0, array_size - 1); }

void m_sort(int numbers[], int temp[], int left, int right){ int mid = (right + left) / 2; if (right > left) { m_sort(numbers, temp, left, mid); m_sort(numbers, temp, mid+1, right); merge(numbers, temp, left, mid+1, right); }}

Page 6: Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) Alejandro Salinger Cheriton School of Computer Science University of Waterloo.

Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger 6

PAL-threadsvoid mergeSort(int numbers[], int temp[], int array_size){ m_sort(numbers, temp, 0, array_size - 1); }

void m_sort(int numbers[], int temp[], int left, int right){ int mid = (right + left) / 2; if (right > left) { palthreads { // do in parallel if possible m_sort(numbers, temp, left, mid); m_sort(numbers, temp, mid+1, right); } // implicit join merge(numbers, temp, left, mid+1, right); }}

pendingactivewaiting

Page 7: Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) Alejandro Salinger Cheriton School of Computer Science University of Waterloo.

Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger 7

Work-Optimal Algorithms: Divide & Conquer• Recursive divide-and-conquer algorithms with

time given by:

• By the master theorem:

Page 8: Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) Alejandro Salinger Cheriton School of Computer Science University of Waterloo.

Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger 8

Divide & Conquer

Page 9: Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) Alejandro Salinger Cheriton School of Computer Science University of Waterloo.

Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger 9

Divide & Conquer

• Parallel Master theorem in the LoPRAM:

If we assume parallel merging the third case becomes Tp(n) = (f (n)/p).

Optimal speedup [i.e. Tp(n) = T(n)/p ] so long as p = O(log n).

Page 10: Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) Alejandro Salinger Cheriton School of Computer Science University of Waterloo.

Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger 10

Matrix Multiplication

T(n)=7T(n/2)+O(n2)T(n)=O(n2.8)

Tp(n)=O(n2.8/p)

Page 11: Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) Alejandro Salinger Cheriton School of Computer Science University of Waterloo.

Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger 11

Dynamic programming

• Generic parallel algorithm that exploits the parallelism of execution in the DAG.

Page 12: Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) Alejandro Salinger Cheriton School of Computer Science University of Waterloo.

Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger 12

Conclusions

• Computers have a small number of processors.

• The assumption that p=O(log n) or even O(log2 n) will last for a while.

• Designing work-optimal algorithms for a small number of processors is easy.