Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) Alejandro Salinger...

Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture

(LoPRAM)

Alejandro SalingerCheriton School of Computer Science

University of Waterloo

Joint work with Alejandro López-Ortiz and Reza Dorrigiv

Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger 2

Multicore Challenge

• RAM model will no longer accurately reflect the architecture on which algorithms are executed.

• PRAM facilitates design and analysis, however:– Unrealistic.– Difficult to derive work-optimal algorithms for

Θ(n) processors.• 2, 4, or 8 cores per chip: low degree parallelism.• Thread-based parallelism.


Multicore Challenge

• Design a model such that:– Reflects available degree of parallelism.– Multi-threaded.– Easy theoretical analysis.– Easy to program.

“Programmability has now replaced power as the number oneimpediment to the continuation of Moore’s law” [Gartner]


The LoPRAM Model• Number of cores is not a constant: modeled as O(log n).• Similar to bit-level parallelism, w = O(log n)-bit word.

LoPRAM:• PRAM with p = O(log n) processors running in MIMD

mode.• Concurrent Read Exclusive Write (CREW).• Simplest form: high-level thread-based parallelism.• Semaphores and automatic serialization available and

transparent to programmer.• p = O(log n) but not p = Θ(log n).


PAL-threadsvoid mergeSort(int numbers[], int temp[], int array_size){ m_sort(numbers, temp, 0, array_size - 1); }

void m_sort(int numbers[], int temp[], int left, int right){ int mid = (right + left) / 2; if (right > left) { m_sort(numbers, temp, left, mid); m_sort(numbers, temp, mid+1, right); merge(numbers, temp, left, mid+1, right); }}


PAL-threadsvoid mergeSort(int numbers[], int temp[], int array_size){ m_sort(numbers, temp, 0, array_size - 1); }

void m_sort(int numbers[], int temp[], int left, int right){ int mid = (right + left) / 2; if (right > left) { palthreads { // do in parallel if possible m_sort(numbers, temp, left, mid); m_sort(numbers, temp, mid+1, right); } // implicit join merge(numbers, temp, left, mid+1, right); }}

pendingactivewaiting


Work-Optimal Algorithms: Divide & Conquer• Recursive divide-and-conquer algorithms with

time given by:

• By the master theorem:


Divide & Conquer


Divide & Conquer

• Parallel Master theorem in the LoPRAM:

If we assume parallel merging the third case becomes Tp(n) = (f (n)/p).

Optimal speedup [i.e. Tp(n) = T(n)/p ] so long as p = O(log n).


Matrix Multiplication

T(n)=7T(n/2)+O(n2)T(n)=O(n2.8)

Tp(n)=O(n2.8/p)


Dynamic programming

• Generic parallel algorithm that exploits the parallelism of execution in the DAG.


Conclusions

• Computers have a small number of processors.

• The assumption that p=O(log n) or even O(log2 n) will last for a while.

• Designing work-optimal algorithms for a small number of processors is easy.

Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) Alejandro Salinger...

Documents

Transcript of Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) Alejandro Salinger...