Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) Alejandro Salinger...
-
Upload
imogene-robertson -
Category
Documents
-
view
212 -
download
0
Transcript of Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) Alejandro Salinger...
Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture
(LoPRAM)
Alejandro SalingerCheriton School of Computer Science
University of Waterloo
Joint work with Alejandro López-Ortiz and Reza Dorrigiv
Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger 2
Multicore Challenge
• RAM model will no longer accurately reflect the architecture on which algorithms are executed.
• PRAM facilitates design and analysis, however:– Unrealistic.– Difficult to derive work-optimal algorithms for
Θ(n) processors.• 2, 4, or 8 cores per chip: low degree parallelism.• Thread-based parallelism.
Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger 3
Multicore Challenge
• Design a model such that:– Reflects available degree of parallelism.– Multi-threaded.– Easy theoretical analysis.– Easy to program.
“Programmability has now replaced power as the number oneimpediment to the continuation of Moore’s law” [Gartner]
Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger 4
The LoPRAM Model• Number of cores is not a constant: modeled as O(log n).• Similar to bit-level parallelism, w = O(log n)-bit word.
LoPRAM:• PRAM with p = O(log n) processors running in MIMD
mode.• Concurrent Read Exclusive Write (CREW).• Simplest form: high-level thread-based parallelism.• Semaphores and automatic serialization available and
transparent to programmer.• p = O(log n) but not p = Θ(log n).
Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger 5
PAL-threadsvoid mergeSort(int numbers[], int temp[], int array_size){ m_sort(numbers, temp, 0, array_size - 1); }
void m_sort(int numbers[], int temp[], int left, int right){ int mid = (right + left) / 2; if (right > left) { m_sort(numbers, temp, left, mid); m_sort(numbers, temp, mid+1, right); merge(numbers, temp, left, mid+1, right); }}
Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger 6
PAL-threadsvoid mergeSort(int numbers[], int temp[], int array_size){ m_sort(numbers, temp, 0, array_size - 1); }
void m_sort(int numbers[], int temp[], int left, int right){ int mid = (right + left) / 2; if (right > left) { palthreads { // do in parallel if possible m_sort(numbers, temp, left, mid); m_sort(numbers, temp, mid+1, right); } // implicit join merge(numbers, temp, left, mid+1, right); }}
pendingactivewaiting
Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger 7
Work-Optimal Algorithms: Divide & Conquer• Recursive divide-and-conquer algorithms with
time given by:
• By the master theorem:
Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger 8
Divide & Conquer
Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger 9
Divide & Conquer
• Parallel Master theorem in the LoPRAM:
If we assume parallel merging the third case becomes Tp(n) = (f (n)/p).
Optimal speedup [i.e. Tp(n) = T(n)/p ] so long as p = O(log n).
Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger 10
Matrix Multiplication
T(n)=7T(n/2)+O(n2)T(n)=O(n2.8)
Tp(n)=O(n2.8/p)
Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger 11
Dynamic programming
• Generic parallel algorithm that exploits the parallelism of execution in the DAG.
Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger 12
Conclusions
• Computers have a small number of processors.
• The assumption that p=O(log n) or even O(log2 n) will last for a while.
• Designing work-optimal algorithms for a small number of processors is easy.