Introduction to Parallel Processing

Dr. Guy Tel-ZurLecture 10

Agenda

• Administration• Final presentations• Demos• Theory• Next week plan• Home assignment #4 (last)

Final Projects

• Next Sunday: Groups 1-16 will present• Next Monday: Groups 17+ will present• 10 minutes presentation per group• All group members should present• Send to: gtelzur@gmail.com your

presentation by midnight of the previous day

נוכחות חובה

Final Presentations

החלוקה לקבוצות הינה קשיחה• נקודות בציון5קבוצה שלא תציג תאבד •יש לבצע חזרה ולוודא עמידה בזמנים• שם הפרויקט, מטרתו, המצגת צריכה לכלול:•

האתגר בבעיה מבחינת החישוב המקבילי, דרכים לפתרון.

לא תתקבלנה מצגות בזמן השיעור! יש להקפיד •לשלוח אותן אל המרצה מבעוד מועד

The Course RoadmapIntroduction

Message Passing

HTCHPC

Shared MemoryCondor

Grid Computing

Cloud Computing

MPI OpenMPCilk++

GPU Computing New!

Advanced Parallel Computing and Distributed Computing course

• A new course at the department: Distributed Computing: Advanced Parallel

Processing course + Grid Computing + Cloud Computing

Course Number: 361-1-4691

• If you are interested in this course please send me an email

• Algorithms – Numerical Algorithms (“slides11.ppt”)

• Introduction to Grid Computing• Some demos• Home assignment #4

Futuristic A-Symmetric Multi-Core Chip

SACC Sequential Accelerator

Theory

• Numerical Algorithms– Slides from:

UNIVERSITY OF NORTH CAROLINA AT CHARLOTTE Department of Computer Science

ITCS 4145/5145 Parallel Programming Spring 2009

Dr. Barry Wilkinson

Matrix multiplication, solving a system of linear equations, iterative methods

URL is Here

• Hybrid Parallel Programming – MPI + OpenMP

• Cloud Computing– Setting a HPC cluster– Setting a Condor

machine(a separate presentation)

• StarHPC• Cilk++• GPU Computing (a

separate presentation)• Eclipse PTP• Kepler workflow

Hybrid MPI + OpenMP DemoMachine File:

hobbit1hobbit2hobbit3hobbit4

Each hobbit has 8 cores

mpicc -o mpi_out mpi_test.c -fopenmp

OpenMP

An Idea for a final project!!!

cd ~/mpi program name: hybridpi.c

MPI is not installed yet on the hobbits, in the meanwhile:vdwarf5vdwarf6vdwarf7vdwarf8

top -u tel-zur -H -d 0.05

H – show threads, d – delay for refresh, u - user

Hybrid MPI+OpenMP continued

Hybrid Pi (MPI+OpenMP#include <stdio.h>#include <mpi.h>#include <omp.h>#define NBIN 100000#define MAX_THREADS 8

int main(int argc,char **argv) {int nbin,myid,nproc,nthreads,tid;double step,sum[MAX_THREADS]={0.0},pi=0.0,pig;

MPI_Init(&argc,&argv);MPI_Comm_rank(MPI_COMM_WORLD,&myid);MPI_Comm_size(MPI_COMM_WORLD,&nproc);nbin = NBIN/nproc;

step = 1.0/(nbin*nproc);

#pragma omp parallel private(tid){

int i;double x;nthreads = omp_get_num_threads();tid = omp_get_thread_num();for (i=nbin*myid+tid; i<nbin*(myid+1); i+=nthreads) {

x = (i+0.5)*step;sum[tid] += 4.0/(1.0+x*x);

}printf("rank tid sum = %d %d %e\n",myid,tid,sum[tid]);

}for(tid=0; tid<nthreads; tid++)

pi += sum[tid]*step;

MPI_Allreduce(&pi,&pig,1,MPI_DOUBLE,MPI_SUM,MPI_COMM_WORLD);if (myid==0) printf("PI = %f\n",pig);

MPI_Finalize();return 0;

Cilk++

Simple, powerful expression of task parallelism: cilk_for – Parallelize for loops cilk_spawn – Specify the start of parallel execution cilk_sync – Specify the end of parallel execution

http://software.intel.com/en-us/articles/intel-cilk-plus/

17/8/2011

Fibonachi (Fibonacci)Try:http://www.wolframalpha.com/input/?i=fibonacci+number

Fibonachi Numbersserial version

// 1, 1, 2, 3, 5, 8, 13, 21, 34, ... // Serial version// Credit: http://myxman.org/dp/node/182

long fib_serial(long n) {

if (n < 2) return n;

return fib_serial(n-1) + fib_serial(n-2);

Cilk++ Fibonachi (Fibonacci)#include <cilk.h>#include <stdio.h>

long fib_parallel(long n){ long x, y; if (n < 2) return n; x = cilk_spawn fib_parallel(n-1); y = fib_parallel(n-2); cilk_sync; return (x+y); }

int cilk_main(){ int N=50;

long result;result = fib_parallel(N);printf("fib of %d is %d\n",N,result);return 0;

Cilk_spawn

ADD PARALLELISM USING CILK_SPAWN We are now ready to introduce parallelism into our qsort program. The cilk_spawn keyword indicates that a function )the child( may be executed in parallel with the code that follows the cilk_spawn statement )the parent(. Note that the keyword allows but does not require parallel operation. The Cilk++ scheduler will dynamically determine what actually gets executed in parallel when multiple processors are available. The cilk_sync statement indicates that the function may not continue until all cilk_spawn requests in the same function have completed. cilk_sync does not affect parallel strands spawned in other functions.

Cilkview Fn(30)

Strands and Knots A Cilk++ program fragments

... do_stuff_1(); // execute strand 1 cilk_spawn func_3(); // spawn strand 3 at knot A do_stuff_2(); // execute strand 2 cilk_sync; // sync at knot B do_stuff_4(); // execute strand 4 ...

DAG with two spawns (labeled A and B) and one sync (labeled C)

Let's add labels to the strands to indicate the number of milliseconds it takes to execute each strand

a more complex Cilk++ program (DAG):

In ideal circumstances (e.g., if there is no scheduling overhead) then, if an unlimited number of processors are available, this program should run for 68 milliseconds.

Work and SpanWorkThe total amount of processor time required to complete the program is the sum of all the numbers. We call this the work. In this DAG, the work is 181 milliseconds for the 25 strands shown, and if the program is run on a single processor, the program should run for 181 milliseconds. SpanAnother useful concept is the span, sometimes called the critical path length. The span is the most expensive path that goes from the beginning to the end of the program. In this DAG, the span is 68 milliseconds, as shown below:

divide-and-conquer strategycilk_forShown here: 8 threads and 8 iterations

Here is the DAG for a serial loop that spawns each iteration. In this case, the work is not well balanced, because each child does the work of only one iteration before incurring the scheduling overhead inherent in entering a sync.

Race conditionsCheck the “qsort-race” program with cilkscreen:

StarHPC on the Cloud

Will be ready for PP201X?

Eclipse PTPParallel Tools Platform

http://www.eclipse.org/ptp/

Will be ready for PP201X?

Recursion in OpenMPlong fib_parallel(long n) { long x, y; if (n < 2) return n; #pragma omp task default(none) shared(x,n) { x = fib_parallel(n-1); } y = fib_parallel(n-2); #pragma omp taskwait return (x+y); }

#pragma omp parallel #pragma omp single { r = fib_parallel(n); }

Reference: http://myxman.org/dp/node/182

Use the taskwait pragma to specify a wait for child tasks

to be completed that are generated by the current

The task pragma can be useful for parallelizing

irregular algorithms such as recursive algorithms

for which other OpenMP workshare constructs are

inadequate.

Intel® Parallel Studio

• Use Parallel Composerto create and compile a parallel application

• Use Parallel Inspectorto improve reliability by finding memory and threading errors

• Use Parallel Amplifierto improve parallel performance by tuning threaded code

Intel® Parallel Studio

Parallel Studio add new features to Visual Studio

Intel’s Parallel Amplifier – Execution Bottlenecks

Intel’s Parallel Inspector – Threading Errors

Error – Data Race

Intel Parallel Studio - Composer

The installation of this part failed for me.Probably because I didn’t install before Intel’s C++ compiler.Sorry I can’t make a demo here…

Introduction to Parallel Processing

Documents

Transcript of Introduction to Parallel Processing

Welcome to “Introduction to Parallel Processing”.

Parallel Processing & Distributed Systemshungnq/.../ParallelProcessing... · Parallel Processing Terminology Parallel processing Parallel computer – Multi-processor computer capable

October 2005, Lecture #1 Introduction to Parallel Processing.

Introduction to Parallel Processing Shantanu Dutt University of Illinois at Chicago.

Introduction to Parallel Processing CS 147 November 12, 2004 Johnny Lai.

Psych 85-419: Introduction to Parallel Distributed Processing

INTRODUCTION TO PARALLEL PROCESSING CHAPTER - 12 SHOBHANA RAJAN 7/16/2001.

Introduction to Parallel Processing in R · 2020. 3. 23. · Chapter 1 Introduction to Parallel Processing in R Instead of starting with an abstract overview of parallel programming,

Parallel Processing and Multiprocessors Why Parallel ...ee565/slides/ch4.pdf · Parallel Processing and Multiprocessors why parallel processing? types of parallel processors ... Flynn

1 Parallel computing and its recent topics. 2 Outline 1. Introduction of parallel processing (1)What is parallel processing (2)Classification of parallel.

Introduction to Parallel Processing - Muhammad Shaaban's Homepage

Introduction to Parallel Processing - Home - Springer978-0-306-46964-0/1.pdf · INTRODUCTION TO PARALLEL PROCESSING ... is an outgrowth of lecture notes that the author ... Linking

Parallel Processing Steve Terpe CS 147. Overview What is Parallel Processing What is Parallel Processing Parallel Processing in Nature Parallel Processing.

Introduction to Parallel Processing Dr. Guy Tel-Zur Lecture 10.

Lecture 23 Introduction to Parallel Processing Hardwaremniemier/teaching/2011_B_Fall/lectures/23_PPT_1up… · Lecture 23 Introduction to Parallel Processing Hardware" Suggested reading:"

Parallel Processing & Distributed Systemscse.hcmut.edu.vn/~ptvu/ppds/lec1.pdf · Parallel Processing Terminology Parallel processing Parallel computer – Multi-processor computer

CS6963 Parallel Programming for Graphics Processing Units (GPUs) Lecture 1: Introduction L1: Introduction 1.

INTRODUCTION TO PARALLEL PROCESSING WITH EIGHT NODE ... Processing... · INTRODUCTION TO PARALLEL PROCESSING WITH EIGHT NODE RASPBERRY PI CLUSTER Greg D orr, D rew H agen, B ob L

An Introduction to Parallel Processing

Advanced Computer Architecture (CSL502) PARALLEL PROCESSING · PDF fileAdvanced Computer Architecture (CSL502) Unit 1: Introduction To Parallel Processing 1 PARALLEL PROCESSING