1 Introduction to Parallel Processing with Multi-core Part I Jie Liu, Ph.D. Professor Department of...

21
1 Introduction to Parallel Processing with Multi- core Part I Jie Liu, Ph.D. Professor Department of Computer Science Western Oregon University USA [email protected]

Transcript of 1 Introduction to Parallel Processing with Multi-core Part I Jie Liu, Ph.D. Professor Department of...

1

Introduction to Parallel Processing with Multi-core

Part IJie Liu, Ph.D.

Professor Department of Computer Science

Western Oregon UniversityUSA

[email protected]

2

Now the question – Why parallel?

Three things are for sure:• Tax, death, and parallelism

How long does it take a single person to build I-5?• Answer

What we do is that we want to solve a very computational intensive problem, such as modeling protein interacting with the water surrounding it. The problem could take a long long time.• The protein simulation problem take a Cray X/MP 31,688 years to

simulate 1 second of interaction (in 1990). Let’s say today super computer is 100 time faster than Cray X/MP, we still need more than 300 years!

• The only solution parallel processing

3

Why parallel (2) Moore’s Law

• The logic density of silicon-based IC (Integrated Circuits) closely followed the curve , that is, it doubles every year (until 1970, then every 18 months)

Why is the density related to processor’s speed? Because, during the process of “Computing,” the electrons need to carry signal from one end of a circuit to the other end.

For a 2GHz computer, its signals travel about .5 meters per clock cycle (.5 nanosecond)

That is, the speed of light places a physical limitation on how fast a sign processor computer can run

19622 t

4

Why parallel (3)

There are problems require much faster computation power than today’s fastest single CPU computers can provide.

The speed of light limits how fast a single CPU computer can run

If we want to solve some computational intensive problems in a reasonable amount of time, we have to result to parallel computers!

5

Some Definitions Parallel processing

• Information processing that emphasizes on concurrent manipulation of data belonging to many processes solving a single problem

• Example: having 100 processors sorting an array of 1,400,000,000 element – is Parallel processing

• Example: printing homework while reading emails – is concurrent, but not Parallel processing because the processes are not solving the same problem.

A parallel computer is a multi-processor computer capable of parallel processing• Computers with just co-processors for math and image

processing are not considered as parallel computers (some people disagree with this notion)

6

Two forms of parallelisms Control Parallelism

• Concurrency is achieve by applying different operations to different data elements of a single problem

• Pipeline is a special form of control parallelism Assembly line is an example of pipeline

Data Parallelism• Concurrency is achieve by applying the same operation

to different data elements of a single problem Taking a class is an example of data parallelism (if we

assuming you all are learning at the same speed) Marching of army brigade can be considered as data

parallelism

• Note the granularity of the above examples

7

Control VS. Data Parallelism Looking the following statement

1. if a[i] > b[i]2. a[i] = a[i]*b[i]3. else 4. b[i] = a[i]-b[i]

In a control parallelism fashion, some processors execute statement a[i] = a[i]*b[i], other may execute b[i] = a[i]-b[i] during the same clock cycle

In a data parallelism fashion, especially on a SIMD machine, this if statement is executed in two clock cycles:

• During the first clock cycle, all the processors satisfy the condition of a[i] > b[i] execute statement a[i] = a[i]*b[i].

• During the second machine cycle, processors not satisfy the condition of a[i] > b[i] execute statement b[i] = a[i]-b[i]

8

Speedup – Take I Speedup is a measurement of how well or how effective a

parallel algorithm is Is defined as the ratio between the time needed for the

most efficient sequential algorithm to perform a computation and the time needed to perform the same computation on a parallel computer with a parallel algorithm. That is,

Example, we developed a parallel bubble sort that sort n elements in O(log n) time using n processors. The speedup is because there are efficient sorting algorithms that has a complexity of O(nlogn)

)()log

log*( nO

n

nnO

T

T

algorithm parallel

algorithm sequentialefficient most

9

Brain Exercise Six equally skilled students need to make 210

special cookies, each consists of the following tasks 1. Break dough into small pieces of equal size (1)2. Hand roll the small size dough pieces into balls (1)3. Press the balls flat for rolling (1)4. Roll the flat dough into wrappers (1)5. Place suitable amount of fillings onto the wrappers (1)6. Fold the wrappers to enclose the fillings completely to

finish making a cookie (1)• How to do this in a pipeline fashion?• How to do this in a control parallelism fashion, other

than pipeline?• How to do this in data parallel fashion?

10

Approach #1

D1 ~ D6 D7 ~ D12

  T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12

S1 1 2 3 4 5 6 1 2 3 4 5 6

S2 1 2 3 4 5 6 1 2 3 4 5 6

S3 1 2 3 4 5 6 1 2 3 4 5 6

S4 1 2 3 4 5 6 1 2 3 4 5 6

S5 1 2 3 4 5 6 1 2 3 4 5 6

S6 1 2 3 4 5 6 1 2 3 4 5 6

11

Approach #2

  T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12

S1 1 1 1 1 1 1 1 1 1 1 1 1

S2   2 2 2 2 2 2 2 2 2 2 2

S3     3 3 3 3 3 3 3 3 3 3

S4       4 4 4 4 4 4 4 4 4

S5         5 5 5 5 5 5 5 5

S6           6 6 6 6 6 6 6

D1 D2 D3 D4 D5 D7D6

12

Analysis Sequential cost (1+1+1+1+1+1)*210 = 1260 time

units Maximum Speedup for Approach #1

• ? Maximum Speedup for Approach #2

• ? Other questions to consider

• If I have 1260 students, can I get the task done in 1 time unit?

• What if step 3 takes 3 time units and step 6 takes 2 time units?

• What if I add more “skilled” students to different approaches, what would be the effect?

13

Grand challenges A list of problems that are very computational

intensive, but can benefit human being greatly, heavily funded by the US government

The following is just the category of problems

14

Parallel Computers & Companies

15

One of the Fastest Computer Per ttp://abcnews.go.com/Technology/WireStory?id=5028546&page=2

By: IBM and Los Alamos National Laboratory Name: Roadrunner (Named after New Mexico’s state

bird ) Twice as fast as IBM's Blue Gene, which is three time

faster than the next fastest computer in the world Cost $100,000,000 – very cheap Speed 1,000,000,000,000,000 FLOP per second

(petaflop) Usage: primarily on nuclear weapons work, including

simulating nuclear explosions Related to gaming: In some ways, it's "a very souped-

up Sony PlayStation 3." Some facts:

The interconnecting system occupies 6,000 square feet with 57 miles of fiber optics and weighs 500,000 pounds. Although made from commercial parts, the computer consists of 6,948 dual-core computer chips and 12,960 cell engines, and it has 80 terabytes of memory housed in 288 connected refrigerator-sized racks.

Two years ago, the fastest computer in the world can perform 100,000,000,000,000 FLOP per second 100 taraflop

16

Parallel Computers and Programming – the trend

Hardware• Super computers – multiprocessor/multicomputer – the fastest

computers at the time• Beowulf – cluster of off-the-shelf computers linked by a switch • Othe distributed system such as NOW• Multi-core – Many core (a CPU itself) within a CPU, soon will go

over 60+ cores per CPU Programming

• MPI for message passing architecture • Vendor specific add-on to well known programming languages• New language such as Microsoft’s F#• Multi-core programming (add-on to well known programming

languages) Intel's Threading Building Blocks (TBB) Microsoft’s Task Parallel Library -- support Parallel For, PLINQ and

etc, need to keep an eye on this one Third party such as Jibu – may merge with MS

17

Multi-Core Programming

Sequential

Parallel

18

Why Study Parallel Processing/Programming

Making your code run more efficiently Utilize existing resources (other cores) … … Good coding class for CS students

• To learn something new• To improve your skill sets• To improve your problem solving skills• To exercise your brain • To review many Computer Science subject areas• To relax a constraint our professors embedded in our

thinking process in our early years of studying (What is the PC in a CPU?)

19

PRAM (Parallel Random Access Machine)

A theoretical parallel computer Consists of a control unit, global memory, and an

unbounded set of processors, each with its own memory.

In addition,• Each processor has its unique id• At each step, a active processor can Read/Write

memory (global or private), perform the instruction as all other active processors, idle, or activate another processor

How many steps does it take to activate n processors

20

PRAM

21

Important Terms computational intensive

problem Moore’s Law Parallel processing parallel computer Control Parallelism Data Parallelism Speedup Grand challenges

Massive Parallel Computer

Roadrunner petaflop Super computers Beowulf NOW MPI Multi-core PRAM