1 Introduction to Parallel Processing with Multi-core Part I Jie Liu, Ph.D. Professor Department of...
-
Upload
harvey-bell -
Category
Documents
-
view
215 -
download
0
Transcript of 1 Introduction to Parallel Processing with Multi-core Part I Jie Liu, Ph.D. Professor Department of...
1
Introduction to Parallel Processing with Multi-core
Part IJie Liu, Ph.D.
Professor Department of Computer Science
Western Oregon UniversityUSA
2
Now the question – Why parallel?
Three things are for sure:• Tax, death, and parallelism
How long does it take a single person to build I-5?• Answer
What we do is that we want to solve a very computational intensive problem, such as modeling protein interacting with the water surrounding it. The problem could take a long long time.• The protein simulation problem take a Cray X/MP 31,688 years to
simulate 1 second of interaction (in 1990). Let’s say today super computer is 100 time faster than Cray X/MP, we still need more than 300 years!
• The only solution parallel processing
3
Why parallel (2) Moore’s Law
• The logic density of silicon-based IC (Integrated Circuits) closely followed the curve , that is, it doubles every year (until 1970, then every 18 months)
Why is the density related to processor’s speed? Because, during the process of “Computing,” the electrons need to carry signal from one end of a circuit to the other end.
For a 2GHz computer, its signals travel about .5 meters per clock cycle (.5 nanosecond)
That is, the speed of light places a physical limitation on how fast a sign processor computer can run
19622 t
4
Why parallel (3)
There are problems require much faster computation power than today’s fastest single CPU computers can provide.
The speed of light limits how fast a single CPU computer can run
If we want to solve some computational intensive problems in a reasonable amount of time, we have to result to parallel computers!
5
Some Definitions Parallel processing
• Information processing that emphasizes on concurrent manipulation of data belonging to many processes solving a single problem
• Example: having 100 processors sorting an array of 1,400,000,000 element – is Parallel processing
• Example: printing homework while reading emails – is concurrent, but not Parallel processing because the processes are not solving the same problem.
A parallel computer is a multi-processor computer capable of parallel processing• Computers with just co-processors for math and image
processing are not considered as parallel computers (some people disagree with this notion)
6
Two forms of parallelisms Control Parallelism
• Concurrency is achieve by applying different operations to different data elements of a single problem
• Pipeline is a special form of control parallelism Assembly line is an example of pipeline
Data Parallelism• Concurrency is achieve by applying the same operation
to different data elements of a single problem Taking a class is an example of data parallelism (if we
assuming you all are learning at the same speed) Marching of army brigade can be considered as data
parallelism
• Note the granularity of the above examples
7
Control VS. Data Parallelism Looking the following statement
1. if a[i] > b[i]2. a[i] = a[i]*b[i]3. else 4. b[i] = a[i]-b[i]
In a control parallelism fashion, some processors execute statement a[i] = a[i]*b[i], other may execute b[i] = a[i]-b[i] during the same clock cycle
In a data parallelism fashion, especially on a SIMD machine, this if statement is executed in two clock cycles:
• During the first clock cycle, all the processors satisfy the condition of a[i] > b[i] execute statement a[i] = a[i]*b[i].
• During the second machine cycle, processors not satisfy the condition of a[i] > b[i] execute statement b[i] = a[i]-b[i]
8
Speedup – Take I Speedup is a measurement of how well or how effective a
parallel algorithm is Is defined as the ratio between the time needed for the
most efficient sequential algorithm to perform a computation and the time needed to perform the same computation on a parallel computer with a parallel algorithm. That is,
Example, we developed a parallel bubble sort that sort n elements in O(log n) time using n processors. The speedup is because there are efficient sorting algorithms that has a complexity of O(nlogn)
)()log
log*( nO
n
nnO
T
T
algorithm parallel
algorithm sequentialefficient most
9
Brain Exercise Six equally skilled students need to make 210
special cookies, each consists of the following tasks 1. Break dough into small pieces of equal size (1)2. Hand roll the small size dough pieces into balls (1)3. Press the balls flat for rolling (1)4. Roll the flat dough into wrappers (1)5. Place suitable amount of fillings onto the wrappers (1)6. Fold the wrappers to enclose the fillings completely to
finish making a cookie (1)• How to do this in a pipeline fashion?• How to do this in a control parallelism fashion, other
than pipeline?• How to do this in data parallel fashion?
10
Approach #1
D1 ~ D6 D7 ~ D12
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12
S1 1 2 3 4 5 6 1 2 3 4 5 6
S2 1 2 3 4 5 6 1 2 3 4 5 6
S3 1 2 3 4 5 6 1 2 3 4 5 6
S4 1 2 3 4 5 6 1 2 3 4 5 6
S5 1 2 3 4 5 6 1 2 3 4 5 6
S6 1 2 3 4 5 6 1 2 3 4 5 6
11
Approach #2
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12
S1 1 1 1 1 1 1 1 1 1 1 1 1
S2 2 2 2 2 2 2 2 2 2 2 2
S3 3 3 3 3 3 3 3 3 3 3
S4 4 4 4 4 4 4 4 4 4
S5 5 5 5 5 5 5 5 5
S6 6 6 6 6 6 6 6
D1 D2 D3 D4 D5 D7D6
12
Analysis Sequential cost (1+1+1+1+1+1)*210 = 1260 time
units Maximum Speedup for Approach #1
• ? Maximum Speedup for Approach #2
• ? Other questions to consider
• If I have 1260 students, can I get the task done in 1 time unit?
• What if step 3 takes 3 time units and step 6 takes 2 time units?
• What if I add more “skilled” students to different approaches, what would be the effect?
13
Grand challenges A list of problems that are very computational
intensive, but can benefit human being greatly, heavily funded by the US government
The following is just the category of problems
15
One of the Fastest Computer Per ttp://abcnews.go.com/Technology/WireStory?id=5028546&page=2
By: IBM and Los Alamos National Laboratory Name: Roadrunner (Named after New Mexico’s state
bird ) Twice as fast as IBM's Blue Gene, which is three time
faster than the next fastest computer in the world Cost $100,000,000 – very cheap Speed 1,000,000,000,000,000 FLOP per second
(petaflop) Usage: primarily on nuclear weapons work, including
simulating nuclear explosions Related to gaming: In some ways, it's "a very souped-
up Sony PlayStation 3." Some facts:
The interconnecting system occupies 6,000 square feet with 57 miles of fiber optics and weighs 500,000 pounds. Although made from commercial parts, the computer consists of 6,948 dual-core computer chips and 12,960 cell engines, and it has 80 terabytes of memory housed in 288 connected refrigerator-sized racks.
Two years ago, the fastest computer in the world can perform 100,000,000,000,000 FLOP per second 100 taraflop
16
Parallel Computers and Programming – the trend
Hardware• Super computers – multiprocessor/multicomputer – the fastest
computers at the time• Beowulf – cluster of off-the-shelf computers linked by a switch • Othe distributed system such as NOW• Multi-core – Many core (a CPU itself) within a CPU, soon will go
over 60+ cores per CPU Programming
• MPI for message passing architecture • Vendor specific add-on to well known programming languages• New language such as Microsoft’s F#• Multi-core programming (add-on to well known programming
languages) Intel's Threading Building Blocks (TBB) Microsoft’s Task Parallel Library -- support Parallel For, PLINQ and
etc, need to keep an eye on this one Third party such as Jibu – may merge with MS
18
Why Study Parallel Processing/Programming
Making your code run more efficiently Utilize existing resources (other cores) … … Good coding class for CS students
• To learn something new• To improve your skill sets• To improve your problem solving skills• To exercise your brain • To review many Computer Science subject areas• To relax a constraint our professors embedded in our
thinking process in our early years of studying (What is the PC in a CPU?)
19
PRAM (Parallel Random Access Machine)
A theoretical parallel computer Consists of a control unit, global memory, and an
unbounded set of processors, each with its own memory.
In addition,• Each processor has its unique id• At each step, a active processor can Read/Write
memory (global or private), perform the instruction as all other active processors, idle, or activate another processor
How many steps does it take to activate n processors