Lecture3

16
CS-416 Parallel and Distributed Systems Jawwad Shamsi Lecture #3 20 th January 2010

Transcript of Lecture3

Page 1: Lecture3

CS-416 Parallel and Distributed Systems

Jawwad ShamsiLecture #3

20th January 2010

Page 2: Lecture3

Announcement

• Possible Name Change to– High Performance Computing

Page 3: Lecture3

Recap

• Pipelining• Vector Instruction• Super Scalar Execution

Page 4: Lecture3

Super-Scalar Execution

Page 5: Lecture3

Dependencies

• Data Dependency• Resource dependency• Branch Dependency

Page 6: Lecture3

Dynamic Instruction Issue

• 3rd Segment– Processor needs capability of • Out of order sequencing

Page 7: Lecture3

Limitations of Memory Systems

• Latency• Bandwidth

Page 8: Lecture3

Effect of Latency - Example

• 1 GHZ processor (1 ns)– 100 ns latency– Two multiply-add units

• four instructions in each cycle of 1 ns

– Peak Rating• 4GLOPS• Memory latency 100 cycles • block size is one word• Processor must wait 100 cycles before it can process the data.

– Peak speed 1 floating point operation / 100 nsec– 10 MFLOPS

Page 9: Lecture3

Effect of Bandwidth

• Process 1 GHZ• 100 cycle latency DRAM • Block size is one word, the processor takes 100

cycles to fetch each word. • Therefore, the algorithm performs one FLOP

every 100 cycles for a peak speed of 10 MFLOPS

• Increase Block Size??

Page 10: Lecture3

• 1 for (i = 0; i < 1000; i++) – 2 column_sum[i] = 0.0; – 3 for (j = 0; j < 1000; j++) • 4 column_sum[i] += b[j][i];

Page 11: Lecture3
Page 12: Lecture3

• Pre-fetching• Multi-Threading

Page 13: Lecture3

Impact of bandwidth on multithreaded programs

• Threads share Memory– Cache• Cache size will be limited• Limited Cache-hit ratio

– Decrease in effective bandwith

Page 14: Lecture3

Simple Execution

• for(i=0;i<n;i++) • 2 c[i] = dot_product(get_row(a, i), b);

Page 15: Lecture3

Threaded Execution

• for(i=0;i<n;i++) – 2 c[i] = create_thread(dot_product, get_row(a, i),

b);

Page 16: Lecture3

• 1 for (i = 0; i < 1000; i++) • 2 column_sum[i] = 0.0; 3 • for (j = 0; j < 1000; j++) – 4 for (i = 0; i < 1000; i++) • 5 column_sum[i] += b[j][i];