Thread-Level Parallelism — Parallel Programminghtseng/classes/cse203... · •Processor/compiler...

37
Thread-Level Parallelism — Parallel Programming Hung-Wei

Transcript of Thread-Level Parallelism — Parallel Programminghtseng/classes/cse203... · •Processor/compiler...

Page 1: Thread-Level Parallelism — Parallel Programminghtseng/classes/cse203... · •Processor/compiler may reorder your memory operations/ instructions • Coherence protocol can only

Thread-Level Parallelism — Parallel Programming

Hung-Wei

Page 2: Thread-Level Parallelism — Parallel Programminghtseng/classes/cse203... · •Processor/compiler may reorder your memory operations/ instructions • Coherence protocol can only

SuperScalar Processor w/ ROB

!2

Instruction Queue

Fetch/decode instructionUnresolved

Branch

Address DataMemory

P1 P2 P3 P4 P5 P6 … …

Physical Registers

valid

va

lue

physical register #X1

X2X3…Register

mapping table

Renaming logic

Address Resolution

IntegerALU

Floating-Point Adder

Floating-Point Mul/Div Branch

Addr.

Value

Addr.

Dest

Reg.

LoadQueue

StoreQueue

Page 3: Thread-Level Parallelism — Parallel Programminghtseng/classes/cse203... · •Processor/compiler may reorder your memory operations/ instructions • Coherence protocol can only

Recap: What about “linked list”

!3

LOOP: ld X10, 8(X10) addi X7, X7, 1 bne X10, X0, LOOP

Static instructions Dynamic instructions① ld X10, 8(X10) ② addi X7, X7, 1 ③ bne X10, X0, LOOP ④ ld X10, 8(X10) ⑤ addi X7, X7, 1 ⑥ bne X10, X0, LOOP ⑦ ld X10, 8(X10) ⑧ addi X7, X7, 1 ⑨ bne X10, X0, LOOP

Instru

ction

Queu

e

1

3

2

5

7

1 23 4

5 6

7 89 4

6

8

910

11ILP is low because of data dependencies

Wasted slots

Wasted slots

Wasted slots

Wasted slots

Wasted slotsWasted slots

Page 4: Thread-Level Parallelism — Parallel Programminghtseng/classes/cse203... · •Processor/compiler may reorder your memory operations/ instructions • Coherence protocol can only

Simultaneous multithreading

!4

Instru

ction

Queu

e

1 2

5

1 23 4

5 6

7 8

3 4

76

8

① ld X10, 8(X10) ② addi X7, X7, 1 ③ bne X10, X0, LOOP ④ ld X10, 8(X10) ⑤ addi X7, X7, 1 ⑥ bne X10, X0, LOOP ⑦ ld X10, 8(X10) ⑧ addi X7, X7, 1 ⑨ bne X10, X0, LOOP

① ld X1, 0(X10) ② addi X10, X10, 8 ③ add X20, X20, X1 ④ bne X10, X2, LOOP ⑤ ld X1, 0(X10) ⑥ addi X10, X10, 8 ⑦ add X20, X20, X1 ⑧ bne X10, X2, LOOP ⑨ ld X1, 0(X10) ɩ addi X10, X10, 8 ꋷ add X20, X20, X1 ꋸ bne X10, X2, LOOP

1 23 4

5 6

7 89 10 9 10

1 2

3

54

6

11 12 11 12

9

7

8 9

Page 5: Thread-Level Parallelism — Parallel Programminghtseng/classes/cse203... · •Processor/compiler may reorder your memory operations/ instructions • Coherence protocol can only

SMT SuperScalar Processor w/ ROB

!5

Instruction Queue

Fetch/decode

instruction

Address DataMemory

P1 P2 P3 P4 P5 P6 … …

Physical Registers

valid

va

luephysical register #X1X2X3…

Register mapping table #1Renaming

logic

Address Resolution

IntegerALU

Floating-Point Adder

Floating-Point Mul/Div Branch

Addr.

Value

Addr.

Dest

Reg.

LoadQueue

StoreQueue

physical register #X1X2X3…

Register mapping table #2

PC #1PC #2

Page 6: Thread-Level Parallelism — Parallel Programminghtseng/classes/cse203... · •Processor/compiler may reorder your memory operations/ instructions • Coherence protocol can only

SMT SuperScalar Processor w/ ROB

!6

Instruction Queue

Fetch/decode

instruction

Address DataMemory

P1 P2 P3 P4 P5 P6 … …

Physical Registers

valid

va

luephysical register #X1X2X3…

Register mapping table #1Renaming

logic

Address Resolution

IntegerALU

Floating-Point Adder

Floating-Point Mul/Div Branch

Addr.

Value

Addr.

Dest

Reg.

LoadQueue

StoreQueue

physical register #X1X2X3…

Register mapping table #2

PC #1PC #2

O(IW4)

Page 7: Thread-Level Parallelism — Parallel Programminghtseng/classes/cse203... · •Processor/compiler may reorder your memory operations/ instructions • Coherence protocol can only

Concept of CMP

!7

Processor

Last-level $ (LLC)

CoreRegisters

L1-$

L2-$

CoreRegisters

L1-$

L2-$

CoreRegisters

L1-$

L2-$

CoreRegisters

L1-$

L2-$

Page 8: Thread-Level Parallelism — Parallel Programminghtseng/classes/cse203... · •Processor/compiler may reorder your memory operations/ instructions • Coherence protocol can only

What software thinks about “multiprogramming” hardware

!8

Shared Memory

Core CoreRegisters

CoreRegisters

CoreRegistersRegisters

Thread Thread Thread Thread

Shared Virtual Address Space

L1-$

L2-$

L1-$

L2-$

L1-$

L2-$

L1-$

L2-$

Page 9: Thread-Level Parallelism — Parallel Programminghtseng/classes/cse203... · •Processor/compiler may reorder your memory operations/ instructions • Coherence protocol can only

• Coherency — Guarantees all processors see the same value for a variable/memory address in the system when the processors need the value at the same time • What value should be seen

• Consistency — All threads see the change of data in the same order • When the memory operation should be done

!9

Coherency & Consistency

Page 10: Thread-Level Parallelism — Parallel Programminghtseng/classes/cse203... · •Processor/compiler may reorder your memory operations/ instructions • Coherence protocol can only

Snooping Protocol

!10

Invalid Shared

Exclusive

read miss(processor)

write

miss

(p

roces

sor)

write miss(bus)

write request(processor)

write

miss

(bus

) wr

ite ba

ck da

ta

read miss(bus)

write back data

read miss/hit

read/write miss (bus)

write hit

Page 11: Thread-Level Parallelism — Parallel Programminghtseng/classes/cse203... · •Processor/compiler may reorder your memory operations/ instructions • Coherence protocol can only

What happens when we write in coherent caches?

!11

Shared Memory

Core CoreRegisters

CoreRegisters

CoreRegistersRegisters

Thread Thread Thread Thread

Shared Virtual Address Space

L1-$

L2-$

L1-$

L2-$

L1-$

L2-$

L1-$

L2-$

for(i=0;i<size/4;i++) sum += a[i];

sum = 0

for(i=size/4;i<size/2;i++) sum += a[i]; for(i=size/2;i<3*size/4;i++)

sum += a[i];

for(i=3*size/4;i<size;i++) sum += a[i];

sum = 0 sum = 0 sum = 0

sum = 0

sum = 0xDEADBEEF write miss/invalidate

sum = 0 sum = 0 sum = 0

read miss

sum = 0xDEADBEEF

write backsum = 0xDEADBEEFsum = 0xDEADBEEF

Page 12: Thread-Level Parallelism — Parallel Programminghtseng/classes/cse203... · •Processor/compiler may reorder your memory operations/ instructions • Coherence protocol can only

• Assuming that we are running the following code on a CMP with a cache coherency protocol, how many of the following outputs are possible? (a is initialized to 0 as assume we will output more than 10 numbers)

က: 0 1 2 3 4 5 6 7 8 9 က< 1 2 5 9 3 6 8 10 12 13 က> 1 1 1 1 1 1 1 1 64 100 က@ 1 1 1 1 1 1 1 1 1 100 A. 0 B. 1 C. 2 D. 3 E. 4

!12

Cache coherency

thread 1 thread 2

while(1) printf(“%d ”,a);

while(1) a++;

Page 13: Thread-Level Parallelism — Parallel Programminghtseng/classes/cse203... · •Processor/compiler may reorder your memory operations/ instructions • Coherence protocol can only

• Performance/correctness in multiprogramming • Dark Silicon and modern architectures

!13

Outline

Page 14: Thread-Level Parallelism — Parallel Programminghtseng/classes/cse203... · •Processor/compiler may reorder your memory operations/ instructions • Coherence protocol can only

Performance/correctness issues in multiprogramming environments

!14

Page 15: Thread-Level Parallelism — Parallel Programminghtseng/classes/cse203... · •Processor/compiler may reorder your memory operations/ instructions • Coherence protocol can only

False Sharing

!15

Shared Memory

Core CoreRegisters

CoreRegisters

CoreRegistersRegisters

Thread Thread Thread Thread

Shared Virtual Address Space

L1-$

L2-$

L1-$

L2-$

L1-$

L2-$

L1-$

L2-$

A[0] = 0 A[1] = 0 A[2] = 0 A[3] = 0

A[0] = 0 A[1] = 0 A[2] = 0 A[3] = 0

A[0] = 0 A[1] = 0 A[2] = 0 A[3] = 0

A[0] = 0 A[1] = 0 A[2] = 0 A[3] = 0

A[0]=0

A[0] = 0xDEADBEEF A[1] = 0 A[2] = 0 A[3] = 0 write miss/

invalidate

A[0] = 0 A[1] = 0 A[2] = 0 A[3] = 0

A[0] = 0 A[1] = 0 A[2] = 0 A[3] = 0

A[0] = 0 A[1] = 0 A[2] = 0 A[3] = 0

write miss

A[0]=0xDEADBEEF

write back

A[0] = 0xDEADBEEF A[1] = 0 A[2] = 0 A[3] = 0

A[0] = 0xDEADBEEF A[1] = 0 A[2] = 0 A[3] = 0

Page 16: Thread-Level Parallelism — Parallel Programminghtseng/classes/cse203... · •Processor/compiler may reorder your memory operations/ instructions • Coherence protocol can only

L v.s. R

!20

Version L Version R

c

void *threaded_vadd(void *thread_id) { int tid = *(int *)thread_id; int i; for(i=tid;i<ARRAY_SIZE;i+=NUM_OF_THREADS) { c[i] = a[i] + b[i]; } return NULL; }

void *threaded_vadd(void *thread_id) { int tid = *(int *)thread_id; int i; for(i=tid*(ARRAY_SIZE/NUM_OF_THREADS);i<(tid+1)*(ARRAY_SIZE/NUM_OF_THREADS);i++) { c[i] = a[i] + b[i]; } return NULL; }

c

Page 17: Thread-Level Parallelism — Parallel Programminghtseng/classes/cse203... · •Processor/compiler may reorder your memory operations/ instructions • Coherence protocol can only

• 3Cs: • Compulsory, Conflict, Capacity

• Coherency miss: • A “block” invalidated because of the sharing among processors.

!21

4Cs of cache misses

Page 18: Thread-Level Parallelism — Parallel Programminghtseng/classes/cse203... · •Processor/compiler may reorder your memory operations/ instructions • Coherence protocol can only

• True sharing • Processor A modifies X, processor B also want to access X.

• False sharing • Processor A modifies X, processor B also want to access Y.

However, Y is invalidated because X and Y are in the same block!

!22

False sharing

Page 19: Thread-Level Parallelism — Parallel Programminghtseng/classes/cse203... · •Processor/compiler may reorder your memory operations/ instructions • Coherence protocol can only

• Comparing implementations of thread_vadd — L and R, please identify which one will be performing better and why

A. L is better, because the cache miss rate is lower B. R is better, because the cache miss rate is lower C. L is better, because the instruction count is lower D. R is better, because the instruction count is lower E. Both are about the same

!23

Performance comparison

for(i = 0 ; i < NUM_OF_THREADS ; i++) { tids[i] = i; pthread_create(&thread[i], NULL, threaded_vadd, &tids[i]); } for(i = 0 ; i < NUM_OF_THREADS ; i++) pthread_join(thread[i], NULL);

Main thread

Version L Version Rvoid *threaded_vadd(void *thread_id) { int tid = *(int *)thread_id; int i; for(i=tid;i<ARRAY_SIZE;i+=NUM_OF_THREADS) { c[i] = a[i] + b[i]; } return NULL; }

void *threaded_vadd(void *thread_id) { int tid = *(int *)thread_id; int i; for(i=tid*(ARRAY_SIZE/NUM_OF_THREADS);i<(tid+1)*(ARRAY_SIZE/NUM_OF_THREADS);i++) { c[i] = a[i] + b[i]; } return NULL; }

Page 20: Thread-Level Parallelism — Parallel Programminghtseng/classes/cse203... · •Processor/compiler may reorder your memory operations/ instructions • Coherence protocol can only

Possible scenarios

!28

Thread 1 a=1; x=b;

Thread 2

b=1; y=a;

(0,1)Thread 1

a=1; x=b;

Thread 2 b=1; y=a;

(1,0)

Thread 1 a=1;

x=b;

Thread 2

b=1; y=a;

(1,1)Thread 1

x=b; a=1;

Thread 2 y=a;

b=1;

(0,0)

OoO Scheduling!

Page 21: Thread-Level Parallelism — Parallel Programminghtseng/classes/cse203... · •Processor/compiler may reorder your memory operations/ instructions • Coherence protocol can only

• Processor/compiler may reorder your memory operations/instructions • Coherence protocol can only guarantee the update of the same

memory address • Processor can serve memory requests without cache miss first • Compiler may store values in registers and perform memory

operations later • Each processor core may not run at the same speed (cache

misses, branch mis-prediction, I/O, voltage scaling and etc..) • Threads may not be executed/scheduled right after it’s spawned

!29

Why (0,0)?

Page 22: Thread-Level Parallelism — Parallel Programminghtseng/classes/cse203... · •Processor/compiler may reorder your memory operations/ instructions • Coherence protocol can only

• Consider the given program. You can safely assume the caches are coherent. How many of the following outputs will you see? က: (0, 0) က< (0, 1) က> (1, 0) က@ (1, 1) A. 0 B. 1 C. 2 D. 3 E. 4

!30

Again — how many values are possible?

int main() { int i; pthread_t thread[2]; pthread_create(&thread[0], NULL, modifya, NULL); pthread_create(&thread[1], NULL, modifyb, NULL); pthread_join(thread[0], NULL); pthread_join(thread[1], NULL); fprintf(stderr,”(%d, %d)\n",x,y); return 0; }

#include <stdio.h> #include <stdlib.h> #include <pthread.h> #include <unistd.h>

volatile int a,b; volatile int x,y; volatile int f; void* modifya(void *z) { a=1; x=b; return NULL; } void* modifyb(void *z) { b=1; y=a; return NULL; }

Page 23: Thread-Level Parallelism — Parallel Programminghtseng/classes/cse203... · •Processor/compiler may reorder your memory operations/ instructions • Coherence protocol can only

• x86 provides an “mfence” instruction to prevent reordering across the fence instruction

• x86 only supports this kind of “relaxed consistency” model. You still have to be careful enough to make sure that your code behaves as you expected

!31

fence instructions

thread 1 thread 2

a=1;

x=b;

b=1;

y=a;a=1 must occur/update before mfence b=1 must occur/update before mfencemfence mfence

Page 24: Thread-Level Parallelism — Parallel Programminghtseng/classes/cse203... · •Processor/compiler may reorder your memory operations/ instructions • Coherence protocol can only

• Processor behaviors are non-deterministic • You cannot predict which processor is going faster • You cannot predict when OS is going to schedule your thread

• Cache coherency only guarantees that everyone would eventually have a coherent view of data, but not when

• Cache consistency is hard to support

!32

Take-aways of parallel programming

Page 25: Thread-Level Parallelism — Parallel Programminghtseng/classes/cse203... · •Processor/compiler may reorder your memory operations/ instructions • Coherence protocol can only

Power/Energy Consumption & Dark Silicon

Hung-Wei Tseng

Page 26: Thread-Level Parallelism — Parallel Programminghtseng/classes/cse203... · •Processor/compiler may reorder your memory operations/ instructions • Coherence protocol can only

• Power is the direct contributor of “heat” • Packaging of the chip • Heat dissipation cost

• Energy = P * ET • The electricity bill and battery life is related to energy! • Lower power does not necessary means better battery life if the

processor slow down the application too much

!38

Power v.s. Energy

Page 27: Thread-Level Parallelism — Parallel Programminghtseng/classes/cse203... · •Processor/compiler may reorder your memory operations/ instructions • Coherence protocol can only

• Regarding power and energy, how many of the following statements are correct? က: Lowering the power consumption helps extending the battery life က< Lowering the power consumption helps reducing the heat generation က> Lowering the energy consumption helps reducing the electricity bill က@ A CPU with 10% utilization can still consume 33% of the peak power A. 0 B. 1 C. 2 D. 3 E. 4

!39

Power & Energy

Page 28: Thread-Level Parallelism — Parallel Programminghtseng/classes/cse203... · •Processor/compiler may reorder your memory operations/ instructions • Coherence protocol can only

Power

!40

Page 29: Thread-Level Parallelism — Parallel Programminghtseng/classes/cse203... · •Processor/compiler may reorder your memory operations/ instructions • Coherence protocol can only

• The power consumption due to the switching of transistor states

• Dynamic power per transistor

• α: average switches per cycle • C: capacitance • V: voltage • f: frequency, usually linear with V • N: the number of transistors

!41

Dynamic/Active Power

Pdynamic ∼ α × C × V2 × f × N

Page 30: Thread-Level Parallelism — Parallel Programminghtseng/classes/cse203... · •Processor/compiler may reorder your memory operations/ instructions • Coherence protocol can only

• The power consumption due to leakage — transistors do not turn all the way off during no operation

• Becomes the dominant factor in the most advanced process technologies.

• N: number of transistors • V: voltage • Vt: threshold voltage where

transistor conducts (begins to switch)

!42

Static/Leakage Power

Pleakag e ∼ N × V × e−Vt

Page 31: Thread-Level Parallelism — Parallel Programminghtseng/classes/cse203... · •Processor/compiler may reorder your memory operations/ instructions • Coherence protocol can only

• Given a scaling factor S

!43

Dennardian ScalingParameter Relation Classical Scaling

Power Budget 1Chip Size 1

Vdd (Supply Voltage) 1/SVt (Threshold Voltage) 1/S 1/S

tex (oxide thickness) 1/SW, L (transistor dimensions) 1/S

Cgate (gate capacitance) WL/tox 1/SIsat (saturation current) WVdd/tox 1/S

F (device frequency) Isat/(CgateVdd) SD (Device/Area) 1/(WL) S2

p (device power) IsatVdd 1/S2

P (chip power) Dp 1U (utilization) 1/P 1

Page 32: Thread-Level Parallelism — Parallel Programminghtseng/classes/cse203... · •Processor/compiler may reorder your memory operations/ instructions • Coherence protocol can only

• Given a scaling factor S

!48

Dennardian BrokenParameter Relation Classical Scaling Leakage Limited

Power Budget 1 1Chip Size 1 1

Vdd (Supply Voltage) 1/S 1Vt (Threshold Voltage) 1/S 1/S 1

tex (oxide thickness) 1/S 1/SW, L (transistor

dimensions)1/S 1/S

Cgate (gate capacitance) WL/tox 1/S 1/SIsat (saturation current) WVdd/tox 1/S 1

F (device frequency) Isat/(CgateVdd) S SD (Device/Area) 1/(WL) S2 S2

p (device power) IsatVdd 1/S2 1P (chip power) Dp 1 S2

U (utilization) 1/P 1 1/S2

Page 33: Thread-Level Parallelism — Parallel Programminghtseng/classes/cse203... · •Processor/compiler may reorder your memory operations/ instructions • Coherence protocol can only

Power consumption to light on all transistors

!49

Chip Chip0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.50.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.50.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.50.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.50.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.50.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.50.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.50.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.50.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.50.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5

1 1 1 1 1 1 1

1 1 1 1 1 1 1

1 1 1 1 1 1 1

1 1 1 1 1 1 1

1 1 1 1 1 1 1

1 1 1 1 1 1 1

1 1 1 1 1 1 1

Chip1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1

=49W =50W =100W!

Dennardian Scaling Dennardian Broken

On ~ 50W

Off ~ 0W

Dark!

Page 34: Thread-Level Parallelism — Parallel Programminghtseng/classes/cse203... · •Processor/compiler may reorder your memory operations/ instructions • Coherence protocol can only

Dark Silicon and the End of Multicore Scaling

H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam and D. Burger University of Washington, University of Wisconsin—Madison, University of Texas at Austin,

Microsoft Research

!50

Page 35: Thread-Level Parallelism — Parallel Programminghtseng/classes/cse203... · •Processor/compiler may reorder your memory operations/ instructions • Coherence protocol can only

More cores per chip, slower per core

!51

Page 36: Thread-Level Parallelism — Parallel Programminghtseng/classes/cse203... · •Processor/compiler may reorder your memory operations/ instructions • Coherence protocol can only

• If we are able to cram more transistors within the same chip area (Moore’s law continues), but the power consumption per transistor remains the same. Right now, if we power the chip with the same power consumption but put more transistors in the same area because the technology allows us to. How many of the following statements are true? က: The power consumption per chip will increase က< The power density of the chip will increase က> Given the same power budget, we may not able to power on all chip area if we maintain the

same clock rate က@ Given the same power budget, we may have to lower the clock rate of circuits to power on all

chip area A. 0 B. 1 C. 2 D. 3 E. 4

!52

What happens if power doesn’t scale with process technologies?

Page 37: Thread-Level Parallelism — Parallel Programminghtseng/classes/cse203... · •Processor/compiler may reorder your memory operations/ instructions • Coherence protocol can only

• Final Review tonight — 7pm-8:20pm @ WCH 143 • Homework #4 due 12/4 • iEval submission — attach your “confirmation” screen, you get

an extra/bonus homework • Project due on 12/2 • Office hour for Hung-Wei this week — MWF 1p-2p

!53

Announcement