Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU...

74
Lecture 6: GPU Programming G63.2011.002/G22.2945.001 · October 12, 2010 GPU Architecture (recap) Programming GPUs

Transcript of Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU...

Page 1: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Lecture 6: GPU Programming

G63.2011.002/G22.2945.001 · October 12, 2010

GPU Architecture (recap) Programming GPUs

Page 2: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Admin bits

• Start thinking about final projects• Find Teams

• About HW4

• Legislative Day (Dec 14)

• HW3 posted

GPU Architecture (recap) Programming GPUs

Page 3: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

HW3

GPU Architecture (recap) Programming GPUs

Page 4: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Outline

GPU Architecture (recap)

Programming GPUs

GPU Architecture (recap) Programming GPUs

Page 5: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Outline

GPU Architecture (recap)

Programming GPUs

GPU Architecture (recap) Programming GPUs

Page 6: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

“CPU-style” Cores

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

CPU-“style” cores

ALU (Execute)

Fetch/ Decode

Execution Context

Out-of-order control logic

Fancy branch predictor

Memory pre-fetcher

Data cache (A big one)

13

Credit: Kayvon Fatahalian (Stanford)

GPU Architecture (recap) Programming GPUs

Page 7: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Slimming down

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Slimming down

ALU (Execute)

Fetch/ Decode

Execution Context

Idea #1:

Remove components that help a single instruction stream run fast

14

Credit: Kayvon Fatahalian (Stanford)

GPU Architecture (recap) Programming GPUs

Page 8: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

More Space: Double the Number of Cores

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Two cores (two fragments in parallel)

ALU (Execute)

Fetch/ Decode

Execution Context

ALU (Execute)

Fetch/ Decode

Execution Context

<diffuseShader>: 

sample r0, v4, t0, s0 

mul  r3, v0, cb0[0] 

madd r3, v1, cb0[1], r3 

madd r3, v2, cb0[2], r3 

clmp r3, r3, l(0.0), l(1.0) 

mul  o0, r0, r3 

mul  o1, r1, r3 

mul  o2, r2, r3 

mov  o3, l(1.0) 

fragment 1

<diffuseShader>: 

sample r0, v4, t0, s0 

mul  r3, v0, cb0[0] 

madd r3, v1, cb0[1], r3 

madd r3, v2, cb0[2], r3 

clmp r3, r3, l(0.0), l(1.0) 

mul  o0, r0, r3 

mul  o1, r1, r3 

mul  o2, r2, r3 

mov  o3, l(1.0) 

fragment 2

15

Credit: Kayvon Fatahalian (Stanford)

GPU Architecture (recap) Programming GPUs

Page 9: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

. . . again

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Four cores (four fragments in parallel)

ALU (Execute)

Fetch/ Decode

Execution Context

ALU (Execute)

Fetch/ Decode

Execution Context

ALU (Execute)

Fetch/ Decode

Execution Context

ALU (Execute)

Fetch/ Decode

Execution Context

16

Credit: Kayvon Fatahalian (Stanford)

GPU Architecture (recap) Programming GPUs

Page 10: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

. . . and again

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Sixteen cores (sixteen fragments in parallel)

ALU ALU

ALU ALU

ALU ALU

ALU ALU

ALU ALU

ALU ALU

ALU ALU

ALU ALU

16 cores = 16 simultaneous instruction streams 17

Credit: Kayvon Fatahalian (Stanford)

→ 16 independent instruction streams

Reality: instruction streams not actuallyvery different/independent

GPU Architecture (recap) Programming GPUs

Page 11: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

. . . and again

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Sixteen cores (sixteen fragments in parallel)

ALU ALU

ALU ALU

ALU ALU

ALU ALU

ALU ALU

ALU ALU

ALU ALU

ALU ALU

16 cores = 16 simultaneous instruction streams 17

Credit: Kayvon Fatahalian (Stanford)

→ 16 independent instruction streams

Reality: instruction streams not actuallyvery different/independent

GPU Architecture (recap) Programming GPUs

Page 12: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Saving Yet More Space

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Recall: simple processing core

Fetch/ Decode

ALU (Execute)

Execution Context

19 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Recall: simple processing core

Fetch/ Decode

ALU (Execute)

Execution Context

19

Idea #2

Amortize cost/complexity ofmanaging an instruction streamacross many ALUs

→ SIMD

Credit: Kayvon Fatahalian (Stanford)

GPU Architecture (recap) Programming GPUs

Page 13: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Saving Yet More Space

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Recall: simple processing core

Fetch/ Decode

ALU (Execute)

Execution Context

19 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Recall: simple processing core

Fetch/ Decode

ALU (Execute)

Execution Context

19

Idea #2

Amortize cost/complexity ofmanaging an instruction streamacross many ALUs

→ SIMD

Credit: Kayvon Fatahalian (Stanford)

GPU Architecture (recap) Programming GPUs

Page 14: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Saving Yet More Space

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Add ALUs

Fetch/ Decode

Idea #2:

Amortize cost/complexity of managing an instruction stream across many ALUs

ALU 1 ALU 2 ALU 3 ALU 4

ALU 5 ALU 6 ALU 7 ALU 8

SIMD processing Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Shared Ctx Data

20 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Recall: simple processing core

Fetch/ Decode

ALU (Execute)

Execution Context

19

Idea #2

Amortize cost/complexity ofmanaging an instruction streamacross many ALUs

→ SIMD

Credit: Kayvon Fatahalian (Stanford)

GPU Architecture (recap) Programming GPUs

Page 15: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Saving Yet More Space

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Add ALUs

Fetch/ Decode

Idea #2:

Amortize cost/complexity of managing an instruction stream across many ALUs

ALU 1 ALU 2 ALU 3 ALU 4

ALU 5 ALU 6 ALU 7 ALU 8

SIMD processing Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Shared Ctx Data

20 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Add ALUs

Fetch/ Decode

Idea #2:

Amortize cost/complexity of managing an instruction stream across many ALUs

ALU 1 ALU 2 ALU 3 ALU 4

ALU 5 ALU 6 ALU 7 ALU 8

SIMD processing Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Shared Ctx Data

20

Idea #2

Amortize cost/complexity ofmanaging an instruction streamacross many ALUs

→ SIMD

Credit: Kayvon Fatahalian (Stanford)

GPU Architecture (recap) Programming GPUs

Page 16: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Gratuitous Amounts of Parallelism!

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

128 fragments in parallel

= 16 simultaneous instruction streams 16 cores = 128 ALUs

24 Credit: Kayvon Fatahalian (Stanford)

Example:

128 instruction streams in parallel16 independent groups of 8 synchronized streams

Great if everybody in a group does thesame thing.

But what if not?

What leads to divergent instructionstreams?

GPU Architecture (recap) Programming GPUs

Page 17: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Gratuitous Amounts of Parallelism!

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

128 fragments in parallel

= 16 simultaneous instruction streams 16 cores = 128 ALUs

24 Credit: Kayvon Fatahalian (Stanford)

Example:

128 instruction streams in parallel16 independent groups of 8 synchronized streams

Great if everybody in a group does thesame thing.

But what if not?

What leads to divergent instructionstreams?

GPU Architecture (recap) Programming GPUs

Page 18: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Gratuitous Amounts of Parallelism!

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

128 fragments in parallel

= 16 simultaneous instruction streams 16 cores = 128 ALUs

24 Credit: Kayvon Fatahalian (Stanford)

Example:

128 instruction streams in parallel16 independent groups of 8 synchronized streams

Great if everybody in a group does thesame thing.

But what if not?

What leads to divergent instructionstreams?

GPU Architecture (recap) Programming GPUs

Page 19: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Branches

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

But what about branches?

ALU 1 ALU 2 . . . ALU 8 . . . Time

(clocks)

2 ... 1 ... 8

if (x > 0) { 

} else { 

<unconditional shader code> 

<resume unconditional shader code> 

y = pow(x, exp); 

y *= Ks; 

refl = y + Ka;   

x = 0;  

refl = Ka;   

26

Credit: Kayvon Fatahalian (Stanford)

GPU Architecture (recap) Programming GPUs

Page 20: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Branches

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

But what about branches?

ALU 1 ALU 2 . . . ALU 8 . . . Time

(clocks)

2 ... 1 ... 8

if (x > 0) { 

} else { 

<unconditional shader code> 

<resume unconditional shader code> 

y = pow(x, exp); 

y *= Ks; 

refl = y + Ka;   

x = 0;  

refl = Ka;   

T T T F F F F F

27

Credit: Kayvon Fatahalian (Stanford)

GPU Architecture (recap) Programming GPUs

Page 21: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Branches

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

But what about branches?

ALU 1 ALU 2 . . . ALU 8 . . . Time

(clocks)

2 ... 1 ... 8

if (x > 0) { 

} else { 

<unconditional shader code> 

<resume unconditional shader code> 

y = pow(x, exp); 

y *= Ks; 

refl = y + Ka;   

x = 0;  

refl = Ka;   

T T T F F F F F

Not all ALUs do useful work! Worst case: 1/8 performance

28

Credit: Kayvon Fatahalian (Stanford)

GPU Architecture (recap) Programming GPUs

Page 22: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Branches

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

But what about branches?

ALU 1 ALU 2 . . . ALU 8 . . . Time

(clocks)

2 ... 1 ... 8

if (x > 0) { 

} else { 

<unconditional shader code> 

<resume unconditional shader code> 

y = pow(x, exp); 

y *= Ks; 

refl = y + Ka;   

x = 0;  

refl = Ka;   

T T T F F F F F

29

Credit: Kayvon Fatahalian (Stanford)

GPU Architecture (recap) Programming GPUs

Page 23: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Remaining Problem: Slow Memory

ProblemMemory still has very high latency. . .. . . but we’ve removed most of thehardware that helps us deal with that.

We’ve removed

• caches

• branch prediction

• out-of-order execution

So what now?

Idea #3

Even more parallelism+ Some extra memory

= A solution!

GPU Architecture (recap) Programming GPUs

Page 24: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Remaining Problem: Slow Memory

ProblemMemory still has very high latency. . .. . . but we’ve removed most of thehardware that helps us deal with that.

We’ve removed

• caches

• branch prediction

• out-of-order execution

So what now?

Idea #3

Even more parallelism+ Some extra memory

= A solution!

GPU Architecture (recap) Programming GPUs

Page 25: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Remaining Problem: Slow Memory

ProblemMemory still has very high latency. . .. . . but we’ve removed most of thehardware that helps us deal with that.

We’ve removed

• caches

• branch prediction

• out-of-order execution

So what now?SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Hiding shader stalls Time

(clocks) Frag 1 … 8

Fetch/ Decode

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Shared Ctx Data

ALU ALU ALU ALU

ALU ALU ALU ALU

33

Idea #3

Even more parallelism+ Some extra memory

= A solution!

GPU Architecture (recap) Programming GPUs

Page 26: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Remaining Problem: Slow Memory

ProblemMemory still has very high latency. . .. . . but we’ve removed most of thehardware that helps us deal with that.

We’ve removed

• caches

• branch prediction

• out-of-order execution

So what now?SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Hiding shader stalls Time

(clocks)

Fetch/ Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

1 2

3 4

1 2 3 4

Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32

34

Idea #3

Even more parallelism+ Some extra memory

= A solution!

GPU Architecture (recap) Programming GPUs

Page 27: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Hiding Memory Latency

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Hiding shader stalls Time

(clocks) Frag 1 … 8

Fetch/ Decode

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Shared Ctx Data

ALU ALU ALU ALU

ALU ALU ALU ALU

33

Credit: Kayvon Fatahalian (Stanford)

GPU Architecture (recap) Programming GPUs

Page 28: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Hiding Memory Latency

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Hiding shader stalls Time

(clocks)

Fetch/ Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

1 2

3 4

1 2 3 4

Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32

34

Credit: Kayvon Fatahalian (Stanford)

GPU Architecture (recap) Programming GPUs

Page 29: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Hiding Memory Latency

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Hiding shader stalls Time

(clocks)

Stall

Runnable

1 2 3 4

Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32

35

Credit: Kayvon Fatahalian (Stanford)

GPU Architecture (recap) Programming GPUs

Page 30: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Hiding Memory Latency

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Hiding shader stalls Time

(clocks)

Stall

Runnable

1 2 3 4

Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32

36

Credit: Kayvon Fatahalian (Stanford)

GPU Architecture (recap) Programming GPUs

Page 31: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Hiding Memory Latency

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Hiding shader stalls Time

(clocks)

1 2 3 4

Stall

Stall

Stall

Stall

Runnable

Runnable

Runnable

Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32

37

Credit: Kayvon Fatahalian (Stanford)

GPU Architecture (recap) Programming GPUs

Page 32: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Hiding Memory Latency

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Throughput! Time

(clocks)

Stall

Runnable

2 3 4

Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32

Done!

Stall

Runnable

Done!

Stall

Runnable

Done!

Stall

Runnable

Done!

1

Increase run time of one group To maximum throughput of many groups

Start

Start

Start

38

Credit: Kayvon Fatahalian (Stanford)

GPU Architecture (recap) Programming GPUs

Page 33: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

GPU Architecture Summary

Core Ideas:

1. Many slimmed down cores→ lots of parallelism

2. More ALUs, Fewer Control Units

3. Avoid memory stalls by interleavingexecution of SIMD groups

Credit: Kayvon Fatahalian (Stanford)

GPU Architecture (recap) Programming GPUs

Page 34: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Nvidia GTX200

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Off-chip Memory150 GB/s

GPU Architecture (recap) Programming GPUs

Page 35: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Outline

GPU Architecture (recap)

Programming GPUsIntro to OpenCL: The five W’s

GPU Architecture (recap) Programming GPUs

Page 36: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Outline

GPU Architecture (recap)

Programming GPUsIntro to OpenCL: The five W’s

GPU Architecture (recap) Programming GPUs

Page 37: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

GPU Programming: Gains and Losses

Gains Losses

+ Memory Bandwidth(140 GB/s vs. 12 GB/s)+ Compute Bandwidth(Peak: 1 TF/s vs. 50 GF/s,Real: 200 GF/s vs. 10 GF/s)o Data-parallel programmingo Functional portability be-tween devices (via OpenCL)

- No performance portability- Data size � Alg. design- Cheap branches (i.e. ifs)- Fine-grained malloc/free*)- Recursion *)- Function pointers *)- IEEE 754 FP compliance *)

*) Less problematic with new hardware. (Nvidia “Fermi”)

GPU Architecture (recap) Programming GPUs

Page 38: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

GPU Programming: Gains and Losses

Gains Losses

+ Memory Bandwidth(140 GB/s vs. 12 GB/s)+ Compute Bandwidth(Peak: 1 TF/s vs. 50 GF/s,Real: 200 GF/s vs. 10 GF/s)o Data-parallel programmingo Functional portability be-tween devices (via OpenCL)

- No performance portability- Data size � Alg. design- Cheap branches (i.e. ifs)- Fine-grained malloc/free*)- Recursion *)- Function pointers *)- IEEE 754 FP compliance *)

*) Less problematic with new hardware. (Nvidia “Fermi”)

GPU Architecture (recap) Programming GPUs

Page 39: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

What is OpenCL?

OpenCL (Open Computing Language) is anopen, royalty-free standard for general purposeparallel programming across CPUs, GPUs andother processors. [OpenCL 1.1 spec]

• Device-neutral (Nv GPU, AMD GPU,Intel/AMD CPU)

• Vendor-neutral

• Comes with RTCG

Defines:

• Host-side programming interface (library)

• Device-side programming language (!)

GPU Architecture (recap) Programming GPUs

Page 40: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Who?

© Copyright Khronos Group, 2010 - Page 4

OpenCL Working Group

• Diverse industry participation

- Processor vendors, system OEMs, middleware vendors, application developers

• Many industry-leading experts involved in OpenCL’s design

- A healthy diversity of industry perspectives

• Apple made initial proposal and is very active in the working group

- Serving as specification editor

Credit: Khronos Group

GPU Architecture (recap) Programming GPUs

Page 41: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

When?

© Copyright Khronos Group, 2010 - Page 5

OpenCL Timeline

• Six months from proposal to released OpenCL 1.0 specification

- Due to a strong initial proposal and a shared commercial incentive

• Multiple conformant implementations shipping

- Apple’s Mac OS X Snow Leopard now ships with OpenCL

• 18 month cadence between OpenCL 1.0 and OpenCL 1.1

- Backwards compatibility protect software investment

Apple proposes OpenCL working group and contributes draft specification to Khronos

Khronos publicly releases OpenCL 1.0 as royalty-free specification

Khronos releases OpenCL 1.0 conformance tests to ensure high-quality implementations

Jun08

Dec08

May09

2H09

Multiple conformant implementations ship across diverse OS and platforms

Jun10

OpenCL 1.1 Specification released and first implementations ship

Credit: Khronos Group

GPU Architecture (recap) Programming GPUs

Page 42: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Why?

© Copyright Khronos Group, 2010 - Page 3

Processor Parallelism

CPUsMultiple cores driving performance increases

GPUsIncreasingly general purpose data-parallel

computing

Graphics APIs and Shading

Languages

Multi-processor

programming – e.g. OpenMP

EmergingIntersection

HeterogeneousComputing

OpenCL is a programming framework for heterogeneous compute resources

Credit: Khronos Group

GPU Architecture (recap) Programming GPUs

Page 43: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

OpenCL: Computing as a Service

Host(CPU)

Memory

Compute Device 0 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 0 (Platform 1)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 1)

· · ·· · ·· · ·

Memory

Platform 0 (e.g. CPUs)

Platform 1 (e.g. GPUs)

(think “chip”,has memoryinterface)

Compute Unit(think “processor”,has insn. fetch)

Processing Element(think “SIMD lane”)

C “Runtime”

Device Language: ∼ C99

GPU Architecture (recap) Programming GPUs

Page 44: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

OpenCL: Computing as a Service

Host(CPU)

Memory

Compute Device 0 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 0 (Platform 1)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 1)

· · ·· · ·· · ·

Memory

Platform 0 (e.g. CPUs)

Platform 1 (e.g. GPUs)

(think “chip”,has memoryinterface)

Compute Unit(think “processor”,has insn. fetch)

Processing Element(think “SIMD lane”)

C “Runtime”

Device Language: ∼ C99

GPU Architecture (recap) Programming GPUs

Page 45: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

OpenCL: Computing as a Service

Host(CPU)

Memory

Compute Device 0 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 0 (Platform 1)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 1)

· · ·· · ·· · ·

Memory

Platform 0 (e.g. CPUs)

Platform 1 (e.g. GPUs)

(think “chip”,has memoryinterface)

Compute Unit(think “processor”,has insn. fetch)

Processing Element(think “SIMD lane”)

C “Runtime”

Device Language: ∼ C99

GPU Architecture (recap) Programming GPUs

Page 46: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

OpenCL: Computing as a Service

Host(CPU)

Memory

Compute Device 0 (Platform 0)

· · ·· · ·· · ·

MemoryCompute Device 1 (Platform 0)

· · ·· · ·· · ·

MemoryCompute Device 0 (Platform 1)

· · ·· · ·· · ·

MemoryCompute Device 1 (Platform 1)

· · ·· · ·· · ·

Memory

Platform 0 (e.g. CPUs)

Platform 1 (e.g. GPUs)

(think “chip”,has memoryinterface)

Compute Unit(think “processor”,has insn. fetch)

Processing Element(think “SIMD lane”)

C “Runtime”

Device Language: ∼ C99

GPU Architecture (recap) Programming GPUs

Page 47: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

OpenCL: Computing as a Service

Host(CPU)

Memory

Compute Device 0 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 0 (Platform 1)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 1)

· · ·· · ·· · ·

Memory

Platform 0 (e.g. CPUs)

Platform 1 (e.g. GPUs)

(think “chip”,has memoryinterface)

Compute Unit(think “processor”,has insn. fetch)

Processing Element(think “SIMD lane”)

C “Runtime”

Device Language: ∼ C99

GPU Architecture (recap) Programming GPUs

Page 48: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

OpenCL: Computing as a Service

Host(CPU)

Memory

Compute Device 0 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 0 (Platform 1)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 1)

· · ·· · ·· · ·

Memory

Platform 0 (e.g. CPUs)

Platform 1 (e.g. GPUs)

(think “chip”,has memoryinterface)

Compute Unit(think “processor”,has insn. fetch)

Processing Element(think “SIMD lane”)

C “Runtime”

Device Language: ∼ C99

GPU Architecture (recap) Programming GPUs

Page 49: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

OpenCL: Computing as a Service

Host(CPU)

Memory

Compute Device 0 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 0 (Platform 1)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 1)

· · ·· · ·· · ·

Memory

Platform 0 (e.g. CPUs)

Platform 1 (e.g. GPUs)

(think “chip”,has memoryinterface)

Compute Unit(think “processor”,has insn. fetch)

Processing Element(think “SIMD lane”)

C “Runtime”

Device Language: ∼ C99

GPU Architecture (recap) Programming GPUs

Page 50: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

OpenCL: Computing as a Service

Host(CPU)

Memory

Compute Device 0 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 0 (Platform 1)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 1)

· · ·· · ·· · ·

Memory

Platform 0 (e.g. CPUs)

Platform 1 (e.g. GPUs)

(think “chip”,has memoryinterface)

Compute Unit(think “processor”,has insn. fetch)

Processing Element(think “SIMD lane”)

C “Runtime”

Device Language: ∼ C99

GPU Architecture (recap) Programming GPUs

Page 51: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

OpenCL: Computing as a Service

Host(CPU)

Memory

Compute Device 0 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 0 (Platform 1)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 1)

· · ·· · ·· · ·

Memory

Platform 0 (e.g. CPUs)

Platform 1 (e.g. GPUs)

(think “chip”,has memoryinterface)

Compute Unit(think “processor”,has insn. fetch)

Processing Element(think “SIMD lane”)

C “Runtime”

Device Language: ∼ C99

GPU Architecture (recap) Programming GPUs

Page 52: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

OpenCL: Computing as a Service

Host(CPU)

Memory

Compute Device 0 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 0 (Platform 1)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 1)

· · ·· · ·· · ·

Memory

Platform 0 (e.g. CPUs)

Platform 1 (e.g. GPUs)

(think “chip”,has memoryinterface)

Compute Unit(think “processor”,has insn. fetch)

Processing Element(think “SIMD lane”)

C “Runtime”

Device Language: ∼ C99

GPU Architecture (recap) Programming GPUs

Page 53: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

OpenCL: Computing as a Service

Host(CPU)

Memory

Compute Device 0 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 0 (Platform 1)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 1)

· · ·· · ·· · ·

Memory

Platform 0 (e.g. CPUs)

Platform 1 (e.g. GPUs)

(think “chip”,has memoryinterface)

Compute Unit(think “processor”,has insn. fetch)

Processing Element(think “SIMD lane”)

C “Runtime”

Device Language: ∼ C99

GPU Architecture (recap) Programming GPUs

Page 54: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

OpenCL: Computing as a Service

Host(CPU)

Memory

Compute Device 0 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 0 (Platform 1)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 1)

· · ·· · ·· · ·

Memory

Platform 0 (e.g. CPUs)

Platform 1 (e.g. GPUs)

(think “chip”,has memoryinterface)

Compute Unit(think “processor”,has insn. fetch)

Processing Element(think “SIMD lane”)

C “Runtime”

Device Language: ∼ C99

GPU Architecture (recap) Programming GPUs

Page 55: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

OpenCL: Computing as a Service

Host(CPU)

Memory

Compute Device 0 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 0 (Platform 1)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 1)

· · ·· · ·· · ·

Memory

Platform 0 (e.g. CPUs)

Platform 1 (e.g. GPUs)

(think “chip”,has memoryinterface)

Compute Unit(think “processor”,has insn. fetch)

Processing Element(think “SIMD lane”)

C “Runtime”

Device Language: ∼ C99

GPU Architecture (recap) Programming GPUs

Page 56: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

OpenCL: Computing as a Service

Host(CPU)

Memory

Compute Device 0 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 0 (Platform 1)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 1)

· · ·· · ·· · ·

Memory

Platform 0 (e.g. CPUs)

Platform 1 (e.g. GPUs)

(think “chip”,has memoryinterface)

Compute Unit(think “processor”,has insn. fetch)

Processing Element(think “SIMD lane”)

C “Runtime”

Device Language: ∼ C99

GPU Architecture (recap) Programming GPUs

Page 57: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

OpenCL: Execution Model

© Copyright Khronos Group, 2010 - Page 11

An N-dimension domain of work-items

• Define the “best” N-dimensioned index space for your algorithm

- Global Dimensions: 1024 x 1024 (whole problem space)

- Local Dimensions: 128 x 128 (work group … executes together)

1024

10

24

Synchronization between work-items

possible only within workgroups:

barriers and memory fences

Cannot synchronize outside

of a workgroup

Credit: Khronos Group

GPU Architecture (recap) Programming GPUs

Page 58: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

OpenCL: Execution Model

nD Grid

Group(0, 0)

Group(0, 1)

Group(1, 0)

Group(1, 1)

Group(2, 0)

Group(2, 1)

Work Group (1, 0)

Item(0, 0)

Item(0, 1)

Item(0, 2)

Item(0, 3)

Item(1, 0)

Item(1, 1)

Item(1, 2)

Item(1, 3)

Item(2, 0)

Item(2, 1)

Item(2, 2)

Item(2, 3)

Item(3, 0)

Item(3, 1)

Item(3, 2)

Item(3, 3)

• Two-tiered Parallelism• Grid = Nx × Ny × Nz work groups• Work group = Sx × Sy × Sz work items• Total:

∏i∈{x,y ,z} SiNi work items

• Abstraction of core/SIMD lane HWconcept

• Comm/Sync only within work group

• Grid/Group ≈ outer loops in an algorithm

• Device Language:get {global,group,local} {id,size}(axis)

GPU Architecture (recap) Programming GPUs

Page 59: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Workgroups: Hardware to Software

Workgroup x

Wor

kgro

up

y

HW reality: SIMD lanes

SW abstraction: n-dim. work group

How do the two fit together?

→ Lexicographically!

Remember HW shenanigans:

• Quad-pumped Fetch/Decode

• Extra width for latency hiding

GPU Architecture (recap) Programming GPUs

Page 60: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Workgroups: Hardware to Software

SIMD Group 0

Workgroup x

Wor

kgro

up

y

HW reality: SIMD lanes

SW abstraction: n-dim. work group

How do the two fit together?

→ Lexicographically!

Remember HW shenanigans:

• Quad-pumped Fetch/Decode

• Extra width for latency hiding

GPU Architecture (recap) Programming GPUs

Page 61: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Workgroups: Hardware to Software

SIMD Group 1

Workgroup x

Wor

kgro

up

y

HW reality: SIMD lanes

SW abstraction: n-dim. work group

How do the two fit together?

→ Lexicographically!

Remember HW shenanigans:

• Quad-pumped Fetch/Decode

• Extra width for latency hiding

GPU Architecture (recap) Programming GPUs

Page 62: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Workgroups: Hardware to Software

SIMD Group 2

Workgroup x

Wor

kgro

up

y

HW reality: SIMD lanes

SW abstraction: n-dim. work group

How do the two fit together?

→ Lexicographically!

Remember HW shenanigans:

• Quad-pumped Fetch/Decode

• Extra width for latency hiding

GPU Architecture (recap) Programming GPUs

Page 63: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Workgroups: Hardware to Software

. . .

Workgroup x

Wor

kgro

up

y

HW reality: SIMD lanes

SW abstraction: n-dim. work group

How do the two fit together?

→ Lexicographically!

Remember HW shenanigans:

• Quad-pumped Fetch/Decode

• Extra width for latency hiding

GPU Architecture (recap) Programming GPUs

Page 64: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

The OpenCL C Language

© Copyright Khronos Group, 2010 - Page 16

Programming kernels: OpenCL C Language

• A subset of ISO C99

- But without some C99 features such as standard C99 headers,

function pointers, recursion, variable length arrays, and bit fields

• A superset of ISO C99 with additions for:

- Work-items and workgroups

- Vector types

- Synchronization

- Address space qualifiers

• Also includes a large set of built-in functions

- Image manipulation

- Work-item manipulation,

- Specialized math routines, etc.

Credit: Khronos Group

GPU Architecture (recap) Programming GPUs

Page 65: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Dive into OpenCL: Preparation

1 #ifdef APPLE2 #include <OpenCL/opencl.h>3 #else4 #include <CL/cl.h>5 #endif

1 #include ”cl−helper.h”23 int main()4 {5 // init6 cl context ctx ; cl command queue queue;7 create context on (”NVIDIA”, NULL, 0, &ctx, &queue, 0);89 // allocate and initialize CPU memory

10 const size t sz = 10000;11 float a[sz ];12 for ( size t i = 0; i < sz; ++i) a[i ] = i;

GPU Architecture (recap) Programming GPUs

Page 66: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Dive into OpenCL: Memory

15 // allocate GPU memory, transfer to GPU1617 cl int status ;18 cl mem buf a = clCreateBuffer(ctx , CL MEM READ WRITE,19 sizeof(float ) ∗ sz , 0, &status);20 CHECK CL ERROR(status, ”clCreateBuffer”);2122 CALL CL GUARDED(clEnqueueWriteBuffer, (23 queue, buf a , /∗blocking∗/ CL TRUE, /∗offset∗/ 0,24 sz ∗ sizeof(float ), a,25 0, NULL, NULL));

GPU Architecture (recap) Programming GPUs

Page 67: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Dive into OpenCL: Running

28 // load kernels29 char ∗ knl text = read file (”twice . cl”);30 cl kernel knl = kernel from string (ctx , knl text , ”twice”, NULL);31 free ( knl text );3233 // run code on GPU34 SET 1 KERNEL ARG(knl, buf a);35 size t gdim[] = { sz };36 size t ldim [] = { 1 };37 CALL CL GUARDED(clEnqueueNDRangeKernel,38 (queue, knl ,39 /∗dimensions∗/ 1, NULL, gdim, ldim,40 0, NULL, NULL));

1 kernel void twice( global float ∗a)2 { a[ get global id (0)] ∗= 2; }

GPU Architecture (recap) Programming GPUs

Page 68: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Dive into OpenCL: Clean-up

43 // clean up ...44 CALL CL GUARDED(clReleaseMemObject, (buf a));45 CALL CL GUARDED(clReleaseKernel, (knl));46 CALL CL GUARDED(clReleaseCommandQueue, (queue));47 CALL CL GUARDED(clReleaseContext, (ctx));48 }

Why check for errors?

• GPUs have (some) memory protection

• Invalid sizes (block/grid/. . . )

• Out of memory, access restriction, hardware limitations, etc.

Does this code use the hardware well?

GPU Architecture (recap) Programming GPUs

Page 69: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Dive into OpenCL: Clean-up

43 // clean up ...44 CALL CL GUARDED(clReleaseMemObject, (buf a));45 CALL CL GUARDED(clReleaseKernel, (knl));46 CALL CL GUARDED(clReleaseCommandQueue, (queue));47 CALL CL GUARDED(clReleaseContext, (ctx));48 }

Why check for errors?

• GPUs have (some) memory protection

• Invalid sizes (block/grid/. . . )

• Out of memory, access restriction, hardware limitations, etc.

Does this code use the hardware well?

GPU Architecture (recap) Programming GPUs

Page 70: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Getting your feet wet

Thinking about GPU programming

How would we modify the program to. . .

• . . . print the contents of the result?

• . . . compute ci = aibi?

• . . . use groups of 256 work items each?

• . . . use groups of 16× 16 work items?

GPU Architecture (recap) Programming GPUs

Page 71: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Getting your feet wet

Thinking about GPU programming

How would we modify the program to. . .

• . . . print the contents of the result?

• . . . compute ci = aibi?

• . . . use groups of 256 work items each?

• . . . use groups of 16× 16 work items?

GPU Architecture (recap) Programming GPUs

Page 72: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Getting your feet wet

Thinking about GPU programming

How would we modify the program to. . .

• . . . print the contents of the result?

• . . . compute ci = aibi?

• . . . use groups of 256 work items each?

• . . . use groups of 16× 16 work items?

GPU Architecture (recap) Programming GPUs

Page 73: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Getting your feet wet

Thinking about GPU programming

How would we modify the program to. . .

• . . . print the contents of the result?

• . . . compute ci = aibi?

• . . . use groups of 256 work items each?

• . . . use groups of 16× 16 work items?

GPU Architecture (recap) Programming GPUs

Page 74: Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Questions?

?

GPU Architecture (recap) Programming GPUs