[Harvard CS264] 04 - Intermediate-level CUDA Programming

144
Lecture #4: Intermediate-level CUDA | February 15th, 2011 Nicolas Pinto (MIT, Harvard) [email protected] Massively Parallel Computing CS 264 / CSCI E-292

description

http://cs264.org

Transcript of [Harvard CS264] 04 - Intermediate-level CUDA Programming

Page 1: [Harvard CS264] 04 - Intermediate-level CUDA Programming

Lecture #4: Intermediate-level CUDA | February 15th, 2011

Nicolas Pinto (MIT, Harvard) [email protected]

Massively Parallel ComputingCS 264 / CSCI E-292

Page 2: [Harvard CS264] 04 - Intermediate-level CUDA Programming

Administrivia

• HW1: due Fri 2/18/11 (this week)

• Projects: think about it, consult the staff

• New guest lecturers!

• Max Lin (Google), Kurt Messersmith et al. (Amazon), David Rich et al. (Microsoft)

Page 3: [Harvard CS264] 04 - Intermediate-level CUDA Programming

During this course,

we’ll try to

and use existing material ;-)

“ ”

adapted for CS264

Page 4: [Harvard CS264] 04 - Intermediate-level CUDA Programming

Todayyey!!

Page 5: [Harvard CS264] 04 - Intermediate-level CUDA Programming

Outline

• CUDA Language & APIs (overview)

• Threading/Execution (cont’d)

• Memory/Communication (cont’d)

• Tools

• Libraries

Page 6: [Harvard CS264] 04 - Intermediate-level CUDA Programming

Outline

• CUDA Language & APIs (overview)

• Threading/Execution (cont’d)

• Memory/Communication (cont’d)

• Tools

• Libraries

Page 7: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/DEFD/$$0

G

! !"#$%&"'()*%)(%(+,-#)./-(.'(#/01.$"2()/(%-2(

"3"#,)"'(/-()*"(*/')

! 4%$$%5$"(/-$6(+&/0(%-/)*"&()*"(*/')

! 7,-#)./-'(8.)*/,)(%-6(49!:(2"#$'1"# %&"(

*/')(56(2"+%,$)

! 4%-(,'"(!!"#$%!! %-2(!!&'()*'!!+

)/;")*"&

! !"#$%&"'()*%)(%(+,-#)./-(.'(#/01.$"2()/(%-2(

"3"#,)"'(/-()*"(2"<.#"

! 4%$$%5$"(+&/0()*"(*/')

! 9'"2(%'()*"("-)&6(1/.-)(+&/0(*/')()/(2"<.#"

! 49!:(1&/<.2"'(%('")(/+(5,.$)=.-(<"#)/&()61"'>

! *",-./+0*",-./+*",-1/+0*",-1/+*",-2/+0*",-2/+*",-3/+0*",-3/+

! $"#-%./+0$"#-%./+$"#-%1/+0$"#-%1/+$"#-%2/+0$"#-%2/+$"#-%3/+0$"#-%3/

! )4%./+0)4%./+)4%1/+0)4%1/+)4%2/+0)4%2/+)4%3/+0)4%3/+

! 5#46./+05#46./+5#461/+05#461/+5#462/+05#462/+5#463/+05#463/+

! 75#,%./+75#,%1/+75#,%2/+75#,%3+

! 4%-(#/-')&,#)(%(<"#)/&()61"(8.)*('1"#.%$(

+,-#)./->

8,9'!!"#$%&'(%):(;/+(.!"#$

! 4%-(%##"''("$"0"-)'(/+(%(<"#)/&()61"(8.)*(

!"#$%&!"'$%&!"($%&!")$*

('*(,-<=

! &)82 .'(%('1"#.%$(<"#)/&()61"

! ?%0"(%'(0)4%2@("3#"1)(#%-(5"(#/-')&,#)"2(

+&/0(%('#%$%&()/(+/&0(%(<"#)/&>

:$*,5,-/+./+.>

! 49!:(1&/<.2"'(+/,&(;$/5%$@(5,.$)=.-(<%&.%5$"'

! %"-',&?&=@(@5#*9?&=@(@5#*9A)8@(

6-)&A)8

! +',-.&/0&/&1&)822&34&10)4%22&

! :##"''.5$"(/-$6(+&/0(2"<.#"(#/2"

! 4%--/)()%A"(%22&"''

! 4%--/)(%''.;-(<%$,"

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/DE/D/$$0

#

! !"#$!%&'()*'+),-./'+0'1(2*'('3014*+-./'

4)0563+7''890:*'"0'+;*'%*+(9

! <-):+')*9*(:*'=>?$>@A'B01B*5

! C*30.5')*9*(:*'-.'D?$>@')*EF)-++*.

! G4*.'C06)3*

! H0+',*+'('I-(B9*':096+-0.

&*I-3*

%69+-EJ)03*::0) %69+-EJ)03*::0)

!

!"#$%&'()"*+ !"#$%&'()"*+

,(-.#(&'()"*+

/0 /0

/0 /0

/0 /0

! !

/0 /0

/0 /0

/0 /0

! !

%69+-EJ)03*::0)

!"#$%&'()"*+

/0 /0

/0 /0

/0 /0

! !

! ";*'K6.5(1*.+(9'6.-+'-:'+;*'!"#$%&'(#)*$!!)#

! !':3(9()L'1.23%(45*(#.1."2 K90(+-./'40-.+'!MN

! G.*'%69+-49,$!55'4*)'39032'3,39*

! 678 #OOOE@PQ'30149-(.+

! C+)*(1'4)03*::0):'()*'/)064*5'-.+0'&+,"-.

(#)*$!!)#!

! %69+-E4)03*::0):')6.'-.'/9',&%"#:1;(5

! !'.61B*)'0K'169+-E4)03*::0):'K0)1'('/$0-*$

%69+-EJ)03*::0)

<(3.1;(*1!"#$%&

'()"*+

!

/;*($)&0*"#(11"*

/=$*(>&'()"*+

?%"@$%&'()"*+

A"21;$2;&'()"*+

8(B;C*(&'()"*+

/;*($)&0*"#(11"*1&=$-(&$##(11&;"D

'()"*+&8+5( E##(11 /=$*.23

R*/-:+*): R*(5$S)-+* J)-I(+*

M03(9'%*10), R*(5$S)-+* J)-I(+*

C;()*5'%*10), R*(5$S)-+* %69+-EJ)03*::0)

T90B(9'%*10), R*(5$S)-+* &*I-3*

80.:+(.+'%*10), R*(5 &*I-3*

"*U+6)*'%*10), R*(5 &*I-3*

Language

Page 8: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/DEFD/$$0

G

! !"#$%&"'()*%)(%(+,-#)./-(.'(#/01.$"2()/(%-2(

"3"#,)"'(/-()*"(*/')

! 4%$$%5$"(/-$6(+&/0(%-/)*"&()*"(*/')

! 7,-#)./-'(8.)*/,)(%-6(49!:(2"#$'1"# %&"(

*/')(56(2"+%,$)

! 4%-(,'"(!!"#$%!! %-2(!!&'()*'!!+

)/;")*"&

! !"#$%&"'()*%)(%(+,-#)./-(.'(#/01.$"2()/(%-2(

"3"#,)"'(/-()*"(2"<.#"

! 4%$$%5$"(+&/0()*"(*/')

! 9'"2(%'()*"("-)&6(1/.-)(+&/0(*/')()/(2"<.#"

! 49!:(1&/<.2"'(%('")(/+(5,.$)=.-(<"#)/&()61"'>

! *",-./+0*",-./+*",-1/+0*",-1/+*",-2/+0*",-2/+*",-3/+0*",-3/+

! $"#-%./+0$"#-%./+$"#-%1/+0$"#-%1/+$"#-%2/+0$"#-%2/+$"#-%3/+0$"#-%3/

! )4%./+0)4%./+)4%1/+0)4%1/+)4%2/+0)4%2/+)4%3/+0)4%3/+

! 5#46./+05#46./+5#461/+05#461/+5#462/+05#462/+5#463/+05#463/+

! 75#,%./+75#,%1/+75#,%2/+75#,%3+

! 4%-(#/-')&,#)(%(<"#)/&()61"(8.)*('1"#.%$(

+,-#)./->

8,9'!!"#$%&'(%):(;/+(.!"#$

! 4%-(%##"''("$"0"-)'(/+(%(<"#)/&()61"(8.)*(

!"#$%&!"'$%&!"($%&!")$*

('*(,-<=

! &)82 .'(%('1"#.%$(<"#)/&()61"

! ?%0"(%'(0)4%2@("3#"1)(#%-(5"(#/-')&,#)"2(

+&/0(%('#%$%&()/(+/&0(%(<"#)/&>

:$*,5,-/+./+.>

! 49!:(1&/<.2"'(+/,&(;$/5%$@(5,.$)=.-(<%&.%5$"'

! %"-',&?&=@(@5#*9?&=@(@5#*9A)8@(

6-)&A)8

! +',-.&/0&/&1&)822&34&10)4%22&

! :##"''.5$"(/-$6(+&/0(2"<.#"(#/2"

! 4%--/)()%A"(%22&"''

! 4%--/)(%''.;-(<%$,"

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/DE/D/$$0

#

! !"#$!%&'()*'+),-./'+0'1(2*'('3014*+-./'

4)0563+7''890:*'"0'+;*'%*+(9

! <-):+')*9*(:*'=>?$>@A'B01B*5

! C*30.5')*9*(:*'-.'D?$>@')*EF)-++*.

! G4*.'C06)3*

! H0+',*+'('I-(B9*':096+-0.

&*I-3*

%69+-EJ)03*::0) %69+-EJ)03*::0)

!

!"#$%&'()"*+ !"#$%&'()"*+

,(-.#(&'()"*+

/0 /0

/0 /0

/0 /0

! !

/0 /0

/0 /0

/0 /0

! !

%69+-EJ)03*::0)

!"#$%&'()"*+

/0 /0

/0 /0

/0 /0

! !

! ";*'K6.5(1*.+(9'6.-+'-:'+;*'!"#$%&'(#)*$!!)#

! !':3(9()L'1.23%(45*(#.1."2 K90(+-./'40-.+'!MN

! G.*'%69+-49,$!55'4*)'39032'3,39*

! 678 #OOOE@PQ'30149-(.+

! C+)*(1'4)03*::0):'()*'/)064*5'-.+0'&+,"-.

(#)*$!!)#!

! %69+-E4)03*::0):')6.'-.'/9',&%"#:1;(5

! !'.61B*)'0K'169+-E4)03*::0):'K0)1'('/$0-*$

%69+-EJ)03*::0)

<(3.1;(*1!"#$%&

'()"*+

!

/;*($)&0*"#(11"*

/=$*(>&'()"*+

?%"@$%&'()"*+

A"21;$2;&'()"*+

8(B;C*(&'()"*+

/;*($)&0*"#(11"*1&=$-(&$##(11&;"D

'()"*+&8+5( E##(11 /=$*.23

R*/-:+*): R*(5$S)-+* J)-I(+*

M03(9'%*10), R*(5$S)-+* J)-I(+*

C;()*5'%*10), R*(5$S)-+* %69+-EJ)03*::0)

T90B(9'%*10), R*(5$S)-+* &*I-3*

80.:+(.+'%*10), R*(5 &*I-3*

"*U+6)*'%*10), R*(5 &*I-3*

Language

Page 9: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/DEFD/$$0

G

! !"#$%&"'()*%)(%(+,-#)./-(.'(#/01.$"2()/(%-2(

"3"#,)"'(/-()*"(*/')

! 4%$$%5$"(/-$6(+&/0(%-/)*"&()*"(*/')

! 7,-#)./-'(8.)*/,)(%-6(49!:(2"#$'1"# %&"(

*/')(56(2"+%,$)

! 4%-(,'"(!!"#$%!! %-2(!!&'()*'!!+

)/;")*"&

! !"#$%&"'()*%)(%(+,-#)./-(.'(#/01.$"2()/(%-2(

"3"#,)"'(/-()*"(2"<.#"

! 4%$$%5$"(+&/0()*"(*/')

! 9'"2(%'()*"("-)&6(1/.-)(+&/0(*/')()/(2"<.#"

! 49!:(1&/<.2"'(%('")(/+(5,.$)=.-(<"#)/&()61"'>

! *",-./+0*",-./+*",-1/+0*",-1/+*",-2/+0*",-2/+*",-3/+0*",-3/+

! $"#-%./+0$"#-%./+$"#-%1/+0$"#-%1/+$"#-%2/+0$"#-%2/+$"#-%3/+0$"#-%3/

! )4%./+0)4%./+)4%1/+0)4%1/+)4%2/+0)4%2/+)4%3/+0)4%3/+

! 5#46./+05#46./+5#461/+05#461/+5#462/+05#462/+5#463/+05#463/+

! 75#,%./+75#,%1/+75#,%2/+75#,%3+

! 4%-(#/-')&,#)(%(<"#)/&()61"(8.)*('1"#.%$(

+,-#)./->

8,9'!!"#$%&'(%):(;/+(.!"#$

! 4%-(%##"''("$"0"-)'(/+(%(<"#)/&()61"(8.)*(

!"#$%&!"'$%&!"($%&!")$*

('*(,-<=

! &)82 .'(%('1"#.%$(<"#)/&()61"

! ?%0"(%'(0)4%2@("3#"1)(#%-(5"(#/-')&,#)"2(

+&/0(%('#%$%&()/(+/&0(%(<"#)/&>

:$*,5,-/+./+.>

! 49!:(1&/<.2"'(+/,&(;$/5%$@(5,.$)=.-(<%&.%5$"'

! %"-',&?&=@(@5#*9?&=@(@5#*9A)8@(

6-)&A)8

! +',-.&/0&/&1&)822&34&10)4%22&

! :##"''.5$"(/-$6(+&/0(2"<.#"(#/2"

! 4%--/)()%A"(%22&"''

! 4%--/)(%''.;-(<%$,"!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/DE/D/$$0

#

! !"#$!%&'()*'+),-./'+0'1(2*'('3014*+-./'

4)0563+7''890:*'"0'+;*'%*+(9

! <-):+')*9*(:*'=>?$>@A'B01B*5

! C*30.5')*9*(:*'-.'D?$>@')*EF)-++*.

! G4*.'C06)3*

! H0+',*+'('I-(B9*':096+-0.

&*I-3*

%69+-EJ)03*::0) %69+-EJ)03*::0)

!

!"#$%&'()"*+ !"#$%&'()"*+

,(-.#(&'()"*+

/0 /0

/0 /0

/0 /0

! !

/0 /0

/0 /0

/0 /0

! !

%69+-EJ)03*::0)

!"#$%&'()"*+

/0 /0

/0 /0

/0 /0

! !

! ";*'K6.5(1*.+(9'6.-+'-:'+;*'!"#$%&'(#)*$!!)#

! !':3(9()L'1.23%(45*(#.1."2 K90(+-./'40-.+'!MN

! G.*'%69+-49,$!55'4*)'39032'3,39*

! 678 #OOOE@PQ'30149-(.+

! C+)*(1'4)03*::0):'()*'/)064*5'-.+0'&+,"-.

(#)*$!!)#!

! %69+-E4)03*::0):')6.'-.'/9',&%"#:1;(5

! !'.61B*)'0K'169+-E4)03*::0):'K0)1'('/$0-*$

%69+-EJ)03*::0)

<(3.1;(*1!"#$%&

'()"*+

!

/;*($)&0*"#(11"*

/=$*(>&'()"*+

?%"@$%&'()"*+

A"21;$2;&'()"*+

8(B;C*(&'()"*+

/;*($)&0*"#(11"*1&=$-(&$##(11&;"D

'()"*+&8+5( E##(11 /=$*.23

R*/-:+*): R*(5$S)-+* J)-I(+*

M03(9'%*10), R*(5$S)-+* J)-I(+*

C;()*5'%*10), R*(5$S)-+* %69+-EJ)03*::0)

T90B(9'%*10), R*(5$S)-+* &*I-3*

80.:+(.+'%*10), R*(5 &*I-3*

"*U+6)*'%*10), R*(5 &*I-3*

Language

Page 10: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/DEFD/$$0

G

! !"#$%&"'()*%)(%(+,-#)./-(.'(#/01.$"2()/(%-2(

"3"#,)"'(/-()*"(*/')

! 4%$$%5$"(/-$6(+&/0(%-/)*"&()*"(*/')

! 7,-#)./-'(8.)*/,)(%-6(49!:(2"#$'1"# %&"(

*/')(56(2"+%,$)

! 4%-(,'"(!!"#$%!! %-2(!!&'()*'!!+

)/;")*"&

! !"#$%&"'()*%)(%(+,-#)./-(.'(#/01.$"2()/(%-2(

"3"#,)"'(/-()*"(2"<.#"

! 4%$$%5$"(+&/0()*"(*/')

! 9'"2(%'()*"("-)&6(1/.-)(+&/0(*/')()/(2"<.#"

! 49!:(1&/<.2"'(%('")(/+(5,.$)=.-(<"#)/&()61"'>

! *",-./+0*",-./+*",-1/+0*",-1/+*",-2/+0*",-2/+*",-3/+0*",-3/+

! $"#-%./+0$"#-%./+$"#-%1/+0$"#-%1/+$"#-%2/+0$"#-%2/+$"#-%3/+0$"#-%3/

! )4%./+0)4%./+)4%1/+0)4%1/+)4%2/+0)4%2/+)4%3/+0)4%3/+

! 5#46./+05#46./+5#461/+05#461/+5#462/+05#462/+5#463/+05#463/+

! 75#,%./+75#,%1/+75#,%2/+75#,%3+

! 4%-(#/-')&,#)(%(<"#)/&()61"(8.)*('1"#.%$(

+,-#)./->

8,9'!!"#$%&'(%):(;/+(.!"#$

! 4%-(%##"''("$"0"-)'(/+(%(<"#)/&()61"(8.)*(

!"#$%&!"'$%&!"($%&!")$*

('*(,-<=

! &)82 .'(%('1"#.%$(<"#)/&()61"

! ?%0"(%'(0)4%2@("3#"1)(#%-(5"(#/-')&,#)"2(

+&/0(%('#%$%&()/(+/&0(%(<"#)/&>

:$*,5,-/+./+.>

! 49!:(1&/<.2"'(+/,&(;$/5%$@(5,.$)=.-(<%&.%5$"'

! %"-',&?&=@(@5#*9?&=@(@5#*9A)8@(

6-)&A)8

! +',-.&/0&/&1&)822&34&10)4%22&

! :##"''.5$"(/-$6(+&/0(2"<.#"(#/2"

! 4%--/)()%A"(%22&"''

! 4%--/)(%''.;-(<%$,"!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/DE/D/$$0

#

! !"#$!%&'()*'+),-./'+0'1(2*'('3014*+-./'

4)0563+7''890:*'"0'+;*'%*+(9

! <-):+')*9*(:*'=>?$>@A'B01B*5

! C*30.5')*9*(:*'-.'D?$>@')*EF)-++*.

! G4*.'C06)3*

! H0+',*+'('I-(B9*':096+-0.

&*I-3*

%69+-EJ)03*::0) %69+-EJ)03*::0)

!

!"#$%&'()"*+ !"#$%&'()"*+

,(-.#(&'()"*+

/0 /0

/0 /0

/0 /0

! !

/0 /0

/0 /0

/0 /0

! !

%69+-EJ)03*::0)

!"#$%&'()"*+

/0 /0

/0 /0

/0 /0

! !

! ";*'K6.5(1*.+(9'6.-+'-:'+;*'!"#$%&'(#)*$!!)#

! !':3(9()L'1.23%(45*(#.1."2 K90(+-./'40-.+'!MN

! G.*'%69+-49,$!55'4*)'39032'3,39*

! 678 #OOOE@PQ'30149-(.+

! C+)*(1'4)03*::0):'()*'/)064*5'-.+0'&+,"-.

(#)*$!!)#!

! %69+-E4)03*::0):')6.'-.'/9',&%"#:1;(5

! !'.61B*)'0K'169+-E4)03*::0):'K0)1'('/$0-*$

%69+-EJ)03*::0)

<(3.1;(*1!"#$%&

'()"*+

!

/;*($)&0*"#(11"*

/=$*(>&'()"*+

?%"@$%&'()"*+

A"21;$2;&'()"*+

8(B;C*(&'()"*+

/;*($)&0*"#(11"*1&=$-(&$##(11&;"D

'()"*+&8+5( E##(11 /=$*.23

R*/-:+*): R*(5$S)-+* J)-I(+*

M03(9'%*10), R*(5$S)-+* J)-I(+*

C;()*5'%*10), R*(5$S)-+* %69+-EJ)03*::0)

T90B(9'%*10), R*(5$S)-+* &*I-3*

80.:+(.+'%*10), R*(5 &*I-3*

"*U+6)*'%*10), R*(5 &*I-3*

Language

Page 11: [Harvard CS264] 04 - Intermediate-level CUDA Programming

© NVIDIA Corporation 2010 54

!"#$%&'()*$+(,%-.*./

0+'.'(*.'($12($/3'4(25(."#$%&'(&*$+(23'.*$%2#4(%#(4%#67'(3.'8%4%2#

!!"#$%"&'9(:%.'8$(&*33%#6($2(+*.:1*.'(;<=>*4$(-"$(721'.(*88".*8/(?4''(3.26@(6"%:'(52.(:'$*%74ABC*&37'49(!!()$"&*'+,!!-*."&*'+,!!./0"&*+1'

"#$%"&',9(82&3%7'($2(&"7$%37'(%#4$."8$%2#4<721'.(-"$(+%6+'.(*88".*8/(?D("73(2.(7'44ABC*&37'49(()$"&*'+,-*."&*'+,./0"&*+1'

0+'(2#(-!"3(4!5346,82&3%7'.(23$%2#(52.8'4('E'./("#$%"&',$2(82&3%7'($2(!!"#$%"&'

Unit of Least Precision (ULP) is the gap between the floating-point numbers nearest a given real number

Language

Page 12: [Harvard CS264] 04 - Intermediate-level CUDA Programming

© NVIDIA Corporation 2010

CUDA APIs

API allows the host to manage the devicesAllocate memory & transfer dataLaunch kernels

High level of abstraction - start here!

More control, more verbose

(OpenCL: Similar to CUDA C Driver API)

(aka “Device” API)

API

Page 13: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/D/#D/$$0

&

! !"#$%&'($)*+,$(-.$/0*123#+$4567,2*6+$4*08

! '#127#$9:6:;#9#6,

! <#9*0=$9:6:;#9#6,

! >,0#:9$9:6:;#9#6,

! ?1#6,$9:6:;#9#6,

! !#@,50#$9:6:;9#6,

! A/#6BCD'20#7,E$26,#0*/#0:F2G2,=

! !"#$)*+,$(-.$2+$#@/*+#3$:+$,H*$3244#0#6,$

!"#$%&

! !"#$G*H$G#1#G$'#127#$(-.$I/0#42@8$75J

! !"#$"2;"$G#1#G$K56,29#$(-.$I/0#42@8$753:J

! >*9#$,"26;+$7:6$F#$3*6#$,"0*5;"$F*,"$(-.+L$*,"#0+$:0#$+/#72:G2M#3

! %:6$F#$92@#3$,*;#,"#0$IH2,"$7:0#J

! (GG$B-&$7*9/5,26;$2+$/#04*09#3$*6$:$3#127#! !*$:GG*7:,#$9#9*0=L$056$:$/0*;0:9L$#,7$*6$,"#$":03H:0#L$H#$6##3$:$!"#$%"&%'()"*)

! '#127#$7*6,#@,+$:0#$F*563$N8N$H2,"$"*+,$,"0#:3+$IO5+,$G2P#$A/#6BCQJ! >*L$#:7"$"*+,$,"0#:3$9:=$":1#$:,$9*+,$*6#$3#127#$7*6,#@,

! (63L$#:7"$3#127#$7*6,#@,$2+$:77#++2FG#$40*9$*6G=$*6#$"*+,$,"0#:3

! (GG$3#127#$(-.$7:GG+$0#,506$:6$#00*0D+577#++$

7*3#$*4$,=/#8$+,-"./0)

! (GG$056,29#$(-.$7:GG+$0#,506$:6$#00*0D+577#++$

7*3#$*4$,=/#$%/!12--'-3)

! (6$26,#;#0$1:G5#$H2,"$M#0*$R$6*$#00*0

! %/!14")51.)2--'-L$%/!14")2--'-6)-$(7

! K56,29#$(-.$7:GG+$:5,*9:,27:GG=$262,2:G2M#

! '#127#$(-.$7:GG+$95+,$7:GG$%/8($)

! !"#$420+,$I*/,2*6:GSJ$+,#/$2+$,*$#659#0:,#$,"#$

:1:2G:FG#$3#127#+

! %/9"#$%"4")+'/()

! %/9"#$%"4")

! %/9"#$%"4"):1;"

! %/9"#$%"4")<')10=";'->

! %/9"#$%"4")?))-$@/)"

! !

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/DE/D/$$0

#

! !"#$!%&'()*'+),-./'+0'1(2*'('3014*+-./'

4)0563+7''890:*'"0'+;*'%*+(9

! <-):+')*9*(:*'=>?$>@A'B01B*5

! C*30.5')*9*(:*'-.'D?$>@')*EF)-++*.

! G4*.'C06)3*

! H0+',*+'('I-(B9*':096+-0.

&*I-3*

%69+-EJ)03*::0) %69+-EJ)03*::0)

!

!"#$%&'()"*+ !"#$%&'()"*+

,(-.#(&'()"*+

/0 /0

/0 /0

/0 /0

! !

/0 /0

/0 /0

/0 /0

! !

%69+-EJ)03*::0)

!"#$%&'()"*+

/0 /0

/0 /0

/0 /0

! !

! ";*'K6.5(1*.+(9'6.-+'-:'+;*'!"#$%&'(#)*$!!)#

! !':3(9()L'1.23%(45*(#.1."2 K90(+-./'40-.+'!MN

! G.*'%69+-49,$!55'4*)'39032'3,39*

! 678 #OOOE@PQ'30149-(.+

! C+)*(1'4)03*::0):'()*'/)064*5'-.+0'&+,"-.

(#)*$!!)#!

! %69+-E4)03*::0):')6.'-.'/9',&%"#:1;(5

! !'.61B*)'0K'169+-E4)03*::0):'K0)1'('/$0-*$

%69+-EJ)03*::0)

<(3.1;(*1!"#$%&

'()"*+

!

/;*($)&0*"#(11"*

/=$*(>&'()"*+

?%"@$%&'()"*+

A"21;$2;&'()"*+

8(B;C*(&'()"*+

/;*($)&0*"#(11"*1&=$-(&$##(11&;"D

'()"*+&8+5( E##(11 /=$*.23

R*/-:+*): R*(5$S)-+* J)-I(+*

M03(9'%*10), R*(5$S)-+* J)-I(+*

C;()*5'%*10), R*(5$S)-+* %69+-EJ)03*::0)

T90B(9'%*10), R*(5$S)-+* &*I-3*

80.:+(.+'%*10), R*(5 &*I-3*

"*U+6)*'%*10), R*(5 &*I-3*

API

Page 14: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/D/#D/$$0

&

! !"#$%&'($)*+,$(-.$/0*123#+$4567,2*6+$4*08

! '#127#$9:6:;#9#6,

! <#9*0=$9:6:;#9#6,

! >,0#:9$9:6:;#9#6,

! ?1#6,$9:6:;#9#6,

! !#@,50#$9:6:;9#6,

! A/#6BCD'20#7,E$26,#0*/#0:F2G2,=

! !"#$)*+,$(-.$2+$#@/*+#3$:+$,H*$3244#0#6,$

!"#$%&

! !"#$G*H$G#1#G$'#127#$(-.$I/0#42@8$75J

! !"#$"2;"$G#1#G$K56,29#$(-.$I/0#42@8$753:J

! >*9#$,"26;+$7:6$F#$3*6#$,"0*5;"$F*,"$(-.+L$*,"#0+$:0#$+/#72:G2M#3

! %:6$F#$92@#3$,*;#,"#0$IH2,"$7:0#J

! (GG$B-&$7*9/5,26;$2+$/#04*09#3$*6$:$3#127#! !*$:GG*7:,#$9#9*0=L$056$:$/0*;0:9L$#,7$*6$,"#$":03H:0#L$H#$6##3$:$!"#$%"&%'()"*)

! '#127#$7*6,#@,+$:0#$F*563$N8N$H2,"$"*+,$,"0#:3+$IO5+,$G2P#$A/#6BCQJ! >*L$#:7"$"*+,$,"0#:3$9:=$":1#$:,$9*+,$*6#$3#127#$7*6,#@,

! (63L$#:7"$3#127#$7*6,#@,$2+$:77#++2FG#$40*9$*6G=$*6#$"*+,$,"0#:3

! (GG$3#127#$(-.$7:GG+$0#,506$:6$#00*0D+577#++$

7*3#$*4$,=/#8$+,-"./0)

! (GG$056,29#$(-.$7:GG+$0#,506$:6$#00*0D+577#++$

7*3#$*4$,=/#$%/!12--'-3)

! (6$26,#;#0$1:G5#$H2,"$M#0*$R$6*$#00*0

! %/!14")51.)2--'-L$%/!14")2--'-6)-$(7

! K56,29#$(-.$7:GG+$:5,*9:,27:GG=$262,2:G2M#

! '#127#$(-.$7:GG+$95+,$7:GG$%/8($)

! !"#$420+,$I*/,2*6:GSJ$+,#/$2+$,*$#659#0:,#$,"#$

:1:2G:FG#$3#127#+

! %/9"#$%"4")+'/()

! %/9"#$%"4")

! %/9"#$%"4"):1;"

! %/9"#$%"4")<')10=";'->

! %/9"#$%"4")?))-$@/)"

! !

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/DE/D/$$0

#

! !"#$!%&'()*'+),-./'+0'1(2*'('3014*+-./'

4)0563+7''890:*'"0'+;*'%*+(9

! <-):+')*9*(:*'=>?$>@A'B01B*5

! C*30.5')*9*(:*'-.'D?$>@')*EF)-++*.

! G4*.'C06)3*

! H0+',*+'('I-(B9*':096+-0.

&*I-3*

%69+-EJ)03*::0) %69+-EJ)03*::0)

!

!"#$%&'()"*+ !"#$%&'()"*+

,(-.#(&'()"*+

/0 /0

/0 /0

/0 /0

! !

/0 /0

/0 /0

/0 /0

! !

%69+-EJ)03*::0)

!"#$%&'()"*+

/0 /0

/0 /0

/0 /0

! !

! ";*'K6.5(1*.+(9'6.-+'-:'+;*'!"#$%&'(#)*$!!)#

! !':3(9()L'1.23%(45*(#.1."2 K90(+-./'40-.+'!MN

! G.*'%69+-49,$!55'4*)'39032'3,39*

! 678 #OOOE@PQ'30149-(.+

! C+)*(1'4)03*::0):'()*'/)064*5'-.+0'&+,"-.

(#)*$!!)#!

! %69+-E4)03*::0):')6.'-.'/9',&%"#:1;(5

! !'.61B*)'0K'169+-E4)03*::0):'K0)1'('/$0-*$

%69+-EJ)03*::0)

<(3.1;(*1!"#$%&

'()"*+

!

/;*($)&0*"#(11"*

/=$*(>&'()"*+

?%"@$%&'()"*+

A"21;$2;&'()"*+

8(B;C*(&'()"*+

/;*($)&0*"#(11"*1&=$-(&$##(11&;"D

'()"*+&8+5( E##(11 /=$*.23

R*/-:+*): R*(5$S)-+* J)-I(+*

M03(9'%*10), R*(5$S)-+* J)-I(+*

C;()*5'%*10), R*(5$S)-+* %69+-EJ)03*::0)

T90B(9'%*10), R*(5$S)-+* &*I-3*

80.:+(.+'%*10), R*(5 &*I-3*

"*U+6)*'%*10), R*(5 &*I-3*

API

Page 15: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/D/#D/$$0

&

! !"#$%&'($)*+,$(-.$/0*123#+$4567,2*6+$4*08

! '#127#$9:6:;#9#6,

! <#9*0=$9:6:;#9#6,

! >,0#:9$9:6:;#9#6,

! ?1#6,$9:6:;#9#6,

! !#@,50#$9:6:;9#6,

! A/#6BCD'20#7,E$26,#0*/#0:F2G2,=

! !"#$)*+,$(-.$2+$#@/*+#3$:+$,H*$3244#0#6,$

!"#$%&

! !"#$G*H$G#1#G$'#127#$(-.$I/0#42@8$75J

! !"#$"2;"$G#1#G$K56,29#$(-.$I/0#42@8$753:J

! >*9#$,"26;+$7:6$F#$3*6#$,"0*5;"$F*,"$(-.+L$*,"#0+$:0#$+/#72:G2M#3

! %:6$F#$92@#3$,*;#,"#0$IH2,"$7:0#J

! (GG$B-&$7*9/5,26;$2+$/#04*09#3$*6$:$3#127#! !*$:GG*7:,#$9#9*0=L$056$:$/0*;0:9L$#,7$*6$,"#$":03H:0#L$H#$6##3$:$!"#$%"&%'()"*)

! '#127#$7*6,#@,+$:0#$F*563$N8N$H2,"$"*+,$,"0#:3+$IO5+,$G2P#$A/#6BCQJ! >*L$#:7"$"*+,$,"0#:3$9:=$":1#$:,$9*+,$*6#$3#127#$7*6,#@,

! (63L$#:7"$3#127#$7*6,#@,$2+$:77#++2FG#$40*9$*6G=$*6#$"*+,$,"0#:3

! (GG$3#127#$(-.$7:GG+$0#,506$:6$#00*0D+577#++$

7*3#$*4$,=/#8$+,-"./0)

! (GG$056,29#$(-.$7:GG+$0#,506$:6$#00*0D+577#++$

7*3#$*4$,=/#$%/!12--'-3)

! (6$26,#;#0$1:G5#$H2,"$M#0*$R$6*$#00*0

! %/!14")51.)2--'-L$%/!14")2--'-6)-$(7

! K56,29#$(-.$7:GG+$:5,*9:,27:GG=$262,2:G2M#

! '#127#$(-.$7:GG+$95+,$7:GG$%/8($)

! !"#$420+,$I*/,2*6:GSJ$+,#/$2+$,*$#659#0:,#$,"#$

:1:2G:FG#$3#127#+

! %/9"#$%"4")+'/()

! %/9"#$%"4")

! %/9"#$%"4"):1;"

! %/9"#$%"4")<')10=";'->

! %/9"#$%"4")?))-$@/)"

! !

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/DE/D/$$0

#

! !"#$!%&'()*'+),-./'+0'1(2*'('3014*+-./'

4)0563+7''890:*'"0'+;*'%*+(9

! <-):+')*9*(:*'=>?$>@A'B01B*5

! C*30.5')*9*(:*'-.'D?$>@')*EF)-++*.

! G4*.'C06)3*

! H0+',*+'('I-(B9*':096+-0.

&*I-3*

%69+-EJ)03*::0) %69+-EJ)03*::0)

!

!"#$%&'()"*+ !"#$%&'()"*+

,(-.#(&'()"*+

/0 /0

/0 /0

/0 /0

! !

/0 /0

/0 /0

/0 /0

! !

%69+-EJ)03*::0)

!"#$%&'()"*+

/0 /0

/0 /0

/0 /0

! !

! ";*'K6.5(1*.+(9'6.-+'-:'+;*'!"#$%&'(#)*$!!)#

! !':3(9()L'1.23%(45*(#.1."2 K90(+-./'40-.+'!MN

! G.*'%69+-49,$!55'4*)'39032'3,39*

! 678 #OOOE@PQ'30149-(.+

! C+)*(1'4)03*::0):'()*'/)064*5'-.+0'&+,"-.

(#)*$!!)#!

! %69+-E4)03*::0):')6.'-.'/9',&%"#:1;(5

! !'.61B*)'0K'169+-E4)03*::0):'K0)1'('/$0-*$

%69+-EJ)03*::0)

<(3.1;(*1!"#$%&

'()"*+

!

/;*($)&0*"#(11"*

/=$*(>&'()"*+

?%"@$%&'()"*+

A"21;$2;&'()"*+

8(B;C*(&'()"*+

/;*($)&0*"#(11"*1&=$-(&$##(11&;"D

'()"*+&8+5( E##(11 /=$*.23

R*/-:+*): R*(5$S)-+* J)-I(+*

M03(9'%*10), R*(5$S)-+* J)-I(+*

C;()*5'%*10), R*(5$S)-+* %69+-EJ)03*::0)

T90B(9'%*10), R*(5$S)-+* &*I-3*

80.:+(.+'%*10), R*(5 &*I-3*

"*U+6)*'%*10), R*(5 &*I-3*

API

Page 16: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/D/#D/$$0

&

! !"#$%&'($)*+,$(-.$/0*123#+$4567,2*6+$4*08

! '#127#$9:6:;#9#6,

! <#9*0=$9:6:;#9#6,

! >,0#:9$9:6:;#9#6,

! ?1#6,$9:6:;#9#6,

! !#@,50#$9:6:;9#6,

! A/#6BCD'20#7,E$26,#0*/#0:F2G2,=

! !"#$)*+,$(-.$2+$#@/*+#3$:+$,H*$3244#0#6,$

!"#$%&

! !"#$G*H$G#1#G$'#127#$(-.$I/0#42@8$75J

! !"#$"2;"$G#1#G$K56,29#$(-.$I/0#42@8$753:J

! >*9#$,"26;+$7:6$F#$3*6#$,"0*5;"$F*,"$(-.+L$*,"#0+$:0#$+/#72:G2M#3

! %:6$F#$92@#3$,*;#,"#0$IH2,"$7:0#J

! (GG$B-&$7*9/5,26;$2+$/#04*09#3$*6$:$3#127#! !*$:GG*7:,#$9#9*0=L$056$:$/0*;0:9L$#,7$*6$,"#$":03H:0#L$H#$6##3$:$!"#$%"&%'()"*)

! '#127#$7*6,#@,+$:0#$F*563$N8N$H2,"$"*+,$,"0#:3+$IO5+,$G2P#$A/#6BCQJ! >*L$#:7"$"*+,$,"0#:3$9:=$":1#$:,$9*+,$*6#$3#127#$7*6,#@,

! (63L$#:7"$3#127#$7*6,#@,$2+$:77#++2FG#$40*9$*6G=$*6#$"*+,$,"0#:3

! (GG$3#127#$(-.$7:GG+$0#,506$:6$#00*0D+577#++$

7*3#$*4$,=/#8$+,-"./0)

! (GG$056,29#$(-.$7:GG+$0#,506$:6$#00*0D+577#++$

7*3#$*4$,=/#$%/!12--'-3)

! (6$26,#;#0$1:G5#$H2,"$M#0*$R$6*$#00*0

! %/!14")51.)2--'-L$%/!14")2--'-6)-$(7

! K56,29#$(-.$7:GG+$:5,*9:,27:GG=$262,2:G2M#

! '#127#$(-.$7:GG+$95+,$7:GG$%/8($)

! !"#$420+,$I*/,2*6:GSJ$+,#/$2+$,*$#659#0:,#$,"#$

:1:2G:FG#$3#127#+

! %/9"#$%"4")+'/()

! %/9"#$%"4")

! %/9"#$%"4"):1;"

! %/9"#$%"4")<')10=";'->

! %/9"#$%"4")?))-$@/)"

! !

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/DE/D/$$0

#

! !"#$!%&'()*'+),-./'+0'1(2*'('3014*+-./'

4)0563+7''890:*'"0'+;*'%*+(9

! <-):+')*9*(:*'=>?$>@A'B01B*5

! C*30.5')*9*(:*'-.'D?$>@')*EF)-++*.

! G4*.'C06)3*

! H0+',*+'('I-(B9*':096+-0.

&*I-3*

%69+-EJ)03*::0) %69+-EJ)03*::0)

!

!"#$%&'()"*+ !"#$%&'()"*+

,(-.#(&'()"*+

/0 /0

/0 /0

/0 /0

! !

/0 /0

/0 /0

/0 /0

! !

%69+-EJ)03*::0)

!"#$%&'()"*+

/0 /0

/0 /0

/0 /0

! !

! ";*'K6.5(1*.+(9'6.-+'-:'+;*'!"#$%&'(#)*$!!)#

! !':3(9()L'1.23%(45*(#.1."2 K90(+-./'40-.+'!MN

! G.*'%69+-49,$!55'4*)'39032'3,39*

! 678 #OOOE@PQ'30149-(.+

! C+)*(1'4)03*::0):'()*'/)064*5'-.+0'&+,"-.

(#)*$!!)#!

! %69+-E4)03*::0):')6.'-.'/9',&%"#:1;(5

! !'.61B*)'0K'169+-E4)03*::0):'K0)1'('/$0-*$

%69+-EJ)03*::0)

<(3.1;(*1!"#$%&

'()"*+

!

/;*($)&0*"#(11"*

/=$*(>&'()"*+

?%"@$%&'()"*+

A"21;$2;&'()"*+

8(B;C*(&'()"*+

/;*($)&0*"#(11"*1&=$-(&$##(11&;"D

'()"*+&8+5( E##(11 /=$*.23

R*/-:+*): R*(5$S)-+* J)-I(+*

M03(9'%*10), R*(5$S)-+* J)-I(+*

C;()*5'%*10), R*(5$S)-+* %69+-EJ)03*::0)

T90B(9'%*10), R*(5$S)-+* &*I-3*

80.:+(.+'%*10), R*(5 &*I-3*

"*U+6)*'%*10), R*(5 &*I-3*

API

Page 17: [Harvard CS264] 04 - Intermediate-level CUDA Programming

Runtime API(high-level)

Page 18: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/D/#D/$$0

E

! !"#$%&$%#'(()$%*%+$,-#$%&-.'%!"#$%&!$'$(

&$%/$.%*%+$,-#$%'*"+0$%(1%.23$%)*+$%&!$

! 4*"%"(&%#5$*.$%*%#(".$6.%&-.'%!")(,)-$.($

! 78".-9$%:;<%35(,-+$)%*%)-930-1-$+%-".$51*#$%

1(5%#5$*.-"/%*%#(".$6.=

! !"+.'$(#$%&!$)/"0(

! !"+.1$(#$%&!$

! :"+%.'$%8)$180=

! !"+.)2//3$#$%&!$

! !"#$%&$%'*,$%*%#(".$6.%>)*!/0($,(?%#*"%

*00(#*.$%9$9(52@%#*00%*%A;B%18"#.-("%$.#C%%

! 4(".$6.%-)%-930-#-.02%*))(#-*.$+%&-.'%#5$*.-"/%

.'5$*+

! D(%)2"#'5("-E$%*00%.'5$*+)%>4;B%'().%&-.'%

A;B%.'5$*+)?%#*00%!")(,140!2-/0&5$

! F*-.)%1(5%*00%A;B%.*)G)%.(%1-"-)'%

! :00(#*.$HI5$$%9$9(52=

! !"6$7899/!:;!"6$7<-$$

! <"-.-*0-E$%9$9(52=

! !"6$73$(

! 4(32%9$9(52=

! !"6$7!=4>(/#:;!"6$7!=4#(/>:;

!"6$7!=4#(/#

! F'$"%*00(#*.-"/%9$9(52%1(5%.'$%2/3(@%#*"%

8)$%!"##$% H%&'( H%!!")

! !5%8)$%!"6$7899/!>/3(@%!"6$7<-$$>/3(

! D'$)$%18"#.-(")%*00(#*.$%'().%9$9(52%.'*.%-)%

)"*'+#$%,'-

! ;$51(59*"#$%-935(,$+%1(5%#(32%.(H15(9%

3*/$J0(#G$+%'().%9$9(52

! :00(#*.$HI5$$%9$9(52=

! !"+.6.99/!@%!"+.<-$$

! <"-.-*0-E$%9$9(52=

! !"+.6$73$(

! 4(32%9$9(52=

! !"+.6$7!=4

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/DE/D/$$0

#

! !"#$!%&'()*'+),-./'+0'1(2*'('3014*+-./'

4)0563+7''890:*'"0'+;*'%*+(9

! <-):+')*9*(:*'=>?$>@A'B01B*5

! C*30.5')*9*(:*'-.'D?$>@')*EF)-++*.

! G4*.'C06)3*

! H0+',*+'('I-(B9*':096+-0.

&*I-3*

%69+-EJ)03*::0) %69+-EJ)03*::0)

!

!"#$%&'()"*+ !"#$%&'()"*+

,(-.#(&'()"*+

/0 /0

/0 /0

/0 /0

! !

/0 /0

/0 /0

/0 /0

! !

%69+-EJ)03*::0)

!"#$%&'()"*+

/0 /0

/0 /0

/0 /0

! !

! ";*'K6.5(1*.+(9'6.-+'-:'+;*'!"#$%&'(#)*$!!)#

! !':3(9()L'1.23%(45*(#.1."2 K90(+-./'40-.+'!MN

! G.*'%69+-49,$!55'4*)'39032'3,39*

! 678 #OOOE@PQ'30149-(.+

! C+)*(1'4)03*::0):'()*'/)064*5'-.+0'&+,"-.

(#)*$!!)#!

! %69+-E4)03*::0):')6.'-.'/9',&%"#:1;(5

! !'.61B*)'0K'169+-E4)03*::0):'K0)1'('/$0-*$

%69+-EJ)03*::0)

<(3.1;(*1!"#$%&

'()"*+

!

/;*($)&0*"#(11"*

/=$*(>&'()"*+

?%"@$%&'()"*+

A"21;$2;&'()"*+

8(B;C*(&'()"*+

/;*($)&0*"#(11"*1&=$-(&$##(11&;"D

'()"*+&8+5( E##(11 /=$*.23

R*/-:+*): R*(5$S)-+* J)-I(+*

M03(9'%*10), R*(5$S)-+* J)-I(+*

C;()*5'%*10), R*(5$S)-+* %69+-EJ)03*::0)

T90B(9'%*10), R*(5$S)-+* &*I-3*

80.:+(.+'%*10), R*(5 &*I-3*

"*U+6)*'%*10), R*(5 &*I-3*

API

Page 19: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/D/#D/$$0

E

! !"#$%&$%#'(()$%*%+$,-#$%&-.'%!"#$%&!$'$(

&$%/$.%*%+$,-#$%'*"+0$%(1%.23$%)*+$%&!$

! 4*"%"(&%#5$*.$%*%#(".$6.%&-.'%!")(,)-$.($

! 78".-9$%:;<%35(,-+$)%*%)-930-1-$+%-".$51*#$%

1(5%#5$*.-"/%*%#(".$6.=

! !"+.'$(#$%&!$)/"0(

! !"+.1$(#$%&!$

! :"+%.'$%8)$180=

! !"+.)2//3$#$%&!$

! !"#$%&$%'*,$%*%#(".$6.%>)*!/0($,(?%#*"%

*00(#*.$%9$9(52@%#*00%*%A;B%18"#.-("%$.#C%%

! 4(".$6.%-)%-930-#-.02%*))(#-*.$+%&-.'%#5$*.-"/%

.'5$*+

! D(%)2"#'5("-E$%*00%.'5$*+)%>4;B%'().%&-.'%

A;B%.'5$*+)?%#*00%!")(,140!2-/0&5$

! F*-.)%1(5%*00%A;B%.*)G)%.(%1-"-)'%

! :00(#*.$HI5$$%9$9(52=

! !"6$7899/!:;!"6$7<-$$

! <"-.-*0-E$%9$9(52=

! !"6$73$(

! 4(32%9$9(52=

! !"6$7!=4>(/#:;!"6$7!=4#(/>:;

!"6$7!=4#(/#

! F'$"%*00(#*.-"/%9$9(52%1(5%.'$%2/3(@%#*"%

8)$%!"##$% H%&'( H%!!")

! !5%8)$%!"6$7899/!>/3(@%!"6$7<-$$>/3(

! D'$)$%18"#.-(")%*00(#*.$%'().%9$9(52%.'*.%-)%

)"*'+#$%,'-

! ;$51(59*"#$%-935(,$+%1(5%#(32%.(H15(9%

3*/$J0(#G$+%'().%9$9(52

! :00(#*.$HI5$$%9$9(52=

! !"+.6.99/!@%!"+.<-$$

! <"-.-*0-E$%9$9(52=

! !"+.6$73$(

! 4(32%9$9(52=

! !"+.6$7!=4

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/DE/D/$$0

#

! !"#$!%&'()*'+),-./'+0'1(2*'('3014*+-./'

4)0563+7''890:*'"0'+;*'%*+(9

! <-):+')*9*(:*'=>?$>@A'B01B*5

! C*30.5')*9*(:*'-.'D?$>@')*EF)-++*.

! G4*.'C06)3*

! H0+',*+'('I-(B9*':096+-0.

&*I-3*

%69+-EJ)03*::0) %69+-EJ)03*::0)

!

!"#$%&'()"*+ !"#$%&'()"*+

,(-.#(&'()"*+

/0 /0

/0 /0

/0 /0

! !

/0 /0

/0 /0

/0 /0

! !

%69+-EJ)03*::0)

!"#$%&'()"*+

/0 /0

/0 /0

/0 /0

! !

! ";*'K6.5(1*.+(9'6.-+'-:'+;*'!"#$%&'(#)*$!!)#

! !':3(9()L'1.23%(45*(#.1."2 K90(+-./'40-.+'!MN

! G.*'%69+-49,$!55'4*)'39032'3,39*

! 678 #OOOE@PQ'30149-(.+

! C+)*(1'4)03*::0):'()*'/)064*5'-.+0'&+,"-.

(#)*$!!)#!

! %69+-E4)03*::0):')6.'-.'/9',&%"#:1;(5

! !'.61B*)'0K'169+-E4)03*::0):'K0)1'('/$0-*$

%69+-EJ)03*::0)

<(3.1;(*1!"#$%&

'()"*+

!

/;*($)&0*"#(11"*

/=$*(>&'()"*+

?%"@$%&'()"*+

A"21;$2;&'()"*+

8(B;C*(&'()"*+

/;*($)&0*"#(11"*1&=$-(&$##(11&;"D

'()"*+&8+5( E##(11 /=$*.23

R*/-:+*): R*(5$S)-+* J)-I(+*

M03(9'%*10), R*(5$S)-+* J)-I(+*

C;()*5'%*10), R*(5$S)-+* %69+-EJ)03*::0)

T90B(9'%*10), R*(5$S)-+* &*I-3*

80.:+(.+'%*10), R*(5 &*I-3*

"*U+6)*'%*10), R*(5 &*I-3*

API

Page 20: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/D/#D/$$0

'

! !"#$%&#'(%#)%)(*%+*%*,(%)+-(%*#-(%+)%*,(%

./01*#20%#0321+*#204

!"#$"%!&'()*

! +,!$--. !"#$%&#'()*+,#-*#%."#&+*/#01*#2223444#

'&"%0(5"#("65%.0(5"#758*9.059:

! 5,(%12-6#7("%8(0("+*()%1+77)%*2%+77%$(3#1(%9:;%

*2%)(*/6%*,(%(<(1/*#20%(03#"20-(0*

! 9%)*"(+-%#)%+%)(=/(01(%2.%26("+*#20)%*,+*%

211/"%#0%2"$("%%>?8?

@? A26B%$+*+%."2-%,2)*%*2%$(3#1(

C? ><(1/*(%$(3#1(%./01*#20%

D? A26B%$+*+%."2-%$(3#1(%*2%,2)*

! 9%)*"(+-%#)%+%)(=/(01(%2.%26("+*#20)%*,+*%

211/"%#0%2"$("%%>?8?

@? A26B%$+*+%."2-%,2)*%*2%$(3#1(

C? ><(1/*(%$(3#1(%./01*#20%

D? A26B%$+*+%."2-%$(3#1(%*2%,2)*

! 9%)*"(+-%#)%+%)(=/(01(%2.%26("+*#20)%*,+*%

211/"%#0%2"$("

! E#..("(0*%)*"(+-)%1+0%F(%/)($%*2%-+0+8(%

1201/""(01B%%>?8?

G3("7+66#08%-(-2"B%126B%."2-%20(%)*"(+-%H#*,%*,(%./01*#20%(<(1/*#20%."2-%+02*,("

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/DE/D/$$0

#

! !"#$!%&'()*'+),-./'+0'1(2*'('3014*+-./'

4)0563+7''890:*'"0'+;*'%*+(9

! <-):+')*9*(:*'=>?$>@A'B01B*5

! C*30.5')*9*(:*'-.'D?$>@')*EF)-++*.

! G4*.'C06)3*

! H0+',*+'('I-(B9*':096+-0.

&*I-3*

%69+-EJ)03*::0) %69+-EJ)03*::0)

!

!"#$%&'()"*+ !"#$%&'()"*+

,(-.#(&'()"*+

/0 /0

/0 /0

/0 /0

! !

/0 /0

/0 /0

/0 /0

! !

%69+-EJ)03*::0)

!"#$%&'()"*+

/0 /0

/0 /0

/0 /0

! !

! ";*'K6.5(1*.+(9'6.-+'-:'+;*'!"#$%&'(#)*$!!)#

! !':3(9()L'1.23%(45*(#.1."2 K90(+-./'40-.+'!MN

! G.*'%69+-49,$!55'4*)'39032'3,39*

! 678 #OOOE@PQ'30149-(.+

! C+)*(1'4)03*::0):'()*'/)064*5'-.+0'&+,"-.

(#)*$!!)#!

! %69+-E4)03*::0):')6.'-.'/9',&%"#:1;(5

! !'.61B*)'0K'169+-E4)03*::0):'K0)1'('/$0-*$

%69+-EJ)03*::0)

<(3.1;(*1!"#$%&

'()"*+

!

/;*($)&0*"#(11"*

/=$*(>&'()"*+

?%"@$%&'()"*+

A"21;$2;&'()"*+

8(B;C*(&'()"*+

/;*($)&0*"#(11"*1&=$-(&$##(11&;"D

'()"*+&8+5( E##(11 /=$*.23

R*/-:+*): R*(5$S)-+* J)-I(+*

M03(9'%*10), R*(5$S)-+* J)-I(+*

C;()*5'%*10), R*(5$S)-+* %69+-EJ)03*::0)

T90B(9'%*10), R*(5$S)-+* &*I-3*

80.:+(.+'%*10), R*(5 &*I-3*

"*U+6)*'%*10), R*(5 &*I-3*

API

Page 21: [Harvard CS264] 04 - Intermediate-level CUDA Programming

“Device” Driver API(low-level)

Page 22: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/D/#D/$$0

&

! !"#$%&'($)*+,$(-.$/0*123#+$4567,2*6+$4*08

! '#127#$9:6:;#9#6,

! <#9*0=$9:6:;#9#6,

! >,0#:9$9:6:;#9#6,

! ?1#6,$9:6:;#9#6,

! !#@,50#$9:6:;9#6,

! A/#6BCD'20#7,E$26,#0*/#0:F2G2,=

! !"#$)*+,$(-.$2+$#@/*+#3$:+$,H*$3244#0#6,$

!"#$%&

! !"#$G*H$G#1#G$'#127#$(-.$I/0#42@8$75J

! !"#$"2;"$G#1#G$K56,29#$(-.$I/0#42@8$753:J

! >*9#$,"26;+$7:6$F#$3*6#$,"0*5;"$F*,"$(-.+L$*,"#0+$:0#$+/#72:G2M#3

! %:6$F#$92@#3$,*;#,"#0$IH2,"$7:0#J

! (GG$B-&$7*9/5,26;$2+$/#04*09#3$*6$:$3#127#! !*$:GG*7:,#$9#9*0=L$056$:$/0*;0:9L$#,7$*6$,"#$":03H:0#L$H#$6##3$:$!"#$%"&%'()"*)

! '#127#$7*6,#@,+$:0#$F*563$N8N$H2,"$"*+,$,"0#:3+$IO5+,$G2P#$A/#6BCQJ! >*L$#:7"$"*+,$,"0#:3$9:=$":1#$:,$9*+,$*6#$3#127#$7*6,#@,

! (63L$#:7"$3#127#$7*6,#@,$2+$:77#++2FG#$40*9$*6G=$*6#$"*+,$,"0#:3

! (GG$3#127#$(-.$7:GG+$0#,506$:6$#00*0D+577#++$

7*3#$*4$,=/#8$+,-"./0)

! (GG$056,29#$(-.$7:GG+$0#,506$:6$#00*0D+577#++$

7*3#$*4$,=/#$%/!12--'-3)

! (6$26,#;#0$1:G5#$H2,"$M#0*$R$6*$#00*0

! %/!14")51.)2--'-L$%/!14")2--'-6)-$(7

! K56,29#$(-.$7:GG+$:5,*9:,27:GG=$262,2:G2M#

! '#127#$(-.$7:GG+$95+,$7:GG$%/8($)

! !"#$420+,$I*/,2*6:GSJ$+,#/$2+$,*$#659#0:,#$,"#$

:1:2G:FG#$3#127#+

! %/9"#$%"4")+'/()

! %/9"#$%"4")

! %/9"#$%"4"):1;"

! %/9"#$%"4")<')10=";'->

! %/9"#$%"4")?))-$@/)"

! !

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/DE/D/$$0

#

! !"#$!%&'()*'+),-./'+0'1(2*'('3014*+-./'

4)0563+7''890:*'"0'+;*'%*+(9

! <-):+')*9*(:*'=>?$>@A'B01B*5

! C*30.5')*9*(:*'-.'D?$>@')*EF)-++*.

! G4*.'C06)3*

! H0+',*+'('I-(B9*':096+-0.

&*I-3*

%69+-EJ)03*::0) %69+-EJ)03*::0)

!

!"#$%&'()"*+ !"#$%&'()"*+

,(-.#(&'()"*+

/0 /0

/0 /0

/0 /0

! !

/0 /0

/0 /0

/0 /0

! !

%69+-EJ)03*::0)

!"#$%&'()"*+

/0 /0

/0 /0

/0 /0

! !

! ";*'K6.5(1*.+(9'6.-+'-:'+;*'!"#$%&'(#)*$!!)#

! !':3(9()L'1.23%(45*(#.1."2 K90(+-./'40-.+'!MN

! G.*'%69+-49,$!55'4*)'39032'3,39*

! 678 #OOOE@PQ'30149-(.+

! C+)*(1'4)03*::0):'()*'/)064*5'-.+0'&+,"-.

(#)*$!!)#!

! %69+-E4)03*::0):')6.'-.'/9',&%"#:1;(5

! !'.61B*)'0K'169+-E4)03*::0):'K0)1'('/$0-*$

%69+-EJ)03*::0)

<(3.1;(*1!"#$%&

'()"*+

!

/;*($)&0*"#(11"*

/=$*(>&'()"*+

?%"@$%&'()"*+

A"21;$2;&'()"*+

8(B;C*(&'()"*+

/;*($)&0*"#(11"*1&=$-(&$##(11&;"D

'()"*+&8+5( E##(11 /=$*.23

R*/-:+*): R*(5$S)-+* J)-I(+*

M03(9'%*10), R*(5$S)-+* J)-I(+*

C;()*5'%*10), R*(5$S)-+* %69+-EJ)03*::0)

T90B(9'%*10), R*(5$S)-+* &*I-3*

80.:+(.+'%*10), R*(5 &*I-3*

"*U+6)*'%*10), R*(5 &*I-3*

API

Page 23: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/D/#D/$$0

E

! !"#$%&$%#'(()$%*%+$,-#$%&-.'%!"#$%&!$'$(

&$%/$.%*%+$,-#$%'*"+0$%(1%.23$%)*+$%&!$

! 4*"%"(&%#5$*.$%*%#(".$6.%&-.'%!")(,)-$.($

! 78".-9$%:;<%35(,-+$)%*%)-930-1-$+%-".$51*#$%

1(5%#5$*.-"/%*%#(".$6.=

! !"+.'$(#$%&!$)/"0(

! !"+.1$(#$%&!$

! :"+%.'$%8)$180=

! !"+.)2//3$#$%&!$

! !"#$%&$%'*,$%*%#(".$6.%>)*!/0($,(?%#*"%

*00(#*.$%9$9(52@%#*00%*%A;B%18"#.-("%$.#C%%

! 4(".$6.%-)%-930-#-.02%*))(#-*.$+%&-.'%#5$*.-"/%

.'5$*+

! D(%)2"#'5("-E$%*00%.'5$*+)%>4;B%'().%&-.'%

A;B%.'5$*+)?%#*00%!")(,140!2-/0&5$

! F*-.)%1(5%*00%A;B%.*)G)%.(%1-"-)'%

! :00(#*.$HI5$$%9$9(52=

! !"6$7899/!:;!"6$7<-$$

! <"-.-*0-E$%9$9(52=

! !"6$73$(

! 4(32%9$9(52=

! !"6$7!=4>(/#:;!"6$7!=4#(/>:;

!"6$7!=4#(/#

! F'$"%*00(#*.-"/%9$9(52%1(5%.'$%2/3(@%#*"%

8)$%!"##$% H%&'( H%!!")

! !5%8)$%!"6$7899/!>/3(@%!"6$7<-$$>/3(

! D'$)$%18"#.-(")%*00(#*.$%'().%9$9(52%.'*.%-)%

)"*'+#$%,'-

! ;$51(59*"#$%-935(,$+%1(5%#(32%.(H15(9%

3*/$J0(#G$+%'().%9$9(52

! :00(#*.$HI5$$%9$9(52=

! !"+.6.99/!@%!"+.<-$$

! <"-.-*0-E$%9$9(52=

! !"+.6$73$(

! 4(32%9$9(52=

! !"+.6$7!=4

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/DE/D/$$0

#

! !"#$!%&'()*'+),-./'+0'1(2*'('3014*+-./'

4)0563+7''890:*'"0'+;*'%*+(9

! <-):+')*9*(:*'=>?$>@A'B01B*5

! C*30.5')*9*(:*'-.'D?$>@')*EF)-++*.

! G4*.'C06)3*

! H0+',*+'('I-(B9*':096+-0.

&*I-3*

%69+-EJ)03*::0) %69+-EJ)03*::0)

!

!"#$%&'()"*+ !"#$%&'()"*+

,(-.#(&'()"*+

/0 /0

/0 /0

/0 /0

! !

/0 /0

/0 /0

/0 /0

! !

%69+-EJ)03*::0)

!"#$%&'()"*+

/0 /0

/0 /0

/0 /0

! !

! ";*'K6.5(1*.+(9'6.-+'-:'+;*'!"#$%&'(#)*$!!)#

! !':3(9()L'1.23%(45*(#.1."2 K90(+-./'40-.+'!MN

! G.*'%69+-49,$!55'4*)'39032'3,39*

! 678 #OOOE@PQ'30149-(.+

! C+)*(1'4)03*::0):'()*'/)064*5'-.+0'&+,"-.

(#)*$!!)#!

! %69+-E4)03*::0):')6.'-.'/9',&%"#:1;(5

! !'.61B*)'0K'169+-E4)03*::0):'K0)1'('/$0-*$

%69+-EJ)03*::0)

<(3.1;(*1!"#$%&

'()"*+

!

/;*($)&0*"#(11"*

/=$*(>&'()"*+

?%"@$%&'()"*+

A"21;$2;&'()"*+

8(B;C*(&'()"*+

/;*($)&0*"#(11"*1&=$-(&$##(11&;"D

'()"*+&8+5( E##(11 /=$*.23

R*/-:+*): R*(5$S)-+* J)-I(+*

M03(9'%*10), R*(5$S)-+* J)-I(+*

C;()*5'%*10), R*(5$S)-+* %69+-EJ)03*::0)

T90B(9'%*10), R*(5$S)-+* &*I-3*

80.:+(.+'%*10), R*(5 &*I-3*

"*U+6)*'%*10), R*(5 &*I-3*

API

Page 24: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/D/#D/$$0

E

! !"#$%&$%#'(()$%*%+$,-#$%&-.'%!"#$%&!$'$(

&$%/$.%*%+$,-#$%'*"+0$%(1%.23$%)*+$%&!$

! 4*"%"(&%#5$*.$%*%#(".$6.%&-.'%!")(,)-$.($

! 78".-9$%:;<%35(,-+$)%*%)-930-1-$+%-".$51*#$%

1(5%#5$*.-"/%*%#(".$6.=

! !"+.'$(#$%&!$)/"0(

! !"+.1$(#$%&!$

! :"+%.'$%8)$180=

! !"+.)2//3$#$%&!$

! !"#$%&$%'*,$%*%#(".$6.%>)*!/0($,(?%#*"%

*00(#*.$%9$9(52@%#*00%*%A;B%18"#.-("%$.#C%%

! 4(".$6.%-)%-930-#-.02%*))(#-*.$+%&-.'%#5$*.-"/%

.'5$*+

! D(%)2"#'5("-E$%*00%.'5$*+)%>4;B%'().%&-.'%

A;B%.'5$*+)?%#*00%!")(,140!2-/0&5$

! F*-.)%1(5%*00%A;B%.*)G)%.(%1-"-)'%

! :00(#*.$HI5$$%9$9(52=

! !"6$7899/!:;!"6$7<-$$

! <"-.-*0-E$%9$9(52=

! !"6$73$(

! 4(32%9$9(52=

! !"6$7!=4>(/#:;!"6$7!=4#(/>:;

!"6$7!=4#(/#

! F'$"%*00(#*.-"/%9$9(52%1(5%.'$%2/3(@%#*"%

8)$%!"##$% H%&'( H%!!")

! !5%8)$%!"6$7899/!>/3(@%!"6$7<-$$>/3(

! D'$)$%18"#.-(")%*00(#*.$%'().%9$9(52%.'*.%-)%

)"*'+#$%,'-

! ;$51(59*"#$%-935(,$+%1(5%#(32%.(H15(9%

3*/$J0(#G$+%'().%9$9(52

! :00(#*.$HI5$$%9$9(52=

! !"+.6.99/!@%!"+.<-$$

! <"-.-*0-E$%9$9(52=

! !"+.6$73$(

! 4(32%9$9(52=

! !"+.6$7!=4

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/DE/D/$$0

#

! !"#$!%&'()*'+),-./'+0'1(2*'('3014*+-./'

4)0563+7''890:*'"0'+;*'%*+(9

! <-):+')*9*(:*'=>?$>@A'B01B*5

! C*30.5')*9*(:*'-.'D?$>@')*EF)-++*.

! G4*.'C06)3*

! H0+',*+'('I-(B9*':096+-0.

&*I-3*

%69+-EJ)03*::0) %69+-EJ)03*::0)

!

!"#$%&'()"*+ !"#$%&'()"*+

,(-.#(&'()"*+

/0 /0

/0 /0

/0 /0

! !

/0 /0

/0 /0

/0 /0

! !

%69+-EJ)03*::0)

!"#$%&'()"*+

/0 /0

/0 /0

/0 /0

! !

! ";*'K6.5(1*.+(9'6.-+'-:'+;*'!"#$%&'(#)*$!!)#

! !':3(9()L'1.23%(45*(#.1."2 K90(+-./'40-.+'!MN

! G.*'%69+-49,$!55'4*)'39032'3,39*

! 678 #OOOE@PQ'30149-(.+

! C+)*(1'4)03*::0):'()*'/)064*5'-.+0'&+,"-.

(#)*$!!)#!

! %69+-E4)03*::0):')6.'-.'/9',&%"#:1;(5

! !'.61B*)'0K'169+-E4)03*::0):'K0)1'('/$0-*$

%69+-EJ)03*::0)

<(3.1;(*1!"#$%&

'()"*+

!

/;*($)&0*"#(11"*

/=$*(>&'()"*+

?%"@$%&'()"*+

A"21;$2;&'()"*+

8(B;C*(&'()"*+

/;*($)&0*"#(11"*1&=$-(&$##(11&;"D

'()"*+&8+5( E##(11 /=$*.23

R*/-:+*): R*(5$S)-+* J)-I(+*

M03(9'%*10), R*(5$S)-+* J)-I(+*

C;()*5'%*10), R*(5$S)-+* %69+-EJ)03*::0)

T90B(9'%*10), R*(5$S)-+* &*I-3*

80.:+(.+'%*10), R*(5 &*I-3*

"*U+6)*'%*10), R*(5 &*I-3*

API

Page 25: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/D/#D/$$0

E

! !"#$%&$%#'(()$%*%+$,-#$%&-.'%!"#$%&!$'$(

&$%/$.%*%+$,-#$%'*"+0$%(1%.23$%)*+$%&!$

! 4*"%"(&%#5$*.$%*%#(".$6.%&-.'%!")(,)-$.($

! 78".-9$%:;<%35(,-+$)%*%)-930-1-$+%-".$51*#$%

1(5%#5$*.-"/%*%#(".$6.=

! !"+.'$(#$%&!$)/"0(

! !"+.1$(#$%&!$

! :"+%.'$%8)$180=

! !"+.)2//3$#$%&!$

! !"#$%&$%'*,$%*%#(".$6.%>)*!/0($,(?%#*"%

*00(#*.$%9$9(52@%#*00%*%A;B%18"#.-("%$.#C%%

! 4(".$6.%-)%-930-#-.02%*))(#-*.$+%&-.'%#5$*.-"/%

.'5$*+

! D(%)2"#'5("-E$%*00%.'5$*+)%>4;B%'().%&-.'%

A;B%.'5$*+)?%#*00%!")(,140!2-/0&5$

! F*-.)%1(5%*00%A;B%.*)G)%.(%1-"-)'%

! :00(#*.$HI5$$%9$9(52=

! !"6$7899/!:;!"6$7<-$$

! <"-.-*0-E$%9$9(52=

! !"6$73$(

! 4(32%9$9(52=

! !"6$7!=4>(/#:;!"6$7!=4#(/>:;

!"6$7!=4#(/#

! F'$"%*00(#*.-"/%9$9(52%1(5%.'$%2/3(@%#*"%

8)$%!"##$% H%&'( H%!!")

! !5%8)$%!"6$7899/!>/3(@%!"6$7<-$$>/3(

! D'$)$%18"#.-(")%*00(#*.$%'().%9$9(52%.'*.%-)%

)"*'+#$%,'-

! ;$51(59*"#$%-935(,$+%1(5%#(32%.(H15(9%

3*/$J0(#G$+%'().%9$9(52

! :00(#*.$HI5$$%9$9(52=

! !"+.6.99/!@%!"+.<-$$

! <"-.-*0-E$%9$9(52=

! !"+.6$73$(

! 4(32%9$9(52=

! !"+.6$7!=4

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/DE/D/$$0

#

! !"#$!%&'()*'+),-./'+0'1(2*'('3014*+-./'

4)0563+7''890:*'"0'+;*'%*+(9

! <-):+')*9*(:*'=>?$>@A'B01B*5

! C*30.5')*9*(:*'-.'D?$>@')*EF)-++*.

! G4*.'C06)3*

! H0+',*+'('I-(B9*':096+-0.

&*I-3*

%69+-EJ)03*::0) %69+-EJ)03*::0)

!

!"#$%&'()"*+ !"#$%&'()"*+

,(-.#(&'()"*+

/0 /0

/0 /0

/0 /0

! !

/0 /0

/0 /0

/0 /0

! !

%69+-EJ)03*::0)

!"#$%&'()"*+

/0 /0

/0 /0

/0 /0

! !

! ";*'K6.5(1*.+(9'6.-+'-:'+;*'!"#$%&'(#)*$!!)#

! !':3(9()L'1.23%(45*(#.1."2 K90(+-./'40-.+'!MN

! G.*'%69+-49,$!55'4*)'39032'3,39*

! 678 #OOOE@PQ'30149-(.+

! C+)*(1'4)03*::0):'()*'/)064*5'-.+0'&+,"-.

(#)*$!!)#!

! %69+-E4)03*::0):')6.'-.'/9',&%"#:1;(5

! !'.61B*)'0K'169+-E4)03*::0):'K0)1'('/$0-*$

%69+-EJ)03*::0)

<(3.1;(*1!"#$%&

'()"*+

!

/;*($)&0*"#(11"*

/=$*(>&'()"*+

?%"@$%&'()"*+

A"21;$2;&'()"*+

8(B;C*(&'()"*+

/;*($)&0*"#(11"*1&=$-(&$##(11&;"D

'()"*+&8+5( E##(11 /=$*.23

R*/-:+*): R*(5$S)-+* J)-I(+*

M03(9'%*10), R*(5$S)-+* J)-I(+*

C;()*5'%*10), R*(5$S)-+* %69+-EJ)03*::0)

T90B(9'%*10), R*(5$S)-+* &*I-3*

80.:+(.+'%*10), R*(5 &*I-3*

"*U+6)*'%*10), R*(5 &*I-3*

API

Page 26: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/D/#D/$$0

#

! !"#$%&''(!!"#$%"&''(%&$#"!"#$%&)#)(*+! ,&-"&'.("&''(%&$#"%&&%')#)(*+"/012

! 3**&+."&*#"%*#&$#4"56$7"&".8#%696%"564$7"&-4"7#6:7$"&-4"#'#)#-$"$+8#

! ;#)(*+"'&+(<$"6."(8$6)6=#4"/#>:>"8&%?6-:2"@+"*<-$6)#

! !"&))*+,)$*-$! !"&))*+.$/-)(+! !"#$%!0+.-(&! !"#$%!0+1-(&!"#

! 3")(4<'#"6."&"@'(@"(9"ABC"%(4#D4&$&"&'(-:"

56$7".()#"$+8#"6-9(*)&$6(-

! >%<@6- 96'#.

! 3")(4<'#"6."%*#&$#4"@+"'(&46-:"&"%<@6- 56$7"

!"#(2"'$,)$*-$ (*"!"#(2"'$3(*2.*-*

! ;(4<'#"%&-"@#"<-'(&4#4"56$7"

!"#(2"'$45'(*2

! E(&46-:"&")(4<'#"&'.("%(86#."6$"$("$7#"4#F6%#

! ,&-"$7#-":#$"$7#"&44*#.."(9"9<-%$6(-."&-4"

:'(@&'"F&*6&@'#.G

!"#(2"'$6$-7"5!-8(5

!"#(2"'$6$-6'(9*'

!"#(2"'$6$-:$;<$=

! H-%#"&")(4<'#"6."'(&4#4!"&-4"5#"7&F#"&"

9<-%$6(-"8(6-$#*!"5#"%&-"%&''"&"9<-%$6(-

! I#")<.$".#$<8"$7#"!"!#$%&'()!(*&+'(,!(%)

96*.$

! JK#%<$6(-"#-F6*(-)#-$"6-%'<4#.G

" L7*#&4"M'(%?"N6=#

" N7&*#4";#)(*+"N6=#

" O<-%$6(-"B&*&)#$#*.

" A*64"N6=#

! L7*#&4"M'(%?"N6=#G"

!"7"5!>$-?'(!@>A*0$

! N7&*#4";#)(*+"N6=#G

!"7"5!>$->A*)$2>8B$

! O<-%$6(-"B&*&)#$#*.G

!"C*)*%>$->8B$DE!"C*)*%>$-8DE

!"C*)*%>$-=DE!"C*)*%>$-F

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/DE/D/$$0

#

! !"#$!%&'()*'+),-./'+0'1(2*'('3014*+-./'

4)0563+7''890:*'"0'+;*'%*+(9

! <-):+')*9*(:*'=>?$>@A'B01B*5

! C*30.5')*9*(:*'-.'D?$>@')*EF)-++*.

! G4*.'C06)3*

! H0+',*+'('I-(B9*':096+-0.

&*I-3*

%69+-EJ)03*::0) %69+-EJ)03*::0)

!

!"#$%&'()"*+ !"#$%&'()"*+

,(-.#(&'()"*+

/0 /0

/0 /0

/0 /0

! !

/0 /0

/0 /0

/0 /0

! !

%69+-EJ)03*::0)

!"#$%&'()"*+

/0 /0

/0 /0

/0 /0

! !

! ";*'K6.5(1*.+(9'6.-+'-:'+;*'!"#$%&'(#)*$!!)#

! !':3(9()L'1.23%(45*(#.1."2 K90(+-./'40-.+'!MN

! G.*'%69+-49,$!55'4*)'39032'3,39*

! 678 #OOOE@PQ'30149-(.+

! C+)*(1'4)03*::0):'()*'/)064*5'-.+0'&+,"-.

(#)*$!!)#!

! %69+-E4)03*::0):')6.'-.'/9',&%"#:1;(5

! !'.61B*)'0K'169+-E4)03*::0):'K0)1'('/$0-*$

%69+-EJ)03*::0)

<(3.1;(*1!"#$%&

'()"*+

!

/;*($)&0*"#(11"*

/=$*(>&'()"*+

?%"@$%&'()"*+

A"21;$2;&'()"*+

8(B;C*(&'()"*+

/;*($)&0*"#(11"*1&=$-(&$##(11&;"D

'()"*+&8+5( E##(11 /=$*.23

R*/-:+*): R*(5$S)-+* J)-I(+*

M03(9'%*10), R*(5$S)-+* J)-I(+*

C;()*5'%*10), R*(5$S)-+* %69+-EJ)03*::0)

T90B(9'%*10), R*(5$S)-+* &*I-3*

80.:+(.+'%*10), R*(5 &*I-3*

"*U+6)*'%*10), R*(5 &*I-3*

API

Page 27: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/D/#D/$$0

#

! !"#$%&''(!!"#$%"&''(%&$#"!"#$%&)#)(*+! ,&-"&'.("&''(%&$#"%&&%')#)(*+"/012

! 3**&+."&*#"%*#&$#4"56$7"&".8#%696%"564$7"&-4"7#6:7$"&-4"#'#)#-$"$+8#

! ;#)(*+"'&+(<$"6."(8$6)6=#4"/#>:>"8&%?6-:2"@+"*<-$6)#

! !"&))*+,)$*-$! !"&))*+.$/-)(+! !"#$%!0+.-(&! !"#$%!0+1-(&!"#

! 3")(4<'#"6."&"@'(@"(9"ABC"%(4#D4&$&"&'(-:"

56$7".()#"$+8#"6-9(*)&$6(-

! >%<@6- 96'#.

! 3")(4<'#"6."%*#&$#4"@+"'(&46-:"&"%<@6- 56$7"

!"#(2"'$,)$*-$ (*"!"#(2"'$3(*2.*-*

! ;(4<'#"%&-"@#"<-'(&4#4"56$7"

!"#(2"'$45'(*2

! E(&46-:"&")(4<'#"&'.("%(86#."6$"$("$7#"4#F6%#

! ,&-"$7#-":#$"$7#"&44*#.."(9"9<-%$6(-."&-4"

:'(@&'"F&*6&@'#.G

!"#(2"'$6$-7"5!-8(5

!"#(2"'$6$-6'(9*'

!"#(2"'$6$-:$;<$=

! H-%#"&")(4<'#"6."'(&4#4!"&-4"5#"7&F#"&"

9<-%$6(-"8(6-$#*!"5#"%&-"%&''"&"9<-%$6(-

! I#")<.$".#$<8"$7#"!"!#$%&'()!(*&+'(,!(%)

96*.$

! JK#%<$6(-"#-F6*(-)#-$"6-%'<4#.G

" L7*#&4"M'(%?"N6=#

" N7&*#4";#)(*+"N6=#

" O<-%$6(-"B&*&)#$#*.

" A*64"N6=#

! L7*#&4"M'(%?"N6=#G"

!"7"5!>$-?'(!@>A*0$

! N7&*#4";#)(*+"N6=#G

!"7"5!>$->A*)$2>8B$

! O<-%$6(-"B&*&)#$#*.G

!"C*)*%>$->8B$DE!"C*)*%>$-8DE

!"C*)*%>$-=DE!"C*)*%>$-F

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/DE/D/$$0

#

! !"#$!%&'()*'+),-./'+0'1(2*'('3014*+-./'

4)0563+7''890:*'"0'+;*'%*+(9

! <-):+')*9*(:*'=>?$>@A'B01B*5

! C*30.5')*9*(:*'-.'D?$>@')*EF)-++*.

! G4*.'C06)3*

! H0+',*+'('I-(B9*':096+-0.

&*I-3*

%69+-EJ)03*::0) %69+-EJ)03*::0)

!

!"#$%&'()"*+ !"#$%&'()"*+

,(-.#(&'()"*+

/0 /0

/0 /0

/0 /0

! !

/0 /0

/0 /0

/0 /0

! !

%69+-EJ)03*::0)

!"#$%&'()"*+

/0 /0

/0 /0

/0 /0

! !

! ";*'K6.5(1*.+(9'6.-+'-:'+;*'!"#$%&'(#)*$!!)#

! !':3(9()L'1.23%(45*(#.1."2 K90(+-./'40-.+'!MN

! G.*'%69+-49,$!55'4*)'39032'3,39*

! 678 #OOOE@PQ'30149-(.+

! C+)*(1'4)03*::0):'()*'/)064*5'-.+0'&+,"-.

(#)*$!!)#!

! %69+-E4)03*::0):')6.'-.'/9',&%"#:1;(5

! !'.61B*)'0K'169+-E4)03*::0):'K0)1'('/$0-*$

%69+-EJ)03*::0)

<(3.1;(*1!"#$%&

'()"*+

!

/;*($)&0*"#(11"*

/=$*(>&'()"*+

?%"@$%&'()"*+

A"21;$2;&'()"*+

8(B;C*(&'()"*+

/;*($)&0*"#(11"*1&=$-(&$##(11&;"D

'()"*+&8+5( E##(11 /=$*.23

R*/-:+*): R*(5$S)-+* J)-I(+*

M03(9'%*10), R*(5$S)-+* J)-I(+*

C;()*5'%*10), R*(5$S)-+* %69+-EJ)03*::0)

T90B(9'%*10), R*(5$S)-+* &*I-3*

80.:+(.+'%*10), R*(5 &*I-3*

"*U+6)*'%*10), R*(5 &*I-3*

API

Page 28: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/D/#D/$$0

#

! !"#$%&''(!!"#$%"&''(%&$#"!"#$%&)#)(*+! ,&-"&'.("&''(%&$#"%&&%')#)(*+"/012

! 3**&+."&*#"%*#&$#4"56$7"&".8#%696%"564$7"&-4"7#6:7$"&-4"#'#)#-$"$+8#

! ;#)(*+"'&+(<$"6."(8$6)6=#4"/#>:>"8&%?6-:2"@+"*<-$6)#

! !"&))*+,)$*-$! !"&))*+.$/-)(+! !"#$%!0+.-(&! !"#$%!0+1-(&!"#

! 3")(4<'#"6."&"@'(@"(9"ABC"%(4#D4&$&"&'(-:"

56$7".()#"$+8#"6-9(*)&$6(-

! >%<@6- 96'#.

! 3")(4<'#"6."%*#&$#4"@+"'(&46-:"&"%<@6- 56$7"

!"#(2"'$,)$*-$ (*"!"#(2"'$3(*2.*-*

! ;(4<'#"%&-"@#"<-'(&4#4"56$7"

!"#(2"'$45'(*2

! E(&46-:"&")(4<'#"&'.("%(86#."6$"$("$7#"4#F6%#

! ,&-"$7#-":#$"$7#"&44*#.."(9"9<-%$6(-."&-4"

:'(@&'"F&*6&@'#.G

!"#(2"'$6$-7"5!-8(5

!"#(2"'$6$-6'(9*'

!"#(2"'$6$-:$;<$=

! H-%#"&")(4<'#"6."'(&4#4!"&-4"5#"7&F#"&"

9<-%$6(-"8(6-$#*!"5#"%&-"%&''"&"9<-%$6(-

! I#")<.$".#$<8"$7#"!"!#$%&'()!(*&+'(,!(%)

96*.$

! JK#%<$6(-"#-F6*(-)#-$"6-%'<4#.G

" L7*#&4"M'(%?"N6=#

" N7&*#4";#)(*+"N6=#

" O<-%$6(-"B&*&)#$#*.

" A*64"N6=#

! L7*#&4"M'(%?"N6=#G"

!"7"5!>$-?'(!@>A*0$

! N7&*#4";#)(*+"N6=#G

!"7"5!>$->A*)$2>8B$

! O<-%$6(-"B&*&)#$#*.G

!"C*)*%>$->8B$DE!"C*)*%>$-8DE

!"C*)*%>$-=DE!"C*)*%>$-F

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/DE/D/$$0

#

! !"#$!%&'()*'+),-./'+0'1(2*'('3014*+-./'

4)0563+7''890:*'"0'+;*'%*+(9

! <-):+')*9*(:*'=>?$>@A'B01B*5

! C*30.5')*9*(:*'-.'D?$>@')*EF)-++*.

! G4*.'C06)3*

! H0+',*+'('I-(B9*':096+-0.

&*I-3*

%69+-EJ)03*::0) %69+-EJ)03*::0)

!

!"#$%&'()"*+ !"#$%&'()"*+

,(-.#(&'()"*+

/0 /0

/0 /0

/0 /0

! !

/0 /0

/0 /0

/0 /0

! !

%69+-EJ)03*::0)

!"#$%&'()"*+

/0 /0

/0 /0

/0 /0

! !

! ";*'K6.5(1*.+(9'6.-+'-:'+;*'!"#$%&'(#)*$!!)#

! !':3(9()L'1.23%(45*(#.1."2 K90(+-./'40-.+'!MN

! G.*'%69+-49,$!55'4*)'39032'3,39*

! 678 #OOOE@PQ'30149-(.+

! C+)*(1'4)03*::0):'()*'/)064*5'-.+0'&+,"-.

(#)*$!!)#!

! %69+-E4)03*::0):')6.'-.'/9',&%"#:1;(5

! !'.61B*)'0K'169+-E4)03*::0):'K0)1'('/$0-*$

%69+-EJ)03*::0)

<(3.1;(*1!"#$%&

'()"*+

!

/;*($)&0*"#(11"*

/=$*(>&'()"*+

?%"@$%&'()"*+

A"21;$2;&'()"*+

8(B;C*(&'()"*+

/;*($)&0*"#(11"*1&=$-(&$##(11&;"D

'()"*+&8+5( E##(11 /=$*.23

R*/-:+*): R*(5$S)-+* J)-I(+*

M03(9'%*10), R*(5$S)-+* J)-I(+*

C;()*5'%*10), R*(5$S)-+* %69+-EJ)03*::0)

T90B(9'%*10), R*(5$S)-+* &*I-3*

80.:+(.+'%*10), R*(5 &*I-3*

"*U+6)*'%*10), R*(5 &*I-3*

API

Page 29: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/D/#D/$$0

#

! !"#$%&''(!!"#$%"&''(%&$#"!"#$%&)#)(*+! ,&-"&'.("&''(%&$#"%&&%')#)(*+"/012

! 3**&+."&*#"%*#&$#4"56$7"&".8#%696%"564$7"&-4"7#6:7$"&-4"#'#)#-$"$+8#

! ;#)(*+"'&+(<$"6."(8$6)6=#4"/#>:>"8&%?6-:2"@+"*<-$6)#

! !"&))*+,)$*-$! !"&))*+.$/-)(+! !"#$%!0+.-(&! !"#$%!0+1-(&!"#

! 3")(4<'#"6."&"@'(@"(9"ABC"%(4#D4&$&"&'(-:"

56$7".()#"$+8#"6-9(*)&$6(-

! >%<@6- 96'#.

! 3")(4<'#"6."%*#&$#4"@+"'(&46-:"&"%<@6- 56$7"

!"#(2"'$,)$*-$ (*"!"#(2"'$3(*2.*-*

! ;(4<'#"%&-"@#"<-'(&4#4"56$7"

!"#(2"'$45'(*2

! E(&46-:"&")(4<'#"&'.("%(86#."6$"$("$7#"4#F6%#

! ,&-"$7#-":#$"$7#"&44*#.."(9"9<-%$6(-."&-4"

:'(@&'"F&*6&@'#.G

!"#(2"'$6$-7"5!-8(5

!"#(2"'$6$-6'(9*'

!"#(2"'$6$-:$;<$=

! H-%#"&")(4<'#"6."'(&4#4!"&-4"5#"7&F#"&"

9<-%$6(-"8(6-$#*!"5#"%&-"%&''"&"9<-%$6(-

! I#")<.$".#$<8"$7#"!"!#$%&'()!(*&+'(,!(%)

96*.$

! JK#%<$6(-"#-F6*(-)#-$"6-%'<4#.G

" L7*#&4"M'(%?"N6=#

" N7&*#4";#)(*+"N6=#

" O<-%$6(-"B&*&)#$#*.

" A*64"N6=#

! L7*#&4"M'(%?"N6=#G"

!"7"5!>$-?'(!@>A*0$

! N7&*#4";#)(*+"N6=#G

!"7"5!>$->A*)$2>8B$

! O<-%$6(-"B&*&)#$#*.G

!"C*)*%>$->8B$DE!"C*)*%>$-8DE

!"C*)*%>$-=DE!"C*)*%>$-F!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/DE/D/$$0

#

! !"#$!%&'()*'+),-./'+0'1(2*'('3014*+-./'

4)0563+7''890:*'"0'+;*'%*+(9

! <-):+')*9*(:*'=>?$>@A'B01B*5

! C*30.5')*9*(:*'-.'D?$>@')*EF)-++*.

! G4*.'C06)3*

! H0+',*+'('I-(B9*':096+-0.

&*I-3*

%69+-EJ)03*::0) %69+-EJ)03*::0)

!

!"#$%&'()"*+ !"#$%&'()"*+

,(-.#(&'()"*+

/0 /0

/0 /0

/0 /0

! !

/0 /0

/0 /0

/0 /0

! !

%69+-EJ)03*::0)

!"#$%&'()"*+

/0 /0

/0 /0

/0 /0

! !

! ";*'K6.5(1*.+(9'6.-+'-:'+;*'!"#$%&'(#)*$!!)#

! !':3(9()L'1.23%(45*(#.1."2 K90(+-./'40-.+'!MN

! G.*'%69+-49,$!55'4*)'39032'3,39*

! 678 #OOOE@PQ'30149-(.+

! C+)*(1'4)03*::0):'()*'/)064*5'-.+0'&+,"-.

(#)*$!!)#!

! %69+-E4)03*::0):')6.'-.'/9',&%"#:1;(5

! !'.61B*)'0K'169+-E4)03*::0):'K0)1'('/$0-*$

%69+-EJ)03*::0)

<(3.1;(*1!"#$%&

'()"*+

!

/;*($)&0*"#(11"*

/=$*(>&'()"*+

?%"@$%&'()"*+

A"21;$2;&'()"*+

8(B;C*(&'()"*+

/;*($)&0*"#(11"*1&=$-(&$##(11&;"D

'()"*+&8+5( E##(11 /=$*.23

R*/-:+*): R*(5$S)-+* J)-I(+*

M03(9'%*10), R*(5$S)-+* J)-I(+*

C;()*5'%*10), R*(5$S)-+* %69+-EJ)03*::0)

T90B(9'%*10), R*(5$S)-+* &*I-3*

80.:+(.+'%*10), R*(5 &*I-3*

"*U+6)*'%*10), R*(5 &*I-3*

API

Page 30: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/D/#D/$$0

#

! !"#$%&''(!!"#$%"&''(%&$#"!"#$%&)#)(*+! ,&-"&'.("&''(%&$#"%&&%')#)(*+"/012

! 3**&+."&*#"%*#&$#4"56$7"&".8#%696%"564$7"&-4"7#6:7$"&-4"#'#)#-$"$+8#

! ;#)(*+"'&+(<$"6."(8$6)6=#4"/#>:>"8&%?6-:2"@+"*<-$6)#

! !"&))*+,)$*-$! !"&))*+.$/-)(+! !"#$%!0+.-(&! !"#$%!0+1-(&!"#

! 3")(4<'#"6."&"@'(@"(9"ABC"%(4#D4&$&"&'(-:"

56$7".()#"$+8#"6-9(*)&$6(-

! >%<@6- 96'#.

! 3")(4<'#"6."%*#&$#4"@+"'(&46-:"&"%<@6- 56$7"

!"#(2"'$,)$*-$ (*"!"#(2"'$3(*2.*-*

! ;(4<'#"%&-"@#"<-'(&4#4"56$7"

!"#(2"'$45'(*2

! E(&46-:"&")(4<'#"&'.("%(86#."6$"$("$7#"4#F6%#

! ,&-"$7#-":#$"$7#"&44*#.."(9"9<-%$6(-."&-4"

:'(@&'"F&*6&@'#.G

!"#(2"'$6$-7"5!-8(5

!"#(2"'$6$-6'(9*'

!"#(2"'$6$-:$;<$=

! H-%#"&")(4<'#"6."'(&4#4!"&-4"5#"7&F#"&"

9<-%$6(-"8(6-$#*!"5#"%&-"%&''"&"9<-%$6(-

! I#")<.$".#$<8"$7#"!"!#$%&'()!(*&+'(,!(%)

96*.$

! JK#%<$6(-"#-F6*(-)#-$"6-%'<4#.G

" L7*#&4"M'(%?"N6=#

" N7&*#4";#)(*+"N6=#

" O<-%$6(-"B&*&)#$#*.

" A*64"N6=#

! L7*#&4"M'(%?"N6=#G"

!"7"5!>$-?'(!@>A*0$

! N7&*#4";#)(*+"N6=#G

!"7"5!>$->A*)$2>8B$

! O<-%$6(-"B&*&)#$#*.G

!"C*)*%>$->8B$DE!"C*)*%>$-8DE

!"C*)*%>$-=DE!"C*)*%>$-F!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/DE/D/$$0

#

! !"#$!%&'()*'+),-./'+0'1(2*'('3014*+-./'

4)0563+7''890:*'"0'+;*'%*+(9

! <-):+')*9*(:*'=>?$>@A'B01B*5

! C*30.5')*9*(:*'-.'D?$>@')*EF)-++*.

! G4*.'C06)3*

! H0+',*+'('I-(B9*':096+-0.

&*I-3*

%69+-EJ)03*::0) %69+-EJ)03*::0)

!

!"#$%&'()"*+ !"#$%&'()"*+

,(-.#(&'()"*+

/0 /0

/0 /0

/0 /0

! !

/0 /0

/0 /0

/0 /0

! !

%69+-EJ)03*::0)

!"#$%&'()"*+

/0 /0

/0 /0

/0 /0

! !

! ";*'K6.5(1*.+(9'6.-+'-:'+;*'!"#$%&'(#)*$!!)#

! !':3(9()L'1.23%(45*(#.1."2 K90(+-./'40-.+'!MN

! G.*'%69+-49,$!55'4*)'39032'3,39*

! 678 #OOOE@PQ'30149-(.+

! C+)*(1'4)03*::0):'()*'/)064*5'-.+0'&+,"-.

(#)*$!!)#!

! %69+-E4)03*::0):')6.'-.'/9',&%"#:1;(5

! !'.61B*)'0K'169+-E4)03*::0):'K0)1'('/$0-*$

%69+-EJ)03*::0)

<(3.1;(*1!"#$%&

'()"*+

!

/;*($)&0*"#(11"*

/=$*(>&'()"*+

?%"@$%&'()"*+

A"21;$2;&'()"*+

8(B;C*(&'()"*+

/;*($)&0*"#(11"*1&=$-(&$##(11&;"D

'()"*+&8+5( E##(11 /=$*.23

R*/-:+*): R*(5$S)-+* J)-I(+*

M03(9'%*10), R*(5$S)-+* J)-I(+*

C;()*5'%*10), R*(5$S)-+* %69+-EJ)03*::0)

T90B(9'%*10), R*(5$S)-+* &*I-3*

80.:+(.+'%*10), R*(5 &*I-3*

"*U+6)*'%*10), R*(5 &*I-3*

API

Page 31: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/D/#D/$$0

'

! !"#$%&#'(%#)%)(*%+*%*,(%)+-(%*#-(%+)%*,(%

./01*#20%#0321+*#204

!"#$"%!&'()*

! +,!$--. !"#$%&#'()*+,#-*#%."#&+*/#01*#2223444#

'&"%0(5"#("65%.0(5"#758*9.059:

! 5,(%12-6#7("%8(0("+*()%1+77)%*2%+77%$(3#1(%9:;%

*2%)(*/6%*,(%(<(1/*#20%(03#"20-(0*

! 9%)*"(+-%#)%+%)(=/(01(%2.%26("+*#20)%*,+*%

211/"%#0%2"$("%%>?8?

@? A26B%$+*+%."2-%,2)*%*2%$(3#1(

C? ><(1/*(%$(3#1(%./01*#20%

D? A26B%$+*+%."2-%$(3#1(%*2%,2)*

! 9%)*"(+-%#)%+%)(=/(01(%2.%26("+*#20)%*,+*%

211/"%#0%2"$("%%>?8?

@? A26B%$+*+%."2-%,2)*%*2%$(3#1(

C? ><(1/*(%$(3#1(%./01*#20%

D? A26B%$+*+%."2-%$(3#1(%*2%,2)*

! 9%)*"(+-%#)%+%)(=/(01(%2.%26("+*#20)%*,+*%

211/"%#0%2"$("

! E#..("(0*%)*"(+-)%1+0%F(%/)($%*2%-+0+8(%

1201/""(01B%%>?8?

G3("7+66#08%-(-2"B%126B%."2-%20(%)*"(+-%H#*,%*,(%./01*#20%(<(1/*#20%."2-%+02*,("

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/DE/D/$$0

#

! !"#$!%&'()*'+),-./'+0'1(2*'('3014*+-./'

4)0563+7''890:*'"0'+;*'%*+(9

! <-):+')*9*(:*'=>?$>@A'B01B*5

! C*30.5')*9*(:*'-.'D?$>@')*EF)-++*.

! G4*.'C06)3*

! H0+',*+'('I-(B9*':096+-0.

&*I-3*

%69+-EJ)03*::0) %69+-EJ)03*::0)

!

!"#$%&'()"*+ !"#$%&'()"*+

,(-.#(&'()"*+

/0 /0

/0 /0

/0 /0

! !

/0 /0

/0 /0

/0 /0

! !

%69+-EJ)03*::0)

!"#$%&'()"*+

/0 /0

/0 /0

/0 /0

! !

! ";*'K6.5(1*.+(9'6.-+'-:'+;*'!"#$%&'(#)*$!!)#

! !':3(9()L'1.23%(45*(#.1."2 K90(+-./'40-.+'!MN

! G.*'%69+-49,$!55'4*)'39032'3,39*

! 678 #OOOE@PQ'30149-(.+

! C+)*(1'4)03*::0):'()*'/)064*5'-.+0'&+,"-.

(#)*$!!)#!

! %69+-E4)03*::0):')6.'-.'/9',&%"#:1;(5

! !'.61B*)'0K'169+-E4)03*::0):'K0)1'('/$0-*$

%69+-EJ)03*::0)

<(3.1;(*1!"#$%&

'()"*+

!

/;*($)&0*"#(11"*

/=$*(>&'()"*+

?%"@$%&'()"*+

A"21;$2;&'()"*+

8(B;C*(&'()"*+

/;*($)&0*"#(11"*1&=$-(&$##(11&;"D

'()"*+&8+5( E##(11 /=$*.23

R*/-:+*): R*(5$S)-+* J)-I(+*

M03(9'%*10), R*(5$S)-+* J)-I(+*

C;()*5'%*10), R*(5$S)-+* %69+-EJ)03*::0)

T90B(9'%*10), R*(5$S)-+* &*I-3*

80.:+(.+'%*10), R*(5 &*I-3*

"*U+6)*'%*10), R*(5 &*I-3*

API

Page 32: [Harvard CS264] 04 - Intermediate-level CUDA Programming

Outline

• CUDA Language & APIs (overview)

• Threading/Execution (cont’d)

• Memory/Communication (cont’d)

• Tools

• Libraries

Page 33: [Harvard CS264] 04 - Intermediate-level CUDA Programming

© 2008 NVIDIA Corporation.

Execution Model

Software Hardware

Threads are executed by thread processors

Thread

Thread Processor

Thread Block Multiprocessor

Thread blocks are executed on multiprocessors

Thread blocks do not migrate

Several concurrent thread blocks can reside on one multiprocessor - limited by multiprocessor resources (shared memory and register file)

...

Grid Device

A kernel is launched as a grid of thread blocks

Only one kernel can execute on a device at one time

Threading Hierarchy

Page 34: [Harvard CS264] 04 - Intermediate-level CUDA Programming

© 2008 NVIDIA Corporation.

Thread Batching

Kernel launches a grid of thread blocksThreads within a block cooperate via shared memory

Threads within a block can synchronize

Threads in different blocks cannot cooperate

Allows programs to transparently scale to different GPUs

Grid

Thread Block 0

Shared Memory

Thread Block 1

Shared Memory

Thread Block N-1

Shared Memory

Page 35: [Harvard CS264] 04 - Intermediate-level CUDA Programming

© 2008 NVIDIA Corporation.

Transparent Scalability

Kernel grid

Block 2 Block 3

Block 4 Block 5

Block 6 Block 7

Device Device

Block 0 Block 1 Block 2 Block 3

Block 4 Block 5 Block 6 Block 7

Block 0 Block 1

Block 2 Block 3

Block 4 Block 5

Block 6 Block 7

Block 0 Block 1

Hardware is free to schedule thread blocks on any processor

A kernel scales across parallel multiprocessors

Page 36: [Harvard CS264] 04 - Intermediate-level CUDA Programming

Thread Arithmetic

Page 37: [Harvard CS264] 04 - Intermediate-level CUDA Programming

Indexing Arrays: Example

In this example, the red entry would have an index of 21:

int index = threadIdx.x + blockIdx.x * M;

= 5 + 2 * 8;

= 21;

blockIdx.x = 2

M = 8 threads/block

0 178 16 18 19 20 2121 3 4 5 6 7 109 11 12 13 14 15

Page 38: [Harvard CS264] 04 - Intermediate-level CUDA Programming

Indexing Arrays: Example

In this example, the red entry would have an index of 21:

int index = threadIdx.x + blockIdx.x * M;

= 5 + 2 * 8;

= 21;

blockIdx.x = 2

M = 8 threads/block

0 178 16 18 19 20 2121 3 4 5 6 7 109 11 12 13 14 15

Addition with Threads and Blocks

The blockDim.x is a built-in variable for threads per block:

int index= threadIdx.x + blockIdx.x * blockDim.x;

A combined version of our vector addition kernel to use blocks and threads:

__global__ void add( int *a, int *b, int *c ) {

int index = threadIdx.x + blockIdx.x * blockDim.x;

c[index] = a[index] + b[index];

}

So what changes in main() when we use both blocks and threads?

Page 39: [Harvard CS264] 04 - Intermediate-level CUDA Programming

Control Flow

Page 40: [Harvard CS264] 04 - Intermediate-level CUDA Programming

Control Flow Divergence

What happens if you have the following code?

!"#"$$#%&'()*+*,-,../*$01#.2

3(45(/*$06#.2

3

Page 41: [Harvard CS264] 04 - Intermediate-level CUDA Programming

Control Flow Divergence

Branch

Path A

Path B

Branch

Path A

Path B

Page 42: [Harvard CS264] 04 - Intermediate-level CUDA Programming

Control Flow Divergence

Nested branches are handled as well

!"#"$$#%&'()*+*,-,../!"#0)'#%&'()*+*,-,..*$12#.3

(45(*$16#.3

7(45(*$18#.3

Page 43: [Harvard CS264] 04 - Intermediate-level CUDA Programming

Control Flow Divergence

BranchBranch

Path A

Path C

Branch

Path B

Page 44: [Harvard CS264] 04 - Intermediate-level CUDA Programming

Control Flow Divergence

for correctness (*)You might have to think about it for performance

Depends on your branch conditions

Page 45: [Harvard CS264] 04 - Intermediate-level CUDA Programming

Control Flow Divergence

Performance drops off with the degree of divergence

!"#$%&'$&()*+,+-.- /0123%*!) 45...

%*!) 65...

7

Page 46: [Harvard CS264] 04 - Intermediate-level CUDA Programming

Divergence

0

5

10

15

20

25

30

35

0 2 4 6 8 10 12 14 16 18

Performance

Divergence

Page 47: [Harvard CS264] 04 - Intermediate-level CUDA Programming

Occupancy

Page 48: [Harvard CS264] 04 - Intermediate-level CUDA Programming

© NVIDIA Corporation 2010

!""#$%&"'

()*+%,-.&/0*#"0.1&/-%*+-+2+"#0+,-/+3#+&0.%44'5-/1-+2+"#0.&6-10)+*-7%*$/-./-0)+-1&4'-7%'-01-).,+-4%0+&".+/-%&,-8++$-0)+-)%*,7%*+-9#/'

!""#$%&"' :-;#<9+*-1=-7%*$/-*#&&.&6-"1&"#**+&04'-1&-%-<#40.$*1"+//1*-,.>.,+,-9'-<%2.<#<-&#<9+*-1=-7%*$/-0)%0-"%&-*#&-"1&"#**+&04'

?.<.0+,-9'-*+/1#*"+-#/%6+@A+6./0+*/B)%*+,-<+<1*'

Page 49: [Harvard CS264] 04 - Intermediate-level CUDA Programming

© NVIDIA Corporation 2010

!"#$%&'()*'+*,-'.)/*,&0,$&

1'#2'3"#$%&'4'1'#2'5/"0,(*#$)&&#*&6#'7""'5/"0,(*#$)&&#*&'879)'70'")7&0'#:)'3"#$%'0#');)$/0)

1'#2'3"#$%&'<'1'#2'5/"0,(*#$)&&#*&'4'=>/"0,(")'3"#$%&'$7:'*/:'$#:$/**):0"?',:'7'5/"0,(*#$)&&#*

!!"#$%&'()*+",-.%))('08)'87*-@7*)'3/&?6/3A)$0'0#'*)&#/*$)'797,"73,",0?' *)B,&0)*&C'&87*)-'5)5#*?

1'#2'3"#$%&'4'DEE'0#'&$7")'0#'2/0/*)'-)9,$)&!"#$%&');)$/0)-',:'(,()",:)'27&8,#:DEEE'3"#$%&'()*'B*,-'@,""'&$7")'7$*#&&'5/"0,(")'B):)*70,#:&

Page 50: [Harvard CS264] 04 - Intermediate-level CUDA Programming

Outline

• CUDA Language & APIs (overview)

• Threading/Execution (cont’d)

• Memory/Communication (cont’d)

• Tools

• Libraries

Page 51: [Harvard CS264] 04 - Intermediate-level CUDA Programming

© 2008 NVIDIA Corporation.

Kernel Memory Access

Per-thread

Per-block

Per-device

ThreadRegisters

Local Memory

SharedMemory

Block

...Kernel 0

...Kernel 1

GlobalMemory

Time

On-chip

Off-chip, uncached

• On-chip, small

• Fast

• Off-chip, large

• Uncached

• Persistent across kernel launches

• Kernel I/O

Kernel Memory Access

Page 52: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/DE/D/$$0

'

!"#$%&"'%

! !"#$%&'()*%#&+,&#%$-./%#.&0%#&./#%")&

0#+1%..+#&23456&-'&.+)%&7"#89"#%:

! ;%#+<1=+1>&1?1=%&"11%..

! @/+#%&%-/7%#&A5*-/&-'/%$%#&+#&A5*-/&,=+"/

()*+,-."/)'0

! B&.)"==&0+#/-+'&+,&$=+*"=&)%)+#?&/7"/&-.&

0#-C"/%&/+&"&./#%")&0#+1%..+#

! D,/%'&(.%8&".&+C%#,=+9&,#+)&#%$-./%#.

! @=+9&/+&"11%..&2.")%&".&$=+*"=&)%)+#?:

12+'"3-."/)'0

! B&*=+1>&+,&)%)+#?&/7"/&-.&.7"#%8&*?&"==&

./#%")&0#+1%..+#.&-'&"&)(=/-<0#+1%..+#

! 3EFG&0%#&*=+1>H&./+#%8&-'&3EI3FG&*"'>.

! J%#?&,"./&/+&"11%..&2-K%K&".&,"./&".&#%$-./%#.L:&&9-/7+(/&!"#$%&'#()*&+,

4,)5+,-."/)'0

! M7%&="#$%&*=+1>&+,&)%)+#?&.7"#%8&*?&"==&

)(=/-<0#+1%..+#.&+'&/7%&1+)0(/%&8%C-1%

! @-N%&8%0%'8.&+'&8%C-1%&! 5OEPG&/+&3KOQG

! R-$7&*"'89-8/7&S&344QGT.

! @=+9&/+&"11%..&! .%C%#"=&7('8#%8&1=+1>&1?1=%&

="/%'1?K&

! 678-1"17%8

9):%&+:&-."/)'0

! B&*=+1>&+,&#%"8<+'=?&)%)+#?&.7"#%8&*?&"==&

)(=/-<0#+1%..+#.&2E6FG:

! U"17%8&C-"&VFG&1"17%&0%#&)(=/-<0#+1%..+#

! @=+9&/+&"11%..&! .%C%#"=&7('8#%8&1=+1>&1?1=%&="/%'1?&+'&1"17%&)-..

8";&<'"-."/)'0! B&="#$%&*=+1>&+,&#%"8<+'=?&)%)+#?&.7"#%8&*?&"==&)(=/-<0#+1%..+#.

! W%"8.&,#+)&/%I/(#%&)%)+#?&1"'&*%&"#$%&'()*#+,!X%"#%./&+#&=-'%"#&-'/%#0+="/-+'&,+#&,#%%L

! U"17%8&C-"&VFG&1"17%&0%#&)(=/-<0#+1%..+#! @=+9&/+&"11%..&! .%C%#"=&7('8#%8&1=+1>&1?1=%&="/%'1?&+'&1"17%&)-..

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/DE/D/$$0

#

! !"#$!%&'()*'+),-./'+0'1(2*'('3014*+-./'

4)0563+7''890:*'"0'+;*'%*+(9

! <-):+')*9*(:*'=>?$>@A'B01B*5

! C*30.5')*9*(:*'-.'D?$>@')*EF)-++*.

! G4*.'C06)3*

! H0+',*+'('I-(B9*':096+-0.

&*I-3*

%69+-EJ)03*::0) %69+-EJ)03*::0)

!

!"#$%&'()"*+ !"#$%&'()"*+

,(-.#(&'()"*+

/0 /0

/0 /0

/0 /0

! !

/0 /0

/0 /0

/0 /0

! !

%69+-EJ)03*::0)

!"#$%&'()"*+

/0 /0

/0 /0

/0 /0

! !

! ";*'K6.5(1*.+(9'6.-+'-:'+;*'!"#$%&'(#)*$!!)#

! !':3(9()L'1.23%(45*(#.1."2 K90(+-./'40-.+'!MN

! G.*'%69+-49,$!55'4*)'39032'3,39*

! 678 #OOOE@PQ'30149-(.+

! C+)*(1'4)03*::0):'()*'/)064*5'-.+0'&+,"-.

(#)*$!!)#!

! %69+-E4)03*::0):')6.'-.'/9',&%"#:1;(5

! !'.61B*)'0K'169+-E4)03*::0):'K0)1'('/$0-*$

%69+-EJ)03*::0)

<(3.1;(*1!"#$%&

'()"*+

!

/;*($)&0*"#(11"*

/=$*(>&'()"*+

?%"@$%&'()"*+

A"21;$2;&'()"*+

8(B;C*(&'()"*+

/;*($)&0*"#(11"*1&=$-(&$##(11&;"D

'()"*+&8+5( E##(11 /=$*.23

R*/-:+*): R*(5$S)-+* J)-I(+*

M03(9'%*10), R*(5$S)-+* J)-I(+*

C;()*5'%*10), R*(5$S)-+* %69+-EJ)03*::0)

T90B(9'%*10), R*(5$S)-+* &*I-3*

80.:+(.+'%*10), R*(5 &*I-3*

"*U+6)*'%*10), R*(5 &*I-3*

Page 53: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/DE/D/$$0

'

!"#$%&"'%

! !"#$%&'()*%#&+,&#%$-./%#.&0%#&./#%")&

0#+1%..+#&23456&-'&.+)%&7"#89"#%:

! ;%#+<1=+1>&1?1=%&"11%..

! @/+#%&%-/7%#&A5*-/&-'/%$%#&+#&A5*-/&,=+"/

()*+,-."/)'0

! B&.)"==&0+#/-+'&+,&$=+*"=&)%)+#?&/7"/&-.&

0#-C"/%&/+&"&./#%")&0#+1%..+#

! D,/%'&(.%8&".&+C%#,=+9&,#+)&#%$-./%#.

! @=+9&/+&"11%..&2.")%&".&$=+*"=&)%)+#?:

12+'"3-."/)'0

! B&*=+1>&+,&)%)+#?&/7"/&-.&.7"#%8&*?&"==&

./#%")&0#+1%..+#.&-'&"&)(=/-<0#+1%..+#

! 3EFG&0%#&*=+1>H&./+#%8&-'&3EI3FG&*"'>.

! J%#?&,"./&/+&"11%..&2-K%K&".&,"./&".&#%$-./%#.L:&&9-/7+(/&!"#$%&'#()*&+,

4,)5+,-."/)'0

! M7%&="#$%&*=+1>&+,&)%)+#?&.7"#%8&*?&"==&

)(=/-<0#+1%..+#.&+'&/7%&1+)0(/%&8%C-1%

! @-N%&8%0%'8.&+'&8%C-1%&! 5OEPG&/+&3KOQG

! R-$7&*"'89-8/7&S&344QGT.

! @=+9&/+&"11%..&! .%C%#"=&7('8#%8&1=+1>&1?1=%&

="/%'1?K&

! 678-1"17%8

9):%&+:&-."/)'0

! B&*=+1>&+,&#%"8<+'=?&)%)+#?&.7"#%8&*?&"==&

)(=/-<0#+1%..+#.&2E6FG:

! U"17%8&C-"&VFG&1"17%&0%#&)(=/-<0#+1%..+#

! @=+9&/+&"11%..&! .%C%#"=&7('8#%8&1=+1>&1?1=%&="/%'1?&+'&1"17%&)-..

8";&<'"-."/)'0! B&="#$%&*=+1>&+,&#%"8<+'=?&)%)+#?&.7"#%8&*?&"==&)(=/-<0#+1%..+#.

! W%"8.&,#+)&/%I/(#%&)%)+#?&1"'&*%&"#$%&'()*#+,!X%"#%./&+#&=-'%"#&-'/%#0+="/-+'&,+#&,#%%L

! U"17%8&C-"&VFG&1"17%&0%#&)(=/-<0#+1%..+#! @=+9&/+&"11%..&! .%C%#"=&7('8#%8&1=+1>&1?1=%&="/%'1?&+'&1"17%&)-..

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/DE/D/$$0

#

! !"#$!%&'()*'+),-./'+0'1(2*'('3014*+-./'

4)0563+7''890:*'"0'+;*'%*+(9

! <-):+')*9*(:*'=>?$>@A'B01B*5

! C*30.5')*9*(:*'-.'D?$>@')*EF)-++*.

! G4*.'C06)3*

! H0+',*+'('I-(B9*':096+-0.

&*I-3*

%69+-EJ)03*::0) %69+-EJ)03*::0)

!

!"#$%&'()"*+ !"#$%&'()"*+

,(-.#(&'()"*+

/0 /0

/0 /0

/0 /0

! !

/0 /0

/0 /0

/0 /0

! !

%69+-EJ)03*::0)

!"#$%&'()"*+

/0 /0

/0 /0

/0 /0

! !

! ";*'K6.5(1*.+(9'6.-+'-:'+;*'!"#$%&'(#)*$!!)#

! !':3(9()L'1.23%(45*(#.1."2 K90(+-./'40-.+'!MN

! G.*'%69+-49,$!55'4*)'39032'3,39*

! 678 #OOOE@PQ'30149-(.+

! C+)*(1'4)03*::0):'()*'/)064*5'-.+0'&+,"-.

(#)*$!!)#!

! %69+-E4)03*::0):')6.'-.'/9',&%"#:1;(5

! !'.61B*)'0K'169+-E4)03*::0):'K0)1'('/$0-*$

%69+-EJ)03*::0)

<(3.1;(*1!"#$%&

'()"*+

!

/;*($)&0*"#(11"*

/=$*(>&'()"*+

?%"@$%&'()"*+

A"21;$2;&'()"*+

8(B;C*(&'()"*+

/;*($)&0*"#(11"*1&=$-(&$##(11&;"D

'()"*+&8+5( E##(11 /=$*.23

R*/-:+*): R*(5$S)-+* J)-I(+*

M03(9'%*10), R*(5$S)-+* J)-I(+*

C;()*5'%*10), R*(5$S)-+* %69+-EJ)03*::0)

T90B(9'%*10), R*(5$S)-+* &*I-3*

80.:+(.+'%*10), R*(5 &*I-3*

"*U+6)*'%*10), R*(5 &*I-3*

Page 54: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/DE/D/$$0

'

!"#$%&"'%

! !"#$%&'()*%#&+,&#%$-./%#.&0%#&./#%")&

0#+1%..+#&23456&-'&.+)%&7"#89"#%:

! ;%#+<1=+1>&1?1=%&"11%..

! @/+#%&%-/7%#&A5*-/&-'/%$%#&+#&A5*-/&,=+"/

()*+,-."/)'0

! B&.)"==&0+#/-+'&+,&$=+*"=&)%)+#?&/7"/&-.&

0#-C"/%&/+&"&./#%")&0#+1%..+#

! D,/%'&(.%8&".&+C%#,=+9&,#+)&#%$-./%#.

! @=+9&/+&"11%..&2.")%&".&$=+*"=&)%)+#?:

12+'"3-."/)'0

! B&*=+1>&+,&)%)+#?&/7"/&-.&.7"#%8&*?&"==&

./#%")&0#+1%..+#.&-'&"&)(=/-<0#+1%..+#

! 3EFG&0%#&*=+1>H&./+#%8&-'&3EI3FG&*"'>.

! J%#?&,"./&/+&"11%..&2-K%K&".&,"./&".&#%$-./%#.L:&&9-/7+(/&!"#$%&'#()*&+,

4,)5+,-."/)'0

! M7%&="#$%&*=+1>&+,&)%)+#?&.7"#%8&*?&"==&

)(=/-<0#+1%..+#.&+'&/7%&1+)0(/%&8%C-1%

! @-N%&8%0%'8.&+'&8%C-1%&! 5OEPG&/+&3KOQG

! R-$7&*"'89-8/7&S&344QGT.

! @=+9&/+&"11%..&! .%C%#"=&7('8#%8&1=+1>&1?1=%&

="/%'1?K&

! 678-1"17%8

9):%&+:&-."/)'0

! B&*=+1>&+,&#%"8<+'=?&)%)+#?&.7"#%8&*?&"==&

)(=/-<0#+1%..+#.&2E6FG:

! U"17%8&C-"&VFG&1"17%&0%#&)(=/-<0#+1%..+#

! @=+9&/+&"11%..&! .%C%#"=&7('8#%8&1=+1>&1?1=%&="/%'1?&+'&1"17%&)-..

8";&<'"-."/)'0! B&="#$%&*=+1>&+,&#%"8<+'=?&)%)+#?&.7"#%8&*?&"==&)(=/-<0#+1%..+#.

! W%"8.&,#+)&/%I/(#%&)%)+#?&1"'&*%&"#$%&'()*#+,!X%"#%./&+#&=-'%"#&-'/%#0+="/-+'&,+#&,#%%L

! U"17%8&C-"&VFG&1"17%&0%#&)(=/-<0#+1%..+#! @=+9&/+&"11%..&! .%C%#"=&7('8#%8&1=+1>&1?1=%&="/%'1?&+'&1"17%&)-..

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/DE/D/$$0

#

! !"#$!%&'()*'+),-./'+0'1(2*'('3014*+-./'

4)0563+7''890:*'"0'+;*'%*+(9

! <-):+')*9*(:*'=>?$>@A'B01B*5

! C*30.5')*9*(:*'-.'D?$>@')*EF)-++*.

! G4*.'C06)3*

! H0+',*+'('I-(B9*':096+-0.

&*I-3*

%69+-EJ)03*::0) %69+-EJ)03*::0)

!

!"#$%&'()"*+ !"#$%&'()"*+

,(-.#(&'()"*+

/0 /0

/0 /0

/0 /0

! !

/0 /0

/0 /0

/0 /0

! !

%69+-EJ)03*::0)

!"#$%&'()"*+

/0 /0

/0 /0

/0 /0

! !

! ";*'K6.5(1*.+(9'6.-+'-:'+;*'!"#$%&'(#)*$!!)#

! !':3(9()L'1.23%(45*(#.1."2 K90(+-./'40-.+'!MN

! G.*'%69+-49,$!55'4*)'39032'3,39*

! 678 #OOOE@PQ'30149-(.+

! C+)*(1'4)03*::0):'()*'/)064*5'-.+0'&+,"-.

(#)*$!!)#!

! %69+-E4)03*::0):')6.'-.'/9',&%"#:1;(5

! !'.61B*)'0K'169+-E4)03*::0):'K0)1'('/$0-*$

%69+-EJ)03*::0)

<(3.1;(*1!"#$%&

'()"*+

!

/;*($)&0*"#(11"*

/=$*(>&'()"*+

?%"@$%&'()"*+

A"21;$2;&'()"*+

8(B;C*(&'()"*+

/;*($)&0*"#(11"*1&=$-(&$##(11&;"D

'()"*+&8+5( E##(11 /=$*.23

R*/-:+*): R*(5$S)-+* J)-I(+*

M03(9'%*10), R*(5$S)-+* J)-I(+*

C;()*5'%*10), R*(5$S)-+* %69+-EJ)03*::0)

T90B(9'%*10), R*(5$S)-+* &*I-3*

80.:+(.+'%*10), R*(5 &*I-3*

"*U+6)*'%*10), R*(5 &*I-3*

Page 55: [Harvard CS264] 04 - Intermediate-level CUDA Programming

© 2008 NVIDIA Corporation.

Kernel Memory Access

Per-thread

Per-block

Per-device

ThreadRegisters

Local Memory

SharedMemory

Block

...Kernel 0

...Kernel 1

GlobalMemory

Time

On-chip

Off-chip, uncached

• On-chip, small

• Fast

• Off-chip, large

• Uncached

• Persistent across kernel launches

• Kernel I/O

Global Memory

Per-device

...Kernel 0

...Kernel 1

GlobalMemory

Time

• Off-chip, large

• Uncached

• Persistent across kernel launches

• Kernel I/O

• Different types of “global memory”

• Linear Memory

• Texture Memory

• Constant Memory

Page 56: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/DE/D/$$0

0

! !"#$%#&'#%(")'*+%,*"-'*+%."/0#'/#%'/1%2$3#45$%

6$6"57%'**%)"6$%85"6%#&$%0'6$%9&70:)'*%

6$6"57%9""*

! ;40#%1:88$5%:/%'))$00%9'##$5/0+%)')&:/<+%$#)=

! >%.?@>%1$A:)$%:0%'%&:<&*7%9'5'**$*%95")$00"5

! B$%'0046$%:#%)'/%$3$)4#$%6'/7%&4/15$10%"8%

#&5$'10%:/%9'5'**$*

! 2&5$'10%C%D#5$'6%E5")$00"50%F%G

! B&$/%H5:#:/<%.?@>%0"8#H'5$+%#&:/I%:/%#$560%"8%#&5$'10+%/"#%95")$00"50

! >%!"#$"% :0%$3$)4#$1%'0%'%!"#$

! >%&#'( :0%'%)"**$)#:"/%"8%%&"'($)*+,-./

! >%)*#"+(,-%./!,:0%'%)"**$)#:"/%"8%%&"'($/

! 2&5$'1%-*")I0%'/1%#&5$'10%'5$%<:A$/%4/:J4$%

:1$/#:8:$50%

! K1$/#:8:$50%-$%G@+%L@%"5%M@

! ?0$1%#"%&$*9%:1$/#:87%H&:)&%9'5#%"8%'%95"-*$6%

'%#&5$'1N-*")I%0&"4*1%"9$5'#$%"/

@$A:)$

,5:1

0)

12324

0

12354

0)

15324

0

15354

0)

16324

0)

16354

!

! 2&5$'1%O*")I%PG+GQ

7)

12324

7)

12354!

7)

15324

7)

15354

7)

16324

7)

16354

!

!

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/DE/D/$$0

#

! !"#$!%&'()*'+),-./'+0'1(2*'('3014*+-./'

4)0563+7''890:*'"0'+;*'%*+(9

! <-):+')*9*(:*'=>?$>@A'B01B*5

! C*30.5')*9*(:*'-.'D?$>@')*EF)-++*.

! G4*.'C06)3*

! H0+',*+'('I-(B9*':096+-0.

&*I-3*

%69+-EJ)03*::0) %69+-EJ)03*::0)

!

!"#$%&'()"*+ !"#$%&'()"*+

,(-.#(&'()"*+

/0 /0

/0 /0

/0 /0

! !

/0 /0

/0 /0

/0 /0

! !

%69+-EJ)03*::0)

!"#$%&'()"*+

/0 /0

/0 /0

/0 /0

! !

! ";*'K6.5(1*.+(9'6.-+'-:'+;*'!"#$%&'(#)*$!!)#

! !':3(9()L'1.23%(45*(#.1."2 K90(+-./'40-.+'!MN

! G.*'%69+-49,$!55'4*)'39032'3,39*

! 678 #OOOE@PQ'30149-(.+

! C+)*(1'4)03*::0):'()*'/)064*5'-.+0'&+,"-.

(#)*$!!)#!

! %69+-E4)03*::0):')6.'-.'/9',&%"#:1;(5

! !'.61B*)'0K'169+-E4)03*::0):'K0)1'('/$0-*$

%69+-EJ)03*::0)

<(3.1;(*1!"#$%&

'()"*+

!

/;*($)&0*"#(11"*

/=$*(>&'()"*+

?%"@$%&'()"*+

A"21;$2;&'()"*+

8(B;C*(&'()"*+

/;*($)&0*"#(11"*1&=$-(&$##(11&;"D

'()"*+&8+5( E##(11 /=$*.23

R*/-:+*): R*(5$S)-+* J)-I(+*

M03(9'%*10), R*(5$S)-+* J)-I(+*

C;()*5'%*10), R*(5$S)-+* %69+-EJ)03*::0)

T90B(9'%*10), R*(5$S)-+* &*I-3*

80.:+(.+'%*10), R*(5 &*I-3*

"*U+6)*'%*10), R*(5 &*I-3*

Page 57: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/DE/D/$$0

'

!"#$%&"'%

! !"#$%&'()*%#&+,&#%$-./%#.&0%#&./#%")&

0#+1%..+#&23456&-'&.+)%&7"#89"#%:

! ;%#+<1=+1>&1?1=%&"11%..

! @/+#%&%-/7%#&A5*-/&-'/%$%#&+#&A5*-/&,=+"/

()*+,-."/)'0

! B&.)"==&0+#/-+'&+,&$=+*"=&)%)+#?&/7"/&-.&

0#-C"/%&/+&"&./#%")&0#+1%..+#

! D,/%'&(.%8&".&+C%#,=+9&,#+)&#%$-./%#.

! @=+9&/+&"11%..&2.")%&".&$=+*"=&)%)+#?:

12+'"3-."/)'0

! B&*=+1>&+,&)%)+#?&/7"/&-.&.7"#%8&*?&"==&

./#%")&0#+1%..+#.&-'&"&)(=/-<0#+1%..+#

! 3EFG&0%#&*=+1>H&./+#%8&-'&3EI3FG&*"'>.

! J%#?&,"./&/+&"11%..&2-K%K&".&,"./&".&#%$-./%#.L:&&9-/7+(/&!"#$%&'#()*&+,

4,)5+,-."/)'0

! M7%&="#$%&*=+1>&+,&)%)+#?&.7"#%8&*?&"==&

)(=/-<0#+1%..+#.&+'&/7%&1+)0(/%&8%C-1%

! @-N%&8%0%'8.&+'&8%C-1%&! 5OEPG&/+&3KOQG

! R-$7&*"'89-8/7&S&344QGT.

! @=+9&/+&"11%..&! .%C%#"=&7('8#%8&1=+1>&1?1=%&

="/%'1?K&

! 678-1"17%8

9):%&+:&-."/)'0

! B&*=+1>&+,&#%"8<+'=?&)%)+#?&.7"#%8&*?&"==&

)(=/-<0#+1%..+#.&2E6FG:

! U"17%8&C-"&VFG&1"17%&0%#&)(=/-<0#+1%..+#

! @=+9&/+&"11%..&! .%C%#"=&7('8#%8&1=+1>&1?1=%&="/%'1?&+'&1"17%&)-..

8";&<'"-."/)'0! B&="#$%&*=+1>&+,&#%"8<+'=?&)%)+#?&.7"#%8&*?&"==&)(=/-<0#+1%..+#.

! W%"8.&,#+)&/%I/(#%&)%)+#?&1"'&*%&"#$%&'()*#+,!X%"#%./&+#&=-'%"#&-'/%#0+="/-+'&,+#&,#%%L

! U"17%8&C-"&VFG&1"17%&0%#&)(=/-<0#+1%..+#! @=+9&/+&"11%..&! .%C%#"=&7('8#%8&1=+1>&1?1=%&="/%'1?&+'&1"17%&)-..

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/DE/D/$$0

#

! !"#$!%&'()*'+),-./'+0'1(2*'('3014*+-./'

4)0563+7''890:*'"0'+;*'%*+(9

! <-):+')*9*(:*'=>?$>@A'B01B*5

! C*30.5')*9*(:*'-.'D?$>@')*EF)-++*.

! G4*.'C06)3*

! H0+',*+'('I-(B9*':096+-0.

&*I-3*

%69+-EJ)03*::0) %69+-EJ)03*::0)

!

!"#$%&'()"*+ !"#$%&'()"*+

,(-.#(&'()"*+

/0 /0

/0 /0

/0 /0

! !

/0 /0

/0 /0

/0 /0

! !

%69+-EJ)03*::0)

!"#$%&'()"*+

/0 /0

/0 /0

/0 /0

! !

! ";*'K6.5(1*.+(9'6.-+'-:'+;*'!"#$%&'(#)*$!!)#

! !':3(9()L'1.23%(45*(#.1."2 K90(+-./'40-.+'!MN

! G.*'%69+-49,$!55'4*)'39032'3,39*

! 678 #OOOE@PQ'30149-(.+

! C+)*(1'4)03*::0):'()*'/)064*5'-.+0'&+,"-.

(#)*$!!)#!

! %69+-E4)03*::0):')6.'-.'/9',&%"#:1;(5

! !'.61B*)'0K'169+-E4)03*::0):'K0)1'('/$0-*$

%69+-EJ)03*::0)

<(3.1;(*1!"#$%&

'()"*+

!

/;*($)&0*"#(11"*

/=$*(>&'()"*+

?%"@$%&'()"*+

A"21;$2;&'()"*+

8(B;C*(&'()"*+

/;*($)&0*"#(11"*1&=$-(&$##(11&;"D

'()"*+&8+5( E##(11 /=$*.23

R*/-:+*): R*(5$S)-+* J)-I(+*

M03(9'%*10), R*(5$S)-+* J)-I(+*

C;()*5'%*10), R*(5$S)-+* %69+-EJ)03*::0)

T90B(9'%*10), R*(5$S)-+* &*I-3*

80.:+(.+'%*10), R*(5 &*I-3*

"*U+6)*'%*10), R*(5 &*I-3*

Page 58: [Harvard CS264] 04 - Intermediate-level CUDA Programming

Constant Memory

Constants set by CPU, read by GPU

Each SM has 8kiB cache for constants

Optimized for broadcast

Accessing different elements forces serialisation

Can speed some calculations

Can relieve register pressure

Page 59: [Harvard CS264] 04 - Intermediate-level CUDA Programming

Declared at file scope__constant__ float dc_myConst;

Set via cudaMemcpyToSymbol API callcudaMemcpyToSymbol( “dc_myConst”, 3.14f, sizeof(float) )

Accessed by name in kernel__global__ MyKernel( ... ) { .... float myVal = dc_myConst+1; ....}

Constant Memory

Page 60: [Harvard CS264] 04 - Intermediate-level CUDA Programming

Textures

Textures are essentially look up tables

Can only be written by the host

Cached on each multiprocessor (8kiB)

Optimised for 2D spatial locality

Hardware interpolation possible

Limited precision

Can clamp or wrap at boundaries

Page 61: [Harvard CS264] 04 - Intermediate-level CUDA Programming

Textures

Declaration and setup rather involved

See programming guide

Accessed in kernels via texture fetches:tex1D, tex2D, tex3D, etc.

Co-ordinates at texel centres

Have to take care when accessing elements

Page 62: [Harvard CS264] 04 - Intermediate-level CUDA Programming

Textures

Can improve load coalescing from global memory

If whole texture fits in 8kiB cache, has grid lifetime

Clamping/wrapping can aid edge case handling

Have to test to determine benefits

Page 63: [Harvard CS264] 04 - Intermediate-level CUDA Programming

General Principles

Memory access patterns are crucial

Even CPUs are typically memory bound

GPUs have 100x FP

Only 10x memory bandwidth

Have to keep the GPU busy

Page 64: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(

=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

/DE/D/$$0

F

!"#$%&'()*+,-

! !"#$%&'(")*+,-".'/"$'0&"12"#+"345"#$%&'(6

! !**"#$%&'(6"78"'"#$%&'(")*+,-"'%&"%18"+8"#$&"

6'.&".1*#792%+,&66+%

! :$16",'8",+..187,'#&"07'"6$'%&(".&.+%/

! !8("6/8,$%+87;&"

! :$%&'(6"+<"'")*+,-"'%&".1*#72*&=&("+8#+"'"

.1*#792%+,&66+%"'6"!"#$%

>?@

A+%#$)%7(B&

CD!E

F+1#$)%7(B&

F!:! G#$&%8&#

H%'2$7,6">'%("I"

>@C!

J%+8#"F7(&"K16

E&.+%/"K16 ?>L"K16

?>L9G=2

%&66"K16

!

! ./012 +%"./0$! D&2*',&("!H?

! ?5?M"J1**"C12*&="F&%7'*M"F/..&#%7,"K16! 53NEKI6")'8(O7(#$"78"&',$"(7%&,#7+8

! "#$$#%&'()#$*+%,-(+%.#($/&.+0&,(1&2%,3(,+8<7B1%'#7+86P""GPBQ! ?>L9G"4R="S"4R"*'8&6

! 4R"#7.&6"#$&")'8(O7(#$"TUHKI6V

! :$&">@C!"62&,7<7,'#7+8"$'6")&&8"12('#&(! W&%67+8"4PN"4 L87#7'*"%&*&'6&M"N4INX

! W&%67+8"4P4"4@2('#&"O7#$"8&O&%"$'%(O'%&M"NUINX

! K',-O'%(6",+.2'#7)*&

! G=2&,#&("12('#&6"78"8&'%"<1#1%&Q! W&%67+8"4P5"I"5PN

! RY9)7#"<*+'#78B"2+78#"6122+%#"T7P&P"(+1)*&V

! W&%67+8"4P4"'((&("6+.&"7.2+%#'8#"16&<1*"

<&'#1%&6Q

3*456%#$

! !6/8,$%+8+16".&.+%/",+27&6

! !6/8,$%+8+16"H?@"2%+B%'."*'18,$

7%#&6%#$

! !#+.7,".&.+%/"786#%1,#7+86

3+ Gb/s

8 GB/s

25+ GB/s

160+ GB/sto

VRAM

PC Architecture

modified from Matthew Bolitho

Page 65: [Harvard CS264] 04 - Intermediate-level CUDA Programming

PCIe Transfers(first thing to optimize?)

Page 66: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$%&'()##*+&%,

!"#$%&'()*+,

& -.()*+

&. /.'()*+

-./

0./

0./123+*"

-./123+*"

-+#"(4&5&(6*+3(-./(123+*"(5+0./(123+*"

789(-/4):2*92';<=

-+#"(4&5&(6*+30./(323+*"(5+-./(123+*"

-,$#<25

*Averaged observed bandwidth

Page 67: [Harvard CS264] 04 - Intermediate-level CUDA Programming

© NVIDIA Corporation 2010

Processing Flow

1. Copy input data from CPU memory to GPU memory

PCI Bus

Page 68: [Harvard CS264] 04 - Intermediate-level CUDA Programming

© NVIDIA Corporation 2010

Processing Flow

1. Copy input data from CPU memory to GPU memory

2. Load GPU program and execute,caching data on chip for performance

PCI Bus

Page 69: [Harvard CS264] 04 - Intermediate-level CUDA Programming

© NVIDIA Corporation 2010

Processing Flow

1. Copy input data from CPU memory to GPU memory

2. Load GPU program and execute,caching data on chip for performance

3. Copy results from GPU memory to CPU memory

PCI Bus

Page 70: [Harvard CS264] 04 - Intermediate-level CUDA Programming

PCIe Transfers

PCIe 2.0 x16 bus has

Latency of 10 µs (observed)

Bandwidth of 8GB/s (theory), 5 GB/s (observed)

A lot of calculations can happen in these times

Page 71: [Harvard CS264] 04 - Intermediate-level CUDA Programming

PCIe Transfers

PCIe transfers occur via DMA

GPU reads pages direct from CPU memory

Very bad if page gets moved mid-transfer

CUDA maintains internal pinned memory buffers

Used for cudaMemcpy calls

Data staged through these

Page 72: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$%&'#'()*+(#$,-'#)

!"#$%#&%'()*+'(',-$."/0$1'#&2'!"#$%&'#'()32&$24'4#-$.521'#&26

7-$"/82'+9:6'+1;$.5&0$0-1*&/<2&'+9:6'!"#$"%!&'()*!" 0&'!"#$"%!&'()*+,-%!!"

!;$.5&0$0-1',-$."/0$1'=40.>'0$'#$;'?&/0&'#1;$.5&0$0-1'>2&$24'4#-$.521

Page 73: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$%&'!"#$%&'!()')*&+,&-!"#$%&'!()

!"#$%&'()**"+),#"-.

/,)#'(01%(')*&+,&- #1(21*3-"#"4(

$&#)-(213.()'(21*3-"#"5

!"#$%&'(!&)&*+,$-./010$&2#)1&('"#'(6"7,8)1%5(9%,+"100(6"#:""&(;<=(2.2-"'(,&+(%"'31&'"('3""+!"#$%&'(!&)!2&#",&)3(4!"#$%&'(!&)!2&#",&5(&,#!"#$%&'(!&6,7!8(4-)94!

()*+'),-./,0#1,'23*+#&'45,6745'"5,6)'#5*74,8&#91

Page 74: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$%&'($()"*!+,"

!""#$%&'()#'*%(+,-'./#0+.#+"/'#1%#+/).02('.'3/4#+.5#(%,3(.#-&&%5-+,%")

$%&'()#&3/,#1%#+""'0+,%5#+/#

!"#$%&'()*++'!,-!"./&'()*++'!,-6"5%(#7%(/-'.#'8#,2%/%#83.0,-'./#!"#$%*++'!&'(),-0!"./#/++'!&'(),-+"/'#9'(:4#

+,--./ &%&'()#+""'9/#5-(%0,#;$!#,(+./8%(/#1)#,2%#<=>#,'#

"'0:%5#,'#+#*2)/-0+"#+55(%//

Page 75: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$%&'($()"*!+,"*-.($/01

!"#$%&'(')%*+%,&'-*%'./%/%0'/#'$+'12%'345

6+7',-/+82'"9%*2%-0'$&'"9%*,-##%0

:7+82*"+"/&'8-,,&'&2"/,0';%'0"+%'"/1&$0%'8*$1$8-,'&%81$"+&' &"<%'"='12%&%'-*%'%>#%+&$9%?

@+$1$-,$A-1$"+

B%<"*7'-,,"8-1$"+&

:1*%-<'C'D9%+1'8*%-1$"+

@+1%*"# *%&"/*8%''234"/'5/4($

Page 76: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$%&'!"#$%&'!()*+),!!"#$%&"'"&()*"

#$%&"'"(+,*(%-./)",$01)*.102",0&34.2,567%1&"8%1&*,0&39)+.32/)"()+.32:"""""""

!"#$%&'!()*+),!')-&.,&/!"#$%&'!()*+),!

!"#$01.&$#2),!1.3,45&!:;

!"#$%&'($&)*'+$(),--$*'+'

.&+'$&/$()+'01((&&/2$-&+$/&3$0((,1'()+'01$4$5

6'),+/($711'%70)'89

6'),+/($711'%70)'89

6'),+/($711'%70)'89

:07)($-&+$';'+9)*7/<$&/$)*'$="#$)&$-7/7(*2$)*'/$+'),+/(

Page 77: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$%&'()$'*#'+&#,'%-'.-$/%-0'(-123#%/-$!"#$%&'()*)+,+-.'&'()+

!"#$%#&'() !"#$%&'()*+'$)'/0+,+!"%&'()*+'$)'/0

!"#$%&'()*)+1)23#2('4!"#$%&'()-5'$)'/61)23#2('7804!"#$,'-!./01/(!/#'9)792"5!'7:;)'97!"#$<'=!>;129)?23'&@!'7804!"#$%&'()2'!3+#/1)23#2('04-/4'+('5!"#$,'-!./01/(!/#'9)792"5!'7:;)'97!"#$<'=!>;3'&@!'?2129)7804

!"#$%&'()6/(!7+3(89'/1)23#2('04

!"#$:7+'$#6/(!7+3(89'/04

*$+%,'-.,%'/0"'#1#")%2+34'(#/0"#!"%&'()A'!25#/1)23#2('0%0'50678#%#9'%2#3'"#%."3,

*$+%,'/0"'#1#")%2+34'03'%2#':;<'%0'/+3+,29'%2#3'"#%."3,

!;<'5$3'&0',%.//'2#"#

=2#'/+",%'6#60")'507)'+,'&03#9',0'%2#'6#60")'$%',0."5#'50.8&'(#'.,#&'$4$+3'()'%2#'!;<

Page 78: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$"%&'()*+%#,$-)./01232"245)62"7)8#$539%#!""#$%&'()(*+,,'-(./0(1233'456(*+,78*#,7(1'&9'',(&:';

.#,$

244',&

!"#$%&'(&)$*$+,-.!/$0$1234%&'567,<<<((!""#$%&'(1233'4589!:;$<=/.";>?

!"#$%&'!()*+),!<1@34%&'A1234%&'5<%&'(&)BC>D67AE!F;AG&/HI;)G1JK.E#L.M;-!G;A+>,

')-&.,&/0')-&.,&/1

2.$3%)4.$'&<1234%&'5%&'(&)7>,

!"#$56.&$#7),!6.8,9:&<>,%&'(&)BB,$%&'(&)$D*6,

N

Page 79: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$%&'()*+,!"#$#%&'()*&+,-.&/0123-4&

/5256,7,-8&9:&;<;&.5=4&5

>4>,?5-4>&$@%&4AB,A4

$@%&-C5A*D4C*&0=4C&(/#4&?5A&64

?0A?3CC4A-&+,-.&/)$%&E4CA47&4F4?3-,0AG

&'()*+, 5770+*&,A>424A>4A-&?0A?3CC4A-&,AH0C>4C&I3434*&0D&4F4?3-,0A

!"#$%&'($)*&+,-./&'($)!"#$%&'($)-'($&(01+,!"%&'($)-'($&(01

@37-,274&*-C451*&4F,*-&+,-.,A&5&*,AB74&?0A-4F-J&-.48&

*.5C4&1410C8&5A>&0-.4C&C4*03C?4*

-)+.(/ !.0'(.11)(

'()&@410C8

#-$%20340)

!.+56')20340)

78#%!.54),%.01/9%%!"#$%&'!()*+,-).! :*00.'%.;)(1*5<

GPU

Page 80: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$%&'(%#%&$"$#

!""#!$%&' ()&'*+,&#-./+0*+0$#1.-0#.)"#$%& 2./.30*0/

4)&*+30#50/&0"#6.)&'1!!!"#$%&'%()*"+,-./'%()*'010 '%()*"'2$)34555

7/+-0/#!89.67368.9#$%&:;<8.=>68.2%-8*"?%&29*"9)%@92*"

;2$)34A

:,2+0$#;#50/&0"#".)&'10$#<+*1#*10#)%&$ $*/0.3#2./.30*0/#0=0')*0#*+,-#.$#

Page 81: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$"%&'(

!"#$"%&')

!"#$"%&'*

!"#$"%&+(

!"#$"%&'(

!"#$"%&')

!"#$"%&+(

!"#$"%&'*

,-./&'(

,-./&+(

,-./&+)

,-./&')

,-./&+*

,-./&+0

,-./&'(

,-./&')

,-./&+(

,-./&+0

,-./&+)

,-./&+*

,12'&3456789

:'3!&' :'3!&+

,;<=">?@>6

,;8<A46">?@>6

Independent Tasks

Scheduling on GPU

Page 82: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$%&'()$*+$,*-$#./

!"#$"%&'(%(%)&*+%&,$--%.&$"&/0%&1+.%+&21.%&$)&%3%2(/%.

01234&0!5/

671378&!9

671378&!:

671378&!;

671378&<9

=2>5&!9

=2>5&!:

=2>5&<9

=2>5&<?

=2>5&<:

=2>5&<;

'@17!A&!

'@17!A&<

671378&!9

=2>5&!:

=#BC&7.D$.(

=#EBF-(7.D$.(

=2>5&!9

671378&!:

671378&!;

=2>5&<9

=2>5&<:

671378&<9

=2>5&<;

=2>5&<?

Page 83: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$%&'()*$'+#*$#,-./-0'12

(+34'12

,-./-0'15

,-./-0'16

(+34'72

(+34'75

,-./-0'72(+34'15

(+34'76

(+34'78

(+..-(9':14;

()<='->?@>$

()&<A"$->?@>$

,-./-0'12

,-./-0'15

,-./-0'16

,-./-0'72

(+34'12

(+34'15

(+34'72

(+34'78

(+34'75

(+34'76

!9.-1B'1

!9.-1B'7

Page 84: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$#%&'(%)*+,'-+./#0%.01+%'2!(-3'45!6'7%+*8.*9,'%:#)#";0%6'&;0;'0+;"6$%+';:0*<%0/%+=

!"#$%&'(")*%&+,,-./0%+121+% 3')"45",*>/%'%;6,'?;,'0*';./#%@%'.*9,A.*)910%'*@%+:;9=61 7.+89'%!"#$%&+,,-./B

C1"0#)%D'!"#$%&'(&)*!&+,$-.23'?#0/'!"#$(&)*!&/$012.' $:;<

E+#@%+D'!"3'435&$'&23'?#0/'3673897/:;71<%8

:1 ;99"<+$'%,-..'=%5>?%('(")*C1"0#)%D'!"#$12.':,,2!=>F'16%'!"#$12.':,,2!/$00&# $:;<

E+#@%+D'!"/&?12.':,,2!=> 16%'36(:7/@/1<%8:AA<37(@BC3@/:;

@1 A'$%+%5?B;%='C-<'%,"-.$')%$"%$D-#%('(")*C1"0#)%D'!"#$12.'D&'(&)*!&;2*E'&5=>E+#@%+D'!"/&?12.'D&'(&)*!&;2*E'&5=>

E1 FG#$%G#'%$D+$%,"-.$')%-.%*"G)%2').'9#H

BG/%.H'0/%'!"#$"%&'()$*+',- A'./01234.20566748/620.590$5:0&;<60$2$;7=&%@#.%'9+*9%+0,'$:;<'0*'6%%'#$'I%+*8G*9,'#6';@;#:;J:%K

Page 85: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$%&$'()*+,-".,/"0

!"#"$%&$#'"(&)*''*+$,-*'$#.*$/01*$23&$"3#,4"#%5"6678$

9&*$:.*($+"#"$%&$,(67$'*"+;:'%##*($,(5*9&*$),'$-*'7$&4"66$"4,3(#&$,)$+"#"$<(*:$-"'%"26*&8$0/9;=/9$5,443(%5"#%,(>9&*$:.*($5,4?3#*;4*4,'7$'"#%,$%&$-*'7$.%@.$"(+$,553?"(57$%&$.%@.8$&,$6"#*(57$,-*'$/01*$%&$.%++*(0,"6*&5%(@$%&$!"#$#!%&&'(%4?,'#"(#A

Page 86: [Harvard CS264] 04 - Intermediate-level CUDA Programming

PCIe Transfers Optimization

PCIe bus is slow

Try to minimize transfers

Use pinned memory on host whenever possible

Try to perform copies asynchronously

Page 87: [Harvard CS264] 04 - Intermediate-level CUDA Programming

Outline

• CUDA Language & APIs (overview)

• Threading/Execution (cont’d)

• Memory/Communication (cont’d)

• Tools

• Libraries

Page 88: [Harvard CS264] 04 - Intermediate-level CUDA Programming

© NVIDIA Corporation 2010

CUDA-GDB

Extended version of GDB with support for C for CUDA

Supported on Linux 32bit/64bit systems

Seamlessly debug both the host|CPU and device|GPU codeSet breakpoints on any source line or symbol nameSingle step executes only one warp except on sync threads Access and print all CUDA memory allocations, local, global, constant and shared vars.

Walkthrough example with sourcecode : CUDA-GDB manual

Page 89: [Harvard CS264] 04 - Intermediate-level CUDA Programming

© NVIDIA Corporation 2010

Linux GDB Integration with EMACS

Page 90: [Harvard CS264] 04 - Intermediate-level CUDA Programming

© NVIDIA Corporation 2010

Linux GDB Integration with DDD

Page 91: [Harvard CS264] 04 - Intermediate-level CUDA Programming

© NVIDIA Corporation 2010

CUDA-MemCheck

Detects/tracks memory errorsOut of bounds accessesMisaligned accesses (types must be aligned on their size)

Integrated into CUDA-GDBLinux and WinXPWin7 and Vista support coming

11©NVIDIA 2010

Page 92: [Harvard CS264] 04 - Intermediate-level CUDA Programming

© NVIDIA Corporation 2010

CUDA Driver Low-level Profiling support1. Set up environment variables

!"#$%&'()*+,-./012345

!"#$%&'()*+,-./0123,(6745

!"#$%&'()*+,-./0123,(/80194:$;<=>?&"&'

!"#$%&'()*+,-./0123,2/94#%$<=@!?:AB

2. Set up configuration fileFILE "config.txt":

>#CA&D%&&=E!A&DE#

=;A&%C:&=$;A

3. Run applicationED&%="FC@

4. View profiler output

FILE "profile.csv":G'()*+,-./0123,2/9,73.61/8'5?H

G'()*+,*371(3'I'9!0$%:!'JJII'9K

G'()*+,-./0123,(67'5

G'K1F36K+F-0+(K/.'<DLMLNN5!DL:5L:

>#CA&D%&&=E!A&DE#OE!&P$QO>#C&=E!O:#C&=E!O$::C#D;:RO=;A&%C:&=$;A

55H<S!DD5I!TNLLIOE!E:#RU&$*OV?TLJO5L?III

55H<S!DD5I!HQD:IOE!E:#RU&$*OH?WWSOS?III

55H<S!DD5I!MH:!IOE!E:#RU&$*OV?TLJOW?III

55H<S!DD5I<L!DWIO,X5IQED&%="EC@-<==6,==6,O5M?LMWOSI?IIIOI?TTTOST

HL

55H<S!DD5I<SSTDIOE!E:#R*&$UOV?VVWOTW?III

Page 93: [Harvard CS264] 04 - Intermediate-level CUDA Programming

© NVIDIA Corporation 2010

CUDA Visual Profiler - OverviewPerformance analysis tool to fine tune CUDA applications

Supported on Linux/Windows/Mac platforms

Functionality:

Execute a CUDA application and collect profiling data

Multiple application runs to collect data for all hardware performance counters

Profiling data for all kernels and memory transfers

Analyze profiling data

Page 94: [Harvard CS264] 04 - Intermediate-level CUDA Programming

© NVIDIA Corporation 2010

CUDA Visual Profiler data for kernels

Page 95: [Harvard CS264] 04 - Intermediate-level CUDA Programming

© NVIDIA Corporation 2010

CUDA Visual Profiler computed data for kernels

Instruction throughput: Ratio of achieved instruction rate to peak single issue instruction rate

Global memory read throughput (Gigabytes/second)

Global memory write throughput (Gigabytes/second)

Overall global memory access throughput (Gigabytes/second)

Global memory load efficiency

Global memory store efficiency

Page 96: [Harvard CS264] 04 - Intermediate-level CUDA Programming

© NVIDIA Corporation 2010

CUDA Visual Profiler data for memory transfers

Memory transfer type and direction(D=Device, H=Host, A=cuArray)

e.g. H to D: Host to Device

Synchronous / Asynchronous

Memory transfer size, in bytes

Stream ID

Page 97: [Harvard CS264] 04 - Intermediate-level CUDA Programming

© NVIDIA Corporation 2010

CUDA Visual Profiler data analysis viewsViews:

Summary table Kernel tableMemcopy table Summary plotGPU Time Height plotGPU Time Width plotProfiler counter plotProfiler table column plotMulti-device plotMulti-stream plot

Analyze profiler counters

Analyze kernel occupancy

Page 98: [Harvard CS264] 04 - Intermediate-level CUDA Programming

© NVIDIA Corporation 2010

CUDA Visual Profiler Misc.Multiple sessions

Compare views for different sessions

Comparison Summary plot

Profiler projects save & load

Import/Export profiler data (.CSV format)

Page 99: [Harvard CS264] 04 - Intermediate-level CUDA Programming

Outline

• CUDA Language & APIs (overview)

• Threading/Execution (cont’d)

• Memory/Communication (cont’d)

• Tools

• Libraries

Page 100: [Harvard CS264] 04 - Intermediate-level CUDA Programming

CUBLAS

Page 101: [Harvard CS264] 04 - Intermediate-level CUDA Programming

© NVIDIA Corporation 2009

CUBLASCUDA accelerated BLAS (Basic Linear Algebra Subprograms)

Create matrix and vector objects in GPU memory spaceFill objects with dataCall sequence of CUBLAS functionsRetrieve data from GPU (optionally)

!"#$%&'#((')'*+,-#.%/'00'1%$.+2%!'3'4.56-.5$'78

9:;$+4<=%*>?$5+.'+$6"+'@''1%$.+2%!'A'9:;$+4<15.&BC1-1CDC1-ECD7F'''''''''9:;$+4<+,6E&BC'+$6"+C1-1CDC1-,CD7F'

AA'%>%/E'GH'#.%/+.#524C'/%4.+/.'/%4#1:+$#?'&#'I'GH'@@'H7'8

9:;$+4<=%*>9:;$+4<956E&BC'1-;C'DC'1-/C'D7F9:;$+4<+,6E&BC'JDKHC'1-EC'DC'1-/C'D7F

L%$4%'

9:;$+4<+,6E&BCJ+$6"+C1-ECDC1-/CD7F'KKK'

Page 102: [Harvard CS264] 04 - Intermediate-level CUDA Programming

© NVIDIA Corporation 2009

CUBLAS FeaturesSingle precision data:

Level 1 (vector-vector O(N) )Level 2 (matrix-vector O(N2) )Level 3 (matrix-matrix O(N3) )

Complex single precision data:Level 1CGEMM

Double precision data:Level 1: DASUM, DAXPY, DCOPY, DDOT, DNRM2, DROT, DROTM, DSCAL, DSWAP, ISAMAX, IDAMIN Level 2: DGEMV, DGER, DSYR, DTRSVLevel 3: ZGEMM, DGEMM, DTRSM, DTRMM, DSYMM, DSYRK, DSYR2K

Page 103: [Harvard CS264] 04 - Intermediate-level CUDA Programming

© NVIDIA Corporation 2009

CUBLAS Performance: CPU vs GPU

CUBLAS: CUDA 2.3, Tesla C1060 MKL 10.0.3: Intel Core2 Extreme, 3.00GHz

Page 104: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$%&'()*+,*-./0)

!"

#"

$"

%"

&"

'!"

'#"

'!#$ #!$& (!)# $!*% +'#! %'$$ )'%&

&1))

231'456'78$

7.9*:;'2:-)/5:,/5'<=;=>

,-.

/(0'

/(0#

Up to 2x average speedup over CUBLAS 3.1

Less variation in performancefor different dimensions vs. 3.1

!"##$%&'(%)%&'*%+,%-./0/1%$2345%!(676%89":;<%*6'('&'6(=%+,%>?5@A!+B2%/,C24%!+B2%DE%F-2G542HI

Average speedup of {S/D/C/Z}GEMM x {NN,NT,TN,TT}

Page 105: [Harvard CS264] 04 - Intermediate-level CUDA Programming

CULA

Page 106: [Harvard CS264] 04 - Intermediate-level CUDA Programming

MATLAB Interface

! 15+ functions! Up to 10x speedup

! Dense linear algebra! C/C++ & FORTRAN! 150+ Routines

GPU Accelerated Linear Algebra

Supercomputer Speeds

Performance 7x of

Partnership

Developed in partnership with NVIDIA

CULA (LAPACK for heterogeneous systems)

Page 107: [Harvard CS264] 04 - Intermediate-level CUDA Programming

CULA - PerformanceSupercomputing Speeds

This graph shows the relative speed of many CULA functions when compared to

(Fermi) and an Intel Core i7 860. More at www.culatools.com

Page 108: [Harvard CS264] 04 - Intermediate-level CUDA Programming

CUSPARSE

Page 109: [Harvard CS264] 04 - Intermediate-level CUDA Programming

Sparse Matrix Performance: CPU vs. GPU

0x

5x

10x

15x

20x

25x

30x

35xMultiplication of a sparse matrix by multiple vectors

"Non-transposed""Transposed"MKL 10.2

Average speedup across S,D,C,Z

!"#$%&#'()*+(,-(./010%(23456(!+787(9$":;<(=7*+*)*7+>(,-(?@6AB!,C3(0-D35(!,C3(EF(G.3H653IJ

Page 110: [Harvard CS264] 04 - Intermediate-level CUDA Programming

CUFFT

Page 111: [Harvard CS264] 04 - Intermediate-level CUDA Programming

© NVIDIA Corporation 2009

CUFFTCUFFT is the CUDA FFT libraryComputes parallel FFT on an NVIDIA GPU

Plan contains information about optimal configuration for a given transform.Plans can be persisted to prevent recalculation.Good fit for CUFFT because different kinds of FFTs require different thread/block/grid configurations.

Page 112: [Harvard CS264] 04 - Intermediate-level CUDA Programming

© NVIDIA Corporation 2009

CUFFT Features

1D, 2D and 3D transforms of complex and real-valued dataBatched execution for doing multiple 1D transforms in parallel1D transform size up to 8M elements2D and 3D transform sizes in the range [2,16384]In-place and out-of-place transforms for real and complex data.

Page 113: [Harvard CS264] 04 - Intermediate-level CUDA Programming

© NVIDIA Corporation 2009

CUFFT Example

!"#$%&#'()'*+,!"#$%&#'(-'.*/

01$$234&"5# 654&701$$289:65#; <%"424='<9"424701"4>45590??@9%"<<AB%"424='C%D#9$?01$$289:65#;A<()<(-A701"4>45590??@9%"<<AB9"424='C%D#9$?01$$289:65#;A<()<(-A7

E<'8F#42#'4'*G'HHI'654&J'<E01$$2K54&*"?B654&='()=(-='8LHHIM8*8A7

E<'LC#'2N#'8LHHI'654&'29'2F4&C$9F:'2N#'C%O&45'912'9$'6540#J'<E01$$2P;#08*8?654&='%"424='9"424='8LHHIMHQRSTRGA7

E<'U&@#FC#'2F4&C$9F:'2N#'C%O&45'%&'6540#J'<E01$$2P;#08*8?654&='9"424='9"424='8LHHIMU(VPRWPA7

E<'G#C2F9X'2N#'8LHHI'654&J'<E01$$2G#C2F9X?654&A7

01"4HF##?%"424A701"4HF##?9"424A7

Complex 2D transform

Page 114: [Harvard CS264] 04 - Intermediate-level CUDA Programming

© NVIDIA Corporation 2009

CUFFT Performance: CPU vs GPU

!"##$%&'()%*+,-,.%$/012%34565%789:;<%45'4=4)%>"2?@3A=/%,BC/1%3A=/%DE%F*/G21/HI%('&7JK

Page 115: [Harvard CS264] 04 - Intermediate-level CUDA Programming

CUFFT 3.2: Improved Radix-3, -5, -7

!

"!

#!!

#"!

$!!

$"!

# $ % & " ' ( ) * #! ## #$ #% #& #"

!"#$%&

'()*+,-./0

123-45*6+&%768996(::06

+$!(!,-%.$

+$!(!,-%.#

/01

!

#!

$!

%!

&!

"!

'!

(!

# $ % & " ' ( ) * #! ## #$ #% #& #"

!"#$%&

'()*+,-./0

123-45*6+;%768996(::60

+$!(!,-%.$

+$!(!,-%.#

/01

9<""=6*>?6@6*>A6(B6CDE;EF6=/,'269?GHG6!%<IJ#6AG>?>*>G?K6(B6LM2359(N/6EBO/'69(N/6-H6+C/P2'/Q0

Radix-5, -7 and mixed radix improvements not shown

Page 116: [Harvard CS264] 04 - Intermediate-level CUDA Programming

CUDPP

Page 117: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$$

!"#$%&'&(")*"+$,+-./&*)&0'12/".'&'##/#".&$0$3$4/5"*)&"!678

9:";'&&$5"<=>?7?8@A"B:"CD/15"<6!7@A"E:"E/1,F.3'"<6!7@A"8:"7'4$G5)1"<6!7@A"E:"HI/1,"<6!7@A"J:"K+'1,"<6!7@

8#,)&$3+05

2FG..E2'1A"2FG..E/,0/13/GE2'1A"2FG..L/GF2/

2FG..E)&3A"2FG..L'1GA"2FG..E.'&5/9'3&$M>/23)&9F#3$.#(

8GG$3$)1'#"'#,)&$3+05"$1".&),&/55

N&'.+5A"0)&/"5)&3$1,A"3&//5A"+'5+$1,A"'F3)3F1$1,

Page 118: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$$%&'()*+,

!"#$$!%&'()*+,-(%& .%&'() /010!"#$$23!456000

!"#$$24##60!"#$$27894:60!"#$$29$:;95279<=4<#0>?

!"#$$@,&ABC DB,&?0

!"#$$<CE*B- +CE*B-0/0.*ADD$B,&FGDB,&60

.%&'()60&*HIBCHC&-E60

J60KL?0

.*ADD3.,&FDB,&60A2%A,-,60A2(A,-,60&*HIBCHC&-EL?

Page 119: [Harvard CS264] 04 - Intermediate-level CUDA Programming

More?

Page 120: [Harvard CS264] 04 - Intermediate-level CUDA Programming

Thrust

Page 121: [Harvard CS264] 04 - Intermediate-level CUDA Programming

© 2008 NVIDIA Corporation

!   Programmer productivity !   Rapidly develop complex applications !   Leverage parallel primitives

!   Encourage generic programming !   Don’t reinvent the wheel !   E.g. one reduction to rule them all

!   High performance !   With minimal programmer effort

!   Interoperability !   Integrates with CUDA C/C++ code

Objectives

3

Page 122: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$%&

!""#$%&'()$%#(*+,),-#./,#!0123*&*45#6$)78),8#9%&'()$%#:*+,),-#;69:<#

!/7$)*7%,5!"#$%!&&"'%!()*+!'#,-.!"#$%!&&/*)0+*()*+!'#,-.

2(=/,*$>&5!"#$%!&&%'#!12!"#$%!&&#*/$+*12!"#$%!&&03+4$%0)*(%+5312?$4@ 63

Page 123: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!!"#$%$&'($")*+"&'%,-."%/.0$&1"-%"(2$"2-1((2&/1(332-1(45$6(-&78%(9"245$6:)"77";<=>"(2&/1(33#$%$&'($:245$6?0$#8%:=@"245$6?$%,:=@"&'%,=>

!!"(&'%1A$&",'('"(-"(2$",$586$(2&/1(33,$586$45$6(-&78%(9",45$6 B"245$6>

!!"1-&(",'('"-%"(2$",$586$(2&/1(331-&(:,45$6?0$#8%:=@",45$6?$%,:==>

!!"(&'%1A$&",'('"0'6C"(-"2-1((2&/1(336-DE:,45$6?0$#8%:=@",45$6?$%,:=@"245$6?0$#8%:==>

!"#$%&'()*+,-.

Page 124: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$%&'()'*((+,-'.(/-

!"#$%&'()*(&+"#,-

./)012-3

45$"0-6()(#56

7)#2#68&9#3(&:(;*"3(<"3-*3=

Page 125: [Harvard CS264] 04 - Intermediate-level CUDA Programming

!"#$%$&'($)*+&',-

!"#$%&'($)*+#$&#$,-../+0%'#+#$&#&

12$%+34056$7$58'&&'($+9'6$%&$+:;2<6=$+(>?

;6#'($+64880%'#*>../+8$8@$5&+4%+60&2A0&$5&

Page 126: [Harvard CS264] 04 - Intermediate-level CUDA Programming

More?

Page 127: [Harvard CS264] 04 - Intermediate-level CUDA Programming

PyCUDA

Page 128: [Harvard CS264] 04 - Intermediate-level CUDA Programming

© NVIDIA Corporation 2009

PyCUDA

3rd party open source, written by Andreas Klöckner Exposes all of CUDA via Python bindingsCompiles CUDA on the fly

presents CUDA as an interpreted languageIntegration with numpyHandles memory management, resource allocationCUDA programs are Python strings

Metaprogramming modify source code on-the-flyLike a really complex pre-processor

http://mathema.tician.de/software/pycuda

Page 129: [Harvard CS264] 04 - Intermediate-level CUDA Programming

© NVIDIA Corporation 2009

PyCUDA Example! "#$%&' $()*+,-+&"./&0,1 )*+,20 "#$%&' $()*+,-,*'%"3"'4 "#$%&' 3*#$(56 ,0703*#$(-&,3+%#-&,3+38595:-0,1'($/83*#$(-%,'42:; ,<=$*070)*+,-#/#<,>>%)8,-1"?/90,-+'($/-"'/#1"?/:@ )*+,-#/#)$(<A'%+8,<=$*90,:BC #%+070)*+,-D%*&)/E%+*>/8FFF!G <<=>%H,><<0.%"+0+%*H>"I(8I>%,'0J,:!! K!2 "3'0"+L070'A&/,+M+L-L0N0'A&/,+M+L-(J5O!4 ,P0"+L0Q0J702-GIO!5 R!6 FFF:!; I*3)070#%+-=/'<I*3)'"%38F+%*H>"I(F:!@ I*3)8,<=$*90H>%)S785959!::!B!C ,<+%*H>/+0703*#$(-/#$'(<>"S/8,:2G )*+,-#/#)$(<+'%A8,<+%*H>/+90,<=$*:2! $&"3' ,<+%*H>/+22 $&"3' ,

Page 130: [Harvard CS264] 04 - Intermediate-level CUDA Programming

More?

Page 131: [Harvard CS264] 04 - Intermediate-level CUDA Programming

CURAND

Page 132: [Harvard CS264] 04 - Intermediate-level CUDA Programming

RNG Performance: CPU vs. GPU

0x

5x

10x

15x

20x

25x

SP DP SP DP

Uniform Normal

Generating 100K Sobol' Samples

CURAND 3.2MKL 10.2

!"#$%&'()*'+,'%-.&.$'/0123'!*454'67"89:';4)*)()4*<'+,'=>3?@!+A0'.,B02'!+A0'CD'E%0F320GH

Page 133: [Harvard CS264] 04 - Intermediate-level CUDA Programming

OpenVIDIA

Page 134: [Harvard CS264] 04 - Intermediate-level CUDA Programming

Open source, supported by NVIDIA

Computer Vision Workbench (CVWB)

http://openvidia.sourceforge.net

GPU imaging & computer vision

Demonstrates most commonly used image processing primitives on CUDA

Demos, code & tutorials/information

OpenVIDIA

Page 135: [Harvard CS264] 04 - Intermediate-level CUDA Programming

and many more...

Page 136: [Harvard CS264] 04 - Intermediate-level CUDA Programming

References

• CUDA C Programming Guide 

• CUDA C Best Practices Guide 

• CUDA Reference Manual 

• API Reference, PTX ISA 2.2 

• CUDA-GDB User Manual 

• Visual Profiler Manual  

• User Guides: CUBLAS, CUFFT, CUSPARSE, CURAND

http://developer.nvidia.com/object/gpucomputing.html

Page 137: [Harvard CS264] 04 - Intermediate-level CUDA Programming

iPhD one more thingor two...

Page 138: [Harvard CS264] 04 - Intermediate-level CUDA Programming

Life/Code Hacking #2.xSpeed {listen,read,writ}ing

accelerated e-learning (c) / massively parallel {learn,programm}ing (c)

Page 139: [Harvard CS264] 04 - Intermediate-level CUDA Programming

Life/Code Hacking #2.1Speed listening

accelerated e-learning (c) / massively parallel {learn,programm}ing (c)

Page 140: [Harvard CS264] 04 - Intermediate-level CUDA Programming

Life/Code Hacking #2.1Speed listening

accelerated e-learning (c) / massively parallel {learn,programm}ing (c)

• Step 1: Collect

• online videos, tutorials, podcasts, etc.

• audiobooks

• youtube-dl, get_flash_videos, jDownloader, ffmpeg, mplayer, etc.

• etc.

Page 141: [Harvard CS264] 04 - Intermediate-level CUDA Programming

Life/Code Hacking #2.1Speed listening

accelerated e-learning (c) / massively parallel {learn,programm}ing (c)

• Step 2: Accelerate (time-stretch)

• VLC (Playback > Faster)

• sox $f{,.1.8X.mp3} tempo 1.8 50

• iPod ? mp3splt -t 5.00 -o small-@n large.mp3

Page 142: [Harvard CS264] 04 - Intermediate-level CUDA Programming

Life/Code Hacking #2.1Speed listening

accelerated e-learning (c) / massively parallel {learn,programm}ing (c)

• Step 3: chill or do more ;-)

Page 143: [Harvard CS264] 04 - Intermediate-level CUDA Programming

Demo

Page 144: [Harvard CS264] 04 - Intermediate-level CUDA Programming

COME