Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer...

28
Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu
  • date post

    20-Jan-2016
  • Category

    Documents

  • view

    215
  • download

    0

Transcript of Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer...

Page 1: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.

Name: Kaiyong ZhaoSupervisor: Dr. X. -W Chu

Page 2: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.
Page 3: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.
Page 4: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.
Page 5: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.

0

20

40

60

80

100

120

2003 2004 2005 2006 2007M

em

ory

ba

nd

wid

th (

GB

/s)

GPU

CPUG80 Ultra

G80

G71

NV40

NV30 Hapertown

W oodcrestPrescott EENorthwood

0

20

40

60

80

100

120

2003 2004 2005 2006 2007M

em

ory

ba

nd

wid

th (

GB

/s)

GPU

CPUG80 Ultra

G80

G71

NV40

NV30 Hapertown

W oodcrestPrescott EENorthwood

•Computing Capability

•Memory Bandwidth

Page 6: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.

L2

FB

SP SP

L1

TF

Th

rea

d P

roc

es

so

r

Vtx Thread Issue

Setup / Rstr / ZCull

Geom Thread Issue Pixel Thread Issue

Input Assembler

Host

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

L2

FB

L2

FB

L2

FB

L2

FB

L2

FB

Streaming Multiprocessor (SM)

Streaming Processor (SP)

Page 7: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.

CUDA: CPU + GPU CParallel Computing modal

Single instruction Multiple Thread (SIMT)All threads run the same function(1000s threads on the fly)Each core deal with different data

Hidden the IO by multiple-threads(more than 1000s threads)Speed up Computing / IO Translation Coalesce the IO one time When half warp thread access neighboring data1 cycle@GPU vs. ~1000 cycles@CPU

Page 8: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.
Page 9: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.
Page 10: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.
Page 11: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.
Page 12: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.
Page 13: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.
Page 14: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.

C = vectorA * Matrix B % prime

Page 15: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.

There is no cache for global memory on G80/G200

Constant memory & texture memory have little cache

IO latency400-600 clock cycles

This is the bottle neckKey to Optimization!

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Host

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Host

Page 16: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.
Page 17: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.
Page 18: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.
Page 19: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.

Global memory access by threads in a half-warp can be coalesced

When the words accessed by all threads lie in the same segment of size equal to:

32 bytes if all threads access 8-bit words64 bytes if all threads access 16-bit words 128 bytes if all threads access 32-bit or 64-bit words

Any pattern of addresses requested by the half-warp

Including patterns where multiple threads access the same address

Page 20: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.

Address 0

Thread 0

Address 4

Address …

Address 116

Address 120

Address 124

Address 128

Address …

Address 172

Address 176

Address 180

Address 184

Address 188

Address 252

Thread 1

Thread 2

Thread 3

Thread …

Thread 14

Thread 15

Segment 0 (128B) Segment 1 (128B)

Reduced to 32B Segment size is 32 bytes for 8-bit data, 64 bytes for 16-bit data, 128 bytes for 32-, 64- and 128-bit data.

Page 21: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.

C = vectorA * Matrix B % prime

Page 22: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.
Page 23: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.
Page 24: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.

CPU: Intel® Core™ i7 CPU 860 @ 2.80 GHz (single thread)GPU: XFX GTX280, 1.24 GHz

Page 25: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.

C = vectorA * Matrix B % prime

Page 26: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.

CPU: Intel® Core™ i7 CPU 860 @ 2.80 GHz (single thread)GPU: XFX GTX280, 1.24 GHz

Page 27: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.

Summary

Page 28: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.