CS6068 Parallel Computing - Fall 2015 Lecture Week 5 - Sept 28 Topics: Latency, Layout Design,...

CS6068 Parallel Computing - Fall 2015

Lecture Week 5 - Sept 28

Topics:Latency, Layout Design, OccupancyChallenging Parallel Design Patterns

Sparse Matrix OperationsSieves, N-body problems

Breadth-First Graph Traversals

Latency versus Bandwidth- Latency is time to complete task – eg load

memory and do FLOP- Bandwidth is throughput or number of tasks

per unit time – eg peak global to shared memory load rate, or graphic frames per sec

- GPUs optimize for Bandwidth – employ techniques to hide latency

- Reasons Latency lags Bandwidth over time-Moore’s Law favors Bandwidth over Latency-Smaller, faster transistors but communicate over longer wires, which limits Latency-Increasing Bandwidth with queues hurts Latency

2

Techniques for Performance Hiding Latency and Little’s Law-Number useful Bytes delivered = Average Latency x Bandwidth-Tiling for coalesced reads and utilization of shared memory-Synchthreads barriers and block density-Occupancy – latency hiding achieved by maximize the number of active blocks per SM

3

4

Considerations For Sizing Tiles:

Need to make sure that:

1)Tiles are large enough to get speedups by lowering average latency2)Tiles are small enough so we can fit into the shared/register memory limits and 3)Use many tiles so that groups of thread blocks running on SM fill memory bus

Sizing TilesHow Big Should We Make a Tile?1. If we choose a larger tile, does that mean more or

less total main memory bandwidth?2. If choose larger tile, does that mean easier or harder

to make use of shared memory? 3. If choose larger tile, does that mean we wait longer

or shorter to synchronize?4. If choose larger tile, does that mean we can more

easily hide latency better?

5

Warps and StallsA grid is composed of blocks which are

completely independent.A block is composed of threads which can

communicate within their own blockInstructions are issued within an SM on a per

warp basis -- and generally 32 threads form a warp

If an operand is not ready the warp will stallContext switch between warps when stalled,

and for performance context switch must be very fast

6

Fast Context SwitchingRegisters and shared memory are allocated

for an entire block, as long as that block is active

Once a block is active it will stay active until all threads in that block have completed

Context switching is very fast because registers and shared memory do not need to be saved and restored

Goal: Have enough transactions in flight to saturate the memory bus

Latency is better hidden when there are more transactions in flight. 7

Occupancy Occupancy is an easy, if somewhat imprecise, measure of how well we have saturated the memory pipeline for latency hiding.

In a Cuda Program we measure occupancy as follows: Ratio of #Active Warps to #Maximum Active Warps

Resources are allocated for the entire block and are thus potential occupancy limiters: Register usage, Shared memory usage, Block size

8

Occupancy ExamplesFermi Streaming Multiprocessor SM specs:1536 threads (Max Threads) = 48 active warps of 32 threads per warp

Example 1: Suppose SM has 48K shared memoryProgram Kernel uses 32 bytes of shared memory per thread48K/32 = 1536, so we can actively schedule 1536 threads or 48

warpsMax Occupancy limit = 1

Example 2: Suppose SM has 16K shared memoryProgram Kernel uses 32 bytes of shared memory per thread16K/32 = 512, so we can actively schedule only 512 threads per

block or 16 warpsMax occupancy limit =.3333 Cuda Toolkit includes:

occupancy_calculator.xls 9

Bandwidth versus Latency Dense versus Space

Matrices

10

A Classic Problem: Dense N-Body Compute an NxN matrix of Force

vectorsN objects – each with own parameters, 3D-location, mass, charge, etc.

Goal: Compute N^2 Source/Dest ForcesSolution 1: partition threads using PxP tiles

Q: how to minimize average latency ratio of global versus shared memory fetches

Solution 2: partition space using P threads using spacial locality

Privatization principle: avoid write conflictsImpact of Partition on Bandwidth, Shared

Memory, and Load Imbalance ??? 11

Sparse Matrix Dense Vector Products

• A format for sparse matrices using 3 dense vectors:– Compressed Sparse Row CSR format

• Value = nonzero data one array• Column= identifies column for each

value• Rowptr= pointers to location of each

1st data in each row

Segmented Scan• Sparse Matrix multiplication requires

modified scan, called a segmented scan in which there is special symbol indicating segments within array, and scan are done per segment

• Apply scan operation to segments independently

• Reduces to ordinary scan where we treat segment symbol as an annihilation value

• Work an example using both inclusive and exclusive scans.

Dealing with Sparse Matrices

•Apply dense format CSR•Apply segmented Scan Operations•Granularity considerations

– Thread per Row vs Thread per Element

•Load imbalance•Hybrid Approach?

16

17

Sieve of Eratosthenes: Sequential Algorithm

2 3 4 5 6 7 8 9 10 11 12 13 14 15 1617 18 19 20 21 22 23 24 25 26 27 28 29 30 3132 33 34 35 36 37 38 39 40 41 42 43 44 45 4647 48 49 50 51 52 53 54 55 56 57 58 59 60 61

2 4 6 8 10 12 14 1618 20 22 24 26 28 30

32 34 36 38 40 42 44 4648 50 52 54 56 58 60

3 9 1521 27

33 39 4551 57

525

3555

7

49

A Classic Iterative Refinement:Sieve of Eratosthenes

18

Sieve of E - Python codedef findSmallest(primes, p): #helper function for i in range(p, len(primes)): #finds smallest prime after p if primes[i] == 1: return i

def sieve(n): #return num primes less than n primes = numpy.ones((n+1,), dtype=numpy.int) primes[0]=primes[1]=0 k=2 while (k*k <= n): mult= k*k while mult <= n: primes[mult] = 0 mult = mult + k k = findSmallest(primes,k+1) return sum(primes)

>>>import time, numpyfor x in range(15,27): st=time.clock(); s=sieve(2**x); fin=time.clock() print x, s, fin-st

x sum time15 3512 0.0416 6542 0.117 12251 0.1718 23000 0.3519 43390 0.7120 82025 1.4421 155611 2.9422 295947 5.8923 564163 11.9324 1077871 24.6725 2063689 48.7326 3957809 100.04

20

Work and Step Complexity: W(n)= O(n log log n)S(n) = O(#primes less than √n) = O(√n/log n)

Proof sketch: • Allocate array O(n)• Sieving with each prime p takes work = n/p• To find next smallest prime p takes work O(p)

< n/p• Prime Number Theorem says if a random

integer is selected in range(n), the probability that the selected integer is prime is ~ 1 / log n

• Thus xth smallest prime is ~ x log x. So total work is bounded by O(n) times sum of the reciprocals of first √n/logn primes.

• So W(n) is O(n log log n)Since from Calculus I:

Complexity of Sieve of Eratosthenes

http://en.wikipedia.org/wiki/Randomness

http://en.wikipedia.org/wiki/Probability

21

Designing an Optimized Parallel Solution for

SieveImportant Issues to consider:-What should a thread do? Focus on divisor or quotient?-Is tiling possible? Where should a thread start? What are dependencies?-Can thread blocks make use of coalesced DRAM memory access and shared memory?-Is compaction possible?

Structure of Data Impacts Design

Parallel Graph Traversal Examples: WWW, Facebook, TorApplication: Visit every node once

Breadth-First Traversal:Visit nodes level-by-level, synchronously

Variety of Graphs : Small Depth, Large Depth,

Small-world Depth22

Design of BFSGoal: Compute hop distance of every

node from given source node.1st try:Thread per Edge MethodWork Complexity =Step Complexity =Control IterationsRace ConditionsFinishing ConditionsNext Time = Make this more efficient

23

CS6068 Parallel Computing - Fall 2015 Lecture Week 5 - Sept 28 Topics: Latency, Layout Design,...

Documents

Transcript of CS6068 Parallel Computing - Fall 2015 Lecture Week 5 - Sept 28 Topics: Latency, Layout Design,...