Compaction for Efficient SIMT Control Flow

Compactionfor Efficient SIMT Control Flow

Wilson W. L. FungTor M. Aamodt

University of British ColumbiaHPCA-17 Feb 14, 2011

Thread Block

Wilson Fung, Tor Aamodt Thread Block Compaction 2

Commodity ManyCore Accelerator SIMD HW: Compute BW + Efficiency

Non-Graphics API: CUDA, OpenCL, DirectCompute Programming Model: Hierarchy of scalar threads SIMD-ness not exposed at ISA Scalar threads grouped into warps, run in lockstep

GridBlocksBlocksThread Blocks

Graphics Processor Unit (GPU)

Scalar Thread 1 2 3 4 5 6 7 8 9 10 11 12Scalar ThreadWarp

1 2 3 4 5 6 7 8 9 10 11 12

Single-Instruction-MultiThread

SIMT Execution Model

A: K = A[tid.x];B: if (K > 10) C: K = 10; elseD: K = 0;E: B = C[tid.x] + K;

Branch Divergence

D E 0011C E 1100

B - 1111PC RPC Active MaskE - 1111A 1 2 3 4

B 1 2 3 4

C 1 2 -- --

D -- -- 3 4

E 1 2 3 4

50% SIMD Efficiency!

Reconvergence Stack

A[] = {4,8,12,16};

In some cases: SIMD Efficiency 20%

Dynamic Warp Formation (W. Fung et al., MICRO 2007)

1 2 3 4A

1 2 -- --C

-- -- 3 4D

1 2 3 4E

1 2 3 4B

Warp 0

5 6 7 8 A

5 -- 7 8C

-- 6 -- --D

5 6 7 8E

5 6 7 8B

Warp 1

9 10 11 12A

-- -- 11 12C

9 10 -- --D

9 10 11 12E

9 10 11 12B

Warp 2

Reissue/MemoryLatency

SIMD Efficiency 88%1 2 7 8C

5 -- 11 12CPack

This Paper

Identified DWF Pathologies Greedy Warp Scheduling Starvation

Lower SIMD Efficiency Breaks up coalesced memory access

Lower memory efficiency Some CUDA apps require lockstep execution of

static warps DWF breaks them!

Additional HW to fix these issues? Simpler solution: Thread Block Compaction

Extreme case:5X slowdown

Thread Block Compaction Block-wide Reconvergence Stack

Regroup threads within a block Better Reconv. Stack: Likely Convergence

Converge before Immediate Post-Dominator Robust

Avg. 22% speedup on divergent CUDA apps No penalty on others

PC RPC AMaskWarp 0

E -- 1111D E 0011C E 1100

PC RPC AMaskWarp 1

E -- 1111D E 0100C E 1011

PC RPC AMaskWarp 2

E -- 1111D E 1100C E 0011

PC RPC Active MaskThread Block 0

E -- 1111 1111 1111D E 0011 0100 1100C E 1100 1011 0011 C Warp X

C Warp Y

D Warp UD Warp T

E Warp 0E Warp 1E Warp 2

Outline

Introduction GPU Microarchitecture DWF Pathologies Thread Block Compaction Likely-Convergence Experimental Results Conclusion

GPU Microarchitecture

InterconnectionNetwork

Memory PartitionLast-Level

Cache Bank

Off-ChipDRAM Channel

Cache Bank

SIMT CoreSIMT CoreSIMT CoreSIMT CoreSIMT Core

SIMTFront End SIMD Datapath

FetchDecode

ScheduleBranch

Done (Warp ID)

Memory Subsystem Icnt.NetworkSMem L1 D$ Tex $ Const$ More Details

in Paper

9 6 3 4D-- 10 -- --D1 2 3 4E5 6 7 8E

9 10 11 12E

DWF Pathologies: Starvation Majority Scheduling

Best Performing in Prev. Work Prioritize largest group of threads

with same PC Starvation

LOWER SIMD Efficiency!

Other Warp Scheduler? Tricky: Variable Memory Latency

1 2 7 8C 5 -- 11 12C

9 6 3 4D-- 10 -- --D

1 2 7 8E 5 -- 11 12E

9 6 3 4E-- 10 -- --E

B: if (K > 10) C: K = 10; elseD: K = 0;E: B = C[tid.x] + K;

1000s cycles

DWF Pathologies: Extra Uncoalesced Accesses Coalesced Memory Access = Memory SIMD

1st Order CUDA Programmer Optimization Not preserved by DWF

E: B = C[tid.x] + K;

1 2 3 4E5 6 7 8E

9 10 11 12E

Memory0x1000x1400x180

1 2 7 12E9 6 3 8E5 10 11 4E

Memory0x1000x1400x180

#Acc = 3

#Acc = 9

No DWF

With DWF

L1 Cache AbsorbsRedundant

Memory Traffic

L1$ Port Conflict

DWF Pathologies:Implicit Warp Sync. Some CUDA applications depend on the

lockstep execution of “static warps”

E.g. Task Queue in Ray Tracing

Thread 0 ... 31Thread 32 ... 63Thread 64 ... 95

Warp 0Warp 1Warp 2

int wid = tid.x / 32; if (tid.x % 32 == 0) { sharedTaskID[wid] = atomicAdd(g_TaskID, 32);}my_TaskID = sharedTaskID[wid] + tid.x % 32; ProcessTask(my_TaskID);

ImplicitWarpSync.

StaticWarp

DynamicWarp

StaticWarp

Observation

Compute kernels usually contain divergent and non-divergent (coherent) code segments

Coalesced memory access usually in coherent code segments DWF no benefit there

Coherent

Divergent

Coherent

Reset Warps

Divergence

RecvgPt.

Coales. LD/ST

Thread Block Compaction Run a thread block like a warp

Whole block move between coherent/divergent code Block-wide stack to track exec. paths reconvg.

Barrier @ Branch/reconverge pt. All avail. threads arrive at branch Insensitive to warp scheduling

Warp compaction Regrouping with all avail. threads If no divergence, gives static warp arrangement

Starvation

Implicit Warp Sync.

Extra Uncoalesced Memory Access

Thread Block CompactionPC RPC Active ThreadsA - 1 2 3 4 5 6 7 8 9 10 11 12D E -- -- 3 4 -- 6 -- -- 9 10 -- --C E 1 2 -- -- 5 -- 7 8 -- -- 11 12

E - 1 2 3 4 5 6 7 8 9 10 11 12

1 2 7 8C 5 -- 11 12C9 6 3 4D-- 10 -- --D

5 6 7 8A 9 10 11 12A

1 2 3 4A

5 6 7 8E 9 10 11 12E

1 2 3 4E

A: K = A[tid.x];B: if (K > 10) C: K = 10; elseD: K = 0;E: B = C[tid.x] + K;

5 6 7 8A 9 10 11 12A

1 2 3 4A

5 -- 7 8C -- -- 11 12C

1 2 -- --C

-- 6 -- --D9 10 -- --D

-- -- 3 4D

5 6 7 8E 9 10 11 12E

1 2 7 8E

-- -- -- ---- -- -- --

Thread Block Compaction

Barrier every basic block?! (Idle pipeline) Switch to warps from other thread blocks

Multiple thread blocks run on a core Already done in most CUDA applications

Block 0

Block 1

Block 2

Branch Warp Compaction

Execution

Microarchitecture Modification Per-Warp Stack Block-Wide Stack I-Buffer + TIDs Warp Buffer

Store the dynamic warps New Unit: Thread Compactor

Translate activemask to compact dynamic warps More Detail in Paper

ALUALUALU

I-Cache Decode

Warp Buffer

Score-Board

Issue RegFileMEM

ALUFetch

Block-Wide Stack

Done (WID)

Valid[1:N]

Branch Target PC

ActiveMask

Pred.Thread

Compactor

RarelyTaken

Likely-Convergence Immediate Post-Dominator: Conservative

All paths from divergent branch must merge there Convergence can happen earlier

When any two of the paths merge

Extended Recvg. Stack to exploit this TBC: 30% speedup for Ray Tracing

while (i < K) { X = data[i];A: if ( X = 0 )B: result[i] = Y;C: else if ( X = 1 )D: break;E: i++; }F: return result[i];

FiPDom of A

Details in Paper

Outline

Introduction GPU Microarchitecture DWF Pathologies Thread Block Compaction Likely-Convergence Experimental Results Conclusion

Evaluation Simulation: GPGPU-Sim (2.2.1b)

~Quadro FX5800 + L1 & L2 Caches 21 Benchmarks

All of GPGPU-Sim original benchmarks Rodinia benchmarks Other important applications:

Face Detection from Visbench (UIUC) DNA Sequencing (MUMMER-GPU++) Molecular Dynamics Simulation (NAMD) Ray Tracing from NVIDIA Research

0.6 0.7 0.8 0.9 1 1.1 1.2 1.3

IPC Relative to Baseline

COHEDIVG

Experimental Results 2 Benchmark Groups:

COHE = Non-Divergent CUDA applications DIVG = Divergent CUDA applications

Serious Slowdown from pathologiesNo Penalty for COHE

22% Speedup on DIVG

Per-Warp Stack

Conclusion

Thread Block Compaction Addressed some key challenges of DWF One significant step closer to reality

Benefit from advancements on reconvergence stack Likely-Convergence Point Extensible: Integrate other stack-based proposals

Thank You!

Thread Compactor Convert activemask from block-wide stack to

thread IDs in warp buffer Array of Priority-Encoder

P-Enc P-Enc P-Enc P-Enc

1 2 7 85 -- 11 12

1 2 -- -- 5 -- 7 8 -- -- 11 12C E

1 2 -- --5 -- 7 8-- -- 11 12

1 2 7 8C 5 -- 11 12C

Warp Buffer

Effect on Memory Traffic TBC does still generate some extra uncoalesced

memory access Memory0x1000x1400x180

#Acc = 41 2 7 8C

5 -- 11 12C2nd Acc will hit the L1 cache

No Change to Overall

Memory TrafficIn/out of a core

TBC DWF Baseline

DIVG COHE

e TBC-AGE TBC-RRB 2.67x

Likely-Convergence (2)

NVIDIA uses break instruction for loop exits That handles last example

Our solution: Likely-Convergence Points

This paper: only used to capture loop-breaks

PC RPC LPC LPosActiveThdsF -- -- -- 1 2 3 4E F -- -- --B F E 1 1C F E 1 2 3 4E F E 1 2D F E 1 3 4

E F -- -- 2E F E 1 1E F -- -- 1 2

Convergence!

Likely-Convergence (3)

Applies to both per-warp stack (PDOM) and thread block compaction (TBC) Enable more threads grouping for TBC Side effect: Reduce stack usage in some case

Compaction for Efficient SIMT Control Flow

Documents

Transcript of Compaction for Efficient SIMT Control Flow

Compaction and Correlation Between Compaction and ...

OPTIMUM RESULT – EFFICIENT COMPACTION.

Be Simt Justaju Ka Safar (Iqbalkalmati.blogspot.com)

rashmibade.files.wordpress.com€¦ · Web viewCompaction: Mechanism of compaction, factors affecting compaction, standard & modified proctor Tests, field compaction equipments,

Compaction, the art of sticking together...Rev 03 1 Compaction, the art of sticking together An Introduction to High Efficient Particle Agglomeration methods using Pressing Forces

Roller Compaction of Theophylline - edoc.unibas.ch · Roller compaction is a very attractive technology in the pharmaceutical industry. It is a fast and efficient way of producing

COMPACTION COMPACTION TECHNOLOGY - Ammann · COMPACTION TECHNOLOGY AMMANN TECHNOLOGY PORTFOLIO COMPACTION. 2 INTELLIGENT COMPACTION GETS EVEN SMARTER Precise, transparent and verifiable

Compaction - PE Civil Exampecivilexam.com/Study_Documents/Geotech-Materials-Online/Compaction.pdfCompaction Compaction purposes and processes Laboratory compaction tests Specification

Rates of Soil Compaction and Compaction Recovery in Areas ...

Implementation of Intelligent Compaction Technology for ... · Definition of Intelligent Compaction (IC) Components of Intelligent Compaction How the Intelligent Compaction works

SIMT TIME NETWORK AND SIM TIME SCALE - NIST

Microarchitectural mechanisms to exploit value structure in SIMT ...

Efficient, economical and sustainable compaction

Microarchitectural Mechanisms to Exploit Value Structure in SIMT …cbatten/pdfs/kim-simt... · 2020-07-14 · tectures—parallel processors, SIMD; I.3.1 [Computer Graphics]: Hardware

1.2 - Rapport SIMT Anapec

Flex Pave OH Intelligent Compaction presentationIntelligent Compaction • Intelligent compaction equipment measures and records the quality of compaction during the compaction process.

GPU architecture: Revisiting the SIMT execution model · GPU architecture: Revisiting the SIMT execution model January 2020 Caroline Collange ... Code push push pop push pop pop 1111

Avanzado control de calidad en compactación...•Asphalt compaction and challenges •Automatic optimisation of compaction (IC) •Continuous Compaction Control (CCC) - Compaction

Intelligent Compaction · 2020. 3. 19. · • Intelligent Compaction (IC) is a major innovation in compaction technology • IC is a valuable tool to improve the compaction process

Compaction Geomechanics MBDCI Compaction Geomechanics: Mechanisms, Screening Maurice B. Dusseault.