Exploiting Intra Warp Parallelism for GPGPU

Exploiting Control Divergence to ImproveParallelism in GPGPU

Department of Electronics Engineering, IIT (BHU) Varanasi

1. Why GPU FOR GENERAL PURPOSE

1.1 INTRODUCTION

GPGPU – General Purpose Computation on Graphics Processing Unit

Here in GPU we are doing something which the GPU is not made for. The GPU was built

for Graphics Processing but we are using it for General purpose computations which were

normally done by Central processing Unit. Let‘s see why the GPU has come into the

computational field

How to dig holes faster?

1. Dig at a faster rate: Two shovels per sec instead of one. There is always a limit

This is equivalent to having a faster clock or increasing the clock frequency. (Clock-Shorter

amount of time for each computation)

Although the max limit to the clock is the speed of light, and we are already in the ns range.

There is very little to achieve further

2. Have a more productive shovel: More blades per shovel. But it produces diminishing

returns as the no. of blades increases.

It put forward the fact of using the parallelism (only the ILP) present in the instructions. IT tells

more work to be done on each step per clock cycle.

3. Hire more diggers => parallel computing

Instead of having one fast digger and an awesome shovel what if we have many diggers with

many shovels.

i.e., Many Smaller and simpler processors which is the case for the GPU

1.2 CPU SPEED REMAINING FLAT

The feature size is decreasing every year(Moore‘s law) => Transistors becomes smaller,

faster less power, More on Chip

As sizes diminished, the designers have run the processor faster and faster by increasing

the clock speed. Many years clock speed has gone up; however from the last decade

clock speed has remained constant.

Why we are not able to increase the clock speed further at the previous rate:

The transistor size has been decreasing, but we are unable to increase the speed. The

problem is running a billion transistors generate a lot of heat and we can‘t keep the processors

cool. What matters today is the ‗Power‘.



Let me put an example or a historical background for this: Intel Pentium 4th

generation

was made for a high operating clock frequency of around 10GHz, but the researchers can operate

it only for a time period of 30sec to 1min because of the vast power dissipation at such high

frequency. So they restricted it to around 3 to 4GHz.

1.3 CONSTRAINTS WITH ILP

There is always a limit to the no of blades in the shovel (the pipelining stages here). This is

because of the limited amount of parallelism that comes by exploiting the ILP. Take the case of a

pipeline with 10 pipeline stages. With the conditional instruction frequency of 16% (put ref. no

here the Bible) there may be more than a single branch instruction always in the pipeline, which

always limits the work that can be performed.

Coming to case of the historical background we can put forth the example of the Intel

Pentium vs. Itanium. The Pentium 4 was the most aggressively pipelined processor ever built by

Intel. It used a pipeline with over 20 stages, had seven functional units, and cached micro-ops

rather than x86 instructions. Its relatively inferior performance given the aggressive

implementation was a clear indication that the attempt to exploit more ILP (there could easily be

50 instructions in flight) had failed. The Pentium‘s power consumption was similar to the i7,

although its transistor count was lower, as its primary caches were half as large as the i7, and it

included only a 2 MB secondary cache with no tertiary cache.

Figure 1.1: Technology Scaling vs. Feature Size [10] Figure 1.2: Clock Frequency vs. Year [10]



The Intel Itanium is a VLIW-style architecture, which despite the potential decrease in

complexity compared to dynamically scheduled superscalar‘s, never attained competitive clock

rates with the mainline x86 processors (although it appears to achieve an overall CPI similar to

that of the i7).

Figure 1.3: Three different Intel processors vary widely. Although the Itanium processor has two cores and thei7

four, only one core is used in the benchmarks. [1]

The main limitation in exploiting ILP often turned out to be the memory system. The result was

that these designs never came close to achieving the peak instruction throughput despite the large

transistor counts and extremely sophisticated and clever techniques.

1.4 HOW COMPUTATIONAL POWER CAN BE IMPROVED

We can solve large problems by breaking them into small pieces called kernels and launching

these pieces at the same time by means of simple processors.

Modern GPU‘s:

Thousands of ALUs

Hundreds of processors

Tens of thousands of Concurrent threads

If you were plowing a field, which would you rather use: Two strong oxen or 1024 chickens?

-Seymour Cray, Father of Super Computer

What kind of Processors will we build? Or Why CPU‘s are not energy efficient?

CPU: Complex Control Hardware

↑Flexibility and Performance

↓Expensive in terms of Power

GPU: Simpler Control Hardware

↑More Hardware for computation

↑ Potentially more Power efficient (Ops/Watt) as shown in figure 1.4

↓ More restrictive programming model



Build a (Power Efficient) High Performance Processor

Minimizing Latency Throughput

(Time ↔ Sec) (Stuff/Time ↔ Jobs/Hour)

CPU GPU → Pixel/Sec

Figure 1.4: Intel Core i7-960, NVIDIA GTX 280, and GTX 480 specification [1]

In the Figure 1.4 it can be seen from the rightmost columns show the ratios of GTX 280

and GTX 480 to Core i7. For SP SIMD FLOPS on the GTX 280, the higher speed (933) comes

from a very rare case of dual issuing of fused multiply-add and multiply. More reasonable is 622

for single fused multiply-adds.

1.5 MODERN COMPUTING ERA

It can be easily assumed that when the two previous techniques have reached the limits

then is this not the case with the Graphics Processing Unit. Moore‘s Law states that the density

of transistors at a given die size doubles every 12 months, but has since slowed to every 18

months or loosely translated, the performance of processors doubles every 18 months. Now the

increase of no. of transistors that can be accumulated in a single chip almost became dead as the

die size has reached its limits, and the power dissipation of transistors has also been constraining

the number of transistors that can be fabricated on a single die.



Within the last 8 to 10 years, GPU manufacturers have found that they have far exceeded the

principles of Moore‘s Law by doubling the transistor count on a given die size every 6 months.

Essentially that is performance on a scale of Moore‘s Law cubed. With this rate of increase and

performance potential one can see why a more in depth look at the GPU for uses other than

graphics is well warranted [9]. This has even been replicated in the following figure showing that

we are still in the initial stages of the development of GPU which can be drastically improved in

the forth coming years.

Figure 1.5: New era of processor performance [12]

1.6 A BRIEF HISTORY OF GPU COMPUTING

The graphics processing unit (GPU), first invented by NVIDIA in 1999, is the most

pervasive parallel processor to date. Fueled by the insatiable desire for life-like real-time

graphics, the GPU has evolved into a processor with unprecedented floating-point performance

and programmability; today‘s GPUs greatly outpace CPUs in arithmetic throughput and memory

bandwidth, making them the ideal processor to accelerate a variety of data parallel applications.

Efforts to exploit the GPU for non-graphical applications have been underway since

2003. By using high-level shading languages such as DirectX, OpenGL and Cg, various data

parallel algorithms have been ported to the GPU. Problems such as protein folding, stock options

pricing, SQL queries, and MRI reconstruction achieved remarkable performance speedups on the

GPU. These early efforts that used graphics APIs for general purpose computing were known as

GPGPU programs.

While the GPGPU model demonstrated great speedups, it faced several drawbacks. First,

it required the programmer to possess intimate knowledge of graphics APIs and GPU

architecture. Second, problems had to be expressed in terms of vertex coordinates, textures and



shader programs, greatly increasing program complexity. Third, basic programming features

such as random reads and writes to memory were not supported, greatly restricting the

programming model. Lastly, the lack of double precision support (until recently) meant some

scientific applications could not be run on the GPU.

To address these problems, NVIDIA introduced two key technologies—the G80 unified

graphics and compute architecture (first introduced in GeForce 8800®, Quadro FX 5600

®, and

Tesla C870®

GPUs), and CUDA, a software and hardware architecture that enabled the GPU to

be programmed with a variety of high level programming languages. Together, these two

technologies represented a new way of using the GPU. Instead of programming dedicated

graphics units with graphics APIs, the programmer could now write C programs with CUDA

extensions and target a general purpose, massively parallel processor. We called this new way of

GPU programming ―GPU Computing‖—it signified broader application support, wider

programming language support, and a clear separation from the early ―GPGPU‖ model of

programming.



2. GPU ARCHITECTURE

2.1 INTRODUCTION

Before knowing how the GPU architecture is implemented, we first need to know what

the different parallelisms present in the GPU. GPUs have virtually every type of parallelism that

can be captured by the programming environment: multithreading, MIMD, SIMD, and even

instruction-level.

2.2 MULTITHREADING

Multithreading allows multiple threads to share the functional units of a single processor

in an overlapping fashion. A general method to exploit thread-level parallelism (TLP) is with a

multiprocessor that has multiple independent threads operating at once and in parallel.

Multithreading, however, does not duplicate the entire processor as a multiprocessor does.

Instead, multithreading shares most of the processor core among a set of threads, duplicating

only private state, such as the registers and program counter. There are three main hardware

approaches to multithreading.

2.2.1 FINE-GRAINED MULTITHREADING

Fine-grained multithreading switches between threads on each clock, causing the

execution of instructions from multiple threads to be interleaved. This interleaving is often done

in a round-robin fashion, skipping any threads that are stalled at that time. One key advantage of

fine-grained multithreading is that it can hide the throughput losses that arise from both short and

long stalls, since instructions from other threads can be executed when one thread stalls, even if

the stall is only for a few cycles. The primary disadvantage of fine-grained multithreading is that

it slows down the execution of an individual thread, since a thread that is ready to execute

without stalls will be delayed by instructions from other threads. It trades an increase in

multithreaded throughput for a loss in the performance (as measured by latency) of a single

thread.

2.2.2 COARSE-GRAINED MULTITHREADING

A coarse-grained multithreading switches thread only on costly stalls, such as level two

or three cache misses. This change relieves the need to have thread-switching be essentially free

and is much less likely to slow down the execution of any one thread, since instructions from

other threads will only be issued when a thread encounters a costly stall. It is limited in its ability

to overcome throughput losses, especially from shorter stalls. This limitation arises from the

pipeline start-up costs of coarse-grained multithreading. Because a CPU with coarse-grained

multithreading issues instructions from a single thread, when a stall occurs the pipeline will see a

bubble before the new thread begins executing. Because of this start-up overhead, coarse-grained



multithreading is much more useful for reducing the penalty of very high-cost stalls, where

pipeline refill is negligible compared to the stall time.

2.2.3 SIMULTANEOUS MULTITHREADING

Simultaneous multithreading is a variation on fine-grained multithreading that arises

naturally when fine-grained multithreading is implemented on top of a multiple-issue,

dynamically scheduled processor. SMT uses thread-level parallelism to hide long-latency events

in a processor, thereby increasing the usage of the functional units. The key insight in SMT is

that register renaming and dynamic scheduling allow multiple instructions from independent

threads to be executed without regard to the dependences among them; the resolution of the

dependences can be handled by the dynamic scheduling capability.

Figure 2.1: How four different approaches use the functional unit execution slots of a superscalar

processor. The horizontal dimension represents the instruction execution capability in each clock cycle.

The vertical dimension represents a sequence of clock cycles. An empty (white) box indicates that the

corresponding execution slot is unused in that clock cycle. The shades of gray and black correspond to

four different threads in the multithreading processors. Black is also used to indicate the occupied issue

slots in the case of the superscalar without multithreading support. TheSunT1 and T2 (aka Niagara)

processors are fine-grained multithreaded processors, while the Intel Corei7 and IBM Power7 processors

use SMT. TheT2 has eight threads, the Power7 has four, and the Intel i7 has two. In all existing SMTs,

instructions issue from only one thread at a time. The difference in SMT is that the subsequent decision to

execute an instruction is decoupled and could execute the operations coming from several different

instructions in the same clock cycle [1]

2.3 SINGLE INSTRUCTION MULTIPLE DATA

In SIMD a single instruction can launch many data operations, hence SIMD is potentially

more energy efficient than multiple instruction multiple data (MIMD), which needs to fetch and



execute one instruction per data operation. Data-level parallelism (DLP) is not only present in

the matrix-oriented computations of scientific computing, but also the media oriented image and

sound processing. The biggest advantage of SIMD versus MIMD is that the programmer

continues to think sequentially yet achieves parallel speedup by having parallel data operations.

SIMD comes from the GPU community, offering higher potential performance than is found in

traditional Multicore computers today. While GPUs share features with vector architectures, they

have their own distinguishing characteristics, in part due to the ecosystem in which they evolved.

This environment has a system processor and system memory in addition to the GPU and its

graphics memory. In fact, to recognize those distinctions, the GPU community refers to this type

of architecture as heterogeneous.

the potential speedup from SIMD parallelism is twice that of MIMD parallelism. Hence, it‘s as

least as important to understand SIMD parallelism as MIMD parallelism, although the latter has

received much more fanfare recently. For applications with both data-level parallelism and

thread-level parallelism, the potential speedup in 2020 will be an order of magnitude higher than

today.

2.4 MULTIPLE INSTRUCTION MULTIPLE DATA STREAMS

MIMD is a technique employed to achieve parallelism. Machines using MIMD have a number of

processors that function asynchronously and independently. At any time, different processors

may be executing different instructions on different pieces of data. MIMD architectures may be

Figure 2.2: Potential speedup via parallelism from MIMD, SIMD,

and both MIMD and SIMD over time for x86 computers. This figure

assumes that two cores per chip for MIMD will be added every two

years and the number of operations for SIMD will double [1]

every four years.

For problems with lots

of data parallelism, all three

SIMD variations share the

advantage of being easier for the

programmer than classic parallel

MIMD programming. To put

into perspective the importance

of SIMD versus MIMD, Figure

2.2 plots the number of cores for

MIMD versus the number of 32-

bit and 64-bit operations per

clock cycle in SIMD mode for

x86 computers over time.

For x86 computers, we

expect to see two additional

cores per chip every two years

and the SIMD width to double

every four years. Given these

assumptions, over next decade



used in a number of application areas such as computer-aided design/computer-aided

manufacturing, simulation, modeling, and as communication switches. MIMD machines can be

of either shared memory or distributed memory categories. These classifications are based on

how MIMD processors access memory. Shared memory machines may be of the bus-based,

extended, or hierarchical type. Distributed memory machines may have hypercube or mesh

interconnection schemes.

Figure 2.3: The overview of a MIMD Processor

2.5 FERMI GPU ARCHITECTURE OVERVIEW

The Fermi architecture is the most significant leap forward in GPU architecture since the

original G80. G80 was our initial vision of what a unified graphics and computing parallel

processor should look like. GT200 extended the performance and functionality of G80. With

Fermi, NVIDIA have taken all they have learned from the two prior processors and all the

applications that were written for them, and employed a completely new approach to design to

create the world‘s first computational GPU.

The first Fermi based GPU, implemented with 3.0 billion transistors, features up to 512

CUDA cores. A CUDA core executes a floating point or integer instruction per clock for a

thread. The 512 CUDA cores are organized in 16 SMs of 32 cores each. The GPU has six 64-bit

memory partitions, for a 384-bit memory interface, supporting up to a total of 6 GB of GDDR5

DRAM memory. A host interface connects the GPU to the CPU via PCI-Express. The

GigaThread global scheduler distributes thread blocks to SM thread schedulers.



Figure 2.4: Fermi’s 16 SM are positioned around a common L2 cache.

2.5.1 STREAMING MULTIPROCESSOR

In Fermi, the newly designed integer ALU supports full 32-bit precision for all

instructions, consistent with standard programming language requirements. The integer ALU is

Figure 2.5: Fermi Streaming Multiprocessor (SM)

Each SM features 32 CUDA

processors. Each CUDA processor has

a fully pipelined integer arithmetic

logic unit (ALU) and floating point

unit (FPU). Prior GPUs used IEEE

754-1985 floating point arithmetic.

The Fermi architecture implements the

new IEEE 754-2008 floating-point

standard, providing the fused multiply-

add (FMA) instruction for both single

and double precision arithmetic. FMA

improves over a multiply-add (MAD)

instruction by doing the multiplication

and addition with a single final

rounding step, with no loss of

precision in the addition. FMA is more

accurate than performing the

operations separately. GT200

implemented double precision FMA



also optimized to efficiently support 64-bit and extended precision operations. Various

instructions are supported, including Boolean, shift, move, compare, convert, bit-field extract,

bit-reverse insert, and population count.

NVIDIA Fermi supports 48 active warps for a total of 1536 threads per SM. To

accommodate the large set of threads, GPUs provide large on-chip register files. Fermi has per

SM register file size of 128KB or 21 32-bit registers per thread at full occupancy. Each thread

uses dedicated registers to enable fast switching.16 SMs * 32 cores/SM = 512 floating point

operations per cycle.

16 Load/Store Units: Each SM has 16 load/store units, allowing source and destination

addresses to be calculated for sixteen threads per clock. Supporting units load and store the data

at each address to cache or DRAM.

Four Special Function Units: Special Function Units (SFUs) execute transcendental

instructions such as sin, cosine, reciprocal, and square root. Each SFU executes one instruction

per thread, per clock; a warp executes over eight clocks. The SFU pipeline is decoupled from the

dispatch unit, allowing the dispatch unit to issue to other execution units while the SFU is

occupied.

Designed for Double Precision: Double precision arithmetic is at the heart of HPC applications

such as linear algebra, numerical simulation, and quantum chemistry. The Fermi architecture has

been specifically designed to offer unprecedented performance in double precision; up to 16

double precision fused multiply-add operations can be performed per SM, per clock, a dramatic

improvement over the GT200 architecture.

In the Fermi architecture, each SM has 64 KB of on-chip memory that can be configured as 48

KB of Shared memory with 16 KB of L1 cache or as 16 KB of Shared memory with 48 KB of

L1 cache. For existing applications that make extensive use of Shared memory, tripling the

amount of Shared memory yields significant performance improvements, especially for problems

that are bandwidth constrained. For existing applications that use Shared memory as software

managed cache, code can be streamlined to take advantage of the hardware caching system,

while still having access to at least 16 KB of shared memory for explicit thread cooperation. Best

of all, applications that do not use Shared memory automatically benefit from the L1 cache,

allowing high performance CUDA programs to be built with minimum time and effort

2.5.2 DUAL WARP SCHEDULER

The SM schedules threads in groups of 32 parallel threads called warps. Each SM features two

warp schedulers and two instruction dispatch units, allowing two warps to be issued and

executed concurrently. Fermi‘s dual warp scheduler selects two warps, and issues one instruction

from each warp to a group of sixteen cores, sixteen load/store units, or four SFUs. Because

warps execute independently, Fermi‘s scheduler does not need to check for dependencies from



within the instruction stream. Using this elegant model of dual-issue, Fermi achieves near peak

hardware performance.

Most instructions can be dual issued; two integer instructions, two floating instructions, or a mix

of integer, floating point, load, store, and SFU instructions can be issued concurrently. Double

precision instructions do not support dual dispatch with any other operation.

Figure 2.6: Dual Warp Scheduler

2.5.3 GIGATHREAD THREAD SCHEDULER

One of the most important technologies of the Fermi architecture is its two-level, distributed

thread scheduler. At the chip level, a global work distribution engine schedules thread blocks to

various SMs, while at the SM level, each warp scheduler distributes warps of 32 threads to its

execution units. The first generation GigaThread engine introduced in G80 managed up to

12,288 threads in real time. The Fermi architecture improves on this foundation by providing not

only greater thread throughput, but dramatically faster context switching, concurrent kernel

execution, and improved thread block scheduling.

2.6 COMPUTE UNIFIED DEVICE ARCHITECTURE

CUDA is the hardware and software architecture that enables NVIDIA GPUs to execute

programs written with C, C++, FORTRAN, OpenCL, DirectCompute, and other languages. A

CUDA program calls parallel kernels. A kernel executes in parallel across a set of parallel

threads. The programmer or compiler organizes these threads in thread blocks and grids of thread

blocks. The GPU instantiates a kernel program on a grid of parallel thread blocks. Each thread

within a thread block executes an instance of the kernel, and has a thread ID within its thread

block, program counter, registers, per-thread private memory, inputs, and output results. A thread

block is a set of concurrently executing threads that can cooperate among themselves through



barrier synchronization and shared memory. A thread block has a block ID within its grid. A grid

is an array of thread blocks that execute the same kernel, read inputs from global memory, write

results to global memory, and synchronize between dependent kernel calls. In the CUDA parallel

programming model, each thread has a per-thread private memory space used for register spills,

function calls, and C automatic array variables. Each thread block has a per-Block shared

memory space used for inter-thread communication, data sharing, and result sharing in parallel

algorithms. Grids of thread blocks share results in Global Memory space after kernel-wide global

synchronization.

2.7 PROGRAM EXECUTION IN GPU

When the CPU (Host) want to execute some code on the GPU (Device), then the CPU

copies the required data in the GPU global memory. The program itself contains the instantiation

the device, which then executes the code and keeps the data in the GPU global memory. Then

CPU copies the data output obtained from the GPU global memory to CPU global memory and

then makes use of it. The memory is dynamically created and freed for GPU by the CPU. But the

CPU can‘t access the Shared memory of the GPU or any SM.

Thread: concurrent code and associated

state executed on the CUDA device (in

parallel with other threads). Thread is

the unit of parallelism in CUDA.

Warp: a group of threads executed

physically in parallel.

From Oxford Dictionary: Warp:

In the textile industry, the term ―warp‖

refers to the threads stretched

lengthwise in a loom to be crossed by

the weft.

Block: a group of threads that are

executed together and form the unit of

resource assignment.

Grid: a group of thread blocks that must

all complete before the next kernel call

of the program can take effect. Figure 2.7: CUDA Hierarchy of threads, blocks, and grids,

with corresponding per-thread private, per-block shared,

and per-application global memory spaces.



3. CONTROL FLOW DIVERGENCE

3.1 INTRODUCTION

Current graphics processing unit (GPU) architectures balance the high efficiency of

single instruction multiple data (SIMD) execution with programmability and dynamic control.

GPUs group multiple logical threads together (32 or 64 threads in current designs [7]) that are

then executed on the SIMD hardware. While SIMD execution is used, each logical thread can

follow an independent flow of control, in the single instruction multiple thread execution style.

When control flow causes threads within the same group to diverge and take different control

flow paths, hardware is responsible for maintaining correct execution. This is currently done by

utilizing a reconvergence predicate stack, which partially serializes execution. The stack

mechanism partitions the thread group into subgroups that share the same control flow. A single

subgroup executes at a time, while the execution of those threads that follow a different flow is

masked. While this mechanism is effective and efficient, it does not maximize available

parallelism because it serializes the execution of different subgroups, which degrades

performance in some cases.

We present a mechanism that requires only small changes to the current reconvergence

stack structure, maintains the same SIMD execution efficiency, and yet is able to increase

available parallelism and performance. Unlike previously proposed solutions to the serialized

parallelism problem [15], our technique requires no heuristics, no compiler support, and is

robust to changes in architectural parameters. In our technique hardware does not serialize all

control flows, and instead is able to interleave execution of the taken and not-taken flows.

We propose the first design that extends the reconvergence stack model, which is the

dominant model for handling branch divergence in GPUs. We maintain the simplicity,

robustness, and high SIMD utilization advantages of the stack approach, yet we are able to

exploit more parallelism by interleaving the execution of diverged control-flow paths. We

describe and explain the microarchitecture in depth and show how it can integrate with current

GPUs.

3.2 BACKGROUND

The typical current GPU consists of multiple processing cores (―streaming

multiprocessors (SMs)‖ and ―SIMD units‖ in NVIDIA and AMD terminology respectively),

where each core consists of a set of parallel execution lanes (―CUDA cores‖ and ―SIMD cores‖

in NVIDIA and AMD).In the GPU execution model, each core executes a large number of

logically independent threads, which are all executing the same code (referred to as a kernel).

These parallel lanes operate in a SIMD/vector fashion where a single instruction is issued to all

the lanes for execution. Because each thread can follow its own flow of control while executing

on SIMD units, the name used for this hybrid execution model is single instruction multiple

thread (SIMT). In NVIDIA GPUs, each processing core schedules a SIMD instruction from a



warp of 32 threads while AMD GPUs currently schedule instructions from a wavefront of 64

work items. In the rest of the paper we will use the terms defined in the CUDA language [8].

This execution model enables very efficient hardware, but requires a mechanism to allow each

thread to follow its own thread of control, even though only a single uniform operation can be

issued across all threads in the same warp.

In order to allow independent branching, hardware must provide two mechanisms. The

first mechanism determines which single path, of potentially many control paths, is active and is

executing the current SIMD instruction. The technique for choosing the active path used by

current GPUs is stack-based reconvergence, which we explain in the next subsection. The second

mechanism ensures that only threads that are on the active path, and therefore share the same

program counter (PC) can execute and commit results. This can be done by associating an active

mask with each SIMD instruction that executes. Threads that are in the executing SIMD

instruction but not on the active control path are masked and do not commit results. The mask

may either be computed dynamically by comparing the explicit PC of each thread with the PC

determined for the active path, or alternatively, the mask may be explicitly stored along with

information about the control paths. The GPU in Intel‘s Sandy Bridge [17] stores an explicit PC

for each thread while GPUs from AMD and NVIDIA currently associate a mask with each path.

3.3 STACK-BASED RECONVERGENCE

A significant challenge with the SIMT model is maintaining high utilization of the SIMD

resources when the control flow of different threads within a single warp diverges. There are two

main reasons why SIMD utilization decreases with divergence. The first is that masked

operations needlessly consume resources. This problem has been the focus of a number of recent

research projects, with the main idea being that threads from multiple warps can be combined to

reduce the fraction of masked operations [3, 16]. The second reason is that execution of

concurrent control paths is serialized with every divergence potentially decreasing parallelism.

Therefore, care must be taken to allow them to reconverge with one another. In current GPUs, all

threads that reach a specific diverged branch reconverge at the immediate post-dominator

instruction of that branch [3]. The post-dominator (PDOM) instruction is the first instruction in

the static control flow that is guaranteed to be on both diverged paths [3]. For example, in Figure

3.1, the PDOM of the divergent branch at the end of basic block A (BRB-C) is the instruction that

starts basic block G. Similarly, the PDOM of BRD-E at the end of basic block C is the instruction

starting basic block F.

An elegant way to implement PDOM reconvergence is to treat control flow execution as

a serial stack. Each time control diverges, both the taken and not taken paths are pushed onto a

stack (in arbitrary order) and the path at the new top of stack is executed. When the control path

reaches its reconvergence point, the entry is popped off of the stack and execution now follows

the alternate direction of the diverging branch. This amounts to a serial depth-first traversal of

the control-flow graph. Note that only a single path is executed at any given time, which is the



path that is logically at the top of the stack. There are multiple ways to implement such stack

model, including both explicit hardware structures and implicit traversal with software directives

[3].We describe our mechanisms in the context of an explicit hardware approach, which we

explain below. According to prior publications, this hardware approach is used in NVIDIA

GPUs. We discuss the application of our technique to GPUs with software control in the next

subsections.

A 11111111

B 11000000 C 00111111

D 00110000 E 00001111

F 00111111

G 11111111

Figure 3.1: Example control-flow graph. Each

warp consists of 8 threads and 1’s and 0’s in the

control flow graph designate the active and

inactive threads in each path.



The hardware reconvergence stack tracks the program counter (PC) associated with each

control flow path, which threads are active at each path (the active mask of the path), and at what

PC should a path reconverge (RPC) with its predecessor in the control-flow graph [12]. The

stack contains the information on the control flow of all threads within a warp. Figure 3.2 depicts

the reconvergence stack and its operation on the example control flow shown in Figure 3.1. We

describe this example in detail below.

When a warp first starts executing, the stack is initialized with a single entry: the PC

points to the first instructions of the kernel (first instruction of block A), the active mask is full,

and the RPC (reconvergence PC) is set to the end of the kernel. When a warp executes a

conditional branch, the predicate values for both the taken and non-taken paths (left and right

paths) are computed. If control diverges with some threads following the taken path and others

the nontaken path, the stack is updated to include the newly formed paths (Figure 3.2(b)). First,

the PC field of the current top of the stack (TOS) is modified to the PC value of the

reconvergence point, because when execution returns to this path, it would be at the point where

the execution reconverges (start of block G in the example). The RPC value is explicitly

communicated from software and is computed with a straightforward compiler analysis [3].

Second, the PC of the right path (block C), the corresponding active mask, and the RPC (block

G) is pushed onto the stack. Third, the information on the left path (block B) is similarly pushed

onto the stack. Finally, execution moves to the left path, which is now at the TOS. Note that only

a single path per warp, the one at the TOS, can be scheduled for execution. For this reason we

refer to this baseline architecture as the single-path execution (SPE) model.

When the current PC of a warp matches the RPC field value at that warp‘s TOS, the entry

at the TOS is popped off (Figure 3.2(c)). At this point, the new TOS corresponds to the right path

of the branch and the warp starts executing block C. As the warp encounters another divergent

branch at the end of block C, the stack is once again updated with the left and right paths of

blocks D and E (Figure 3.2(d)). Note how the stack elegantly handles the nested branch and how

the active masks for the paths through blocks D and E are each a subset of the active mask of

block C. When both left and right paths of block D and E finish execution and corresponding

stack entries are popped out, the TOS points to block F and control flow is reconverged back to

the path that started at block C (Figure 3.2(e)) – the active mask is set correctly now that the

nested branch reconverged. Similarly, when block F finishes execution and the PC equals the

reconvergence PC (block G), the stack is again popped and execution continues along a single

path with a full active mask (Figure 3.2(f)).



PC Active Mask RPC

A 11111111 -

PC Active Mask RPC

G 11111111 -

F 00111111 G

E 00001111 F

D 00110000 F

PC Active Mask RPC

G 11111111 -

C 00111111 G

B 11000000 G

PC Active Mask RPC

G 11111111 -

F 00111111 G

PC Active Mask RPC

G 11111111 -

PC Active Mask RPC

G 11111111 -

C 00111111 G

L7

L6

L5

L4

L3

L2

L1

L0

A

A

A

Single-path stack

TOS

(a) Initial status of the stack. The current TOS

designates that basic block A is being executed.

TOS

Single-path stack

(d) Two more entries for block D and E are pushed

into the stack when warp executes BRD-E.

Single-path stack

TOS

(b) Two entries of block B and C are pushed into

the stack when BRB-C is executed. RPC is updated to

block G.

TOS

Single-path stack

(e) Threads are reconverged back at block F when

both entries of block D and E are popped out.

Single-path stack

TOS

(c) The stack entry, corresponding to B at TOS, is

popped out when PC matches RPC of G

TOS

Single-path stack

(f) All eight threads become active again when

stack entry for block F is popped out.

B B

B

C

C C

D D D

E E E

F G G G

SIM

D L

AN

E

Time



This example also points out the two main deficiencies of the SPE model. First, SIMD

utilization decreases every time control flow diverges. SIMD utilization has been the focus of

active research (e.g., [3, 16]) and we do not discuss it further in this paper. Second, execution is

serialized such that only a single path is followed until it completes and reconverges (Figure

3.2(g)). Such SPE model works well for most applications because of the abundant parallelism

exposed through multiple warps within cooperative thread arrays. However, for some

applications, the restriction of following only a single path does degrade performance. Meng et

al. proposed dynamic warp subdivision (DWS) [15], which selectively deviates from the

reconvergence stack execution model, to overcome the serialization issue.

3.4 LIMITATION OF PREVIOUS MODEL

As discussed in previous subsection, SPE is able to address only one aspect of the control

divergence issue while overlooking the other. SPE uses simple hardware and an elegant

execution model to maximize SIMD utilization with structured control flow, but always

serializes execution with only a single path schedulable at any given time. DWS [15] can

interleave the scheduling of multiple paths and increase TLP, but it sacrifices SIMD lane

utilization. Our proposed model, on the other hand, always matches the utilization and SIMD

efficiency of the baseline SPE while still enhancing TLP in some cases. Our approach keeps the

elegant reconvergence stack model and the hardware requires only small modifications to utilize

up to two interleaved paths. Our technique requires only a small number of components within

the GPU microarchitecture and requires no support from software. Specifically, the stack itself is

enhanced to provide up to two concurrent paths for execution, the scoreboard is modified to track

dependencies of two concurrent paths and to correctly handle divergence and reconvergence, and

the warp scheduler is extended to handle up to two schedulable objects per warp.

3.5 WARP SIZE IMPACT

Small warps, i.e., warps as wide as SIMD width, reduce the likelihood of branch divergence

occurrence. Reducing the branch divergence improves SIMD efficiency by increasing the

number of active lanes. At the same time, a small size warp reduces memory coalescing,

effectively increasing memory stalls. This can lead to redundant memory accesses and

increase pressure on the memory subsystem. Large warps, on the other hand, exploit

potentially existing memory access localities among neighbor threads and coalesce them to

a few off-core requests. On the negative side, bigger warp size can increase serialization and

the branch divergence impact.

Figure 3.2: High-level operation of SIMT stack-based reconvergence when executing the control-flow graph in

Figure 3.1. The ones/zeros inside the active mask field designate the active threads in that block. Bubbles in (g)

represent idle execution resources (masked lanes or zero ready warps available for scheduling in the SM).



Insensitive workloads: Warp size affects performance in SIMT cores only for workloads

suffering from branch/memory divergence or showing potential benefits from memory

access coalescing. Therefore, benchmarks lacking either of these characteristics are insensitive

to warp size.

Ideal coalescing and write accesses: Small warps machine coalescing rate is far higher than

other machines due to ideal coalescing hardware. However, ideal coalescing can only capture

the read accesses and does not compensate uncoalesced accesses. Therefore, Small warps

machine may suffer from uncoalesced write accesses. The coalescing rate of Small warps

machine is higher than other machines since it merges many read accesses among warps.

However, uncoalesced write accesses downgrades the overall performance in Small warps

machine.

Practical issues with small warps: Pipeline front-end includes the warp scheduler, fetch

engine, decode instruction and register read stages. Using fewer threads per warp affects

pipeline front-end as it requires a faster clock rate to deliver the needed workload during the

same time period. An increase in the clock rate can increase power dissipation in the

front-end and impose bandwidth limitation issues on the fetch stage. Moreover, using short

warps can impose extra area overhead as the warp scheduler has to select from a larger number

of warps. In this study we focus on how warp size impacts performance, leaving the area and

power evaluations to future works.

Register file: Warp size affects register file design and allocation. GPUs allocate all warp

registers in a single row. Such an allocation allows the read stage to read one operand for

all threads of a warp by accessing a single register file row. For different warp sizes, the

number of registers in a row (row size) varies according to the warp size to preserve

accessibility. Row size should be wider for large warps to read the operands of all threads in a

single row access and narrower for small warps to prevent unnecessary reading.

Figure reports average

performance for GPUs using

different warp sizes and SIMD

widths. For any specific SIMD

width, configuring the warp size to

1-2X larger than SIMD width

provides best average performance.

Widening the warp size beyond 2X

degrades performance. [14] Figure 3.3 Warp size impact on performance for different SIMD

widths, normalized to 8-wide SIMD and 4x warp size. [14]



3.6 LIMITATIONS OF MULTITHREADING DEPTH

As in other throughput-oriented organizations that try to maximize thread concurrency and

hence do not waste area on discovering instruction level parallelism, SM‘s typically employ in-

order pipelines that have limited ability to execute past L1 cache misses or other long latency

events. To hide memory latencies, SM‘s instead time-multiplex among multiple concurrent

warps, each with their own PCs and registers. However, the multi-threading depth (i.e., number

of warps) is limited because adding more warps multiplies the area overhead in register files, and

may increase cache contention as well. As a result of this limited multi-threading depth, the

WPU may run out of work. This can occur even when there are runnable threads that are stalled

only due to SIMD lockstep restrictions. For example, some threads in a warp may be ready while

others are still stalled on a cache miss.

Figure shows that adding more warps does exacerbate L1 contention. This is a capacity

limitation, not just an associativity problem, as shown in Figure 1b, where time waiting on

memory is still significant even with full associativity. These results are all averages across the

benchmarks described in Section 3.2 and obtained with the configuration described in Section 3.

Intra-warp latency tolerance hides latencies without requiring extra threads. However, intra-

warp latency tolerance is only beneficial when threads within the same warp exhibit divergent

behavior. Table 1 shows that many benchmarks exhibit frequent memory divergence. A further

advantage of intra-warp latency tolerance is that the same mechanisms also improve throughput

in the presence of branch divergence.

As SIMD width increases,

the likelihood that at least one

thread will stall the entire warp

increases. However, inter-warp

latency tolerance (i.e., deeper multi-

threading via more warps) is not

sufficient to hide these latencies.

The number of threads whose state

fits in an L1 cache is limited. That

is why intra-warp latency tolerance

is needed. Intrawarp latency also

provides opportunities for memory-

level parallelism (MLP) that

conventional SIMD organizations

do not.

Figure 3.4: (a) A wider SIMD organization does not always

improve performance due to increased time spent waiting

for memory. The number of warps is fixed at four. (b) 16-

wide WPUs with 4 warps even suffer from highly associative

D caches. (c) A few 8-wide warps can beneficially hide

latency, but too many warps eventually exacerbates cache

contention and increases the time spent waiting for memory.



3.7 ENHANCING THE INTRA WARP PARALLELISM

We extend the hardware stack used in many current GPUs to support two concurrent paths of

execution. The idea is that instead of pushing the taken and fall-through paths onto the stack one

after the other, in effect serializing their execution, the two paths are maintained in parallel. A

stack entry of the dual-path stack architecture thus consists of three data elements: a) PC and

active mask value of the left path (PathL), b) PC and active mask value of the right path (PathR),

and c) the RPC (reconvergence PC) of the two paths. We use the generic names left and right

because there is no reason to restrict the mapping of taken and non-taken paths to the fields of

the stack. Note that there is no need to duplicate the RPC field within an entry because PathL and

PathR of a divergent branch have a common reconvergence point. Besides the data fields that

constitute a stack entry, the other components of the control flow hardware, such as the logic for

computing active masks and managing the stack, are virtually identical to those used in the

baseline stack architecture. We expose the two paths for execution on a divergent branch and can

improve performance when this added parallelism is necessary, as shown in Figure 3.6, and

described below for the cases of divergence and reconvergence.

Handling Divergence: A warp starts executing on one of the paths, for example the left path,

with a full active mask. The PC is set to the first instruction in the kernel and the RPC set to the

last instruction (PathL in Figure 3.6(a)). The warp then executes in identical way to the baseline

single-path stack until a divergent branch executes. When the warp executes a divergent branch,

we push a single entry onto the stack, which represents both sides of the branch, rather than

pushing two distinct entries as done with the baseline SPE. The PC field of the block that

diverged is set to the RPC of both the left and right paths (block G in Figure 3.6(b)), because this

is the instruction that should execute when control returns to this path. Then, the active mask and

PC of PathL, as well as the same information for PathR are pushed onto the stack, along with

their common RPC and updating the TOS (Figure 3.6(b)). Because it contains the information

for both paths, the single TOS entry enables the warp scheduler to interleave the scheduling of

active threads at both paths as depicted in Figure 3.6(g). If both paths are active at the time of

divergence, the one to diverge (block C in Figure 3.6(b)) first pushes an entry onto the stack, and

in effect, suspends the other path (block B in Figure 3.6(c)) until control returns to this stack

entry (Figure 3.6(e)). Note that the runtime information required to update the stack entries is

exactly the same as in the baseline single-stack model.



Figure 3.5: Exploiting the parallelism with our model. We assume same control-flow graph in Figure 3.1

PCL MaskL PCR MaskR RPC A 11111111 - -

PCL MaskL PCR MaskR RPC G 11111111 - - -

B 11000000 F 00111111 G

- - E 00001111 F

PCL MaskL PCR MaskR RPC G 11111111 - -

B 11000000 C 00111111 G


B 11000000 F 00111111 G

PCL MaskL PCR MaskR RPC G 11111111 - -

B 11000000 F 00111111 G

D 00110000 E 00001111 F


L7

L6

L5

L4

L3

L2

L1

L0

A

A

A

(a) Initial status of the stack. PathR at TOS is

left blank as there exists only a schedulable

path at the beginning.

(d) When all the instructions in block D are

consumed and PathL‘s PC value matches RPC

(block F), corresponding path is invalidated.

TOS

TOS

TOS

T

O

S

(b) When BRB-C is executed, both taken (PathL)

and non-taken (PathR) path information is

pushed as a single operation. Note that only a

single RPC field is needed per stack.

TOS

(e) When the threads in block E eventually arrive at the

end of its basic block, PathL and PathR are invalidated.

The entry associated with block D and E are therefore

popped, TOS decremented, and block B and F resumes

execution.

TOS

(c) Branching at BRD-E at the end of block C

pushes another entry for both block D and E. PC

of PathR in TOS is updated to block F and TOS is

incremented afterwards.

(f) When the threads in block B and F arrive the

RPC point (block G), the entry is popped out

again, TOS is decremented, and all eight threads

reconverge back at block G.

TOS

SIM

D L

AN

E

SIM

D L

AN

E

B B

C

C C

D D D

B

E E

G G G

F

Time

E

SAV

ED

SAV

ED

SAV

ED

SAV

ED



Handling Reconvergence: When either one of the basic blocks at the TOS arrives at the

reconvergence point and its PC matches the RPC, the block is invalidated (block D in Figure

3.6(d)). Because the right path is still active, though, the entry is not yet popped off of the stack.

Once both paths arrive at the RPC, the stack is popped and control is returned to the next stack

entry (Figure 3.6(e–f))



4. SIMULATOR AND BENCHMARKS

We model the microarchitectural components for execution using GPGPU-Sim (version

3.2.1) [4, 12], which is a detailed cycle-based performance simulator of a general purpose GPU

architecture supporting CUDA version 4.0 and it‘s PTX ISA. The simulator is configured to be

similar to NVIDIA‘s Fermi architecture using the configuration file provided with GPGPU-

Sim [1].

We are using benchmarks selected from Rodinia [5], Parboil [6], CUDA-SDK [8], the

benchmarks provided with GPGPU-Sim [4], that we found to work with GPGPU-Sim, and [15].

The benchmarks studied are ones whose kernel can execute to completion on GPGPU-Sim.

4.1 Top-Level Organization of GPGPU-Sim[12]

The GPU modeled by GPGPU-Sim is composed of Single Instruction Multiple Thread

(SIMT) cores connected via an on-chip connection network to memory partitions that interface

to graphics GDDR DRAM. An SIMT core models a highly multithreaded pipelined SIMD

processor roughly equivalent to what NVIDIA calls a Streaming Multiprocessor (SM) or what

AMD calls a Compute Unit (CU). The organization of an SIMT core is described in Figure 4.1

below.

Figure 4.1: Detailed Microarchitecture Model of SIMT Core [12]

http://gpgpu-sim.org/manual/index.php5/File:Overall-arch.png



4.2 SIMT core microarchitecture [12]

Figure 4.2 below illustrates the SIMT core microarchitecture simulated by GPGPU-Sim

3.x. An SIMT core models a highly multithreaded pipelined SIMD processor roughly equivalent

to what NVIDIA calls a Streaming Multiprocessor (SM) or what AMD calls a Compute Unit

(CU). A Stream Processor (SP) or a CUDA Core would correspond to a lane within an ALU

pipeline in the SIMT core. This microarchitecture model contains many details not found in

earlier versions of GPGPU-Sim. The main differences include:

• A new front-end that models instruction caches and separates the warp scheduling (issue) stage

from the fetch and decode stage

• Scoreboard logic enabling multiple instructions from a single warp to be in the pipeline at once.

• A detailed model of an operand collector that schedules operand access to single ported register

file banks (used to reduce area and power of the register file)

• Flexible model that supports multiple SIMD functional units. This allows memory instructions

and ALU instructions to operate in different pipelines.

The following subsections describe the details in Figure 4.2 by going through each stage of the

pipeline.

Figure 4.2: Overall GPU Architecture Modeled by GPGPU-Sim [12]

The major stages in the front end include instruction cache access and instruction buffering logic,

scoreboard and scheduling logic, SIMT stack.



4.2.1 Fetch and Decode [12]

The instruction buffer (I-Buffer) block in Figure 4.2 is used to buffer instructions after

they are fetched from the instruction cache. It is statically partitioned so that all warps running on

SIMT core have dedicated storage to place instructions. In the current model, each warp has two

I-Buffer entries. Each I-Buffer entry has a valid bit, ready bit and a single decoded instruction for

this warp. The valid bit of an entry indicates that there is a non-issued decoded instruction within

this entry in the I-Buffer. While the ready bit indicates that the decoded instructions of this warp

are ready to be issued to the execution pipeline. Conceptually, the ready bit is set in the schedule

and issue stage using the scoreboard logic and availability of hardware resources (in the

simulator software, rather than actually set a ready bit, a readiness check is performed). The I-

Buffer is initially empty with all valid bits and ready bits deactivated.

A warp is eligible for instruction fetch if it does not have any valid instructions within

the I-Buffer. Eligible warps are scheduled to access the instruction cache in round robin order.

Once selected, a read request is sent to instruction cache with the address of the next instruction

in the currently scheduled warp. By default, two consecutive instructions are fetched. Once a

warp is scheduled for an instruction fetch, its valid bit in the I-Buffer is activated until all the

fetched instructions of this warp are issued to the execution pipeline.

The instruction cache is a read-only, non-blocking set-associative cache that can model

both FIFO and LRU replacement policies with on-miss and on-fill allocation policies. A request

to the instruction cache results in either a hit, miss or a reservation fail. The reservation fail

results if either the Miss Status Holding Register (MSHR) is full or there are no replaceable

blocks in the cache set because all block are reserved by prior pending requests (see section

Caches for more details). In both cases of hit and miss the round-robin fetch scheduler moves to

the next warp. In case of hit, the fetched instructions are sent to the decode stage. In the case of a

miss a request will be generated by the instruction cache. When the miss response is received the

block is filled into the instruction cache and the warp will again need to access the instruction

cache. While the miss is pending, the warp does not access the instruction cache.

A warp finishes execution and is not considered by the fetch scheduler anymore if all its

threads have finished execution without any outstanding stores or pending writes to local

registers. The thread block is considered done once all warps within it are finished and have no

pending operations. Once all thread blocks dispatched at a kernel launch finish, then this kernel

is considered done.

WARP->THREAD BLOCK ->KERNEL

At the decode stage, the recent fetched instructions are decoded and stored in their corresponding

entry in the I-Buffer waiting to be issued.



4.2.2 Instruction Issue [12]

A second round robin arbiter chooses a warp to issue from the I-Buffer to rest of the

pipeline. This round robin arbiter is decoupled from the round robin arbiter used to schedule

instruction cache accesses. The issue scheduler can be configured to issue multiple instructions

from the same warp per cycle. Each valid instruction (i.e. decoded and not issued) in the

currently checked warp is eligible for issuing if

(1) Its warp is not waiting at a barrier,

(2) It has valid instructions in its I-Buffer entries (valid bit is set),

(3) The scoreboard check passes (see section Scoreboard for more details), and

(4) The operand access stage of the instruction pipeline is not stalled.

Memory instructions (Load, store, or memory barriers) are issued to the memory

pipeline. For other instructions, it always prefers the SP pipe for operations that can use both SP

and SFU pipelines. However, if a control hazard is detected then instructions in the I-Buffer

corresponding to this warp are flushed. The warp's next pc is updated to point to the next

instruction (assuming all branches as not-taken). For more information about handling control

flow, refer to SIMT Stack.

4.2.3 SIMT Stack [12]

A per-warp SIMT stack is used to handle the execution of branch divergence on single-

instruction, multiple thread (SIMT) architectures. Since divergence reduces the efficiency of

these architectures, different techniques can be adapted to reduce this effect. One of the simplest

techniques is the post-dominator stack-based reconvergence mechanism. This technique

synchronizes the divergent branches at the earliest guaranteed reconvergence point in order to

increase the efficiency of the SIMT architecture.

Entries of the SIMT stack represent a different divergence level. At each divergence

branch, a new entry is pushed to the top of the stack. The top-of-stack entry is popped when the

warp reaches its reconvergence point. Each entry stores the target PC of the new branch, the

immediate post dominator reconvergence PC and the active mask of threads that are diverging to

this branch. In our model, the SIMT stack of each warp is updated after each instruction issue of

this warp. The target PC, in case of no divergence, is normally updated to the next PC. However,

in case of divergence, new entries are pushed to the stack with the new target PC, the active

mask that corresponds to threads that diverge to this PC and their immediate reconvergence point

PC. Hence, a control hazard is detected if the next PC at top entry of the SIMT stack does not

equal to the PC of the instruction currently under check.



4.2.3 Scoreboard [12]

The Scoreboard algorithm checks for WAW and RAW dependency hazards. As

explained above, the registers written to by a warp are reserved at the issue stage. The scoreboard

algorithm indexed by warp IDs. It stores the required register numbers in an entry that

corresponds to the warp ID. The reserved registers are released at the write back stage.

As mentioned above, the decoded instruction of a warp is not scheduled for issue until

the scoreboard indicates no WAW or RAW hazards exist. The scoreboard detects WAW and

RAW hazards by tracking which registers will be written to by an instruction that has issued but

not yet written its results back to the register file.

4.3 Benchmarks

4.3.1 Histogram (Histo) [18]

The Histogram Parboil benchmark is a straightforward histogramming operation that

accumulates the number of occurrences of each output value in the input data set. The output

histogram is a two-dimensional matrix of chartype bins that saturate at 255. The Parboil input

sets, exemplary of a particular application setting in silicon wafer verification, are what define

the optimizations appropriate for the benchmark. The dimensions of the histogram (256 W ×

8192 H) are very large, yet the input set follows a roughly Gaussian distribution, centered in the

output histogram. Recognizing this high concentration of contributions to the histogram‘s central

region (referred to as the ―eye‖), the benchmark optimizations mainly focus on improving the

throughput of contributions to this area. Prior to performing the histogramming, the optimized

implementations for scratchpad run a kernel that determines the size of the eye by sampling the

input data. Architectures with an implicit cache can forego such analysis, since the hardware

cache will automatically prioritize the heavily accessed region wherever it may be.

Overall, the histogram benchmark demonstrates the high cost of random atomic

updates to a large dataset. The global atomic update penalty can sometimes outweigh a fixed-

factor cost of redundantly reading input data.

4.3.2 Stencil [18]

The importance of solving partial differential equations (PDE) numerically as well as

the computationally intensive nature of this class of application have made PDE solvers an

interesting candidate for accelerators. In the benchmark we include a stencil code, representing

an iterative Jacobi solver of the heat equation on a 3-D structured grid, which can also be used as

a building block for more advanced multi-grid PDE solvers.



The GPU-optimized version draws from several published works on the topic,

containing a combination of 2D blocking in the X-Y plane, and register-tiling (coarsening) along

the Z-direction, similar to the one developed by Datta et al. Even with these optimizations, the

performance limitation is global memory bandwidth for current GPU architectures we have

tested.

4.3.3 3D Laplace Solver (LPS) [19]

Laplace is a highly parallel finance application. As well as using shared memory, care was

taken by the application developer to ensure coalesced global memory accesses. We observe that

this benchmark suffers some performance loss due to branch divergence. We run one iteration on

a 100x100x100 grid.

4.3.4 Lattice-Boltzmann Method simulation (LBM) [19]

The Lattice-Boltzmann Method (LBM) is a method of solving the systems of partial

differential equations governing fluid dynamics. Its implementations typically represent a cell of

the lattice with 20 words of data: 18 represent fluid flows through the 6 faces and 12 edges of the

lattice cell, one represents the density of fluid within the cell, and one represents cell type or

other properties e.g., to differentiate obstacles from fluid. In a timestep, each cell uses the input

flows to compute the resulting output flows from that cell and an updated local fluid density.

Although some similarities are apparent, the major difference between LBM and a

stencil application is that no input data is shared between cells; the fluid flowing into a cell is not

read by any other cell. Therefore, the application has been memory-bandwidth bound in current

studies, and optimization efforts have focused on improving achieved memory bandwidth.

4.3.5 Ray Tracing (RAY) [19]

Ray-tracing is a method of rendering graphics with near photo-realism. In this

implementation, each pixel rendered corresponds to a scalar thread in CUDA. Up to 5 levels of

reflections and shadows are taken into account, so thread behavior depends on what object the

ray hits (if it hits any at all), making the kernel susceptible to branch divergence. We simulate

rendering of a 256x256 image.

4.5.6 Heart Wall

Heartwall application tracks the movement of a mouse heart over a sequence of 104

609x590 ultrasound images to record response to the stimulus. In its initial stage, the program

performs image processing operations on the first image to detect initial, partial shapes of inner

and outer heart walls. These operations include: edge detection, SRAD despeckling (also part of

Rodinia suite) [2], morphological transformation and dilation. In order to reconstruct

approximated full shapes of heart walls, the program generates ellipses that are superimposed

over the image and sampled to mark points on the heart walls (Hough Search). In its final stage



(Heart Wall Tracking presented here) [1], program tracks movement of surfaces by detecting the

movement of image areas under sample points as the shapes of the heart walls change throughout

the sequence of images. Only two stages of the application, SRAD and Tracking, contain enough

data/task parallelism to be interesting for Rodinia suite, and therefore they are presented here

separately. The separation of these two parts of the application allows easy analysis of the

distinct types of workloads. However, in the near future, we plan to convert the entire application

to OpenMP and CUDA and make all of its parts available together as well as separately in the

Rodinia suite.

Tracking is the final stage of the Heart Wall application. It takes the positions of heart

walls from the first ultrasound image in the sequence as determined by the initial detection stage

in the application. Partitioning of the working set between caches and avoiding of cache trashing

contribute to the performance. CUDA implementation of this code is a classic example of the

exploitation of braided parallelism.

4.5.7 LUD

LUD Decomposition (lud_cuda) is an algorithm to calculate the solutions of a set of

linear equations. The LUD kernel decomposes a matrix as the product of a lower triangular

matrix and an upper triangular matrix.

Figure 4.3: GPGPU Sim configuration

Number of SMs 15

Threads per SM 1536

Threads per warp 32

Registers per SM 32768

Shared memory per SM 48KB

Number of warp schedulers 2

Warp scheduling policy Round-Robin

L1 Data cache (size/associativity/line size) 16KB/4/128B

L2 cache (size/associativity/line size) 768KB/8/256B

Number of memory channels 6

Clock[Core: Interconnect: L2: DRAM] 700:1400:700:924 MHz

Memory controller Out-of-order (FR-FCFS)

DRAM Latency 100



Figure 4.4: Description of Benchmarks used

Benchmark Description/ Area Number

of Kernels

Number of

Instructions

% of Memory

Instructions

LUD LU Decomposition 46 40M 7.1665

LBM Lattice-Boltzmann

Method Simulation

100 55936M 0.772

LPS 3D Laplace Solver 1 72M 7.065

HEARTWALL Shapes of Heart walls

over Ultrasonic Images

5 35236M 2.619

HISTO Histogram Operation 80 2348M 16.79

RAY Ray Tracing 1 62M 21.2

STENCIL PDE Solvers 100 2775M 8.954



0

0.2

0.4

0.6

0.8

1

1.2

1.4

IPC with Baseline Architecture

IPC with enhancedreconvergence stack

5. RESULTS AND CONCLUSIONS

Figure 5.1: Table for comparison based on Instructions per cycle

Figure 5.2: Normalized Graph depicting increase in IPC

Average Improvement in IPC is approximately 11.8%. Maximum improvement is seen for

Lattice-Boltzmann Method simulation (LBM) benchmark which is around 21.6%.

Benchmarks IPC for Baseline

Architecture

IPC with enhanced

reconvergence stack Arch.

Percentage

Improvement

LUD 20.59805 24.4674745 18.78539

LBM 54.52345 66.28455 21.57072

LPS 222.43355 252.359 13.45366

HEARTWALL 209.9862 227.40495 8.295188

HISTO 116.18875 125.523765 8.034354

RAY 217.10065 231.50103 6.633043

STENCIL 264.92875 279.685 5.569894

Average Percentage Improvement 11.7631



Figure 5.3: Table for comparison based on Total number of Stalls

Figure 5.4: Normalized Graph depicting decrease in No. of Stalls

Average decrease in Total No. of Stalls is approximately 15.5%. Maximum improvement is seen

for Lattice-Boltzmann Method simulation (LBM) benchmark which is around 27.7%.

Future Scope: We can employ dynamic warp formation method to further improve

SIMD Utilization, which we haven‘t focused in this work of ours.

0

0.2

0.4

0.6

0.8

1

1.2

Total Stalls for BaselineArchitecture

Total Stalls with enhancedreconvergence stack Arch.

Benchmarks Total Stalls for

Baseline Architecture

Total Stalls with enhanced

reconvergence stack Arch.

Percentage

decrease in stalls

LUD 9934860 8845544 10.965

LBM 3256573992 2355437042 27.671

LPS 1466837 1133817 22.703

HEARTWALL 882978728 764754980 13.389

HISTO 219281118 191740591 12.559

RAY 2070150 1772489 14.379

STENCIL 56491686 52808554 6.5198

Average Percentage decrease in stalls 15.455



5. REFERENCES

1. COMPUTER ARCHITECTURE A Quantitative Approach, 5th Edition by David A.

Patterson, John L. Hennessy

2. W. W. Fung, I. Sham, G. Yuan, and T. M. Aamodt. Dynamic Warp Formation and

Scheduling for Efficient GPU Control Flow. In 40th International Symposium on

Microarchitecture (MICRO-40), December 2007.

3. Minsoo Rhu and Mattan Erez, The Dual-Path Execution Model for Efficient GPU

Control Flow in the 19th IEEE International Symposium on High-Performance

Computer Architecture (HPCA-19), Shenzhen, China, Feb. 2013

4. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. Analyzing CUDA Workloads

Using a Detailed GPU Simulator. In IEEE International Symposium on Performance

Analysis of Systems and Software (ISPASS-2009), April 2009.

5. S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee and K. Skadron. Rodinia: A

Benchmark Suite for Heterogeneous Computing. In IEEE International Symposium on

Workload Characterization (IISWC-2009), October 2009

6. IMPACT Research Group. The Parboil Benchmark Suite.

http://www.crhc.uiuc.edu/IMPACT/parboil.php

7. NVIDIA Corporation. NVIDIA‘s Next Generation CUDA Compute Architecture: Fermi,

2009.

8. NVIDIA Corporation. CUDA toolkit, C/C++ SDK CODE Samples

9. Evolution of the Graphical Processing Unit by Thomas Scott Crow

10. http://cpudb.stanford.edu/ a complete database of processors built by the Stanford

University's VLSI Research

11. https://www.udacity.com/course/cs344 a lecture series by David Luebke of NVIDIA

Research , John Owens from the University of California, Davis

12. GPGPU-Sim. http://www.gpgpu-sim.org

13. Jonathan Palacios, Josh Triska A comparison of modern GPU and CPU architectures: and

common convergence of both

14. Ahmad Lashgar, Amirali Baniasadi, and Ahmad Khonsari Warp Size Impact in GPUs:

Large or Small?

15. J. Meng, D. Tarjan, and K. Skadron. Dynamic Warp Subdivision for Integrated Branch

and Memory Divergence Tolerance. In 37th International Symposium on Computer

Architecture (ISCA-37), 2010.

16. V. Narasiman, C. Lee, M. Shebanow, R. Miftakhutdinov, O. Mutlu, and Y. Patt.

Improving GPU Performance via Large Warps and Two-Level Warp Scheduling. In 44th

International Symposium on Microarchitecture (MICRO-44), December 2011.

17. Intel Corporation. Intel HD Graphics Open Source Programmer Reference Manual, June

2011.

http://www.crhc.uiuc.edu/IMPACT/parboil.php

http://cpudb.stanford.edu/

https://www.udacity.com/course/cs344

http://www.gpgpu-sim.org/



18. John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid,Li-Wen Chang, Nasser

Anssari, Geng Daniel Liu, Wen-mei W. Hwu. Parboil: A Revised Benchmark Suite for

Scientific and Commercial Throughput Computing. In IMPACT Technical Report

IMPACT-12-01.

19. Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt.

Analyzing CUDA Workloads Using a Detailed GPU Simulator

Exploiting Intra Warp Parallelism for GPGPU

Documents

Transcript of Exploiting Intra Warp Parallelism for GPGPU