IEEE TRANSACTIONS ON VERY LARGE SCALE ...jingjiez/publications/Bank...the zero-cost context...

14
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1 Bank Stealing for a Compact and Efficient Register File Architecture in GPGPU Naifeng Jing, Shunning Jiang, Shuang Chen, Jingjie Zhang, Li Jiang, Chao Li, Member, IEEE, and Xiaoyao Liang Abstract— Modern general-purpose graphic processing units (GPGPUs) have emerged as pervasive alternatives for parallel high-performance computing. The extreme multithreading in modern GPGPUs demands a large register file (RF), which is typically organized into multiple banks to support the massive parallelism. Although a heavily banked structure benefits RF throughput, its associated area and energy costs with diminishing performance gains greatly limit the future RF scaling. In this paper, we propose an improved RF design with bank stealing techniques, which enable a high RF throughput with compact area. By deeply investigating the GPGPU microarchitecture, we find that the state-of-the-art RF designs’ is far from optimal due to the deficiency in bank utilization, which is the intrinsic limitation to a high RF throughput and a compact RF area. We investigate the causes for bank conflicts and identify that most conflicts can be eliminated by leveraging the fact that the highly banked RF oftentimes experiences underutilization. This is especially true in GPGPUs, where multiple ready warps are available at the scheduling stage with their operands to be wisely coordinated. In this paper, we propose two lightweight bank stealing techniques that can opportunistically fill the idle banks and register entries for better operand service. Using the proposed architecture, the average GPGPU performance can be improved under a smaller energy budget with significant area saving, which makes it promising for sustainable RF scaling. Index Terms— Area efficiency, bank conflict, bank stealing, general purpose graphic processing unit (GPGPU), register file (RF), register liveness. I. I NTRODUCTION M ODERN general-purpose graphic processing units (GPGPUs) have emerged as pervasive alternatives for high-performance computing. GPGPU exploits the inherent massive parallelism by utilizing a large number of concurrent threads executed in single instruction multiple Manuscript received December 16, 2015; revised March 9, 2016 and May 4, 2016; accepted June 4, 2016. This work was supported in part by the National Natural Science Foundation of China under Grant 61402285, Grant 61202026, and Grant 61332001, in part by the China Postdoctoral Science Foundation under Grant 2014T70418, and in part by the Program of the China National 1000 Young Talent Plan and Shanghai Science and Technology Committee under Grant 15YF1406000. (Corresponding author: Xiaoyao Liang.) N. Jing, L. Jiang, C. Li, and X. Liang are with Shanghai Jiao Tong University, Shanghai 200240, China (e-mail: [email protected]; [email protected]; [email protected]). S. Jiang and S. Chen were with Shanghai Jiao Tong University, Shanghai 200240, China. They are now with Cornell University, Ithaca, NY 14850 USA (e-mail: [email protected]; chenshuang0804@ gmail.com). J. Zhang is with Fudan University, Shanghai 200433, China (e-mail: [email protected]). Digital Object Identifier 10.1109/TVLSI.2016.2584623 threads (SIMTs) manner. To hide the long memory access latency during the thread execution, the SIMT execution is featured by a fast thread switching mechanism, termed the zero-cost context switching, supported by a large-sized register file (RF), where the active contexts of the concurrent threads can be held in each streaming multiprocessor (SM). For instance, the classic Nvidia Fermi architecture includes 15 SMs, where each SM is equipped with 32 768 32-bit registers to support 1536 concurrent threads, resulting a nearly 2-MB on-chip RF storage in total in a single GPU [1]. The capacity keeps increasing in each product generation, e.g., doubled in Kepler architecture [2], in seek of even higher thread-level parallelism (TLP). To manage such a large RF, a multibanked structure is widely used [3]–[5]. Multibanked RF sustains the multiported RF throughput by serving concurrent operand accesses as long as there are no bank conflicts. If conflict occurs, the requests have to be serialized, and the RF throughput is reduced leading to performance loss. As a result, in face of the RF capacity scaling in the future GPGPUs, adding more banks is a natural way to accommodate the increasing number of registers without worsening the bank conflict problem. Otherwise, the performance gained from higher TLP could be compromised because more threads from schedulers will be competing for an insufficient number of RF banks that aggravates the conflicts and reduces the throughput. However, banks are not free. Adding banks requires extra column circuitry, such as sensor amplifiers, prechargers, bitline drivers, and a bundle of wires [6]. Meanwhile, the crossbar between banks and operand collectors grows significantly with the increasing number of banks. All these issues affect the RF area, timing, and power [7]. As modern GPGPUs tend to become area-limited and they have already been power- limited as pointed in [5], it is important to incorporate a suitable number of banks while seeking for more economic ways to mitigate the conflicts for the ever-increasing RF capacity. In this paper, we explicitly address the issue for an effi- cient RF design to sustain the capacity scaling in the future GPGPUs. We first identify the causes for the bank conflicts. Then, in the presence of the many bubbles observed in bank utilization and RF entries, we propose to apply fine- grained architectural control by stealing the idle banks to serve concurrent operand accesses, which effectively recovers the throughput loss due to bank conflicts. By leveraging the multiwarp and multibank characteristics in GPGPUs, we 1063-8210 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Transcript of IEEE TRANSACTIONS ON VERY LARGE SCALE ...jingjiez/publications/Bank...the zero-cost context...

Page 1: IEEE TRANSACTIONS ON VERY LARGE SCALE ...jingjiez/publications/Bank...the zero-cost context switching, supported by a large-sized register file (RF), where the active contexts of

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1

Bank Stealing for a Compact and EfficientRegister File Architecture in GPGPU

Naifeng Jing, Shunning Jiang, Shuang Chen, Jingjie Zhang, Li Jiang,Chao Li, Member, IEEE, and Xiaoyao Liang

Abstract— Modern general-purpose graphic processingunits (GPGPUs) have emerged as pervasive alternativesfor parallel high-performance computing. The extrememultithreading in modern GPGPUs demands a large registerfile (RF), which is typically organized into multiple banks tosupport the massive parallelism. Although a heavily bankedstructure benefits RF throughput, its associated area and energycosts with diminishing performance gains greatly limit thefuture RF scaling. In this paper, we propose an improved RFdesign with bank stealing techniques, which enable a high RFthroughput with compact area. By deeply investigating theGPGPU microarchitecture, we find that the state-of-the-artRF designs’ is far from optimal due to the deficiency inbank utilization, which is the intrinsic limitation to a highRF throughput and a compact RF area. We investigate thecauses for bank conflicts and identify that most conflicts canbe eliminated by leveraging the fact that the highly banked RFoftentimes experiences underutilization. This is especially truein GPGPUs, where multiple ready warps are available at thescheduling stage with their operands to be wisely coordinated. Inthis paper, we propose two lightweight bank stealing techniquesthat can opportunistically fill the idle banks and register entriesfor better operand service. Using the proposed architecture,the average GPGPU performance can be improved under asmaller energy budget with significant area saving, which makesit promising for sustainable RF scaling.

Index Terms— Area efficiency, bank conflict, bank stealing,general purpose graphic processing unit (GPGPU), registerfile (RF), register liveness.

I. INTRODUCTION

MODERN general-purpose graphic processingunits (GPGPUs) have emerged as pervasive alternatives

for high-performance computing. GPGPU exploits theinherent massive parallelism by utilizing a large number ofconcurrent threads executed in single instruction multiple

Manuscript received December 16, 2015; revised March 9, 2016 andMay 4, 2016; accepted June 4, 2016. This work was supported in part bythe National Natural Science Foundation of China under Grant 61402285,Grant 61202026, and Grant 61332001, in part by the China PostdoctoralScience Foundation under Grant 2014T70418, and in part by the Programof the China National 1000 Young Talent Plan and Shanghai Science andTechnology Committee under Grant 15YF1406000. (Corresponding author:Xiaoyao Liang.)

N. Jing, L. Jiang, C. Li, and X. Liang are with Shanghai Jiao TongUniversity, Shanghai 200240, China (e-mail: [email protected];[email protected]; [email protected]).

S. Jiang and S. Chen were with Shanghai Jiao Tong University,Shanghai 200240, China. They are now with Cornell University, Ithaca,NY 14850 USA (e-mail: [email protected]; [email protected]).

J. Zhang is with Fudan University, Shanghai 200433, China (e-mail:[email protected]).

Digital Object Identifier 10.1109/TVLSI.2016.2584623

threads (SIMTs) manner. To hide the long memory accesslatency during the thread execution, the SIMT executionis featured by a fast thread switching mechanism, termedthe zero-cost context switching, supported by a large-sizedregister file (RF), where the active contexts of the concurrentthreads can be held in each streaming multiprocessor (SM).For instance, the classic Nvidia Fermi architecture includes15 SMs, where each SM is equipped with 32 768 32-bitregisters to support 1536 concurrent threads, resulting anearly 2-MB on-chip RF storage in total in a single GPU [1].The capacity keeps increasing in each product generation,e.g., doubled in Kepler architecture [2], in seek of even higherthread-level parallelism (TLP).

To manage such a large RF, a multibanked structure iswidely used [3]–[5]. Multibanked RF sustains the multiportedRF throughput by serving concurrent operand accesses aslong as there are no bank conflicts. If conflict occurs, therequests have to be serialized, and the RF throughput isreduced leading to performance loss. As a result, in face ofthe RF capacity scaling in the future GPGPUs, adding morebanks is a natural way to accommodate the increasing numberof registers without worsening the bank conflict problem.Otherwise, the performance gained from higher TLP couldbe compromised because more threads from schedulers willbe competing for an insufficient number of RF banks thataggravates the conflicts and reduces the throughput.

However, banks are not free. Adding banks requires extracolumn circuitry, such as sensor amplifiers, prechargers, bitlinedrivers, and a bundle of wires [6]. Meanwhile, the crossbarbetween banks and operand collectors grows significantly withthe increasing number of banks. All these issues affect theRF area, timing, and power [7]. As modern GPGPUs tendto become area-limited and they have already been power-limited as pointed in [5], it is important to incorporate asuitable number of banks while seeking for more economicways to mitigate the conflicts for the ever-increasingRF capacity.

In this paper, we explicitly address the issue for an effi-cient RF design to sustain the capacity scaling in the futureGPGPUs. We first identify the causes for the bank conflicts.Then, in the presence of the many bubbles observed inbank utilization and RF entries, we propose to apply fine-grained architectural control by stealing the idle banks toserve concurrent operand accesses, which effectively recoversthe throughput loss due to bank conflicts. By leveragingthe multiwarp and multibank characteristics in GPGPUs, we

1063-8210 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: IEEE TRANSACTIONS ON VERY LARGE SCALE ...jingjiez/publications/Bank...the zero-cost context switching, supported by a large-sized register file (RF), where the active contexts of

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 1. Block diagram of a typical GPGPU pipeline, highlighting the banked RF with operand collectors and a crossbar unit.

make lightweight modifications on the compiler as well asthe pipeline for a more balanced utilization of the existingRF resources instead of simply adding more banks.Experimental results demonstrate that our proposed RF withbank stealing can deliver better performance, less energy, andsignificant area saving.

This paper differs from the previous proposals inmany ways. For example, we leverage the single-portedSRAM to build the RF. Although it may cause moreconflicts, as demonstrated in Section III-B, it is morearea- and power-efficient than the multiported staticrandom access memory (SRAM) or pseudodual-portRF solutions for the conflict problem, which add atleast 30% area, as reported in [8]. At the same time, westick to the conventional monolithic banking structure, whichhas lower design complexity and more compact area in thedatapath compared with the multilevel RF designs, such asRF cache [4]. We also step forward from our prior work [9]by identifying all the possible bank conflicts and proposea combined solution. Using the proposed bank stealingtechniques and combined structure, the performance of oursingle-ported and monolithic RF architecture is comparableor even better than other types of RF designs, while the muchreduced silicon area makes it a promising alternative for theRF capability scaling in the future GPGPUs.

The remainder of this paper is organized as follows.Section II reviews the background and preliminaries of modernGPGPU architecture. Section III introduces our motivation.Section IV proposes our techniques with detailed architecturalsupport. Sections V and VI present the evaluation methodol-ogy and experimental results, respectively. Finally, Section VIIdescribes related work, Section VIII concludes this paper.

II. BACKGROUND AND PRELIMINARIES

A. High-Throughput GPGPU Architecture

A modern GPGPU consists of a scalable array of multi-threaded SMs, which execute thousands of concurrent threadsto enable the massive TLP for the extraordinary computationalthroughput. Inside each SM, threads are often issued in a warp.At any given clock cycle, a ready warp is selected and issuedto the data processing lanes by the scheduler. For example,in the classic Nvidia Fermi GPGPU architecture, each SMtypically has 32 data processing lanes, two schedulers, and alarge number of execution units.

To keep the execution units busy even if some warps arestalled for long memory operations, a large number of warpscan be switched with zero-cycle penalty, which requires thatevery thread in SM is allocated with some dedicated physicalregisters in the RF. As a result, the RF size is generally verylarge. More importantly, the RF capacity keeps doubling forevery product generation in pursuit of even larger parallelism.

B. Register File

The requirement for simultaneous multiple read/write oper-ations poses challenges to the RF design. For example, in theNvidia PTX standard [10], each instruction can read up tothree registers and write one register. To guarantee the through-put, the simplest solution is to use a multiported SRAM,which requires two more transistors per bit for an additionalport. Obviously, constructing such a heavily ported (six-readand two-write ports for dual-issue) large RF (hundreds ofkilobytes) is typically not feasible or economical.

To solve this problem, banked RF structure [3]–[5] has beenproposed to avoid the unaffordable area and power overhead inthe multiported designs. While there are various possible RForganizations, one of the most commonly used is to combinemultiple single-ported banks into a monolithic RF, whichreduces the power and design complexity. Each bank has itsown decoder, sense amplifiers, and so on to operate indepen-dently. Registers associated with each warp are mapped tothe banks according to the register/warp identifiers and themaximum registers a thread can use [3], such that differentregisters in a warp and the same registers in different warps canbe evenly distributed across the multiple physical banks. In thisway, the mapping offers a fair balancing and higher possibilityfor the multiple operands in an instruction to be fetched acrossmultiple banks in a single cycle with no conflicts. Otherwise,they have to be serialized degrading the RF throughput.

A typical banked RF structure is shown in Fig. 1. For ascheduled instruction to be dispatched into execution, it firstreserves a free operand collector. The collectors can receivemultiple operands via crossbar concurrently. If running outof collectors, the instruction waits until a collector becomesavailable. The arbiter receives RF requests from the collectors,and then tries to distribute them to different banks but onlyallows a single access to the same bank at a time. Once bankconflicts occur, the requests have to be serialized. For example,write operations usually take higher priority over reads on

Page 3: IEEE TRANSACTIONS ON VERY LARGE SCALE ...jingjiez/publications/Bank...the zero-cost context switching, supported by a large-sized register file (RF), where the active contexts of

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

JING et al.: BANK STEALING FOR A COMPACT AND EFFICIENT REGISTER FILE ARCHITECTURE IN GPGPU 3

Fig. 2. Performance with different numbers of banks.

accessing the same bank [7], [11], [12], anticipating an earlyclearance of the dependent instructions on the scoreboard.After all the operands for an instruction are collected, it isdispatched to a dedicated execution unit, e.g., integer, specialfunction, or ld/st units, and the hosting collector is released.Note that the bank conflicts directly affect the operand fetchinglatency, and then saturate the limited number of collectorsdedicated to the execution units, eventually causing issuestalls at the beginning of the execution pipeline. With fewerconflicts, collectors can turn around faster and better serve theexecution units.

III. MOTIVATION

In pursuit of larger TLP, the RF capacity keeps doubling forevery product generation. Conventional designs simply applycoarse-grained structural expansion by adding more banks toaccommodate more registers. However, this solution is notscalable due to the significant area and power overheads withmarginal performance gains.

A. Design Space Exploration

Conflicts have been acknowledged as the major reason forthe reduced RF throughput. Adding banks can improve thethroughput, because the possibility of conflicts is reduced bydistributing registers into more independent banks. In thissection, we conduct a design space exploration to experimen-tally study how banking impacts the performance, area, delay,and power of a pilot RF design described in Section V.

1) Performance: We vary the number of banks with thetotal RF capacity fixed to 128 KB per SM that is typical inFermi-like GPGPUs [1]. The performance results are averagedacross all the benchmarks and shown in Fig. 2 by normalizingto the 4-bank RF. Fig. 2 shows that adding more banks benefitsthe performance significantly when a small number of banksare used. With increasing banks, the benefit diminishes asthe 16-bank RF delivers almost the same performance as the32-bank case due to the much reduced conflicts. The explo-ration reveals that 16 banks can be optimal for the 128-KB RF.

We also conduct the exploration with 2× TLP on a 256-KB RF and 4× TLP on a 512-KB RF per SM to studythe features in newer GPGPU architectures, such as Kepler.Detailed results are given in Section VI-C. Briefly, it isobserved that the optimal design points shift toward morebanks with larger RF for higher throughput. When more warpinstructions are competing for RF accesses, the number ofbanks needs to be increased to reduce the chance of conflicts.Otherwise, the performance gains from larger TLP can becompromised.

2) RF Area: We use 128-KB RF as a case study. Fig. 3(a)shows the breakdown of the banking area, such as the arraycells, peripheral circuitry, and crossbar. It clearly shows thatmore banks cause significant area overhead. The area ofarray cells stays constant across different schemes due tothe fixed RF capacity. In contrast, the area of the peripheralcircuitry, such as decoders, prechargers, equalizers, bitlinemultiplexers, sense amplifiers, and output drivers, expandsnotably with increased banks. This is because the GPGPUregisters are of 128 bytes,1 much wider than that in CPU.This significantly increases the number of bitlines and widensthe data buses. Each individual bank duplicates the peripheralcircuitry for independent array control, resulting in even largerarea expansion compared with the conventional CPUs [13].

Meanwhile, the dimension of the crossbar betweenRF banks and collectors grows significantly with the numberof banks. As a result, nearly 50% additional area is observedfrom a 16-bank RF to a 32-bank RF, although the two designsdeliver quite similar performance, as shown in Fig. 2.

3) RF Timing: Because we fix the RF capacity, a largerbank count naturally results in shorter access time inside thememory array, as shown in Fig. 3(b). However, array accesstime is just a portion of the total RF delay and the advantageis compromised by the wire transfer time via a larger crossbar.With an increasing number of banks, the crossbar delay tendsto dominate. However, we find that even the worst case schemestudied in this paper can still meet the sub-1-GHz GPGPUfrequency, so we do not consider their timing differences.

4) RF Power: Fig. 3(c) shows the dynamic power of thebanks. The cell array power slightly decreases with morebanks, because the bitlines are shorter under fixed RF capacity.In contrast, the dynamic power spent on data transfers viathe crossbar increases considerably due to the larger RF area.Meanwhile, the leakage power increases significantly, asshown in Fig. 3(d), because the banks duplicate the peripheralcircuitry.

5) Summary: The design space exploration reveals the factthat fewer banks benefit the RF physical design, such as area,timing, and power. However, simply reducing the number ofbanks is not acceptable from the performance point of view.For example, there is around 5% performance loss from a16-bank design to an eight-bank design, as shown in Fig. 2,while this paper shows that there is still room for performanceimprovement even above the conventional 16-bank RF, if theconflicts can be effectively eliminated as we will demonstrateby our proposed 8-bank RF design.

B. Causes for Bank Conflicts

Without conflicts, the operands should pass through theRF with one cycle latency offering maximum throughput.In reality, the operands have to go through many steps untilthey can be dispatched to the execution units. An operandstarts its journey by first going through the arbiter for granted

1There are 32 threads per warp, each with a register of 32 bits. These threadsaccess the RF simultaneously for the 32 registers, resulting in a combinationof 128 bytes for each warped register access.

Page 4: IEEE TRANSACTIONS ON VERY LARGE SCALE ...jingjiez/publications/Bank...the zero-cost context switching, supported by a large-sized register file (RF), where the active contexts of

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 3. Design space exploration with a varying number of banks. (a) Normalized area. (b) Normalized delay. (c) Normalized dynamic power. (d) Normalizedleakage power.

Fig. 4. Latency penalty due to bank conflicts (16-bank RF).

access to a bank, then passing through the crossbar to tem-porarily reside in the collector and finally being dispatched. Inthe case of a conflict, only one operand can proceed, causinglatency penalty to other competing operands. We identify twomajor types of conflicts that lead to the latency penalty.

1) Interread Conflict: This occurs between different warpinstructions. For example, reading register ra may con-flict with a read to the same bank from another warpinstruction.

2) Interwrite Conflict: For example, reading ra may conflictwith a write to the same bank from another warp instruc-tion. When write is prioritized, ra has to be deferred.

We conduct architectural simulation to quantify the conflictimpact in terms of the penalty on average RF access latency.Higher penalty means longer operand waiting that results inreduced RF throughput and more issue stalls. From Fig. 4,we see that the interread and interwrite conflicts are quitecommon, resulting in an extra 0.1 to 0.6 cycle latency thatdirectly causes performance loss. Another type of conflicts(i.e., the intrainstruction conflicts) comes from the competingoperands in the same instruction, which rarely happens andcan be effectively eliminated by compiler renaming.

C. RF Utilization Rate

Even for the optimal design point with 16 banks in Fig. 2,we still observe a underutilization in the RF resources.Fig. 5 shows the averaged utilization rates on all the banksin the RF with different benchmarks. Averagely, the banks areaccessed only 25% of the total execution time.

The phenomena of low utilization can be explained by run-time factors, such as data and control hazards, and misses onmemory units that might incur waiting cycles in the RF. Unbal-anced operand loading of PTX instructions is another factor.Not all instructions will fully use the three source operandsin the PTX format, leading to unbalanced stress on the

Fig. 5. Utilization rates for different benchmarks (16-bank RF).

RF resources. In addition, the ideal case of accessing all thebanks at the same cycle can hardly sustain in a real program.At runtime, operands may not be evenly distributed acrossbanks, leaving some banks free of accesses.

Note that a high utilization does not necessarily meanno bank conflicts. For example, the rate can be increasedby using fewer banks. However, if the bank conflicts arenot adequately handled, a noticeable performance loss canbe observed, as shown in Fig. 2. Instead, a low utilizationobserved in current RF reveals the necessity and possibilityfor an intelligent operand coordination, which can eliminatethe conflicts and accelerate the operand fetching throughoutthe program execution.

IV. PROPOSED TECHNIQUES

A. Overview

In this paper, to reduce the conflicting accesses for fasteroperand fetching, we propose the bank stealing techniquesto opportunistically steal unused banks for better operandcoordination. For interread conflict, it steals a free bank at thecurrent cycle for an operand that is supposed to be fetchedin the next cycle avoiding the potential upcoming conflict.For interwrite conflict, it steals an idle bank with a freeregister entry to temporarily store the write-back value toavoid the conflict with a read. The bank stealing techniqueeffectively leverages the underutilized banks to maximallyserve the pending RF accesses.

In fact, the stealing techniques adequately exploit themultiwarp, multibank feature in GPGPUs. Unlike CPU,GPGPU pipeline is likely to have multiple ready warpswaiting at the issue stage (14 out of 48 warps as observedin our simulation). This makes it possible for a wiseoperand coordination especially with the underutilized bankingresources.

Page 5: IEEE TRANSACTIONS ON VERY LARGE SCALE ...jingjiez/publications/Bank...the zero-cost context switching, supported by a large-sized register file (RF), where the active contexts of

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

JING et al.: BANK STEALING FOR A COMPACT AND EFFICIENT REGISTER FILE ARCHITECTURE IN GPGPU 5

Fig. 6. Operand read with bank stealing for interread conflict removal.

In addition, as the banked RF structures are very commonand essential for SIMT, the proposed schemes can be broadlyapplicable to a wide range of GPGPU architectures.

B. Bank Stealing for Interread Conflicts

1) Basic Idea: In the conventional RF design, whenever aconflict occurs, one of the register reads has to be deferredcausing pipeline stall. However, if one of the conflicting readscan be moved to the previous cycle and steal the bank whenvacant, the conflict and pipeline stall can be eliminated.

As shown in Fig. 6, at time T , warps Wa and Wb areissued at the pipeline issue stage. Here, we assume that bothinstructions have only one operand to read for simplicity. Theiroperands ra and rb are granted to read the RF by passing theconflict checking at the arbitration stage at T + 1. In addition,at T + 1, warps Wc and Wd are issued, and their operandsrc and rd are found to conflict in the arbitration stage at T +2.Conventionally, either rc or rd has to be delayed resultingin serialized reads at T + 3 and T + 4. Differently, in ourbank stealing scheme, the read of rd can be moved one cycleearlier if it is not conflicting with ra and rb by passing conflictchecking at T +1. In that case, Wc and Wd can finish operandreading at T + 2 and T + 3 without pipeline stall, saving onecycle penalty. However, if rd conflicts with ra or rb at T +1, thestealing is aborted and the pipeline behaves as before. In fact,the bank stealing increases the opportunity for read successby stealing the unused RF bandwidth in the previous cycle.

2) Warp Scheduling for Read Stealing: To enable the readstealing, we need to identify the warps (Wc and Wd ) to beissued in the next cycle (T + 1) at the arbitration stage of thecurrent warps (Wa and Wb), and check for the bank conflictsof operands (ra, rb , and rd ) between the current and next warpinstructions. The question is how to identify the candidatewarps to be issued one cycle ahead for bank stealing withthe current warps, so that we can put their operands into thebank conflict checking in the arbitration stage of the currentwarp instructions.

In fact, this is feasible due to the warp organization featuredin GPGPUs. It is very common that multiple warps are readyand waiting for issue in the instruction pool. The selectionof candidate warp can be done with the aid of the defaultwarp scheduler. As shown in Fig. 7, a scheduler typicallyuses a hierarchical selection tree after the ready checking toselect and wake up a warp for issue, as indicated in [14]. Theready checking considers only the warps that are eligible for

Fig. 7. Warp selection for read stealing aided by the default scheduler.

execution without hazards, and the hierarchical tree selects thewarp with the highest priority at the bottom level. Note thatat one stage above the bottom level of the tree, there are twowarps. In our scheme, we always treat the other warp thatis not selected for issue at the current cycle as the candidatewarp for stealing. The operands of the current and candidatewarps will be latched into the arbitration stage to check forconflicts. Upon conflicts, the operands can be read ahead iftwo conditions are met: 1) there must be a free collector tohold the stolen operands and 2) the stealing should be able toresolve the conflicts in the next cycle and not cause conflict inthe current cycle. Note that the scheduler should be overriddenin the next cycle to prioritize the candidate warp for issue iftheir operands have been read ahead.

3) Design Cost and Scheduler Impact: As discussed earlier,the candidate warp is selected by utilizing the default schedul-ing and then latched to the next arbitration stage. To enablethe read stealing, the default scheduler should be overrid-den in the next cycle when a stealing has been performed.As shown in Fig. 7, this logic can be simple that only involvesthree multiplexors. For multiplexor m1, it selects the currentwarp from the selection tree if the read stealing has failed.Otherwise, it overrides the scheduler by selecting the candidatewarp latched in the last cycle. On the other hand, to updatethe candidate warp accordingly, m2 still selects the other warpfrom one level above the bottom as the candidate warp if theread stealing has failed. Otherwise, it selects the current warpas the next candidate warp for stealing, because the stolen warpin the last cycle is being issued now. m3 gives the selectingsignal depending on final scheduling decision left_sel.

Regarding the timing impact, both m1 and m2 dependon the left_sel signal from the selection tree, and the readstealing succeeds signal feedback from the arbitration stage.To minimize the impact, the feedback signal from arbitrationstage should be available as early as possible. Fortunately,it is out of the critical path in the arbitration stage and can beasserted immediately once we find that there are no requestsof higher priority. Therefore, m3 does not add extra delay,because it aligns with the multiplexor in the last level of theselection tree. m2 aligns with m1 to give the current warp andcandidate warp. Then, the two paths can have the same gate-level depth by adding a 2-to-1 multiplexor to the baseline,whose delay is estimated as 1.2× of an FO4 inverter delay.

Page 6: IEEE TRANSACTIONS ON VERY LARGE SCALE ...jingjiez/publications/Bank...the zero-cost context switching, supported by a large-sized register file (RF), where the active contexts of

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 8. Write stealing for interwrite conflict removal.

With stealing enabled, the default instruction flow might bechanged from the original scheduler, but the impact turns outto be small on the performance as observed in our experiments.That is because the candidate warps are chosen from thesecond level to the bottom level of the selection tree, which arelikely to be scheduled in the next cycle by the default scheduleranyway, albeit not always of the next highest priority. It is alsopossible to select other ready warps as the candidate warp forstealing, but the performance can subject to more changes.We select the candidate warp as proposed in Fig. 7, whichcan be compatible to the conventional scheduler and a similarperformance can be expected. For example, when the readstealing is built upon the greedy-then-oldest (GTO) scheduler,the original scheduling sequence of Wa, Wa, Wb, Wb, . . . , maybe changed to Wa, Wb, Wa, Wb, . . . , depending on the actualconflicts happened. In this way, the scheduler with stealing isnot for conflicts alone. It can maximally inherit prior wisdomon scheduler designs, such as GTO, round-robin, and otherschedulers [14], [15], that are essentially built upon them, butadding the bank conflicts into account.

The essence of bank stealing lies in the fact that theRF banks experience underutilization, as shown in Fig. 5.By operand coordination across multiwarps, it leverages theidle banks at current cycle and opportunistically fills themwith predicted future reads. Note that read stealing does notreduce the total number of reads. It just makes a better andmore balanced utilization of the available RF bandwidth.

C. Bank Stealing for Interwrite Conflicts

1) Basic Idea: In this section, we propose another typeof bank stealing to eliminate the interwrite conflicts. As adefault scheme for the write conflicts, the read accesses on thesame bank have to yield due to the higher priority of registerwrites [7], [11], [12], anticipating an earlier release of thedependent instructions from the scoreboard. Note that thecompiler is not able to avoid this conflict beforehand, becausethe write occasion is unknown for load operations.

Due to the multiwarp nature in GPGPUs, oftentimes,the written value is not immediately used, because thedependent instructions are held in the issue pool. In that case,deferring the write will allow the read to proceed, successfullyreducing the pipeline stalls and overlapping the latency amongwarps. The write stealing can be shown in Fig. 8. A registerread of r0 arrives at the same cycle as a write to r8 inthe arbitration stage at time T causing a bank conflict.

Fig. 9. Book-keeping the stolen register Rfree with its actual destinationregister Rdest on each bank. Note that the register organization with eightbanks is just for illustration. Any other forms are applicable as long as thecompiler knows it for a bank aware liveness analysis.

Conventionally, the arbiter will grant the write access to r8that blocks the read of r0. Instead, in our scheme, r8 canbe deferred. We also clear the dependent instructions of thescoreboard and resort to a dependence checking to enforcethe dependence. Actually, since there are many pendingwarps, the dependent instructions are likely to be trapped inthe issue pool, which offers even more opportunities for thewrite back of r8. Unlike CPUs, this turns out to be a commoncase in our simulation, as we find that there is 60% of chancethat the first dependent read arrives more than three cycleslater. This implies the potential of write stealing, and ourexperiment shows that the majority of stolen writes can finda write-back slot within three cycles.

Note that there can be other alternatives applicable forinterwrite conflict removal. For instance, a multiport or pseudodual-port RF helps at the cost of significant area overhead butwith infrequent usage. Another solution is a fully associativewrite buffer as widely used in CPUs [16], or dedicated buffersdistributed on each bank. However, it turns out to be costlybecause in GPGPUs, each warp register contains at least32 threads and each with 32 bits. This means that we needto add 1 kb to buffer just one register entry. Wiring theresults to the buffers is also costly. For example, the crossbarhas to be widened to connect each individual bank, andmultiple write ports are required for concurrent write backs.These modifications further increase the area and power, giventhe super-wide datapath in GPGPUs. Since the RF itself isexperiencing underutilization most of the time, it is meaningfulto take it for the buffering to maximize the resource reuses,which is one of the primary goals of this paper.

2) Temporary Storage for Write Stealing: The prerequisiteof the write stealing is to find a place to temporarily holdthe write-back value. We propose to use the off-the-shelfinactive register entries in the RF with the lightweight compilersupport. This is reasonable in GPGPU architecture given thetwo observations that: 1) the entries and banks are oftentimesunderutilized for GPGPU applications and 2) the banks arealready connected to the result bus that offers direct datapathfor the operand to be written into the stolen bank.

To find the inactive register entries, we associate a smallstatus table holding the free register ID for each bank, i.e.,Rfree with a valid bit f , as shown in Fig. 9. It is analyzedwith compile-time liveness information that can be embedded

Page 7: IEEE TRANSACTIONS ON VERY LARGE SCALE ...jingjiez/publications/Bank...the zero-cost context switching, supported by a large-sized register file (RF), where the active contexts of

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

JING et al.: BANK STEALING FOR A COMPACT AND EFFICIENT REGISTER FILE ARCHITECTURE IN GPGPU 7

into instruction encoding, and then assigned to hardware bypassing along the pipeline as similarly done in [17] and [18].For instance, Rfree is initialized as an inactive register,e.g., ra , into the first instruction encoding. When ra is usedby the following instruction, its encoding will embed anotherregister, e.g., rb if inactive, which also carries an update ofRfree to rb at runtime. Note that any warp running on the SMis able to update Rfree. Therefore, it is expected to assign Rfreeas the least used register if the program is limited by registersor a register that is never used if the program is not limitedby registers. We have found that a single entry for each bankis sufficient for the write stealing, and therefore, a single Rfreeis enough to identify the current inactive register.

Once an inactive register Rfree is found upon an interwriteconflict, a link between the temporary Rfree and the actualdestination register Rdest is established and recorded into thetable with a valid bit m and the destination register/bank ID,as shown in Fig. 9. The information will be used in the arbiterto determine when it can be written back to the destinationregister and removed from the temporary register. Typically,the arbiter will issue a write back when: 1) the result bus isfree and 2) the bank holding the temporary register is free. Thewrite-back value is first read out from the temporary registerand put onto the result bus. In the next cycle, the value willbe written back to the destination register. Nevertheless, it isstill possible that the write back to the destination register canconflict with a read to that bank. If that happens, we will notsteal again but complete the write by stalling the read.

Write stealing offers a second chance to write back thevalue, hoping to avoid the bank conflict in the first round.In case the temporary register does not manage to writeback before the incoming dependent accesses from otherinstructions (e.g., read-after-write (RAW) or write-after-write(WAW)), a forced write-back is required resulting in a pipelinestall to ensure the data integrity. This is achieved when thedependent instruction arrives at the arbitration stage, the arbitercan prioritize the write-back over the dependent instruction toavoid any hazard. However, because registers are not alwaysimmediately used in the multiwarp GPGPUs, most of the time,the temporary register can manage the write-back withoutpipeline stalls. Finally, because the free registers are dead reg-isters, very rarely they become alive in the program execution,and the temporarily stored value needs an immediate write-back, which otherwise might cause pipeline stall.

Compared with the traditional scheme, write stealing addsextra read and write operations, which is negative on energy.However, the added energy can be small due to the limitedwrite stealing operations as observed in our experiments.Given the write energy only accounts for a portion of thedynamic RF energy (around 30%–40%), and the leakagetends to dominate with shrinking technology nodes (up to70%) as exposed in [19], the overall RF energy can bereduced.

The required datapath changes are minimal. The result busalready exists and serves as the regular channel for writingback the results to the RF banks. The status table contains19 bits per bank (two valid bits, 7-bit free register ID, and10-bit destination register/bank IDs) and altogether 152 bits for

eight banks. Compared with the traditional buffering schemesincurring at least thousands of bits overhead, the hardwarecost can be negligible. In addition, given the highly paralleledstructure in GPGPUs, our proposal applies a distributed buffer-ing in each bank by using inactive register entries. This caneliminate the wiring and multiport problems that embarrassthe centralized write buffer solution.

D. Arbitration for Bank Stealing

Finally, the arbiter is modified to enable the bank stealing.For each bank in any clock cycle, there can be three moretypes of incoming accesses except for the normal writes andreads, i.e., stealing reads, stealing writes, and forced writes.A stealing read means that the read tries to steal a bank forinterread conflict removal. A stealing write means that thewrite tries to use a free bank with a free register to temporarilyhold the write-back value. A forced write means that the write-back value will be forced to write back to the destinationregister.

1) Arbitration Changes: Fig. 10(a) shows the logic toarbitrate the read/write request to a bank. At first, all operandrequests are decoded into bank requests as in ①. Then, therequests are passed through a set of cascaded selection in ②,where the bank requesting signal will be generated for eachtype of accesses, such as normal read or stealing write, asshown in Fig. 10(b). Finally, the priority logics in ④ will grantthe ultimate access according to the priority order. Note thatall these steps are required in the conventional design.

To support bank stealing, we add logic to check the datadependence in ②. In general, the forced write takes the highestpriority, because it either comes from the previously stealingwrite that cannot be deferred for a second time or this isthe first write back but it cannot borrow a vacant place fora stealing write. In either case, all other conflicting accesseshave to yield. If there are more forced writes to the same bank,they will be executed in the consecutive cycles.

When a normal write conflicts with other types of accesses,it becomes a stealing write if a free register in a free bankcan serve as temporary storage. This stealing checking isperformed in ③, where the normal writes are arbitrated withbank vacant status, as shown in Fig. 10(c). We use a round-robin policy to arbitrate the banks qualified for stealing.

In summary, the priorities for all these accesses in ④ areordered as: forced write > normal read > stealing read >normal write > stealing write. For each cycle, only one accesswith the highest priority is granted to the bank while all otheraccesses have to stall.

2) Design Cost and Analysis: As shown in Fig. 10, thearbitration changes mainly lie on the cascaded selection, thewrite stealing checking, and the prioritization. The added logicis symmetric, modular, and scalable, and therefore, we believethat the design complexity is manageable. We design theproposed circuitry and find it to be less than 1.2% of theRF area. Because the logic mainly involves 1-bit signals,the area overhead can be negligible when compared withthe extra buffers of 1024 bit wide, which requires similararbitration logic changes as well.

Page 8: IEEE TRANSACTIONS ON VERY LARGE SCALE ...jingjiez/publications/Bank...the zero-cost context switching, supported by a large-sized register file (RF), where the active contexts of

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 10. Hardware support in arbitrator for bank stealing. (a) Overview. (b) Cascaded selection. (c) Write stealing check on banks. Note the arbitration onnormal reads forms the critical path.

In terms of timing, the shaded line in Fig. 10 shows that thenormal read is the original critical path in the arbiter. This isbecause there can be much more normal reads to be arbitrated,and the cascaded selection on them is generally long enoughto overlap the delay of all other types of accesses that run inparallel. For example, the decoding of stolen reads in ① canbe done in parallel with the decoding of the normal reads. Thechecking and selection of stolen writes in ② and ③ can be donein parallel with the selection of the normal reads. The onlylogic added is in the prioritization logic ④ on two paths. Forthe read enable signal path ren, a two-input OR gate convergesthe normal read and the introduced stealing read if no forcedwrite blocks them. For the write enable signal path wen, a four-input AND gate converges the normal write and the stealingwrite if no read request blocks them. Because the normal readis assumed as the critical signal, the AND gate only functionsuntil the normal read settles. Note that the normal read alsoforms the original critical path, which requires a two-inputAND gate as well to assert the read request with no blockingnormal write. Therefore, the two paths align approximately intiming by adding one gate-level depth, e.g., a two-input AND

and a two-input OR gate, whose delay can be estimatedas 0.73× of an FO4 inverter delay. We simulate and provethat tighter delay constraints can absorb its delay to thesub-1-GHz nominal frequency.

V. EVALUATION METHODOLOGY

As discussed earlier, the design overhead has been analyzedfrom a hardware perspective. In our experiment, we turn toan architecture perspective by evaluating the RF designs withdifferent numbers of banks, regarding their performance, area,and energy efficiency using our proposed bank stealing.

A. Design of a Pilot RF

The behavior of a multibanked RF has been intro-duced in Section II-B, and we use the same regis-

ter mapping and organization to the baseline RF design.To learn the physical parameters of such a multibankedRF, we synthesize a pilot RF with multiple single-portedbanks by using SMIC commercial memory compiler tool[20] with industrial cell library. It gives standard SRAMdesigns with four metal layers for over-the-cell routing. Webuild the RF by varying the number of banks with thearea, power, and delay number reported from the memorycompiler.

We also evaluate the crossbar, which connects the banks andthe operand collectors with 128-byte bus for each operand.We use the bit-sliced structure [21], [22] due to its higherdensity for such a wide crossbar. The 128-byte slices areorganized into a 32×32 square array. This structure is regularand easy to scale to various numbers of inputs and outputs,which well fits the needs in our design space exploration. Afterimplementing the layout, it turns out that the area is severelydominated by wiring even using the minimum pitching size.This is due to the large operand bitwidth, which requires thewires to be routed into the innermost slice at each direction.Therefore, we use two metal layers to split the horizontaland vertical wires to the four edges of the square to furtherreduce the wiring size. Finally, the area, power, and delay ofthe crossbar structure are calculated based on the HSPICEsimulation of a single-bit slice due to its regularity.

B. Experiment Setting

For the performance measurement, we use the cycle levelarchitectural simulator GPGPU-Sim v3.2 [23]. If not otherwisespecified, each SM has 32 execution lanes and a 128-KBRF serving at most two warp instructions at a time. Otherkey configuration parameters for the simulator are shownin Table I, which generally conform to the Fermi architecturethat is faithfully modeled by the simulator. We also apply ourtechniques to 256-KB RF and 512-KB RF designs featured innewer GPGPU architectures for a scalability study.

Page 9: IEEE TRANSACTIONS ON VERY LARGE SCALE ...jingjiez/publications/Bank...the zero-cost context switching, supported by a large-sized register file (RF), where the active contexts of

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

JING et al.: BANK STEALING FOR A COMPACT AND EFFICIENT REGISTER FILE ARCHITECTURE IN GPGPU 9

Fig. 11. Design space study using different banks. Proposed RFs are marked with ∗. (a) Normalized performance. (b) Normalized area. (c) Normalized areaefficiency. (d) Normalized energy efficiency.

TABLE I

KEY PARAMETER SETTINGS IN THE GPGPU-Sim SIMULATOR

Table II shows the benchmarks in our evaluation. We inves-tigate 14 benchmarks from CUDA SDK [24] and Parboil [25]suite, which are widely used in GPGPU study. We use CUDASDK 4.2 for compilation and the simulator is configured to usePTXPlus syntax for simulation. This is a hardware instructionset extended by GPGPU-Sim with register allocation informa-tion to enable more accurate simulation results, as reportedin [26]. Due to the lack of open-source compiler alternativeto commercial nvcc to generate codes with register livenessinformation, we intercept the compiled kernel before runningthe simulation. This interception binds our liveness analysisto each register. The simulator is also modified accordinglyin pipeline to accept the liveness information to identify theRF holes at runtime.

In our delay and power evaluation, we use the primitivesfrom the pilot RF design under different banking schemes.Detailed primitives are shown in Table III under 32-nmtechnology and they are obtained from the SIMC memorycompiler tool and our custom layout. We collect RF sta-tistics from GPGPU-Sim and use these primitives to com-pute the RF power, including the normal bank accessing,crossbar transmission, arbiter, and bank stealing if enabled.We also consider the extra read and write operations oneach stolen write. For all other non-RF components, we useGPUWattch [27] to calculate the power.

VI. EXPERIMENTAL RESULTS

We first show the overall performance improvement by ourproposed techniques with different banking numbers. In allthe results, straight RF stands for the traditional RF withoutthe proposed techniques and the proposed RF stands for theRF equipped with the proposed techniques. As a case study

for a better understanding of our proposed techniques, wechoose representative design points for detailed analysis. Wealso sweep different RF sizes to demonstrate the scalability.

A. Design Space Study

For a thorough study, we apply our techniques to RF designswith a fixed capacity of 128 KB but different numbers ofbanks (4, 8, 16, and 32). The results given in Fig. 11 areaveraged on all the benchmarks, including performance, area,area efficiency, and energy efficiency. The area efficiency iscalculated as normalized performance over area, while theenergy efficiency is calculated as normalized performance overenergy.

From Fig. 11(a) and (b), we see that RF designs with ourbank stealing techniques always achieve better performanceover the straight RFs with the same configuration. The RFswith the same configurations have almost the same area dueto the negligible hardware cost for bank stealing. The overallperformance improvement ranges from 8% to 13%. Given thefact that we only make change on a single GPGPU unit, thisimprovement is significant.

The most important observation is that the proposed RFcan often outperform the straight RFs equipped with morebanks. For instance, the proposed 8-bank RF outperformsthe straight RFs with 16 banks or 32 banks. It exhibitsthe effectiveness of our proposed techniques that adequatelyeliminates the potential bank conflicts. It is more efficientthan purely adding more banks. As a result, our proposedtechniques are much more area-efficient, achieving nearly 25%and 50% area savings over the 16- and 32-bank straight RFs.At the same time, the performance is also superior with 7.2%and 5.7% improvement over the 16- and 32-bank straight RFs.

We also compare the area efficiency and energyefficiency between proposed RFs and straight RFsin Fig. 11(c) and (d), respectively. All the results demonstratethat our proposed RF is superior in terms of performance,area, and energy efficiency. Because the RF size keepsincreasing in modern GPGPU, the much improved areaefficiency with better RF performance will be beneficial forthe future scaling.

B. Case Study

As a case study, we choose the RF design that exhibits thehighest area efficiency from the proposed RFs (eight bank).

Page 10: IEEE TRANSACTIONS ON VERY LARGE SCALE ...jingjiez/publications/Bank...the zero-cost context switching, supported by a large-sized register file (RF), where the active contexts of

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

TABLE II

BENCHMARK CHARACTERISTICS

TABLE III

AREA, TIMING, AND POWER PRIMITIVES FOR 128-KBRF WITH 16 BANKS

Fig. 12. Performance comparison on the straight RF designs without andwith our proposed schemes.

Accordingly, we choose eight-bank and 16-bank straight RFdesigns for comparison. The 16-bank straight RF acts as thebaseline and all the results are normalized to it.

Fig. 12 shows the performance comparison in detail. Fromthe bottom bar in Fig. 12, we see that the straight 8-bank RFdegrades the performance by nearly 5% compared with the 16-bank straight RF design. That is mainly due to the increasedbank conflicts.

1) Evaluation on Read Stealing: In Fig. 12, we see thatthe average performance gets improved by around 7% overthe straight 8-bank RF by read stealing. That means anaveraged better performance of 2% than the straight 16-bankRF. Read stealing alone can provide better performance thanthe baseline straight 16-bank RF design except for M.Car outof the 14 benchmarks. The experiment results confirm that theread stealing can effectively reduce the potential conflicts andimprove the overall GPGPU performance.

It is worth mentioning that we have done the same exper-iments using different types of warp schedulers, includinground-robin, two-level, and so on. The conclusion stays thesame to our results using GTO. Although the read stealingmay occasionally modify the instruction flow, the experimentsverify that this scheme is generally applicable and not sensitiveto the host scheduling policies.

Fig. 13. Latency penalty comparison (for each benchmark, the left bar isthe straight 8-bank RF and the right bar is our proposed 8-bank RF).

2) Evaluation on Write Stealing: We further apply writestealing on eight-bank straight RF. The results are stacked ontop of read stealing, manifesting the total performance gain bythe combined schemes shown in Fig. 12.

We find that the two schemes are complementary to eachother on performance. They remove the conflicts and reducethe latency, which is essential to improve the RF through-put. When the two bank stealing techniques are appliedtogether, the enhanced 8-bank RF can outperform the straight16-bank RF by around 7%. Specifically, we see that it outper-forms the straight 16-bank design for all the benchmarks.

3) RF Access Latency: We investigate the RF access latencyby applying our proposed schemes on straight RF design.We take the latency as the indicator for the impact of conflicts,as it is a more direct measurement of RF throughput.

Fig. 13 shows the latency penalty measured in pipelinecycles when accessing the RF. Compared with Fig. 4(a) onthe 16-bank straight RF, we find that the straight 8-bankRF nearly doubles the latency penalty due to aggravatedconflicts. Instead, our proposed 8-bank RF effectively shortensthe operand latency by reducing the interread and interwritepenalties by around 58% and 88%, respectively. The writestealing is more effective to remove the interwrite conflicts,while the read stealing largely depends on the prediction ofthe potential interread conflicts in the next cycle.

From this plot, we see that the fetching latency can betreated as a performance indicator, albeit not linearly. Forapplications with significantly reduced operand latency, theproposed RF usually achieves better performance than thestraight 16-bank RF (Q.Gen, S.qrng, and Tpacf). In a word, ourproposal is effective on reducing the operand latency penaltywith fine-grained control of the precious RF resources.

We also notice an increasing utilization to 54% by the pro-posed 8-bank RF with bank stealing against 23% in the straight

Page 11: IEEE TRANSACTIONS ON VERY LARGE SCALE ...jingjiez/publications/Bank...the zero-cost context switching, supported by a large-sized register file (RF), where the active contexts of

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

JING et al.: BANK STEALING FOR A COMPACT AND EFFICIENT REGISTER FILE ARCHITECTURE IN GPGPU 11

Fig. 14. Write-back breakdown with write stealing.

Fig. 15. Power and energy efficiency.

16-bank RF. This is mainly because the greatly reducednumber of banks and more balanced usage of the RF banksthat accelerate the operand fetching for better performance.

4) More Evaluation on Write Stealing: In general, writestealing can eliminate the interwrite conflicts. This is underthe assumption that the temporary register can find a slot towrite back to the destination register before the first dependentaccess. However, this is not always true. Whenever there ispending RAW or WAW hazard, the value resided in the tem-porary register has to be written back incurring pipeline stall.In Fig. 14, we find that the majority of registers manage towrite back without any intervention to the execution pipeline.First, 83% of the registers are normally written withoutstealing. Therefore, the power overhead introduced by writestealing can be limited. For the registers that do requirestealing, over 25% of the registers are written back within onecycle after stealing. Another 50% of the registers are writtenback successfully by waiting several more cycles but stillbefore its first dependent access. Compared with the baseline,where 17% of the registers are written back conflicting withreads, only 2.5% writes are left that negatively stall thepipeline.

5) Evaluation on Power and Energy Efficiency: We showthe power number and use the metric perf/power to quantifythe power efficiency and perf/energy for the energy efficiency.The power results reported in this section are chip-wide powercounting in all the GPGPU core units.

Fig. 15 shows the power results. From Fig. 15, we confirmthat RF designs using fewer banks consume less power. Forinstance, the straight 8-bank RF consumes 6% less power thanthe straight 16-bank RF. This is because fewer banks slightlyincrease the dynamic banking power but reduce the crossbarpower and the leakage, as explored in Fig. 3(b). However,using traditional designs, fewer banks are inferior in terms ofthe energy efficiency when performance is accounted.

Fig. 16. Scaling of area and energy efficiency with and without our schemes.

In contrast, our proposed RF is more energy efficient.Although it introduces additional power due to write stealingfor additional bank accessing and afflicted logics for arbitra-tion, the overhead is relatively small, as shown in Fig. 15.Taking all the power overhead in our scheme into account, theproposed 8-bank RF increases the total power by around 3%over the straight 8-bank RF but still consumes less power thanthe straight 16-bank RF. The energy efficiency of our proposeddesign can beat the straight 16-bank RF by 15%, owing to thesuperior performance gains.

C. Scalability Study

Finally, we demonstrate the scalability of our techniqueson RF designs in newer GPGPU architectures. For example,the Kepler doubles the RF capacity, allows 255 registers perthread with ISA support, and simplifies the register score-boarding hardware with compiler information, as documentedin [2]. Regarding the VLSI design, we believe that there arenot too many changes as the basic building block, i.e., theSRAM array is hard to change. Therefore, our scalability studyfocuses on the bank organization and capacity increase byapplying our techniques onto RFs with different capacities(128, 256, and 512 KB per SM). We also increase the numberof thread blocks and schedulers accordingly to represent thehigher pressure from increasing TLP. For each RF size, weselect a pair of optimal RF designs with and without ourschemes by considering a balanced performance and area.Then, we compare their area and energy efficiency in Fig. 16.

We find that the most area and energy-efficient configurationis generally not the design with the largest number of banksdue to the marginal performance gains at the cost of excessivepower and area overhead. The optimal point varies withthe RF capacities meaning that the traditional RF designhas to leverage expensive area expansion to accommodatefor the increasing TLP. In contrast, the most area-efficientRFs with our schemes all use eight banks, but can delivercomparable performance to the traditional RFs with a largernumber of banks. This promises more silicon area savings withtechnology scaling. Our proposed bank stealing can effectivelyreduce the conflicts with varying RF sizes. We find a steadyimprovement from 40% to 80% in the area efficiency andsignificantly better energy efficiency by 15%–28% over thetraditional design. The results demonstrate the superiority forour schemes and well justify them for the better scalability inthe future GPGPUs.

Page 12: IEEE TRANSACTIONS ON VERY LARGE SCALE ...jingjiez/publications/Bank...the zero-cost context switching, supported by a large-sized register file (RF), where the active contexts of

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

VII. RELATED WORK

There has been an ample amount of work studying effi-cient RF structures in different microprocessors, includingconventional CPUs, very long instruction word, and streamingprocessors such as GPGPUs.

For example, in general microprocessors, SRAM–dynamicrandom access memory (DRAM) hybrid macrocell [28]and 3T1D embedded DRAM [29] have been applied onL1D caches in general microprocessors. In the GPGPUdomain, a variety of new RF implementations, such as hybridSRAM–DRAM RF [7], spin transfer torque random accessmemory (STT)-magnetic random access memory (MRAM)RF [30], and embedded dynamic random access memory(eDRAM)-based RF [17], [31], [32], have been proposed.As these new types of memory generally compromiseperformance, these works focus on smart strategies tomaximally recover the performance loss.

Techniques exploring architectural modifications are inten-sively studied to improve performance or power. One studyrelated to this paper in the CPU domain [16] proposes an area-efficient CPU RF by reducing ports and applies prefetchingand delayed write back to sustain performance. They leveragequeues with dozens of entries and heavily augment CPUpipeline. However, in GPGPU, it is impractical to apply theirmethods, because there is no register renaming stage andthe operand width is too wide for queuing. Prior work alsostudies multicore cache [33], [34] or cache-conflict prob-lem [35], [36] by a prediction mechanism. Other researchersare interested in hierarchical RF structures in the CPU domain,such as [37] and [38] for less complexity, and RF caching [30],[39], [40] for performance enhancement.

In the GPGPU domain, the work [4] aims to reduce the mainRF access energy by introducing a small register cache for fre-quently accessed operands. To prevent the register cache fromthrashing, they further propose a two-level thread schedulerto maintain a small set of active threads on deck. Accordingto their configuration, they require an additional 6-KB fullyassociative cache per SM to keep the performance withoutloss. This might aggravate the already congested RF layoutwith degraded timing parameters. In contrast, our previouswork [9] reduces the RF area and improves the performance byidentifying the interread conflicts. This work steps forward byidentifying the interwrite conflicts and proposing a combinedstructure to combat with them and enables further performanceimprovement. Another work in [3] proposes to reduce thedynamic and leakage power by identifying the inactive or unal-located registers, which is complementary to our study for fur-ther reduction on RF power. For better performance, cache inGPGPU is also intensively studied recently, such as [41]–[43].Another work proposes a unified on-chip memory design [5],which is able to reconfigure the capacities of RF, sharedmemory, and cache to accommodate diverse applications. Theunified structure merges all the on-chip storage that leads tolonger interconnecting wires and longer data access latency.

VIII. CONCLUSION

In this paper, we propose efficient RF design withexplicit conflict eliminating techniques to sustain the high

RF throughput facing ever-increasing TLP. We first reveal howdifferent alternatives affect the overall performance. To thebank conflict problem, we identify the causes and propose twoeffective techniques accordingly. The read stealing schemeexplicitly exploits the available bandwidth in the highlybanked RF structure to eliminate conflicts and leveragesthe massive parallelism in GPGPU to access the RF moreintelligently. As a complement, the write stealing schemeleverages the idle banks to further hide the latency penaltyon unnecessary operand waiting. Their combination improvesthe performance using a half or even quarter number ofbanks with smaller area and energy budget, which imply thatour proposed RF design with bank stealing can sustain thefast-increasing capacity of RF scaling in the future GPGPUs.

The RF importance to performance has been widelyacknowledged. Always keeping low overhead in mind, ourproposal explicitly leverages fine-grained control of the exist-ing resources without adding costly buffers or other hierar-chical memory design. Our proposed bank stealing techniquesare generally applicable to other banked RF structure. Themicroarchitectural investigation enables a low cost and a moreefficient architecture with the acceptable complexity.

REFERENCES

[1] NVIDIA Corporation, “NVIDIA’s next generation CUDA computearchitecture: Fermi,” NVIDIA, Santa Clara, CA, USA, White Paper,2009.

[2] NVIDIA Corporation, “Nvidia’s next generation CUDA compute archi-tecture: Kepler GK110,” NVIDIA, Santa Clara, CA, USA, White Paper,2012.

[3] M. Abdel-Majeed and M. Annavaram, “Warped register file: A powerefficient register file for GPGPUs,” in Proc. IEEE 19th Int. Symp. HighPerform. Comput. Archit. (HPCA), Feb. 2013, pp. 412–423.

[4] M. Gebhart et al., “Energy-efficient mechanisms for managing threadcontext in throughput processors,” in Proc. 38th Annu. Int. Symp.Comput. Archit., 2011, pp. 235–246.

[5] M. Gebhart, S. W. Keckler, B. Khailany, R. Krashinsky, and W. J. Dally,“Unifying primary cache, scratch, and register file memories ina throughput processor,” in Proc. 45th Annu. IEEE/ACM Int. Symp.Microarchitecture, Dec. 2012, pp. 96–106.

[6] X. Liang, K. Turgay, and D. Brooks, “Architectural power models forSRAM and cam structures based on hybrid analytical/empirical tech-niques,” in Proc. Int. Conf. Comput.-Aided Design (ICCAD), Nov. 2007,pp. 824–830.

[7] W.-K. S. Yu, R. Huang, S. Q. Xu, S.-E. Wang, E. Kan, and G. E. Suh,“SRAM-DRAM hybrid memory with applications to efficient registerfiles in fine-grained multi-threading,” in Proc. 38th Annu. Int. Symp.Comput. Archit., Jun. 2011, pp. 247–258.

[8] C. H. Jung, “Pseudo-dual port memory where ratio of first to secondmemory access is clock duty cycle independent,” U.S. Patent 20070 109 909, May 17, 2007.

[9] N. Jing, S. Chen, S. Jiang, L. Jiang, C. Li, and X. Liang, “Bank stealingfor conflict mitigation in GPGPU register file,” in Proc. IEEE/ACM Int.Symp. Low Power Electron. Design, Jul. 2015, pp. 55–60.

[10] NVIDIA Corporation, “Parallel thread execution ISA version 3.0,”NVIDIA, Santa Clara, CA, USA, White Paper, 2012.

[11] J. H. Tseng and K. Asanovic, “A speculative control scheme for anenergy-efficient banked register file,” IEEE Trans. Comput., vol. 54,no. 6, pp. 741–751, Jun. 2005.

[12] C. Gou and G. N. Gaydadjiev, “Elastic pipeline: Addressing GPUon-chip shared memory bank conflicts,” in Proc. 8th ACM Int. Conf.Comput. Frontiers (CF), 2011, Art. no. 3.

[13] J. H. Tseng and K. Asanovic, “Banked multiported register files forhigh-frequency superscalar microprocessors,” in Proc. 30th Annu. Int.Symp. Comput. Archit., 2003, pp. 62–71.

[14] T. G. Rogers, M. O’Connor, and T. M. Aamodt, “Cache-consciouswavefront scheduling,” in Proc. 45th Annu. IEEE/ACM Int. Symp.Microarchitecture, Dec. 2012, pp. 72–83.

Page 13: IEEE TRANSACTIONS ON VERY LARGE SCALE ...jingjiez/publications/Bank...the zero-cost context switching, supported by a large-sized register file (RF), where the active contexts of

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

JING et al.: BANK STEALING FOR A COMPACT AND EFFICIENT REGISTER FILE ARCHITECTURE IN GPGPU 13

[15] V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu,and Y. N. Patt, “Improving GPU performance via large warps andtwo-level warp scheduling,” in Proc. 44th Annu. IEEE/ACM Int. Symp.Microarchitecture, Dec. 2011, pp. 308–317.

[16] N. S. Kim and T. Mudge, “Reducing register ports using delayed write-back queues and operand pre-fetch,” in Proc. 17th Annu. Int. Conf.Supercomput. (ICS), 2003, pp. 172–182.

[17] N. Jing, H. Liu, Y. Lu, and X. Liang, “Compiler assisted dynamicregister file in GPGPU,” in Proc. Int. Symp. Low Power Electron.Design (ISLPED), 2013, pp. 3–8.

[18] S. Lee, K. Kim, G. Koo, H. Jeon, W. W. Ro, and M. Annavaram,“Warped-compression: Enabling power efficient GPUs through registercompression,” in Proc. 42nd Annu. Int. Symp. Comput. Archit., 2015,pp. 502–514.

[19] M.-T. Chang, P. Rosenfeld, S.-L. Lu, and B. Jacob, “Technologycomparison for large last-level caches (L3Cs): Low-leakage SRAM, lowwrite-energy STT-RAM, and refresh-optimized eDRAM,” in Proc. 19thIEEE Int. Symp. High Perform. Comput. Archit. (HPCA), Feb. 2013,pp. 143–154.

[20] Single-Port Register-File User Guide, Semicond. Manuf. Int. Corp.,Shanghai, China, Jul. 2012.

[21] G. Passas, M. Katevenis, and D. Pnevmatikatos, “A 128×128×24 Gb/scrossbar interconnecting 128 tiles in a single hop and occupying 6% oftheir area,” in Proc. 4th ACM/IEEE Int. Symp. Netw.-Chip, May 2010,pp. 87–95.

[22] C.-H. O. Chen, S. Park, T. Krishna, and L.-S. Peh, “A low-swing crossbarand link generator for low-power networks-on-chip,” in Proc. ACM/IEEEInt. Conf. Comput.-Aided Design, Nov. 2011, pp. 779–786.

[23] A. Bakhoda, L. G. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt,“Analyzing CUDA workloads using a detailed GPU simulator,” inProc. IEEE Int. Symp. Perform. Anal. Syst. Softw., Apr. 2009,pp. 163–174.

[24] NVIDIA. NVIDIA CUDA Toolkit, accessed on 2013. [Online]. Available:https://developer.nvidia.com/cuda-toolkit

[25] J. A. Stratton et al., “Parboil: A revised benchmark suite for scientificand commercial throughput computing,” Center Rel. High-Perform.Comput., Univ. Illinois Urbana–Champaign, Champaign, IL, USA,Tech. Rep., Mar. 2012.

[26] T. M. Aamodt, W. W. L. Fung and T. H. Hetherington.GPGPU-Sim 3.x Simulator, accessed on 2014. [Online]. Available:http://gpgpu-sim.org/manual

[27] J. Leng et al., “GPUWattch: Enabling energy optimizations in GPGPUs,”in Proc. 40th Annu. Int. Symp. Comput. Archit., 2013, pp. 487–498.

[28] A. Valero et al., “An hybrid eDRAM/SRAM macrocell to implementfirst-level data caches,” in Proc. 42nd Annu. IEEE/ACM Int. Symp.Microarchitecture, Dec. 2009, pp. 213–221.

[29] X. Liang, R. Canal, G.-Y. Wei, and D. Brooks, “Process variation tolerant3T1D-based cache architectures,” in Proc. 40th Annu. IEEE/ACM Int.Symp. Microarchitecture, Dec. 2007, pp. 15–26.

[30] N. Goswami, B. Cao, and T. Li, “Power-performance co-optimizationof throughput core architecture using resistive memory,” in Proc. IEEE19th Int. Symp. High Perform. Comput. Archit. (HPCA), Feb. 2013,pp. 342–353.

[31] N. Jing et al., “An energy-efficient and scalable eDRAM-based registerfile architecture for GPGPU,” in Proc. 40th Annu. Int. Symp. Comput.Archit. (ISCA), 2013, pp. 344–355.

[32] N. Jing, L. Jiang, T. Zhang, C. Li, F. Fan, and X. Liang, “Energy-efficienteDRAM-based on-chip storage architecture for GPGPUs,” IEEE Trans.Comput., vol. 65, no. 1, pp. 122–135, Jan. 2016.

[33] S. Sardashti and D. A. Wood, “Decoupled compressed cache: Exploit-ing spatial locality for energy-optimized compressed caching,” inProc. 46th Annu. IEEE/ACM Int. Symp. Microarchitecture, Dec. 2013,pp. 62–73.

[34] H. Kasture and D. Sanchez, “Ubik: Efficient cache sharing with strictQoS for latency-critical workloads,” in Proc. 19th Int. Conf. Archit.Support Program. Lang. Oper. Syst., 2014, pp. 729–742.

[35] A. Yoaz, M. Erez, R. Ronen, and S. Jourdan, “Speculation techniquesfor improving load related instruction scheduling,” in Proc. 26th Int.Symp. Comput. Archit., 1999, pp. 42–53.

[36] M.-J. Wu, M. Zhao, and D. Yeung, “Studying multicore processorscaling via reuse distance analysis,” in Proc. 40th Annu. Int. Symp.Comput. Archit., 2013, pp. 499–510.

[37] I. Park, M. D. Powell, and T. N. Vijaykumar, “Reducing register portsfor higher speed and lower energy,” in Proc. 35th Annu. IEEE/ACM Int.Symp. Microarchitecture, Nov. 2002, pp. 171–182.

[38] R. Balasubramonian, S. Dwarkadas, and D. H. Albonesi, “Reducing thecomplexity of the register file in dynamic superscalar processors,” inProc. 34th Annu. IEEE/ACM Int. Symp. Microarchitecture, Dec. 2001,pp. 237–248.

[39] H. Zeng and K. Ghose, “Register file caching for energy efficiency,” inProc. Int. Symp. Low Power Electron. Design, Oct. 2006, pp. 244–249.

[40] T. M. Jones, M. F. P. O’Boyle, J. Abella, A. González, and O. Ergin,“Energy-efficient register caching with compiler assistance,” ACM Trans.Archit. Code Optim., vol. 6, no. 4, Oct. 2009, Art. no. 13.

[41] W. Jia, K. A. Shaw, and M. Martonosi, “MRPB: Memory requestprioritization for massively parallel processors,” in Proc. 20th Int. Symp.High Perform. Comput. Archit., Feb. 2014, pp. 272–283.

[42] X. Chen, L.-W. Chang, C. I. Rodrigues, J. Lv, Z. Wang, and W.-M. Hwu,“Adaptive cache management for energy-efficient GPU computing,” inProc. 47th Annu. IEEE/ACM Int. Symp. Microarchitecture (MICRO),2014, pp. 343–355.

[43] I. Singh, A. Shriraman, W. W. L. Fung, M. O’Connor, and T. M. Aamodt,“Cache coherence for GPU architectures,” in Proc. IEEE 19th Int. Symp.High Perform. Comput. Archit. (HPCA), Feb. 2013, pp. 578–590.

Naifeng Jing received the B.E., M.E., and Ph.D.degrees from Shanghai Jiao Tong University,Shanghai, China, in 2005, 2008, and 2012,respectively.

He is currently an Assistant Professor with theDepartment of Micro-Nano Electronics, ShanghaiJiao Tong University. His current research interestsinclude high performance and high reliability com-puting architecture and system, and computer-aidedVLSI design.

Shunning Jiang received the B.S. degree in com-puter science from Zhiyuan College, ShanghaiJiao Tong University, Shanghai, China, in 2015.He is currently pursuing the Ph.D. degree with theDepartment of Electrical and Computer Engineering,Cornell University, Ithaca, NY, USA.

He has been with the Advanced Computer Archi-tecture Laboratory, Shanghai Jiao Tong University,since 2013, during his undergraduate studies. Hevisited the Xtra Computing Group with NanyangTechnological University, Singapore, as a Research

Assistant, from 2014 to 2015. His current research interests include computerarchitecture and computer system, especially in parallel architecture, approx-imate computing, and memory system.

Shuang Chen received the B.S. degree in computerscience and technology from Shanghai Jiao TongUniversity (SJTU), Shanghai, China, in 2015. Sheis currently pursuing the Ph.D. degree with theDepartment of Electrical and Computer Engineering,Cornell University, Ithaca, NY, USA.

She was an Undergraduate Research Assistant withthe Advanced Computer Architecture Laboratory,SJTU, from 2013 to 2015, where she was involvedin research on general purpose graphic processingunit. In 2014, she was a Research Assistant with

Nanyang Technological University, Singapore, where she was involved inresearch on both computer architecture and database. She is currently amember of the Computer Systems Laboratory with Cornell University. Hercurrent research interests include chip multiprocessors with an emphasis onthe memory system.

Page 14: IEEE TRANSACTIONS ON VERY LARGE SCALE ...jingjiez/publications/Bank...the zero-cost context switching, supported by a large-sized register file (RF), where the active contexts of

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

14 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Jingjie Zhang is currently pursuing the bachelor’sdegree with the Department of Electrical Engineer-ing, Fudan University, Shanghai, China.

He has been an Undergraduate Research Assistantwith the Advanced Computer Architecture Labo-ratory, Shanghai Jiao Tong University, Shanghai,since 2015. His current research interests includesignal processing, computer aided design, and com-puter architecture.

Li Jiang received the B.S. degree in computerscience and technology from Shanghai Jiao TongUniversity, Shanghai, China, in 2007, and thePh.D. degree from the Department of ComputerScience and Engineering, The Chinese University ofHong Kong, Hong Kong, in 2013.

He has been an Assistant Professor with theDepartment of Computer Science and Engineer-ing, Shanghai Jiao Tong University, since 2014.His current research interests include computerarchitecture, computer aided design, VLSI testing,

and fault tolerance, emphasized on emerging technologies and applications.Dr. Jiang was a recipient of the Best Doctoral Thesis Award at the Asian

Test Symposium in 2014.

Chao Li (M’10) received the Ph.D. degree fromthe University of Florida, Gainesville, FL, USA,in 2014.

He is currently an Assistant Professor with theDepartment of Computer Science and Engineering,Shanghai Jiao Tong University, Shanghai, China. Hiscurrent research interests include computer architec-ture, energy-efficient systems, and green computing.

Dr. Li has received best paper awards from theIEEE HPCA and the IEEE COMPUTER ARCHITEC-TURE LETTERS.

Xiaoyao Liang received the Ph.D. degree fromHarvard University, Cambridge, MA, USA.

He is currently a Professor and the AssociateDean of the Department of Computer Science andEngineering with Shanghai Jiao Tong University,Shanghai, China. He has been a Senior Architector an IC Designer with NVIDIA, Intel, and IBM,USA. His current research interests include com-puter processor and system architectures, energyefficient and resilient microprocessor design, highthroughput and parallel computing, general purpose

graphic processing unit, hardware/software co-design for the cloud and mobilecomputing, IC design, VLSI methodology, FPGA innovations, and compilertechnology.