[ACM Press the ACM SIGGRAPH/EUROGRAPHICS conference - Los Angeles, California...

Graphics Hardware 2005, 30–31 July 2005, Los Angeles CA

© 2005 ACM 1-59593-086-8/05/0007 $5.00

Graphics Hardware (2005)M. Meissner, B.- O. Schneider (Editors)

A Fast, Energy-Efficient Z-ComparatorJustin Hensley, Montek Singh and Anselmo Lastra

University of North Carolina, Chapel Hill, NC, USA — {hensley, montek, lastra}@cs.unc.edu

AbstractWe present a fast and energy-efficient z-comparator that takes advantage of the fact that the result of most depthcomparisons can be determined by examining just a few bits. This feature is made possible by the use of asyn-chronous logic, which enables the comparator to rapidly compare bits until the result is clear and then stop. Usingdepth data from well-known computer games, SPICE simulations indicate that our comparator consumes only25% of the energy and operates 1.67 times faster, on average, compared to an equivalent synchronous design.The comparator design is used to illustrate a more general design principle, “compute on demand,” which canpotentially enable graphics hardware to be faster and more energy-efficient.

Categories and Subject Descriptors (according to ACM CCS): I.3.1 [Computer Graphics]: Hardware Architecture

1. IntroductionThis paper introduces the notion of “compute on demand” asa design principle for fast and energy-efficient graphics hard-ware. The key idea is to exploit the data-dependent nature ofcomputation, and to obtain speed and energy improvementsby optimizing the design for the common case, instead ofassuming worst-case operation. An asynchronous or “clock-less” circuit style is used to facilitate this paradigm. In partic-ular, only those portions of compute blocks are activated thatare actually required for a particular operation, thereby sav-ing energy. In addition, asynchronous components are typi-cally capable of providing data-dependent completion times,thereby potentially obtaining speed improvements.

The design of a z-comparator is presented to illustrate thegeneral compute-on-demand principle. By experimentationwe have determined that, on average, a typical depth com-parison requires examination of many fewer bits than thetypical 24 to 32 bits of the z value. For example, visibility forthe complex frame in Figure 1 is determined by only com-paring an average of 7.3 bits. Since a typical depth compara-tor compares all of the bits of z, it performs many unneces-sary computations. That wastes energy, and potentially alsocosts extra time. In contrast, our novel asynchronous com-parator limits energy dissipation by only performing com-putation as required. To render the frame shown in Figure 1,our asynchronous comparator would dissipate 1/4th the en-ergy of an equivalently sized synchronous comparator, whileoperating 1.67 times faster.

Arguably, only making the z-comparator fast and energy-efficient is not likely to result in any significant improvementin the speed or energy consumption of an entire graphicschip. However, our approach to designing the comparatorholds promise for other parts of the chip as well. For in-stance, certain arithmetic units can be constructed to take

Figure 1: A frame from Unreal Tournament 2004. The framerequires 6,768,766 comparisons of incoming fragments withthe depth buffer. On average, only the 7.3 most significantbits are actually needed to resolve each comparison.

advantage of the fact that the entire precision of a num-ber is not always needed [ECKM05]. Further, several asyn-chronous arithmetic blocks have been designed so as toobtain average-case cycle times and latencies, as opposedto the worst-case operation typical of synchronous com-ponents [NYB97, RSG∗99]. Benefits of asynchrony havealso been demonstrated in mixed synchronous-asynchronouspipelines [STR∗02]. In sum, the comparator design is of-fered as an example to make the case that asynchronous cir-cuits and the compute-on-demand paradigm are promisingfor next-generation graphics hardware.

The remainder of this paper is organized as follows. Sec-tion 2 provides background and a short overview of previouswork. Section 3 presents the new asynchronous comparatordesign. Simulation results for energy and performance arepresented in Section 4, and conclusions appear in Section 5.

J. Hensley, M. Singh & A. Lastra / A Fast, Energy-Efficient Z-Comparator

2. Background and Previous WorkThis section provides background on the circuit style chosenfor the comparator design, and then discusses previous work.There are two key features of our design—asynchronouscontrol logic, and dynamic logic for the datapath—which to-gether facilitate “computation on demand.”

2.1. Background: Asynchronous DesignCurrent microelectronic trends are posing several significantchallenges to the existing paradigm of globally clocked de-sign: (i) distribution of a multi-GigaHertz clock, (ii) hand-ling of multiple timing domains, (iii) overcoming worst-caseperformance, (iv) limiting wasteful clock power dissipation,and (v) interfacing with arbitrary environments.

As a result, an alternative paradigm—asynchronous or“clockless” design—is becoming an increasingly attractiveapproach [BJN99]. An asynchronous system uses handshak-ing between interacting components to achieve local syn-chronization, instead of using global clocking.

Asynchronous design has potentially significant energyand performance benefits. Lower energy consumption re-sults due to elimination of wasteful clock power, and by lim-iting switching activity to when and where needed [BJN99].Performance benefits result because typical asynchronouscomponents are capable of exploiting the data-dependencyof completion times [NYB97, RSG∗99].

Our comparator design uses asynchronous handshakingto ensure that only those components are activated that arerequired for computing the result of the comparison. In par-ticular, a bitslice is activated only if all the higher significantbits are equal, not otherwise.

2.2. Background: Dynamic LogicTraditional datapaths were designed using static CMOSgates composed of complementary pull-up and pull-downnetworks [WE93]. The need for two complementary net-works of transistors implies greater area, slower switchingspeed and greater power consumption. An alternative style,called dynamic logic [WE93], eliminates the need for bulkypull-up networks, replacing them with a single transistordriven by external control. As a result, there is much lessloading on the logic inputs, allowing them to switch faster.Due to its high-performance potential, dynamic logic is in-creasingly used in speed-critical portions of modern chips,e.g., ALUs of recent processors, including the Pentium 4.

Figure 2 shows the structure of a general dynamic gate.It has a pull-down network made of nMOS transistors, butthe pull-up network is replaced by a single pMOS transis-tor (“precharge device”). Typically, there is also an addi-tional nMOS transistor, in series with the pull-down network(“evaluation device” or “foot”). Both the precharge and eval-uation devices are controlled by an external input, called PC

(“precharge control”). There are also two inverters near theoutput of the gate, one for generating the correct output po-larity (“output inverter”), and a weaker one that providesfeedback to stabilize the output (“keeper”).

pull-downnetworkpull-downnetwork

logicinputs

PC

PC

“keeper”

logicoutput

prechargedevice

evaluation deviceor “foot”

dynamic node

outputinverter

Figure 2: A dynamic logic gate

A dynamic gate alternates between two phases ofoperation—precharge (or reset) and evaluation—controlledby the PC input. Precharge occurs when PC is driven low: the“dynamic node” goes high, thereby resetting the gate outputto low. Evaluation occurs when PC is driven high: the pull-down network is enabled to process its inputs; the value pro-duced is inverted to form the gate output. Thus, upon evalu-ation, either the gate output stays low, or makes a monotonictransition from low to high.

The use of dynamic logic is key to the speed and energyefficiency of our comparator. In particular, the PC controlinput of the dynamic gates provides us the ability to controlwhen and where computation is triggered.

2.3. Previous WorkMost relevant prior work is by Knittel et al. [KS95], whichintroduces two comparator designs for use in a novel ap-proach that folds z-comparisons into z-buffer storage itself.

Their first design is the most similar to ours: the compar-ison proceeds from the MSB towards the LSB and, in cer-tain cases, their design has data-dependent completion times.However, there are two key distinctions as well. Their de-sign has data-dependent completion only when the result ofthe z-comparison is “true” (i.e., the new z-value is less thanthe old z-value); for the other cases, a “false” result is in-ferred after a worst-case delay. In contrast, our comparatoris able to exploit data-dependence in all cases, thereby pro-viding a potential speed advantage. The second differenceis in the energy consumption. In particular, their design hasa global enable signal which must be broadcast to all bit-slices, whereas our design asserts the enable for each bitsliceonly as needed, thereby conserving energy. Moreover, theirdesign uses alternating stages that are dominated by nMOSand pMOS transistors; the p-type stages can represent signif-icant capacitive loading. In contrast, our design uses dominologic, which is dominated by n-type devices only, therebyproviding a further energy benefit.

Their second design is a modification of the first one to in-crease concurrency: a 32-bit comparator is decomposed intofour 8-bit comparisons whose results are combined using ap-propriate priority. As a result, speed is improved approxi-mately four-fold, at the cost of higher energy consumption.

42


Our comparator design could be similarly decomposed toachieve higher speed at the cost of energy. However, the rel-ative merits highlighted above are likely to remain the same.

Another relevant approach is the energy-efficient com-parator of Ponomarev et al. [PKEG04], proposed for use insuperscalar CPUs. Somewhat analogous to our “compute-on-demand” functionality, their design has a feature called“dissipate-on-match”: their circuit consumes more energywhen the operands match, and less for a mismatch. However,while our design examines only those bits that are necessary,their design still examines all bits in parallel. Moreover, theirdesign is dominated by pMOS pass transistors, which implyincreased loading, thereby wasting energy.

Most importantly, though, the comparator of [PKEG04]has a significant limitation: it is useful only for testing equal-ity of two operands, not for less-than or greater-than opera-tions. Similarly, the comparator designs of [WLWW03] onlycheck for equality. As a result, these designs are not suitablefor use as a z-comparator. In contrast, our design providesall three comparisons (equal, greater-than or less-than).

3. Asynchronous ComparatorThis section introduces our novel comparator, which gener-ates “less-than,” “equal-to,” or “greater-than” for a pair ofoperands.

3.1. Comparator ArchitectureFigure 3 shows the overall comparator architecture. The en-tire computation is bitsliced, with a partial result at each bitposition evaluated by a function block implemented usingdynamic logic. The precharge/evaluate control of each dy-namic function block, labeled “eval” in Figure 3 (“PC” inFigure 2), is generated by the bitslice to its left.

Computation proceeds from left to right, most significantbit to least significant bit. The key idea is to have evaluationtriggered in a bitslice if and only if all the bits to its left,i.e. the more significant bits, have been inspected and foundto be identical in the two operands. Thus, this design uses thesmallest leftmost prefix needed to evaluate the comparison.

An enabled bitslice compares the bits of the two operandsat that position, and if they are not equal, it generates the“greater-than” (gt) or “less-than” (lt) output. The “greater-than” and “less-than” outputs of all the bitslices are OR’edtogether using a tree of dynamic OR gates, to provide theappropriate result. These trees are in practice quite efficientbecause dynamic OR gates can typically have fan-ins as highas 6. If the comparison at a bitslice is “equal” (eq), an eval-uation request is sent to the next bitslice in the chain, whichthen similarly evaluates the comparison at the next bit. If therightmost, least-significant bit evaluates as “equal,” then theresult of the comparison is reported to be “equal.”

3.2. Comparator Operation: Compute on DemandThe asynchronous comparator takes advantage of the factthat the entire width of the operands is not always needed

OR

OR

lt gt eq

eval

lt gt eq

eval

lt gt eq

eval

lt gt eq

eval

EqualLessThanGreaterThan

EVAL

Figure 3: Our novel “compute-on-demand” comparator

to perform the computation. We have termed this feature“compute on demand,” since each bitslice only computesif its result is required. By preventing unnecessary partial-result evaluations, the asynchronous comparator limits itsenergy dissipation to a minimum. In addition, the latencyof the comparator is data-dependent: easier comparisons arefaster. If the remainder of the graphics pipeline can exploitthe shorter average comparator latency (compared with theworst-case latency of synchronous comparators), then ourdesign also has speed benefits.

If the operands are completely random, then on averageonly three bits need to be compared to resolve a comparison,regardless of input width [YBV∗97]. This is because, as theevaluation proceeds from left to right, the probability thatanother bit must be inspected progressively falls by half.

In practice, however, the average number of bits inspectedwill be greater than three for operands that are not random.In particular, when the comparator is used in the z-compareunit, the operands will be incoming fragments whose depthvalues can exhibit some clustering.

Our experiments on a variety of test scenes show thatonly 6–8 most significant bits are needed, on average, to per-form the z-comparison for 24–32 bit depth values. Figure 1shows a single random frame, rendered at a resolution of1024x768, from the game Unreal Tournament 2004. A traceof all depth comparisons was generated using a modifiedversion of the Mesa [Mes] graphics library. For the frameshown, 6,768,766 comparisons were performed, and on av-erage only the 7.3 most significant bits were needed to eval-uate the z-comparisons.

4. Experimental ResultsThis section presents the results of electrical simulations ofour new asynchronous comparator. To serve as the base casefor comparison, a similar comparator was also designed us-ing a clocked approach. Both were designed using the Ca-dence tool suite, and simulated using Spectre in a 0.18µmTSMC CMOS process, at 300K and 1.8V power supply.

Simulation Results. Each comparator was simulatedwith several different input values, and the computational la-tency and energy consumption were measured. Table 1 sum-marizes the results. The first column lists the number of bitsafter the MSB that were needed to be examined in orderto generate the result, i.e., the length of the shortest left-most prefix evaluated, excluding the MSB. The remainingcolumns provide the latency (τ) and energy consumed (E)for each design, along with the ratio of E for both.

43


0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 230

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2x 106

Compute Chain Length

Num

ber o

f Com

paris

ons

Figure 4: Distribution of z-comparison compute chainlength for the frame shown in Figure 1

The results clearly demonstrate the data-dependent na-ture of the comparison completion times. The shortest com-parisons are roughly similar for the two implementations:590 ps for synchronous and 550 ps for asynchronous. Thelongest comparisons take 4.16 ns and 7.24 ns, respectively.The asynchronous comparator is slower in the worst case,since each successive bitslice is enabled only once it is de-termined that further computation is required.

The advantage of the asynchronous implementation isquite clear: it truly exhibits variable computation delays. Incontrast, the designer of the synchronous implementationwill be forced to choose a clock time period that is longenough to accommodate the worst-case delay, 4.16 ns.

Depending upon the operand distribution, the asyn-chronous implementation can be significantly faster than thesynchronous one. In particular, several real-world examplescenes were analyzed, and we determined that the averagecompute chain lengths were in the range of 6–8 bits. Fig-ure 4 shows the distribution of the compute chain length forthe z-comparisons from the frame shown in Figure 1. Forthis frame, the eight most significant bits provide enough in-formation to capture over 85% of the z-comparisons. Onlyrarely does our comparator need to look beyond 10 bits.Assuming this distribution, our asynchronous comparatorwould be able to evaluate the over 6 million comparisons1.67 times faster (2.49 ns average latency) than the syn-chronous comparator (4.16 ns latency). Of course, the restof the synchronous graphics pipeline must be capable of ex-ploiting this latency improvement.

The energy advantage of the asynchronous comparator iseven more significant. Assuming the distribution of data inFigure 4, our asynchronous comparator, on average, dissi-pates only 1/4th the energy of the synchronous compara-tor. Interestingly, for the shortest compute chain, the asyn-chronous version is over 12 times more energy-efficient thanthe synchronous one. In the worst case, for the longest chain,our asynchronous design still dissipates 41% less energy.

Compute Synchronous Asynchronous Async E/

chain delay, τ (ns) E(pJ) delay, τ (ns) E(pJ) Sync E

0 0.59 17.36 0.55 1.40 0.0804 1.23 18.16 1.69 3.23 0.1788 1.85 18.95 2.87 5.36 0.28312 2.46 19.74 4.04 7.39 0.37416 3.08 20.53 5.20 9.37 0.45620 3.69 21.32 6.37 11.40 0.535

23 4.16 22.00 7.24 12.89 0.586

Table 1: 24-bit Comparator results

5. ConclusionWe have presented a fast, energy-efficient z-comparator thatis able to take advantage of the fact that typically only asmall number of bits need to be examined to make a depthcomparison. Key to the low power and high performance isthe use of asynchronous logic to facilitate a “compute-on-demand” paradigm, which we believe could also be of ben-efit for other arithmetic circuits in the graphics pipeline.

AcknowledgmentsFinancial support was provided by an ATI Research fellow-ship, an IBM Faculty Development Award, and National Sci-ence Foundation grants CCF-0306478 and CCF-0205425.Equipment was provided by NSF grant CNS-0303590.

References[BJN99] BERKEL C. H. K. V., JOSEPHS M. B., NOWICK S. M.:

Scanning the technology: Applications of asynchronous circuits.Proc. of the IEEE 87, 2 (Feb. 1999), 223–233.

[ECKM05] EKANAYAKE V. N., CLINTON KELLY I.,MANOHAR R.: Bitsnap: Dynamic significance compres-sion for a low-energy sensor network asynchronous processor.In ASYNC(2005) (2005), IEEE Computer Society, pp. 144–154.

[KS95] KNITTEL G., SCHILLING A.: Eliminating the z-bufferbottleneck. In EDTC ’95: Proceedings of the 1995 Europeanconference on Design and Test (Washington, DC, USA, 1995),IEEE Computer Society, p. 12.

[Mes] Mesa3d graphics library. http://mesa3d.org.[NYB97] NOWICK S. M., YUN K. Y., BEEREL P. A.: Specula-

tive completion for the design of high-performance asynchronousdynamic adders. In ASYNC (Apr. 1997), pp. 210–223.

[PKEG04] PONOMAREV D., KUCUK G., ERGIN O., GHOSE K.:Energy efficient comparators for superscalar datapaths. IEEETransactions on Computers 53, 7 (July 2004), 892–904.

[RSG∗99] ROTEM S., STEVENS K., GINOSAR R., BEERELP., MYERS C., YUN K., KOL R., DIKE C., RONCKEN M.,AGAPIEV B.: RAPPID: An asynchronous instruction length de-coder. In ASYNC (Apr. 1999), pp. 60–70.

[STR∗02] SINGH M., TIERNO J. A., RYLYAKOV A., RYLOV S.,NOWICK S. M.: An adaptively-pipelined mixed synchronous-asynchronous digital FIR filter chip operating at 1.3 GigaHertz.In ASYNC (Manchester, UK, Apr. 2002).

[WE93] WESTE N., ESHRAGHIAN K.: Priniciples of CMOSVLSI Design, a Systems Perspective, second ed. Addison-WesleyPublishing Co., 1993.

[WLWW03] WANG C.-C., LEE P.-M., WU C.-F., WU H.-L.:High fan-in dynamic cmos comparators with low transistor count.IEEE Transactions on Circuits and Systems I: Fundamental The-ory and Applications (2003).

[YBV∗97] YUN K. Y., BEEREL P. A., VAKILOTOJAR V.,DOOPLY A. E., ARCEO J.: The design and verification of ahigh-performance low-control-overhead asynchronous differen-tial equation solver. In ASYNC (Apr. 1997), pp. 140–153.

44

Copyright © 2005 by the Association for Computing Machinery, Inc.Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for commercial advantage and that copies bear this notice and the full citation on the fi rst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specifi c permission and/or a fee. Request permissions from Permissions Dept, ACM Inc., fax +1 (212) 869-0481 or e-mail [email protected].© 2005 ACM 1-59593-086-8/05/0007 $5.00

http://mesa3d.org

[ACM Press the ACM SIGGRAPH/EUROGRAPHICS conference - Los Angeles, California...

Documents

Transcript of [ACM Press the ACM SIGGRAPH/EUROGRAPHICS conference - Los Angeles, California...