Impact of Coarse-Grained Power Gate Placement on a Fine ...this growth rate. Low power techniques...

Impact of Coarse-Grained Power Gate Placementon a Fine-Grained System Design

Shylesh Umapathy and Aaron StillmakerElectrical and Computer Engineering Department

California State University, FresnoFresno, CA, USA

shylesh [email protected], [email protected]

Abstract—With the 54th commemoration of Moores law andthe intense development of VLSI technology has permitted moreand more IP components to be integrated on a single chip.However, factors such as power consumption has been limitingthis growth rate. Low power techniques such as clock gating,power gating, dynamic voltage and frequency scaling, bodybiasing, and many more have emerged as potential solutions.This paper explores power gating technique and presents thedesign trade-offs between the ring and grid style of power gateplacement in a fine-grained system design. The study used 24physical designs of 12 different sized MAC units ranging from44 to 320-bit inputs, and extracted various parameters. Theresults depict that, using a ring style of placement gives anaverage increase in IR drop of 9.59% when compared to gridstyle of placement for 128 to 320-bits input MAC unit. The gridstyle possesses an additional average congestion of 1.66% whencompared to ring style of placement for 192 to 320-bits inputMAC unit.

Index Terms—power gating, fine-grained, IR drop, congestion

I. INTRODUCTION

Device size for mainstream chip fabrication will soon reach7 nm [1], allowing for multiple billions of transistors on a chip,and ever increasing performance. With the current interest inbig data applications, the power demand for big data centershas increased dramatically, leading to system level powersaving techniques [2]. It is estimated that in 2020, data centersin the United States will consume over 73TWh [3]. This, aswell as the current trend to design large processing arrays [4],has created a demand for low-power techniques for a fine-grained processing level [5].

In deep sub-micron transistors, static power consumptionhas become a significant factor in total chip power con-sumption [6], [7]. Clock gating, power gating, dynamic bodybiasing, and supply voltage ramping all provide methods forreducing static power [8]. This paper explores the designtrade-offs of using the grid and ring methods of power gatesplacement in a fine-grained digital hardware design.

The organization of the paper is as follows. Section IIdiscusses the background and previous works related to powergates. Section III provides details on the implementation usingSynopsys EDA tools and NanGate FreePDK45 open celllibrary [9]. The results and analysis from 24 separate physicaldesigns for different multiply accumulator (MAC) sizes andpower gate placements are presented in Section IV.

ECE Department40

𝑉

𝑉 ≈ 𝑉

𝑆𝑙𝑒𝑒𝑝 = 0 𝑆𝑙𝑒𝑒𝑝 = 1𝑉 > |𝑉 |

𝐼 > 0

𝑉 = 0

Core

𝑉 < 𝑉

𝐼 = 0

Fig. 1. Plot showing how power gates are controlled with sleep signals andare able to control the current into the core’s local power grid.

II. BACKGROUND AND PREVIOUS WORKS

A power gate is a device that is able to remove the majorityof static power in a system by switching/cutting off the powersupply (Vdd) or ground (Vss) for the design while idle, thepower gate is termed as a header and footer respectively [8].Fig. 1 represents the PMOS header power gate configuration,which is used to control the power supply to the core. Thepower gate is controlled by the sleep signal.

Power gates can be placed around the perimeter of thedesign, in a ring pattern, or can be placed in rows and columninside the design, as grid pattern. Figs. 2 & 3 represent theplacement of power gates in grid and ring manner, respectivelywith the power gates highlighted as white squares. There aremany factors to be considered while designing and placingthese power gates in a design/module. When the module needsto come out of idle, the entire module shouldnt be turned onat once, which would lead to a significant rush current. Thenumber of power gates in a module is based on a desiredmaximum IR drop or voltage droop and ramp-up time for thedesign.

Power gates and other low-power design methods are com-monly researched and used in academia and the industry. AMDcommonly employs clock gating, multiple threshold voltagecells and power gating in their modern processors to reducepower consumption [10]. Research has been performed onfloor planning using Sequence pair and B*-tree engines [11],where area is reserved for power gates, resulting in betterfloorplan and less dead space. Lin et al. [12] explored theadvantages of using dual gating compared to the usage of

ECE Department38

Fig. 2. Plot of a placed and routed design from Synopsys IC Compiler of a128-bit MAC unit with power gates placed in a grid pattern. The power gatesare shown as white boxes.

either header or footer gate and seen that the leakage powerand short circuit power was reduced by 86% and 99% respec-tively. DRPA [13] explored power gating techniques, and thebenefits of power gating individual functional units. Usamiet al. [14], the power gate is controlled based on energycharacterization, and code profiling using leakage monitorsand the OS. This approach resulted in a reduction of en-ergy dissipation by 15% compared to conventional approach.Considering all the current sources, inductance, and decapseffects, the worst-case noise is estimated, and an incrementalscheduling procedure was used to wakeup the power gatedblocks. This technique reduced the power/ground noise by50% [15]. Fully Depleted Silicon-On-Insulator (FDSOI) de-vice has been used [16], in which the threshold voltage wasdynamically controlled through back gate bias, making thedevice to perform the function of both a logic device andsleep transistor, hence eliminating the need of separate sleeptransistors. This feature removes the additional area, delay,power and other overheads compared to conventional designs.Nanoelectromechanical systems, NEMS, can be used to createpower gates that consume 30% less power and reduces areaoverhead by 36-83%, when compared to CMOS [17]. Thewide band-gap Graphene Nano-Ribbon Field-Effect Transistor(GNRFET) used as footer sleep transistor reduced the leakagepower, delay time and wake-up time by 88%, 44%, and 24%respectively compared to a MOSFET transistor [18].

There currently is no work in the literature that exploresthe impact related to the placement of power gates in a fine-grained functional unit module. There are two popular knownplacement techniques, ring and grid, but no exploration of thedesign trade-offs between the two methods, which this paper

ECE Department

ring

A. Stillmaker - ECE 146 35Fig. 3. Plot of a placed and routed design from Synopsys IC Compiler of a128-bit MAC unit with power gates placed in a ring pattern. The power gatesare shown as white boxes.

provides.

III. IMPLEMENTATION

A MAC unit was chosen as a computational macro blockwith a comparable size to a fine-grained processor core ona many-core array, and also easy to vary in size, to whichpower gates were placed in ring and grid style and studied.We set to determine the design trade-offs of the two powergate placement methods for different sized functional units,so a range of different sized MAC units were completelydesigned, using a full Synopsys RTL to GDS flow, for MACunits ranging from 44 input bits to 320 input bits, for 24 testdesigns in total. A design flow based on the Unified PowerFormat (UPF) was used for creating a single power domainand inserting the power gates in ring and grid manner. Tomake sure to capture IR drop in just the power gates, arobust power grid was designed for each design. The RTLto GDS flow was performed by first writing a behavioral RTLdescription in Verilog. Synopsys Design Compiler (DC) wasused to perform standard cell synthesis using the NanGateFreePDK45 45 nm open cell library [9], iteratively targetedminimized clock periods, characteristic timing constraints,and our defined UPF specifications. The generated gate-levelnetlist and SDC constraint file from DC was input, alongwith physical NanGate libraries were input into SynopsysIntegrated Circuit (IC) Compiler and a complete physical flowwas performed to iteratively create a floorplan, place powerrails and power gates, place the standard cells, route, performCTS, and optimizations before streaming out a final GDS file.Each iteration was carried out for varied sized MAC unit andusing a PMOS header cell, as shown in Fig. 4, to switch

ECE Department39

𝑉

𝑉

𝑆𝑙𝑒𝑒𝑝

𝑉𝑆𝑙𝑒𝑒𝑝𝑜𝑢𝑡

Fig. 4. Transistor level diagram showing the internals of a common headerstandard cell, such as the one used.

the power supply to the design. In order to keep relativeconsistency, floorplaned chip area was increased at a cubicrate based on the number of input bits for each design.

The supply voltage (Vdd) was set to 1.25V and the numberof power gates for the design was calculated to ensure thevoltage supplied through the power gates, the Virtual Vdd(Vvdd), was not below 1.1V. The sleep signal was sent tothe first power gate and the resulting sleepout signal was thenfed to the subsequent power gates in a daisy chain fashion tominimize the rush current and voltage droop. The area andtarget utilization of the MAC unit was constant for a givennumber of inputs bits and for ring and grid placement designs,and relatively constant for different sized designs. Metal 1through 8 was used for routing.

IV. RESULTS AND ANALYSIS

Results from different sized MAC units extracted from ICCompiler are shown in Table I, where each test was ran on asingle MAC ASIC, simply with varying data widths. The areaof the MAC unit grows cubicly with the input bits and thenumber of power gates used for a design ensured the voltagesupplied to the design was above 1.1V. Maximum clockfrequency and power consumption were negligibly affected bypower gate placement. IR drop and routing congestion wereimpacted by the placement of the power gates, as shown inFigs. 5 & 6.

As seen in Fig. 5 and Table I, the worst-case IR drop en-countered in the ring and grid style of power gates placementbelow 46-bits input MAC unit was the same. From 48-bitsto 96-bits, ring style incurred 1.1% more IR drop than gridand from 128-bits onwards, the average IR drop in ring styleincreased by 9.59% as the area increased. In ring style powergates placement, shown in Fig. 3, as the size of the MACunit increases, the wire distance between the center of thedesign and the power gates increases. This increase in thewire distance leads to an increased IR drop in ring style, whencompared to grid style, shown in Fig. 2, the power gates areplaced along with the design cells, hence the worst-case IRdrop is comparatively less. The increased IR drop in ring stylecan be reduced by adding more parallel power gates to thedesign. But this approach would further add additional areaoverhead when compared to the grid style of placement.

1

Axis = [44 46 48 64 96 128 160 192 224 256 288 320];ring_IR = [14.75 14.93 15.20 19.05 29.14 38.42 49.47 62.55 75.30 91.40 98.90 110.82];ring_cong = [1.01 1.10 1.15 1.29 2.60 3.40 4.04 4.69 5.7 6.98 7.87 8.42];grid_IR = [14.75 14.93 15.19 19.00 28.27 35.04 44.54 56.71 67.82 82.40 90.02 99.30];grid_cong = [1.01 1.12 1.19 1.35 2.92 3.78 4.82 5.90 7.02 8.62 9.38 11.04];figureplot(Axis, ring_IR, '-db', 'MarkerFaceColor', 'blue');hold on;plot(Axis, grid_IR, '--or', 'MarkerFaceColor', 'red');grid on;xlabel('Size of MAC Unit (No. of Input Bits)')ylabel('IR Drop (mV)')legend('Ring', 'Grid', 'Location', 'north')

Published with MATLAB® R2014aFig. 5. Plot of IR drop between ring and grid style of power gates placementfor a MAC unit of varied size.

1

Axis = [44 46 48 64 96 128 160 192 224 256 288 320];ring_IR = [14.75 14.93 15.20 19.05 29.14 38.42 49.47 62.55 75.30 91.40 98.90 110.82];ring_cong = [1.01 1.10 1.15 1.29 2.60 3.40 4.04 4.69 5.7 6.98 7.87 8.42];grid_IR = [14.75 14.93 15.19 19.00 28.27 35.04 44.54 56.71 67.82 82.40 90.02 99.30];grid_cong = [1.01 1.12 1.19 1.35 2.92 3.78 4.82 5.90 7.02 8.62 9.38 11.04];

figureplot(Axis, ring_cong, '-db', 'MarkerFaceColor', 'blue');hold on;plot(Axis, grid_cong, '--or', 'MarkerFaceColor', 'red');grid on;xlabel('Size of MAC unit (input bits)')ylabel('Congestion (%)')legend('Ring', 'Grid', 'Location', 'north')

Published with MATLAB® R2014aFig. 6. Plot of congestion between ring and grid style of power gatesplacement for a MAC unit of varied size.

From Fig. 6, the routing congestion confronted in grid andring style is the same below 44-bits input MAC unit, but after46 bits, the difference increases and it gets worse for grid styleof power gates placement. The reason is because the powergates are placed inside the design/module as seen in Fig. 2.The power gates are mixed with the design cells, this impactsthe placement and routing, leading to an increase in routingcongestion. As the size of the MAC unit increase, the numberof power gates needed also increases and hence the congestiongets worse in grid when compared to ring. In ring placement,power gates are placed at the perimeter of the design withminimal disturbing of the placement and routing of the designcells. The increased congestion can be reduced in grid style byspacing out the cells, but this would lead to an increase in thedie size and also reduce the core utilization when compared

TABLE IRESULTS FOR MAC UNITS OF DIFFERENT SIZES UTILIZING RING AND GRID STYLE OF POWER GATES PLACEMENT.

Number of Area (µm2) Number of IR Drop (mV) Congestion (%) Cost (Eq. 1)Input Bits Power Gates Ring Grid Ring Grid Ring Grid

44 8,658.299 8 14.75 14.75 1.01 1.01 0.0122 0.012246 9,498.859 9 14.93 14.93 1.10 1.12 0.0134 0.013748 10,435.445 10 15.20 15.19 1.15 1.19 0.0143 0.014864 16,628.724 15 19.05 19 1.29 1.35 0.0201 0.021096 36,492.539 32 29.14 28.27 2.60 2.92 0.0619 0.0675128 62,257.300 54 38.42 35.04 3.40 3.78 0.1068 0.1083160 94,127.824 82 49.47 44.54 4.04 4.82 0.1634 0.1755192 137,059.426 119 62.55 56.71 4.69 5.90 0.2398 0.2735224 193,158.560 168 75.30 67.82 5.7 7.02 0.3508 0.3891256 248,548.804 215 91.40 82.40 6.98 8.62 0.5215 0.5806288 314,784.932 274 98.90 90.02 7.87 9.38 0.6362 0.6902320 363,136.551 316 110.82 99.30 8.42 11.04 0.7627 0.8960

1

Axis = [44 46 48 64 96 128 160 192 224 256 288 320];ring_IR = [14.75 14.93 15.20 19.05 29.14 38.42 49.47 62.55 75.30 91.40 98.90 110.82];ring_cong = [1.01 1.10 1.15 1.29 2.60 3.40 4.04 4.69 5.7 6.98 7.87 8.42];grid_IR = [14.75 14.93 15.19 19.00 28.27 35.04 44.54 56.71 67.82 82.40 90.02 99.30];grid_cong = [1.01 1.12 1.19 1.35 2.92 3.78 4.82 5.90 7.02 8.62 9.38 11.04];

ir_drop = ring_IR./max([ring_IR grid_IR]);ringcong = ring_cong./max([ring_cong grid_cong]);cost = ir_drop .* ringcong;

irg_drop = grid_IR./max([ring_IR grid_IR]);gridcong = grid_cong./max([ring_cong, grid_cong]);cost_grid = irg_drop .* gridcong;

figureplot(Axis, cost, '-db', 'MarkerFaceColor', 'blue');hold on;plot(Axis, cost_grid, '--or', 'MarkerFaceColor', 'red');grid on;xlabel('Size of MAC unit (input bits)')ylabel('Cost function')legend('Ring', 'Grid', 'Location', 'north')

Published with MATLAB® R2014aFig. 7. Plot of a normalized cost function (1) between ring and grid style ofpower gates placement for a MAC unit of varied size.

to ring style. In general, routing congestion can be reduced byusing higher metal layers, but that would lead to an increasein the IR drop.

IR drop and congestion percentage were normalized to cre-ate a cost function, used in Table I and Fig. 7, as detailed in (1),where each of the grid and ring placements were normalizedto the maximum between the two. The plot indicates thatwhen normalized, below 44-bits the cost incurred in eitherusing ring or grid style of power gates is the same, but after46-bits, the grid style of power gates placement starts toworsen. As indicated previously, ring placement results in anincreased IR drop, and grid impacts congestion. The relativedifference of congestion percentage impacts the cost functionresults at larger sizes, which shows the grid style has a highernormalized relative cost than ring.

cost =IRDrop

max(IRDrop)× Congestion

max(Congestion)(1)

V. CONCLUSION

This work shows the trade-offs between using ring and gridpower gate placement for different sized system designs. Thepower gate placement style had negligible impact on the area,power and clock period, but showed a significant change onIR drop and congestion. Ring style of power gates placementresulted in worse IR drop when compared to grid due to thelong wire distance between the power gate to the center of thedesign. Grid style of power gates posed placement and routingcongestion that was worse than ring style, as the power gateswere mixed with the regular design cells. When both the IRdrop and congestion were normalized, a cost function indicatesthat, the usage of ring or grid style doesnt make a differencefor MAC units with input size below 44-bits, but the costincurred in grid style of placement starts to get worst above46-bits.

ACKNOWLEDGMENT

This work was supported by the California State University,Fresno Graduate Student Research and Creative ActivitiesSupport Award and the California State University, FresnoFoundation.

REFERENCES

[1] D. Ha, C. Yang, J. Lee, S. Lee, S. H. Lee, K. . Seo, H. S. Oh, E. C.Hwang, S. W. Do, S. C. Park, M. . Sun, D. H. Kim, J. H. Lee, M. I.Kang, S. . Ha, D. Y. Choi, H. Jun, H. J. Shin, Y. J. Kim, J. Lee, C. W.Moon, Y. W. Cho, S. H. Park, Y. Son, J. Y. Park, B. C. Lee, C. Kim,Y. M. Oh, J. S. Park, S. S. Kim, M. C. Kim, K. H. Hwang, S. W. Nam,S. Maeda, D. . Kim, J. . Lee, M. S. Liang, and E. S. Jung, “Highlymanufacturable 7nm finfet technology featuring euv lithography for lowpower and high performance applications,” in 2017 Symposium on VLSITechnology, June 2017, pp. T68–T69.

[2] H. Liu, B. Liu, L. T. Yang, M. Lin, Y. Deng, K. Bilal, and S. U.Khan, “Thermal-aware and dvfs-enabled big data task scheduling fordata centers,” IEEE Transactions on Big Data, vol. 4, no. 2, pp. 177–190, June 2018.

[3] A. Shehabi, S. Smith, D. Sartor, R. Brown, M. Herrlin, J. Koomey,E. Masanet, N. Horner, I. Azevedo, and W. Lintner, “United states datacenter energy usage report,” Lawrence Berkeley National Laboratory,Berkeley, CA, Tech. Rep. LBNL-1005775, June 2016.

[4] B. Bohnenstiehl, A. Stillmaker, J. Pimentel, T. Andreas, B. Liu, A. Tran,E. Adeagbo, and B. Baas, “Kilocore: A fine-grained 1,000-processorarray for task-parallel applications,” IEEE Micro, vol. 37, no. 2, pp.63–69, Mar 2017.

[5] W. H. Cheng and B. M. Baas, “Dynamic voltage and frequency scalingcircuits with two supply voltages,” in IEEE Intl. Symp. on Circuits andSystems, May 2008, pp. 1236–1239.

[6] N. Weste and D. Harris, CMOS VLSI Design: A Circuits and SystemsPerspective, 4th ed. USA: Addison-Wesley Publishing Company, 2010.

[7] A. Stillmaker and B. Baas, “Scaling equations for the accurate predictionof CMOS device performance from 180 nm to 7 nm,” Integration, theVLSI Journal, vol. 58, pp. 74 – 81, 2017.

[8] J. Rabaey, Low Power Design Essentials, 1st ed. Springer PublishingCompany, Incorporated, 2009.

[9] (2008) Nangate 45nm open cell library. [Online]. Available:http://www.nangate.com/?page id=2325

[10] R. Jotwani, S. Sundaram, S. Kosonocky, A. Schaefer, V. F. Andrade,A. Novak, and S. Naffziger, “An x86-64 core in 32 nm soi cmos,” IEEEJournal of Solid-State Circuits, vol. 46, no. 1, pp. 162–172, Jan 2011.

[11] C.-Y. Yeh, H.-M. Chen, L.-D. Huang, W.-T. Wei, C.-H. Lu, and C.-N.Liu, “Using power gating techniques in area-array soc floorplan design,”in 2007 IEEE International SOC Conference, Sept 2007, pp. 233–236.

[12] T. Lin, K. Chong, B. Gwee, and J. S. Chang, “Fine-grained power gatingfor leakage and short-circuit power reduction by using asynchronous-logic,” in 2009 IEEE International Symposium on Circuits and Systems,May 2009, pp. 3162–3165.

[13] Y. Saito, T. Shirai, T. Nakamura, T. Nishimura, Y. Hasegawa, S. Tsut-sumi, T. Kashima, M. Nakata, S. Takeda, K. Usami, and H. Amano,“Leakage power reduction for coarse grained dynamically reconfigurableprocessor arrays with fine grained power gating technique,” in 2008 In-ternational Conference on Field-Programmable Technology, Dec 2008,pp. 329–332.

[14] K. Usami, M. Kudo, K. Matsunaga, T. Kosaka, Y. Tsurui, W. Wang,H. Amano, H. Kobayashi, R. Sakamoto, M. Namiki, M. Kondo, andH. Nakamura, “Design and control methodology for fine grain powergating based on energy characterization and code profiling of micro-processors,” in 2014 19th Asia and South Pacific Design AutomationConference (ASP-DAC), Jan 2014, pp. 843–848.

[15] H. Jiang and M. Marek-Sadowska, “Power gating scheduling forpower/ground noise reduction,” in 2008 45th ACM/IEEE Design Au-tomation Conference, June 2008, pp. 980–985.

[16] E. Ashenafi and M. H. Chowdhury, “A new power gating circuit designapproach using double-gate fdsoi,” IEEE Transactions on Circuits andSystems II: Express Briefs, vol. 65, no. 8, pp. 1074–1078, Aug 2018.

[17] M. B. Henry, “Emerging power-gating techniques for low power digitalcircuits,” Ph.D. dissertation, Virginia Tech, Nov 2011.

[18] H. E. El-hmaily, R. Ezz-Eldin, A. I. A. Galal, and F. A. Hesham Hamed,“Gnrfet/mosfet conjunction power gating structures,” in 2018 IEEE 61stInternational Midwest Symposium on Circuits and Systems (MWSCAS),Aug 2018, pp. 21–24.

Impact of Coarse-Grained Power Gate Placement on a Fine ...this growth rate. Low power techniques...

Documents

Transcript of Impact of Coarse-Grained Power Gate Placement on a Fine ...this growth rate. Low power techniques...