Post on 25-Nov-2021
Low power memory controller subsystem IP exploration using RTL power flow
An End-to-end power analysis and reduction
Methodology
NEERAJNAYAN BALACHANDRAN
DEGREE PROJECT IN ELECTRICAL ENGINEERING, SECOND LEVEL
STOCKHOLM, SWEDEN 2019
KTH ROYAL INSTITUTE OF TECHNOLOGY
E L E C T R I C A L E N G I N E E R I N G A N D C O M P U T E R S C I E N C E
Low power memory controller subsystem IP exploration using RTL power flow
An End-to-end power analysis and
reduction methodology
Neerajnayan Balachandran
2020-06-25
Master’s Thesis
Examiner
Prof. Ahmed Hemani
Academic adviser
Dimitrios Stathis
Industrial adviser
Ioannis Savvidis
KTH Royal Institute of Technology
School of Electrical Engineering and Computer Science (EECS)
Department of Electrical Engineering
SE-100 44 Stockholm, Sweden
Abstract | i
Abstract
With FinFET based Application Specific Integrated Circuit (ASIC) designs delivering on the promises
of scalability, performance, and power, the road ahead is bumpy with technical challenges in building
efficient ASICs. Designers can no longer rely on the ‘auto-scaling’ power reduction that follows
technology node scaling, in these times when 7nm presents itself as a ‘long-lived’ node. This leads to
the need for early power analysis and reduction flows that are incorporated into the ASIC Intellectual
Property (IP) design flow. This leads to a focus on power-efficient design in addition to being
functionally efficient. Power inefficiency related hotspots are the leading causes of chip re-spins, and
a guideline methodology to design blocks in a power-efficient manner leads to a power-efficient
design of the Integrated Circuits (ICs). This alleviates the intensity of cooling requirements and the
cost. The Common Memory controller is one of the leading consumers of power in the ASIC designs
at Ericsson. This Thesis focusses on developing a power analysis and reduction flow for the common
memory controller by connecting the verification environment of the block to low-level power analysis
tools, using motivated test cases to collect power metrics, thereby leading to two main goals of the
Thesis, characterization and optimization of the block for power. This work also includes an energy
efficiency perspective through the Differential Energy Analysis technique, initiated by Qualcomm and
Ansys, to improve the flow by improving the test cases that help uncover power inefficiencies/bugs
and therefore optimize the block. The flow developed in the Thesis fulfills the goals of characterizing
and optimizing the block. The characterization data is presented to provide an idea of the type of data
that can be collected and useful for SoC architects and designers in planning for future designs. The
characterization/profiling data collected from the blocks collectively contribute to the Electronic
System-level power analysis that helps correlate the ASIC power estimate to silicon. The work also
validates the flow by working on a specific sub-block, identifying possible power bugs, modifying the
design and validating improved performance and thereby, validating the flow.
Keywords
Power analysis, Block characterization, Optimization, Differential Energy analysis, Dynamic Power,
Clock gating.
Sammanfattning | iii
Sammanfattning
Med FinFET-baserade applikationsspecifika integrerade kretsar (ASIC) -konstruktioner som ger
löften om skalbarhet, prestanda och kraft är vägen framåt ojämn med tekniska utmaningar när det
gäller att bygga effektiva ASIC: er. Formgivare kan inte längre lita på den "autoskalande"
effektminskningen som följer teknisk nodskalning, i dessa tider då 7nm presenterar sig som en
"långlivad" nod. Detta leder till behovet av tidig kraftanalys och reduktionsflöden som är integrerade
i ASIC Intellectual Property (IP) designflöde. Detta leder till fokus på energieffektiv design förutom
att det är funktionellt effektivt. Krafteffektivitetsrelaterade hotspots är de ledande orsakerna till re-
spins av chip, och en riktlinjemetodik för att konstruera block på ett energieffektivt sätt leder till
energieffektiv design av Integrated Circuits (ICs). Detta lindrar intensiteten hos kylbehovet och
kostnaden. Common Memory-kontrollen är en av de ledande energikonsumenterna i ASIC-designen
hos Ericsson. Denna avhandling fokuserar på att utveckla en effektanalys och reduktionsflöde för den
gemensamma minneskontrollern genom att ansluta verifieringsmiljön för blocket till
lågnivåeffektanalysverktyg, med hjälp av motiverade test caser för att samla effektmätvärden, vilket
leder till två huvudmål för avhandlingen, karakterisering och optimering av blocket för kraft. Detta
arbete inkluderar också energieffektivitetsperspektiv genom Differential Energy Analys-teknik,
initierad av Qualcomm och Ansys, för att förbättra flödet genom att förbättra test cases som hjälper
till att upptäcka effekteffektivitet / buggar och därför optimera blocket. Flödet som utvecklats i
avhandlingen uppfyller målen att karakterisera och optimera blocket. Karaktäriseringsdata
presenteras för att ge en uppfattning om vilken typ av data som kan samlas in och vara användbara
för SoC-arkitekter och designers i planering för framtida mönster. Karaktäriserings- /
profileringsdata som samlats in från blocken bidrar tillsammans till effektanalysen för elektronisk
systemnivå som hjälper till att korrelera ASIC-effektberäkningen till kisel. Arbetet validerar också
flödet genom att arbeta på ett specifikt underblock, identifiera möjliga effektbuggar, modifiera
utforma och validera förbättrad prestanda och därmed validera flödet.
Nyckelord
Power-analys, Block karakterisering, Optimering, Differential Energy-analys, Dynamic Power, Clock
gating
Acknowledgments | v
Acknowledgments
I would like to thank my supervisor at Ericsson Ioannis Savvidis, for the continuous and motivating
support and providing/enabling the right preliminary platform to get straight into the thesis. All my
technical discussions with him have contributed significantly to my knowledge and understanding of
the thesis and the area of work. I would like to thank Prof. Ahmed Hemani for his valuable insights
during different stages and corresponding presentations of the Thesis progress. Every checkpoint
with Prof. Hemani has helped me guide the eventual progress of the thesis. I would like to thank Pierre
Rohdin G, manager at Ericsson, for his continuous support in setting up the whole environment for
the thesis and pointing me to the right resources to get the work going. Also, thanks to the
Infrastructure team at Ericsson, Mikael Carlsson, Jonas Nyman, Haoge Liu, and Suleiman
Abukharmeh for their timely inputs and feedbacks that helped drive different parts of the thesis.
Thanks to my supervisor at KTH, Dimitrios Stathis, for his continuous support through review
meetings, feedbacks and pointers to improvement.
Stockholm, May 2020
Neerajnayan Balachandran
Table of contents | vii
Table of contents
Abstract ........................................................................................................ i Keywords .......................................................................................................................... i
Sammanfattning ........................................................................................ iii Nyckelord ........................................................................................................................ iii
Acknowledgments ...................................................................................... v
Table of contents ...................................................................................... vii List of Figures ............................................................................................ ix
List of Tables ............................................................................................. xi List of acronyms and abbreviations ...................................................... xiii
1 Introduction .......................................................................................... 1 1.1 Background ........................................................................................................ 1 1.2 Problem ............................................................................................................... 2 1.3 Purpose ............................................................................................................... 3 1.4 Goals ................................................................................................................... 3 1.5 Research Methodology ...................................................................................... 4 1.6 Delimitations ....................................................................................................... 4 1.7 Structure of the Thesis ...................................................................................... 4
2 Background .......................................................................................... 5 2.1 Power consumption in ASICs ........................................................................... 5
2.1.1 Dynamic Power ..................................................................................... 5 2.1.2 Static Power .......................................................................................... 6 2.1.3 Power Bugs ........................................................................................... 6
2.2 Software tools used– Introduction and relevance .......................................... 7 2.2.1 ActivityExplorer (VCD2RPT++ and VCD2TB) ...................................... 7 2.2.2 Spyglass Power .................................................................................... 9 2.2.3 PrimeTimePX ...................................................................................... 12
2.3 Dynamic Power reduction and Clock gating ................................................. 13 2.3.1 Clock Gating ....................................................................................... 14 2.3.2 Clock gating performance metrics ...................................................... 15
2.4 Framework development environment .......................................................... 20 2.4.1 Common Memory Controller ............................................................... 21 2.4.2 Performance Analysis framework ....................................................... 21 2.4.3 IP Design flow – Introducing a Power analysis flow ........................... 22
3 Power analysis and reduction framework development methodology ...................................................................................... 24
3.1 Power Test cases ............................................................................................. 24 3.1.1 Power test case/Stimuli knobs ............................................................ 25 3.1.2 Activity Analysis – Test case characterization and understanding ..... 26 3.1.3 Differential Energy Analysis – Test case tuning for optimization ........ 28
3.2 RTL based early Power Analysis and Optimization ..................................... 33 3.2.1 Characterization of the block .............................................................. 34 3.2.2 Analysis and Optimization flow ........................................................... 36
3.3 Netlist based power analysis .......................................................................... 39
viii | Table of contents
3.3.1 VCD2TB – Netlist simulation using Dump, Convert, Replay .............. 39 3.3.2 PrimeTimePX –Gate level sign-off power estimation ......................... 40
3.4 Cache block analysis – A sample analysis & optimization ......................... 41 3.4.1 Problem ............................................................................................... 41 3.4.2 Solution and validation ........................................................................ 42
3.5 Summary – An end-to-end power analysis and reduction flow .................. 47 4 Results and analysis.......................................................................... 48
4.1 Characterization results for Common Memory Controller (CMC) ............... 48 4.2 Power analysis results for optimization ........................................................ 52
5 Conclusion and Future ...................................................................... 56
References ................................................................................................ 59
List of Figures | ix
List of Figures
Figure 1.1: Synopsys Global User survey results presented at DAC 2019 ................... 2 Figure 2.1: Understanding switching power in CMOS switch ........................................ 5 Figure 2.2: ActivityExplorer GUI ..................................................................................... 7 Figure 2.3: Simplified Activity Analysis flow ................................................................... 8 Figure 2.4: Gate level replay simulation and activity analysis flow ................................ 9 Figure 2.5: Spyglass Power analysis flow - goals and steps ....................................... 10 Figure 2.6: PrimeTimePX Analysis Flow ..................................................................... 12 Figure 2.7: Widely used techniques for Power reduction and control [14] ................... 13 Figure 2.8: Power savings and accuracy attainable at different levels of
abstraction [14] ........................................................................................... 14 Figure 2.9: Possible synthesis of clock gating.............................................................. 15 Figure 2.10: Understanding SCGE ................................................................................. 16 Figure 2.11: Understanding DCGE ................................................................................ 17 Figure 2.12: Understanding ROADF .............................................................................. 18 Figure 2.13: Understanding ROADE .............................................................................. 20 Figure 2.14: A generic ASIC IP Design methodology [16] ............................................. 22 Figure 2.15: Early Power perspective to IP design flow ................................................. 23 Figure 3.1: Activity Profile of a typical use-case for the block ...................................... 25 Figure 3.2: Visualizing port exclusivity knob using VCD2RPT++ ................................. 27 Figure 3.3: Simplified view of test case tuning ............................................................. 28 Figure 3.4: Understanding Differential Energy Analysis - uncovering
inefficiencies, Image from [17] ................................................................... 29 Figure 3.5: Inferences based on the scenarios of energy difference between the
two test cases [17] ..................................................................................... 30 Figure 3.6 : Differential Energy Analysis Implementation flow ..................................... 31 Figure 3.7: Differential Energy Analysis - Flow sequence for redundancy
localization .................................................................................................. 32 Figure 3.8: Analysis and problem localization using Differential Energy Analysis ....... 33 Figure 3.9: Purpose-to-use case/operating point correlation to utilize the flow for
ASIC IP blocks ........................................................................................... 33 Figure 3.10: A representation of power-based operating points of an ASIC IP Block ... 34 Figure 3.11: Dynamic power optimization implementation methodology ...................... 38 Figure 3.12: Using VCD2TB for gate-level simulation ................................................... 40 Figure 3.13: Methodology of identifying power inefficiency in CMC - VCD2RPT++
screenshot .................................................................................................. 41 Figure 3.14: FIFO RTL without Clock gating for comparison ......................................... 43 Figure 3.15: FIFO RTL with Clock Gating ...................................................................... 44 Figure 3.16: Validation and signoff analysis of RTL improvements ............................... 45 Figure 3.17: Power analysis and Reduction Flow Summary ......................................... 46 Figure 4.1: Correlation between activity and loading condition of the block ................ 49 Figure 4.2: Variation of the average clock gating efficiency with increased loading
of the block ................................................................................................. 50 Figure 4.3: Power split-up relationship with respect to incremental loading of
block ........................................................................................................... 51 Figure 4.4: Power bug location in a block using Power analysis Flow ......................... 54
List of Tables | xi
List of Tables
Table 1: Operating point vs profiling metrics data for CMC .................................... 49 Table 2: Metric comparison between the modified and original RTL at Cache
level ............................................................................................................ 53 Table 3: Metric comparison of metrics between modified and original RTL for
ACK_FIFOs ................................................................................................ 53
List of acronyms and abbreviations | xiii
List of acronyms and abbreviations
FinFET Fin Field Effect Transistor
ASIC Application Specific Integrated Circuit
RTL Register Transfer Level
IP Intellectual property
DAC Design Automation Conference
CMC Common Memory Controller
CSV Comma Separated Values
CMOS Complementary Metal Oxide Semiconductor
VCD Value Change Dump
ASCII American Standard Code for Information Interchange
GUI Graphical User Interface
FSDB Fast Signal Database
SPEF Standard Parasitic Exchange Format
SDC Synopsys Design Constraint
I/O Input/output
SAIF Switching Activity Interchange format
FSM Finite State Machine
VHDL Very High-Speed Integrated Circuit Hardware Description Language
ICGS Integrated Clock Gating Cell
SCGE Static Clock Gating Efficiency
DCGE Dynamic Clock Gating Efficiency
CG Clock Gating
ROADF Register Output Activity Density for Flops
ROADE Register Output Activity Density for Enable
UVM Universal Verification Methodology
DSP Digital Signal Processor
STA Static Timing Analysis
DC Design Compiler – Synthesis tool from Synopsys
CT Clock Tree
SOT Start of Transaction
SoC System on Chip
IC Integrated Circuit
ESL Electronic System-level
DUT Design under test
CGE Clock Gating Efficiency
ACK Acknowledgment
FIFO First-In-First-Out
Introduction | 1
1 Introduction
As technology and design sophistication in the field of IC design increases with time, several
buzzwords of recent past have become the norm that critically enable today’s electronic design
industry’s way forward. Power and energy efficiency are terms that have been such buzzwords that
have achieved significant focus and emphasis on being incorporated in the mandates of ASIC IP
design flows. Semiconductor process technology limitations, ubiquitous, battery-enabled application
requirements such as IoT and Edge computing hardware necessitates the need for systematic power
analysis and optimization in design cycles.
This Thesis introduces, implements and validates a systematic power analysis and reduction flow
that can be introduced early in the ASIC IP design flow. This work is performed at the ASIC
department at Ericsson, Kista, and utilizes an internal ASIC IP Block that is a significant consumer of
power, for the analysis. This work aims to set a precursor to involving power as a major consideration
apart from performance, area, and cost in the ASIC design flow. This is the way forward to designing
the improved future of ASICs.
1.1 Background
Designing ASICs with a focus on power efficiency is considered a critical bottleneck concern with
limited tools and methodologies that merges seamlessly into the ASIC design flow, in the designs of
today [1, 2]. Power efficiency has gained focus in mainstream and consumer electronics design as the
way forward. The ability of chipmakers to rely on improving technology nodes that scaled transistor
specifications at each node remained a major advantage that almost negated a need for power
optimization flows in the ASIC design. This auto-scaling feature that came with using improved
technology nodes is no longer applicable today [3]. The cost and complexity of designs sky-rocket with
scaling technology nodes. Also, it is anticipated for 7nm to be a long-standing node in ASIC designs
[4]. Hence, today’s designs and all designs going forward will need a high focus on power efficiency
throughout the ASIC design flow, from as early as architecture choices. The existing trend of myopic
focus on performance, cost, and area as the only design concerns is not feasible anymore.
Power dissipation in a digital circuit can be categorized into two major types - Static and dynamic
power [5]. Current designs scaling below 28nm are enabled by the cutting edge FinFET technology.
FinFET technology has numerous advantages over conventional CMOS technology. They have very
low leakage power (contributes to low static power component), higher drive current per transistor
footprint, and hence higher speed [6]. The low static power dissipation feature provides a good basis
to work on the dynamic power of the design. The dynamic power depends on the stimulus to, or the
usage scenarios of, the design [7]. This improvement requires a utilization-based power analysis and
optimization flow that can characterize a design block for power in a specific usage scenario and then
evaluate the scope for power optimization.
The development and validation of this power analysis and reduction flow is implemented on the
Common Memory Controller (CMC) IP block at Ericsson. CMC is a block that enables the sharing of
memory resources between several units such as accelerators, DSPs, and interface blocks. It is one of
the prime consumers of power in the ASIC designs and works out as a good candidate for the analysis,
characterization, and optimization. This flow involves usage of power analysis tools that supports the
development of a framework around providing the power perspective to efficient IP design. This
Thesis work utilizes the Ericsson internal tools such as ActivityExplorer for RTL and netlist analysis,
and commercial tools such as Spyglass Power from Synopsys [8] and PrimeTimePX the signoff power
analysis tool from Synopsys [9].
2 | Introduction
This Thesis deals with early power estimation and implementation of low power techniques for a
design block. RTL Power estimation and implementation of low power techniques are the two difficult
challenges design teams face in the design flows according to Synopsys, presented in DAC 2019. Also,
this Thesis focusses on clock gating as the low power technique towards improved dynamic power
performance. According to Synopsys global user survey 2016 presented by Synopsys at DAC 2019,
clock gating is the power reduction technique that is the most used among design teams (almost 80%
of the designers use this). So, it can be substantiated that this Thesis addresses the most pressing
challenges using the most prevalent low power techniques relevant to the requirements of the
industry.
1.2 Problem
As the ASIC designers can no longer rely on the power reduction that comes with auto-scaling of the
technology nodes, power analysis and optimization flows need to be developed as part of the ASIC IP
design flow. Power-inefficient designs can lead to unreliable systems, that dissipate too much heat
and leads to a reduced lifetime of chips [10]. This becomes a concern with the high density of
transistors in designs and high computational needs from the designs. Power-efficient designs also
have a good effect on the environment due to lower energy footprint. This Thesis tries to provide
better clarity into the inclusion of power-dissipation-perspective during IP development and helps
optimize designs. This helps design teams possess a better understanding of their designs by
characterizing it and optimizing it using the flow. It helps foresee and improve upcoming designs for
power efficiency early in the design process.
Figure 1.1: Synopsys Global User survey results presented at DAC 2019
Introduction | 3
3
The Common Memory Controller (CMC), a key IP block in both the baseband and radio ASIC at
Ericsson is one of the largest consumers of power in the designs. This motivates the need to
implement a characterization and optimization flow for the CMC hierarchical top and sub-blocks.
This Thesis tries to address the need for implementing a power analysis and reduction flow for
CMC that connects the existing block verification environment to the power analysis tools and lays
the foundation for enhancing similar flows onto other ASIC sub-blocks leading to power-efficient
designs.
1.3 Purpose
This thesis aims to develop, implement, and validate a power analysis and reduction flow for the
Common Memory Controller (CMC) ASIC IP block. The flow must be a stimulus-based power analysis
utilizing the power analysis tools. The use-case based analysis provides accurate estimations and
better pointers to improvement. This flow will aim to provide a systematic method to gather power-
related metrics of the design, analyze the data, and optimize the design for power efficiency. Such a
power analysis and reduction flow on CMC shall provide the baseline for further such analyses and
optimization of ASIC design blocks at Ericsson.
This Thesis will lead to a flow that helps with better designs in terms of power. This leads to energy
efficiency, lower heat dissipation, lower carbon and energy footprint, longer component lives, and
hence lower utilization of resources. From the sustainability perspective every small instance of power
this flow saves accumulates over hundreds of times where these designs are used and lead to
significant energy savings, and hence adds up in tiny steps towards energy-efficient designs.
1.4 Goals
This Thesis work aims to develop a methodology to study the power performance of an ASIC design
starting from RTL, through understanding, characterizing the Common Memory Controller (CMC),
and eventually probe for optimization scopes in power performance. This whole methodology needs
to be systemized using in-house and commercial tools to form a baseline procedure that can be
extended to analyze other design blocks at Ericsson. This goal can be sequentially categorized in detail
as follows.
1. Implement a power analysis and reduction flow for the CMC block, that connects the existing
verification environment to the power analysis tools.
2. Transparent extraction of power metrics using in-house and commercial front-end power tools
by facilitating quick power exploration and profiling. Characterize the block on the basis of the
extracted metrics to enable performance interpolation for future design decisions in IP teams.
3. Profile the subsystem to pinpoint potential power improvements in different workload scenarios.
Trim the results to a shortlist of prime candidate modules, realize the RTL changes and
demonstrate the achievable power savings and area/timing tradeoffs.
This work will lead to a power analysis and reduction flow, utilizes standard power analysis tools,
connects to the existing verification environments of design blocks, works on test cases of various load
scenarios for accurate characterization and utilizes some other test case to simulate and uncover
power bugs in design, leading to optimization of the design.
This flow implemented on CMC will lead to the characterization of the block in terms of power
metrics. It also provides a hierarchical breakdown of possible improvement areas in the design. The
work also validates the flow based on these findings by analyzing the power of the block after
improving the design in RTL.
4 | Introduction
1.5 Research Methodology
The implementation of this Thesis can be split into several subsections that are methodically executed
in overlapping timelines to achieve the goals of the project within the overall duration.
The approach to execution of this Thesis project begins with obtaining a clear understanding
of the design environments, power analysis tools, and the design block in the flow. A detailed
understanding specific to the designs at Ericsson is primary in enabling the work towards the goals
of the Thesis.
The understanding of the tools and design runs parallel to understanding the current
verification environment of the design block. The verification framework in place to analyze the
performance of the block needs to be manipulated and utilized to generate test cases for power
analysis. This needs a good understanding of the set of test cases that will be needed and the knobs
that can help tweak the test cases into desirable power analysis tests. This is motivated by the fact that
the dynamic power of the block is a function of the use-case of the design.
The ability to easily handle the environment and tools in cohesion with the understanding of
the verification framework lays the foundation for achieving good test cases for power analysis and
power bug detection. Once these are implementable, the designs with specific stimuli are estimated
for power and analyzed. An energy-based analysis is formulated to improve the test case
identification for uncovering power bugs (Explained in detail in section 3.1.3). This is followed by
several runs of power analysis, data collection through exported CSVs used in excel and analysis.
Optimization motives from the analysis are realized through design modification in the environment
followed by another power analysis using tools.
The details of the above methodology are explained in Section 0.
1.6 Delimitations
A novel approach to power in designs as developed in this Thesis leads to several pointers towards
using and improving the approach. This Thesis tries to focus on develop, implement, and validate this
flow for a design block. It is performed on the memory controller block at Ericsson and is not
attempted on any other blocks for this Thesis. Certain parts of the analysis are manual using the power
analysis tools. The automation of these tasks will involve working with the tool vendors and is not
considered in the Thesis. The Thesis as part of the validation works on one sub-block and provides
the design team with an extensive worksheet with pointers to improve and does not validate all these
improvements in the given time frame of the Thesis.
1.7 Structure of the Thesis
Chapter 2 provides a detailed background for the Thesis work and provides all prerequisites that shall
help to get into the methodology in Chapter 3. The results of these methodologies are presented and
analyzed in chapter 4. Chapter 5 concludes the report and provides pointers towards inferences and
pointers to the future of the methodology.
Background | 5
2 Background
This chapter provides the background information needed to evolve and assimilate the methodology
of the Thesis sufficiently. The following subsections introduce the concepts of power consumption in
digital circuits, introduce the power analysis tools used and provide their relevance to the Thesis,
introduces the power reduction techniques and their relevance in the Thesis, metrics used for power
analysis, introduces the common memory controller’s significance in the Thesis, introduces the
performance analysis framework and its relevance, also discusses the IP design flow in ASICs, which
shall eventually take the methodology investigated in this Thesis under its wings to be implemented
for improved IP designs with power efficiency perspective.
2.1 Power consumption in ASICs
The terminologies associated with power are inconsistent among the available academic sources and
power tools. But, the physics behind these different terms concur. This Thesis work uses
terminologies for power pertaining to Ericsson lingo and relevant to the tools used.
The power components in an ASIC can be broadly classified into two components, namely,
Dynamic power component and Static power component. Dynamic power is the power component
associated with switching of the transistors in a design, which is contributed to by the stimuli at the
input of the design. Static power can be simply defined as the default power consumption of a design
when it is powered on and is idle or inputs are inactive. Understanding these components is critical
in taking steps to reduce these power components, thereby reducing the average power consumption
of the design. These power components in CMOS circuits are discussed in detail in the following sub-
sections.
2.1.1 Dynamic Power
Dynamic power is the component of power that arises out of the switching activity in the transistors
[11]. It corresponds to the power consumed by the device when the signals at the input are changing.
Dynamic power consists of two components.
Switching power – As in Figure 2.1, in any design the switching CMOS circuits have an associated
capacitive load C L. Switching power is the power spent in charging and discharging the capacitance
of the output net during a logic transition. With a VIN switching at frequency fSW over a time-period
zero to T seconds, the dynamic power is the power consumed in the output capacitor, assuming
voltage VDD across the capacitor and current iDD (assuming ideal components in the figure for
simplicity in formulation).
Figure 2.1: Understanding switching power in CMOS switch
6 | Background
The nodes can switch at a factor of the clock frequency fCLK. Therefore, as a means to realize the
transition rate we introduce the activity factor α, which lies between 0 and 1. This can be used as a
statistical measure of activity across a section of the design. Consequently the switching power can
be formulated as follows[12].
𝑆𝑤𝑖𝑡𝑐ℎ𝑖𝑛𝑔 𝑃𝑜𝑤𝑒𝑟 = 𝐶 ∗ 𝑉𝐷𝐷2 ∗ 𝑓𝑆𝑊 = 𝛼 ∗ 𝐶 ∗ 𝑉𝐷𝐷
2 ∗ 𝑓𝐶𝐿𝐾
Switching power can be reduced by reducing the overall activity factor of a design. Switching
power is one of the major contributors to total power in designs.
Internal power (Short circuit power/Crowbar power) – When the transistors in a CMOS circuit
switch, the imperfections in switching durations and the rise and fall durations of the switching inputs
can cause momentary direct current paths between the supply rails. This momentary short circuit
happens on both edge transitions (rising/falling) on the inputs [13]. The crowbar currents are a
function of the relationship between the rise/fall times at the input and output. The power component
is minimized when the two are comparable. A faster transition at the output compared to the input
transition results in higher crowbar current, therefore higher internal power consumption. Internal
power is not a significant concern for well-designed circuits at scaled technology nodes in recent use
because of lower supply rails and threshold voltages.
2.1.2 Static Power
Static power is the power consumption in a circuits idle state, when there are no signal transitions.
There are several contributing factors to static power, and these are generally modeled into the target
technology library in ASIC designs. Also referred to as leakage power, it is caused by the leakage
currents in CMOS circuits. These leakage currents exist in the powered-on devices even if there is no
switching activity.
Although non-critical for the abstraction level of this Thesis operates on, the leakage currents
can be further categorized into its components such as the reverse-biased p-n junction leakage
current, gate induced drain leakage current, gate direct tunneling leakage current, punch-through
leakage current and subthreshold leakage current [14]. There could be slight variations in these
parameters based on the states of the circuits and these can be modeled into the technology library
details.
2.1.3 Power Bugs
One of the major goals of the power analysis and reduction flow developed in this thesis is to uncover
‘Power bugs’ in a design. When a functionally correct design shows switching activity when it is not
supposed to toggle, the design is identified to contain power bugs. These can be avoided by disabling
redundant switching in inactive parts of a functionally accurate design. There are several techniques
to solve power bugs that are discussed later in the thesis. Earlier analysis and detection of power bugs
lead to significant time and cost savings involved in the redesign for its fix.
Background | 7
7
2.2 Software tools used– Introduction and relevance
The power analysis flow developed in the Thesis relies on the power and activity analysis tools for
reporting and investigations for optimization. The tools introduced in this section are used across
different stages if the Thesis and serve purposes which become clearer in section 0. This section
underlines the concepts and features of the tool that are relevant to the Thesis, thereby enabling the
understanding of the reader as the chapters progress.
2.2.1 ActivityExplorer (VCD2RPT++ and VCD2TB)
ActivityExplorer is an in-house tool at Ericsson that is used to perform activity analysis on designs.
The tool analyzes VCD files produced from RTL level and gate-level simulations. It reports the
average, time-based switching activity and clock gating efficiency. These can be visualized at different
hierarchical levels of detail, color-coded based on the metrics and area coded based on size.
The primary step to using the ActivityExplorer in the Thesis is to identify suitable test cases for
power analysis. This generates an interface inputs VCD file (value change dump) file. It is a
standardized ASCII based dump file that captures value changes on variables in a simulation. The
ActivityExplorer tool takes in the VCD file generated and produces a GUI based visualization as shown
in Figure 2.2. It also provides the activity profile (activity vs time) for the design for a selected
hierarchical instance. A simplified flow of the usage of RTL based ActivityExplorer (VCDRPT++) is
as shown in Figure 2.3.
Figure 2.2: ActivityExplorer GUI
8 | Background
RTL simulation-based VCDs are used to create a power testbench using the VCD2TB tool, this
recreates the stimulus for a netlist simulation. It helps recreate/replay the test stimulus of the RTL
simulation for the netlist simulation as a testbench. The VCD generated from this process is the final
VCD that provides the gate level activity, which can be visualized to obtain gate-level activity and clock
gating efficiency plot data, as the maps in the GUI and as time profiles. This ability to visualize the
activity and clock gating efficiency simultaneously for the time slices in the test case is advantageous.
It helps identify primary pointers to inefficiencies when it directly points to bad clock gating
efficiencies for low activity regions of the design for a given time instant in the test case. This pointer
is picked up for one of the cases/designs at Ericsson and the validity of that inference is evaluated
using more focused power analysis. The flow is for netlist level simulation and activity analysis using
VCD2TB and VCD2RPT++ are defined in Figure 2.4.
These tools form the primary steps for test-case choice for power analysis. It is a faster and
preliminary pointer towards dynamic power and activity of a block during a particular test case. This
helps identify the right test cases for power. Identifying the right test case is of prime importance in
the power analysis flow. The typical goal is to identify low activity test cases to identify redundant
activity and hence dynamic power bugs. These flows can also be used to tweak test cases for specific
characterization purposes of the block based on load. Activity analysis provides the right pointers
towards these tasks in a faster and less resource hogging manner. They form the preliminary step
prior to a resource and time-intensive power analysis using specialized tools.
The test cases and the optimization pointers gathered using the activity analysis tools at RTL and
Netlist level are analyzed and validate further using focused power analysis. The VCD2TB and the
VCD2RPT++ tools at Ericsson, form the basis for the first level of power analysis flow for the designs
at Ericsson and serve the purpose of fine-tuning and improving the later stages of the analysis taken
up in the Thesis.
Figure 2.3: Simplified Activity Analysis flow
Background | 9
9
2.2.2 Spyglass Power
Spyglass Power is a power analysis tool from Synopsys. It facilitates early power estimation through
RTL based power and activity analysis for blocks and power exploration. The tool takes in the initial
RTL, a reference netlist, the target technology library, and the activity file (like an FSDB) as the input
for power analysis and exploration. It provides a relatively accurate RTL power estimation and
actionable profiling metrics such as clock gating efficiencies. It helps prepare the RTL for a better
Inferred clock gating. It helps provide early and fast power numbers, component-wise split-up
visibility, and helps perform a metrics-driven power analysis.
Spyglass Power has certain pre-requisites for power analysis. The design under analysis has to
have a Spyglass toolset-based lint clean design. An accurate analysis necessitates a calibration netlist
reference that enables a good power correlation. It also inputs the target library data and associated
parameters. The analysis needs to have defined power test cases simulated with FSDB files dumped
Figure 2.4: Gate level replay simulation and activity analysis flow
10 | Background
for them. It requires the FSDB dumping tool to be compatible with the version of the Spyglass Power
analysis tool.
Once the pre-requisites for the analysis are in place, the initial steps preparing for the analysis
need attention. The power test cases defined are simulated and FSDBs with correct versions are
dumped. The design of the block under analysis needs to be specified. The analysis is pointed to the
right FSDB, with the time window for which the analysis is expected. Spyglass Power analysis happens
as a sequence of goals. These goals need to be set as a preparatory step for the analysis (Figure 2.5).
The analysis also needs to be pointed to the calibration netlist and clock gating thresholds can be set
to enable different levels of clock gating.
Once the preparatory steps are done, the power estimation as a set of goals is represented in
Figure 2.5. The flow takes in a reference SPEF (Standard Parasitic Exchange format) file, a reference
netlist, the library files, switching activity file obtained from simulations (FSDB files), and certain
power parameters for accurate estimation. The first goal is the design read, which reads the design
block under analysis, which is specified in prior. Then the power audit goal is executed where an audit
is performed to check the design, simulation data, and technology library for consistency and lists the
key parameters in the power estimation. Then the vector analysis goal is performed where the activity
is analyzed for a simulation testbench and an activity profile analyzed over time is generated. Then
finally, the power estimation and profiling goal is run where the estimated power, activity, and
efficiency information for clock, registers, and memories are computed for the time intervals of
interest. It also points to inefficient clock gating and opportunities to uncover power bugs.
Spyglass Power analysis results are reported as follows, further detailing the categories explained
in section 2.1 (Dynamic and Static power)
Figure 2.5: Spyglass Power analysis flow - goals and steps
Background | 11
11
1. Combinational power - It is the power consumed by a combinational cell and a net driven by
a combinational cell. It is a dominant component in Datapath intensive designs and is directly
proportional to high data toggle and large combinational logic. There are various techniques
to control this power component if issues are identified, such as reducing combinational
depths, registering inputs to combinational logic, techniques like data gating.
2. Sequential power - It is the power consumed by the sequential cells in a design (registers and
latches) and the output nets of sequential logic.
3. Clock power - It is the power consumed by the clock network. It is one of the major consumers
of dynamic power.
4. Memory power - It is the power consumed by the memory (based on the library file) and the
output nets of the memory cell. Memory power is proportional to the number of read/write
operations on it. The optimization techniques to control this are the ones to minimize
redundant read/write operations. Memory leakage (default power consumed during power-
on) is a significant contributor to this component and necessitates various operational modes
like sleep, deep sleep to manage the leakage power.
The other power components that Spyglass reports are IO power (power consumed by the IO pads
based on technology library), mega cell, and black box powers (special blocks specified in the
configuration or SGDC constraint files). But these are not of focus for this Thesis.
The tool uses a reference netlist and technology library to pseudo synthesize the design RTL and
estimate power. This estimation relies on the two statistical parameters, activity, and probability.
According to the spyglass manual [19]. Activity is defined as the number of toggles per clock cycle on
the signal, averaged across many clock cycles. Probability is the percent of time that a signal is high.
These statistical parameters are used by the tool as the basis for power estimation. This provides a
reliable starting point for power estimation. Spyglass introduces virtual cells for clock tree modeling.
The pseudo netlist with simulation and parasitic data allows spyglass to calculate the contribution of
static and dynamic power to the total power.
The categorization of power is discussed in section 2.1 can be defined in the context of the Spyglass
Power tool as follows.
1. Leakage power – The leakage power of any cell is specified in the technology library file
corresponding to it. In the Spyglass context, total leakage power is the sum of the leakage of
all the cells present in the design. In the spyglass generated reports, the leakage is broken
down into its contributors such as combinational, sequential, or memories. The leakage
power calculations on spyglass are dependent on, the type of cell instantiated in the reference
netlist, activity data if the library has state-dependent leakage values, declaration of any
power domains.
2. Internal power - It corresponds to the power dissipated within the boundary of a cell when a
state transition occurs. The internal power calculation depends on the library file which
annotates energy data for transitions based on slew rate and output load. Spyglass utilizes the
activity information from the simulation data to estimate how often the cells toggle and use
the technology library data to derive the power numbers. Spyglass estimations of internal
power depends on the types of cell instantiated in the reference netlist, activity data (FSDB
files from simulation), wire parasitic and slew values.
3. Switching power – As discussed in section 2.1.1, switching power can be expressed as a
function of the operating voltage, output capacitance, and the switching frequency (or a factor
of clock frequency). Spyglass computes the capacitance from the contribution of the cell pin
(using the library file), the contribution from the wire capacitance model (using the SPEF file,
or the wire load model). Switching power estimation uses these data in cohesion with the
12 | Background
toggling activity of the net (which is derived from the simulation data). Therefore, the
switching power depends on, the type of cell instantiated in the reference netlist, activity data,
and wire parasitics.
2.2.3 PrimeTimePX
PrimeTimePX is a sign-off power analysis tool from Synopsys used later in the IP design flow for
accurate power estimation closest to silicon. It builds a detailed power profile of the design based on
the circuit connectivity, the switching activity, net capacitance, and the power behavior data from the
technology library. It calculates the power behavior for a circuit at the cell level and reports the power
consumption at the chip, block, or cell levels. [20]
Power analysis using PrimeTimePX is implemented using a Tcl script that is specified in the
work directory. This Tcl file consists of a sequence of steps as described in Figure 2.6. The first step
is to specify analysis mode, which could be averaged or time-based power analysis mode.
The input files needed for such an analysis are discussed in detail as follows.
• Gate-level Netlist – PrimeTimePX takes in a gate-level pre-layout netlist (generally a Verilog
file). The netlist contains leaf-level cells instantiated from the library cells. A flat or
hierarchical netlist can be used for such an analysis.
• Technology Library – The technology library file consists of the library cell, with each cell
consisting of timing, power, and characterization information, such as power numbers per
cell.
• SDC file – This file specifies the design constraints. It specifies the constrains on all ports and
pins, I/O paths of a design, and clock.
Figure 2.6: PrimeTimePX Analysis Flow
Background | 13
13
• Parasitic File – This file contains the capacitance of the nets. It is one of the factors in
determining dynamic power.
• Switching Activity – In the averaged power analysis, a SAIF or VCD file is used to read the
switching activity. These are created from RTL or Gate level simulation.
These files are specified in the power analysis and Tcl files prior to the analysis. The analysis
results in a set of results in the form of reports. The report contains all the relevant power numbers
and metrics for the design being analyzed. These files are generated in the work directory and can be
viewed as reports for analysis. It is to be understood that the power analysis using these tools done in
this thesis does not directly correlate to silicon. The metrics from this analysis are used for the
improvement of the block and for a system analysis using excel that is used to correlate to silicon and
ESL (Electronic System Level) power analysis, which is a closer estimate to silicon.
2.3 Dynamic Power reduction and Clock gating
The push towards low power consumption in digital ASIC designs has led to several techniques used
to reduce power in designs. There are several of these techniques focusing on static and dynamic
components of power. The major generic factors that motivate this push towards low power are
battery life elongation, carbon footprint reduction, hot-spot avoidance in devices/reduced cooling
facilities, and longer component life. This Thesis focusses on dynamic power reduction and this
section details the methods, techniques, and metrics used.
Clock power is one of the major contributors to the overall Dynamic power in ASIC designs. With
smaller/advancing technology nodes and reducing percentages of static power component, the clock
power associated switching powers become points of significant concern. The improvement of static
power components with innovative transistor designs using FinFETs provides the opportunity to
focus primarily on the Dynamic power consumption of a design. Dynamic power component is very
susceptible to power bugs based on usage scenarios. This is due to redundant switching activity that
can arise in certain usage scenarios due to non-visibility of these scenarios in the functional
verification of these design blocks. This provides the scope to introduce a power analysis and
reduction flow in the IP design flow that can improve the design, in a manner similar to functional
verification and code coverage. The development of such a power reduction flow would involve using
power reduction techniques on the design.
Figure 2.7: Widely used techniques for Power reduction and control [14]
14 | Background
Several power-reduction techniques have been devised over the years for the reduction of static
and dynamic power components. Figure 2.7 shows a listing of the techniques as discussed in detail in
the reference [14].
There is a crucial correlation between how early an effort is made for power reduction, the power
savings achieved, and the accuracy error as depicted in Figure 2.8 referencing [14]. This motivates
early power analysis and reduction.
The methodology introduced in this Thesis focuses on the introduction of early power analysis
and reduction in the IP design flow. Analysis at the RTL design stage is a good place to start in terms
of the level of impact and accuracy. Power estimation at RTL driven by the performance analysis
framework helps pinpoint test cases that help identify power bugs and therefore in power reduction.
2.3.1 Clock Gating
Once the power bugs are detected, the power reduction technique focused on in the Thesis for
optimization is Clock gating. Clock gating is one of the critical techniques to address the need for a
reduction in dynamic power. It is one of the most widely taken approach when trying to address power
reduction problems.
The simplistic idea behind clock gating is that, in a register, when there is no activity recorded in
the data input, there is no need to clock the registers during that period. This provides the opportunity
to switch off/disable clock transitions during this scenario. It is common in designs to have several /
a bank of registers driven by a single clock line. In these cases, an enable signal is introduced to
gate/disable the clocking of registers. This signal can be labeled as the clock gating enable.
The pseudo-code snippet below represents the implementation of clock gating.
𝑤𝑎𝑖𝑡 𝑢𝑛𝑡𝑖𝑙 𝑐𝑙𝑘′𝑒𝑣𝑒𝑛𝑡 𝑎𝑛𝑑 𝑐𝑙𝑘 = 1;
𝑖𝑓(𝑒𝑛) 𝑞 <= 𝑑;
The above RTL specifies a clock gating enable ‘en’. The synthesis tool interprets this enable as
one of the two implementations as shown in Figure 2.9.
Figure 2.8: Power savings and accuracy attainable at different levels of abstraction [14]
Background | 15
15
The first implementation is a “re-circulating register” implementation where the enable is used
to select between new data or re-circulating the previous data value. The second implementation
involves gating the clock where, when the enable is off, the clock is disabled [15]. The two
implementations are functionally equivalent but differ in timing and power behavior. An integrated
clock gating cell implementation, in the second implementation, helps better in power saving by using
the clock shutdown mechanism.
Therefore, improvement of clock gating metrics by identifying new clock gating opportunities is
a way to improve the power performance of the design. The goal is to introduce gating cells and
formulate good gating enables for these cells. It is critical to ensure that the enables are designed in
such a way that the clock gating opportunities utilized save power rather than increase it. For example,
in a case where the clock is always enabled, the insertion of clock gates leads to additional enable logic
and will consume more power. A clock gate added to design introduces delay to clock tree and makes
clock tree synthesis tougher. This necessitates the need for a differential power computation to ensure
that gating does introduce power savings. Clock gating implementations need to be verified for
impacts on the testability of designs. With designs involving state machines, idles states can be
identified for clock gating of certain sections of the IP adaptively. In IPs designed with multiple clock
domains, idle clock domains can be identified and gated.
2.3.2 Clock gating performance metrics
A power analysis framework implemented in the Thesis that connects to a performance analysis
framework and points to scopes for improvement of dynamic power needs a few well-defined dynamic
performance metrics to work on. While the relevance of these metrics to the complete flow will be
discussed in Section 3, this section introduces the clock gating metrics. These are ratios that provide
an indication of how effective clock gating is in the design being analyzed. These metrics are derived
to be used with the Spyglass Power analysis tool used in this Thesis. These are practical metrics that
are available in the tools to evaluate. The power analysis flow developed in conjecture with the tool
and the metrics leads to early power bug detections pointers. These metrics are accurate early from
the analysis of RTL designs. This enables early power analysis and power bug detection.
Figure 2.9: Possible synthesis of clock gating
16 | Background
2.3.2.1 Static clock gating efficiency
Static clock gating efficiency (SCGE), also known as the clock gating ratio is a structural metric.
It can be defined as the percentage ratio of clock gated registers with respect to the total number of
registers. The ratio can be represented as follows.
𝑆𝐶𝐺𝐸 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐶𝑙𝑜𝑐𝑘 𝑔𝑎𝑡𝑒𝑑 𝑟𝑒𝑔𝑖𝑠𝑡𝑒𝑟𝑠
𝑇𝑜𝑡𝑎𝑙 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑔𝑖𝑠𝑡𝑒𝑟𝑠
This metric can be understood as the percentage of registers in the analyzed design that are
enabled with clock gating. This provides an idea of what percentage of registers in a design hierarchy
can be clock gated for performance improvement. Although low SCGE is a direct indicator of low
scope for clock gating, this metric is best used in combination with the other clock gating metrics that
shall be discussed.
Figure 2.10 shows a sample register tree for which the Static clock gating efficiency is computed.
In the example (Figure 2.10), three of the five registers have an inferred clock gate. This leads to
a static clock gating efficiency of (3/5), that is 60%
Figure 2.10: Understanding SCGE
Background | 17
17
2.3.2.2 Dynamic clock gating efficiency
Dynamic clock gating efficiency (DCGE) is an activity-based gating performance metric. Also
referred to as simply the Clock gating efficiency, because of its significance as the efficiency metric,
DCGE can be simply defined as the percentage of time a clock is gated. This can be represented as
follows.
𝐷𝐶𝐺𝐸 = 𝐺𝑎𝑡𝑒𝑑 𝑜𝑟 𝑠𝑢𝑝𝑝𝑟𝑒𝑠𝑠𝑒𝑑 𝑐𝑙𝑜𝑐𝑘 𝑡𝑜𝑔𝑔𝑙𝑒𝑠
𝐴𝑙𝑙 𝑡𝑜𝑔𝑔𝑙𝑒𝑠 𝑜𝑛 𝑡ℎ𝑒 𝑐𝑙𝑜𝑐𝑘 𝑝𝑖𝑛
= 1 −𝐴𝑐𝑡𝑖𝑣𝑒 𝑐𝑙𝑜𝑐𝑘 𝑡𝑜𝑔𝑔𝑙𝑒𝑠 𝑎𝑓𝑡𝑒𝑟 𝑐𝑙𝑜𝑐𝑘 𝑔𝑎𝑡𝑒
𝐴𝑙𝑙 𝑡𝑜𝑔𝑔𝑙𝑒𝑠 𝑜𝑛 𝑡ℎ𝑒 𝑐𝑙𝑜𝑐𝑘 𝑝𝑖𝑛
DCGE is a measure of how effectively the instantiated clock gates suppress the clock going to the
registers. This can be understood using the example that follows.
In Figure 2.11, the average dynamic clock gating efficiency can be calculated as the mean of the
DCGE on the three clock gated registers. Hence the average DCGE is calculated as,
𝐴𝑣𝑔. 𝐷𝐶𝐺𝐸 = (1
3) ∗ ((
4
8) + (
5
8) + (
6
8)) = (
5
8) = 62.5%
This leads to an average DCGE of 62.5% for the sample design.
Since DCGE is a measure of how effectively the instantiated clock gates suppress the clock, higher
DCGE for a design is good. But as a metric, although higher DCGE means more efficient clock gating,
Figure 2.11: Understanding DCGE
18 | Background
it doesn’t take into consideration the data line transitions that need to be clocked when DCGE
numbers are low. Thus, lower DCGE does not necessarily mean bad clock gating as there is the
possibility that the majority of clock cycles cannot be suppressed due to data transitions on the data
line that need to be clocked in those cycles. This necessitates the need for other metrics that take into
consideration data transitions in a design as well. This leads to the efficiency metrics discussed in the
following sections.
2.3.2.3 Register Output Activity Density for Flops
Register output activity density for flops (ROADF) is a register level metric that depends on the
level of activity on the data line. It is a measure of how effectively or adequately data transitions are
clocked in a register that has a gated clock. It looks for redundant clock transitions on the clock pin
when there are no data transitions on the data line. This metric can be formulated as follows.
𝑅𝑂𝐴𝐷𝐹 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑜𝑔𝑔𝑙𝑒𝑠 𝑜𝑛 𝑡ℎ𝑒 𝐷𝑎𝑡𝑎, 𝑄 𝑝𝑖𝑛
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑎𝑐𝑡𝑖𝑣𝑒 𝑡𝑟𝑎𝑛𝑠𝑖𝑡𝑖𝑜𝑛𝑠 𝑜𝑛 𝑡ℎ𝑒 𝑐𝑙𝑜𝑐𝑘 𝑝𝑖𝑛
This metric can be better understood from the example below.
In the Figure 2.12, using the transitions waveforms of the clock, the clock gating enable and data
lines, we can arrive at the ROADF and DCGE using the above definition as follows,
𝑅𝑂𝐴𝐷𝐹 = (2
3) = 66%
𝐷𝐶𝐺𝐸 = (5
8) = 62.5%
But from the waveforms we see that the clock gating enable allows more clock cycles to be un-
gated than necessary for clocking the 2 data transitions during the period. This provides the scope for
improvement of the ROADF metric to 100% for optimal clock gating while clocking the data
appropriately. This leads us to the improved gating enable waveform which leads to a ROADF as
follows.
Figure 2.12: Understanding ROADF
Background | 19
19
𝑅𝑂𝐴𝐷𝐹 = (2
2) = 100%
A 100% ROADF is an indication of optimal clock gating for that register. If the clock was gated
using the improved enable, it would lead to a reduction in the extra enabled clock cycle leading to a
dynamic clock gating efficiency computed as follows.
𝐷𝐶𝐺𝐸 = (6
8) = 75%
This turns out to be the highest achievable DCGE for the given number of transitions in the data
line, which is the optimal clock gating. This clarifies the ROADF metric’s importance as a validation
towards the highest achievable DCGE. The SCGE and DCGE metrics do not provide a clear picture of
the clock gating performance when looked at, in isolation. These metrics in combination with the
ROADF metric leads to a better analysis as developed in this Thesis.
2.3.2.4 Register Output Activity Density for Enables
Register Output Activity Density for Enables (ROADE) is an extension of the ROADF metric
discussed in the previous section, onto a data path of registers. When a bank of registers, suitably of
a similar functionality or data path is enabled by a common enable for clock gating, ROADE is the
metric that provides a measure of how efficient the clock gating is, considering the data transitions
on all the data lines in the data path. As an extension of the ROADF metric for a bank of multiple
registers, ROADE can be formulated as follows.
𝑅𝑂𝐴𝐷𝐸 = (𝑁𝑢𝑚𝑏𝑒𝑟 𝑖𝑓 𝑡𝑜𝑔𝑔𝑙𝑒𝑠 𝑜𝑛 𝑎𝑙𝑙 𝑑𝑎𝑡𝑎 𝑙𝑖𝑛𝑒𝑠 𝑜𝑓 𝑡ℎ𝑒 𝑟𝑒𝑔𝑖𝑠𝑡𝑒𝑟 𝑏𝑎𝑛𝑘 )
( 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑎𝑐𝑡𝑖𝑣𝑒 𝑡𝑜𝑔𝑔𝑙𝑒𝑠 𝑜𝑛 𝑡ℎ𝑒 𝑐𝑙𝑜𝑐𝑘 𝑝𝑖𝑛 𝑡𝑜 𝑡ℎ𝑒 𝑟𝑒𝑔𝑖𝑠𝑡𝑒𝑟 𝑏𝑎𝑛𝑘)
It helps identify inefficient clock gating enables that clock the registers even when there are no
data transitions across the data path of the register banks. Figure 2.13 helps clarify the idea with
examples. The figure shows a bank of registers connected with a common clock and a common clock
gating enable. ROADE is a metric that is relevant to multi-register data paths for which clock gating
is enabled.
20 | Background
From this example we see that R1 and R2, driven by the common clock, also utilize the common
clock gating enable signal. So, as we the previous sub-section, the ROADF is 66.66%. According to
the definition of ROADE, we see that,
𝑅𝑂𝐴𝐷𝐸 = (3
3) = 100%
Although at register level we see that the clock gating is not optimal, when looked at in the register
data path level we see full utilization of the clock transitions in the clocking the data transitions across
all data paths. This leads to ROADE being a more reliable metric in cases of multi-register data paths.
Since the enable is common to these registers, it becomes irrelevant to consider the ROADF metric
for this bank of registers. A high ROADE value is necessary for the clock gating performance to be
optimal. It becomes clear that the highest achievable DCGE is reached when ROADE tends to be
100%.
ROADE, DCGE, and SCGE are the major metrics that are taken forth and used to be the goal
metrics in the power analysis framework developed and implemented in the Thesis. A combined view
of these provides the current quality of clock gating and helps point out the way ahead for
improvement.
2.4 Framework development environment
This section introduces the tools and methodologies that currently assist the ASIC design and
motivates how the power analysis framework developed in the Thesis can be placed into this existing
way things are done. The section is split into sub-sections that shall introduce the Common memory
controller (CMC), the IP design flow, and the existing verification environment for the CMC that the
power analysis framework shall connect to for appropriate stimulation for power analysis. An
understanding of these dependencies on the framework development provides an overview of all
considerations that might need to be taken to make it effective.
Figure 2.13: Understanding ROADE
Background | 21
21
2.4.1 Common Memory Controller
The Thesis focuses on developing a power analysis flow for the Common Memory Controller (CMC)
subsystem IP, at the ASIC team in Ericsson. It is a persistent part of the ASIC designs and one of the
major consumers of power in the designs. This motivates the development of a power analysis flow
focusing on CMC although the goal is to be able to develop the framework that can eventually be
custom fit to be performed on other design blocks as well.
CMC is a block that enables the sharing of memory resources between other blocks such as DSP,
accelerators, or other interface blocks. The design is hierarchical and enables a top-down analysis
approach when analyzing power and related metrics. The CMC consists of several sub-blocks and the
goal of the Thesis will be to be able to characterize the block using power numbers and other metrics,
look for potential power bugs. This approach needs a sufficient understanding of the sub-blocks and
facilitating the right stimulus to generate the varying levels of block utilization for power analysis. The
awareness of the sub-block functionality also helps confirm improvement areas and feasibility of
actually making improvements.
Simplistically, for this report focusing on power analysis, the CMC can be viewed as a subsystem
that provides DSP and other client-like systems read/write accesses to memory areas.
Characterization of the block would involve, the variation of the read/write operations intensity,
length, and payload to measure, monitor, tabulate and plot the critical parameters that help
understand the design operation and help extrapolate these for future designs. The dynamic power
improvement scopes are to be identified by figuring out redundant switching in the designs, especially
during low activity operation of the design. This Thesis utilizes the performance analysis framework
to create stimuli suitably for power analysis focusing on characterizing the block and optimizing the
block.
2.4.2 Performance Analysis framework
The development of a power analysis flow uses the performance verification framework developed for
Common Memory Controller (CMC). This is a UVM based verification environment, where different
test cases are simulated with varying parameters to characterize the performance of CMC for different
software loads. It also has options to stochastically load the block using variation of parameters. Some
of these parameters are connected to the power analysis flow developed. These parameters are
tweaked to generate the right test cases that shall be utilized for the two purposes of the power analysis
flow, namely, characterization and optimization of CMC. The exact knobs that are used for this and
the motivation towards generating these test cases are discussed in section 3.1.
It is critical to perform power analysis on the right stimulus to the design under analysis. This
provides the right scenarios to create a load-based power profile and also to uncover power bugs using
the right stimulus. Characterizing a block means analyzing the various loading scenarios that the
design is subjected to and analyzing the block for power by generating the stimulus that pushes the
block into these respective operating scenarios. Power bug detection is ideally enabled when looking
at low load scenarios. Low load scenarios lead to lower switching activity and thus helps uncover
redundant switching activity when according to the stimuli the design areas were supposed to be
inactive. Such a control over the stimulus is achievable by connecting the power analysis flow to the
performance verification framework for its stimuli. The Thesis motivates this logical connection to be
made between the two and utilizes the verification framework to be modified and used as the enabler
for the development and implementation of the power analysis flow.
22 | Background
2.4.3 IP Design flow – Introducing a Power analysis flow
The development of an ASIC chip involves several steps that have been modeled over the years as a
sequence of steps of a design flow. This process starts with a requirement or concept as the starting
point. Based on this requirement, architectural specifications are framed. Then multiple iterations of
RTL coding are done followed by multiple iterations of RTL simulation and verification. The RTL
simulation is followed by logic synthesis and optimization. One of the other important steps during
synthesis is the static timing analysis (STA). STA is performed to check and ensure that all the
functional requirements of the design are achieved with timing closure. In the cases of failure to
achieve timing closure (insufficient slack) the optimization and logic synthesis might need revisiting
for improvement of the design timing performance. Once the pre-layout static timing analysis is
passed the designs proceed closer towards silicon through floor planning, placement, and Clock tree
insertion. This is validated with another static timing analysis before finalizing the routing. Finally,
the routed design is taped out. This simplified IP design flow is visualized using the flowchart in Figure
2.14. [16]
Figure 2.14: A generic ASIC IP Design methodology [16]
Background | 23
23
Now when we move this procedural flow on to implementation, we focus on the checkpoints of
the design flow. Figure 2.15 introduces the idea of where power analysis fits in the design flow. The
power analysis flow as discussed in the next section can be started when the RTL code is written, and
a verification environment is set up for the functional verification of RTL. This can enable RTL power
analysis and gate-level power analysis.
The parallelization is shown during the simulation verification and synthesis phases, when the
RTL is available for the design. The yellow boxes connect the power-based RTL iteration to the RTL
design stage. The dark green boxes connect the power analysis outcome to the flow as a checkpoint.
The sub-flow in the light green box represents a power analysis-based optimization that can be
considered similar to code coverage analysis. A power analysis flow that fits this way in the IP design
flow creates a cohesive verification environment for functional and power optimization of design. This
framework includes a power-perspective that is missing the standard IP design flow is what this
Thesis motivates to develop.
Figure 2.15: Early Power perspective to IP design flow
24 | Power analysis and reduction framework development methodology
3 Power analysis and reduction framework development methodology
The power analysis and reduction flow whose pre-requisites were laid out in the previous sections, it
was implemented in several steps that spanned the duration of the Thesis. This section explains the
methodology.
Although the flow implementation is associated very closely to the Common memory controller
(CMC), the verification framework for CMC, and specific power analysis tools, the idea of the
methodology presented is to provide a core concept that can be generalized and used for any ASIC
design blocks for their characterization and optimization (activity-based redundant dynamic power
reduction), using any environment or tools. The feasibility of such a flow is good on any design that
has the RTL code written and verification environment set up. Based on the use-cases of the block,
the performance validation framework can be tweaked to generate stimulus (focused test-cases) for
power analysis and power reduction. This connection to the verification framework of an ASIC IP
Block, as a source for power analysis stimulus, forms the first part of the power analysis framework
developed in this Thesis. This is followed by usage of the stimulus and design for characterizing the
design and pointing to bugs in the design. Based on this step we optimize the design with RTL
modification and repeat the power analysis steps to validate the flow.
So, the subsequent sub-sections detail the steps followed in the methodology that helps develop
the power analysis and reduction framework.
3.1 Power Test cases
Power test cases are very important to clarify before power analysis. Time and effort need to be spent
on specifying test cases for which the power analyses make sense. Power analysis cannot be
generalized for an ASIC IP Block. Power analysis and its results are relevant only when associated
with specific use-scenarios or test cases on the block. For example, for a microprocessor IC design, if
the power analysis is done by booting Linux, and the actual use-case of the IC is a certain end
application running on Linux. The loading situation of the test-cases for these two scenarios vary and
it doesn’t make sense in using a Linux boot power test case for characterizing the design operation in
a different use-case.
The performance analysis framework (described in section 2.4.2) in place for CMC functional
verification is used to create the power test cases. As discussed in section 2.4.1, for the sake of the
development of this power-analysis flow, CMC can be viewed as a subsystem that provides DSPs and
other client-like systems read/write accesses to memory areas to resources. This leads us to the need
for specifying the parameters or knobs that help characterize test cases. The first step to identifying
knobs is to understand the blocks typical use-case and its corresponding activity profile. An
understanding of the test case that can simulate this use-case helps arrive at the knobs to create the
power test cases. Figure 3.1 is an indication of the types of activity curves we achieve by simulation of
the power test cases. The understanding of the use-case in this profile translates to, a peak for buffer
initialization at the beginning of the test case, followed by a series of read/write operations of fixed
length and intensities. This causes the roughly static part of the curve. This is followed by the end of
the test case.
The subsequent sections clarify these knobs, the motivation to use those, the methods to choose
test cases for the two purposes of the power analysis flow, and a summary of the power analysis test
cases that shall be used henceforth in the methodology descriptions.
Power analysis and reduction framework development methodology | 25
25
3.1.1 Power test case/Stimuli knobs
Power test cases for Common Memory Controller (CMC) are developed from the verification
framework using a set of parameter knobs which are varied to create different usage scenarios for
power. As an initial step to understanding these knobs and the test cases, the CMC is simulated for
variation of these parameter values across all values within the functional boundaries. Simulations
across the interest intervals of these parameters helps create a map of the test cases that will come in
handy when clustering them as test cases for power characterization and power optimization. This
section further introduces these parameter knobs and motivates towards their role in the power
analysis framework.
3.1.1.1 Number of buffers allocated in test case
This knob controls the number of buffers allocated to enable the memory read/writes from the
Common Memory Controller (CMC). The buffers in the test case help execute and randomize the
accesses. A higher number of buffers increases the buffer initialization period which is the initial spike
seen in the sample activity profile in Figure 3.1. Although this knob has an impact on the overall
energy of the test case, it’s doesn’t affect the instantaneous value of power during the required
read/write operations. This knob was varied and tested across the possible intervals. This variation
was analyzed using the VCD2RPT++ ActivityExplorer tool and activity profiles were generated as a
map for the knob values. The number of buffers allocated can be used as a knob used to minimize
randomization and increase predictability of memory access for known access sizes. The significance
of this feature of the knob came into importance when computing the cost of access where efforts were
needed to maximize the predictability of access while analyzing the power cost per access.
3.1.1.2 Number of accesses per test
The number of accesses knob fixes the number of accesses that happen in a test case for a client.
Higher the number of accesses, longer the read/write access durations, and hence longer test
durations. This is another knob that impacts the energy of the test case but has no effect on the
instantaneous power of the test. In the activity profile in Figure 3.1 the knob affects the roughly stable
duration of activity where the read/write access takes place after the buffer initialization. This test
knob’s direct relation to test duration is used to create longer tests or shorter tests based on
requirements. Otherwise these are kept constant for tests that vary other knobs that impact the
instantaneous power. The number of accesses per test was varied across the possible values to study
the effect of this knob. The ActivityExplorer VCD2RPT++ was used to analyze the activity.
Figure 3.1: Activity Profile of a typical use-case for the block
26 | Power analysis and reduction framework development methodology
3.1.1.3 Number of used ports/clients
The number of ports is an important metric that directly relates to power. It can be understood
as the number of clients (like DSPs or accelerators) that perform read/write transactions through the
Common Memory Controller (CMC). A higher number of ports corresponds to the higher activity of
the block and therefore higher power. The number of ports is varied across the simulations and
activity analysis is performed using VCD2RPT++. This is also a knob that helps ensure that the power
consumed by the block scales with the loading of the block. If that is not the case it shows that the
system consumes power due to power bugs caused by redundant activity that is independent of the
load. This is a knob that is also used to calculate the cost of access as a function of the number of client
accesses. These are valuable analyses to make for the characterization and optimization of CMC.
3.1.1.4 Intensity of transactions
The roughly stable part of the activity profile in Figure 3.1 translates to the read/write transactions
happening in the test case. The intensity of the transactions can be understood as the delay between
the execution of two transactions. This is defined in the test as the delay between start-of-transaction
(SOT) of two read/write accesses. The higher the intensity of transactions, the higher the activity and
therefore higher power consumption. If we consider the complete test case execution as a job on the
Common Memory Controller (CMC), intensity as a knob helps control the intra-job delay between
transactions that make up the job. As an exercise of understanding and characterizing the knob the
values of intensity was varied across the functionally viable values, and activity of CMC analyzed
across the test case using the VCD2RPT++ activity analysis tool. This knob can be used to load the
design to different levels to characterize the block and also to introduce intra-job delays as a means
to uncover power bugs as discussed in section 3.1.3
3.1.1.5 Port access exclusivity
Common Memory Controller (CMC) consists of multiple interconnect blocks through which the
clients perform memory access. They are used to enable options to logically or functionally discretize
and categorize memory accesses. The port access exclusivity as a knob was introduced to enable
selective and exclusive access to ports and interconnect block that the ports are a part of. This knob
enables characterization by allowing individual accesses for power cost calculation of incremental
read/write accesses. This knob enables optimization by enabling interconnect blocks selectively and
identifying redundant switching in unused interconnect blocks. This is another knob achieved by
modifying the verification framework by identifying opportunities for characterization and
optimization achieved through brainstorming and discussions. The port exclusivity knob was tested
with all combinations of interconnect ports on CMC and analyzed through visualization on the
VCD2RPT++ activity analysis tool. Figure 3.2 is a screenshot capture depicting exclusive access on
one of the interconnect blocks, from the activity analysis tool. In this case the green blocks depict the
interconnect blocks through which memory access is enabled. The gray block is not used for any client
access. The associated activity plots corresponding to the interconnect blocks show the activity during
the test case. This example enables identification of redundant switching activity in the interconnect
sub-block that is supposed to be idle.
3.1.2 Activity Analysis – Test case characterization and understanding
Once the test knobs are decided and enabled in the test framework, an important step towards power
analysis is using them to test them out across the feasible range and study their impacts on the test
case. The knowledge of the effect of these knobs on the block and the test leads to decisions regarding
the tests and the knob settings that will proceed to be used in the power analysis flow. And this
Power analysis and reduction framework development methodology | 27
27
decision regarding test cases based on the knobs is an important initial checkpoint in the power
analysis flow.
For understanding these knobs, Common Memory Controller (CMC) design is simulated with
different values of these knobs. Now, these selected knobs can be enabled to be varied as input
arguments in the simulation commands to run the test case. There are also some less obvious knobs
such as port exclusivity which are enabled by modifying the test case in its System Verilog based
implementation. All the knob combinations obtained, are simulated across values and results
captured with activity profiles. The VCD2RPT++ ActivityExplorer, an in-house tool at Ericsson is used
to get these visualizations. The test cases are simulated to generate VCD (Value Change Dump) files.
The activity analysis tool takes in the VCD to analyze the switching activity across the design and helps
with visualizing this activity of different hierarchical elements over time. The visualization also
highlights which parts of the design are active during which period of the test case and is suited very
well to understand the implications of the test case and the knob variations on the block design. Figure
3.3 is a representation simplifying the process of test characterization before proceeding to power
analysis.
Activity analysis is a good preliminary step for power analysis because power follows the trends
of activity very closely in general. Since activity analysis takes lesser computational effort, lesser time,
and in this case of VCD2RPT++ did not need an external license for internal use. This leads to quick
pointers to understanding of the test case behavior with varying parameter knobs. The extended
functionality achieved through VCD2TB enables to replay the stimulus used for the RTL simulation
Figure 3.2: Visualizing port exclusivity knob using VCD2RPT++
28 | Power analysis and reduction framework development methodology
to generate test benches that can be used to simulate with the netlist, if that is available. This provides
a more accurate activity analysis and helps get an idea of the clock gating efficiency too. These
procedures using these tools are good preliminary analysis tools leading to actual power analysis, as
these are relatively less time consuming and helps fine-tune the test cases that go on into a more
exhaustive, resource and time-intensive power analysis using dedicated licensed tools.
3.1.3 Differential Energy Analysis – Test case tuning for optimization
Once the verification framework is understood and the test knobs are framed for power analysis, the
focus is on finding the right test cases for the characterization and optimization requirements of the
power analysis flow. The test cases for characterization span across all possible loading conditions of
the block, as it is the requirement from the characterization of a block, to understand the behavior
and performance metrics in all possibilities the block operates in. The test cases for dynamic power
optimization though focuses on uncovering redundant switching activity across the design. This
involves identifying design areas that are active but are not functionally necessary for that test case.
There are different approaches taken to identify such cases and Differential Energy analysis is one of
the techniques that help arrive at test cases indicating redundant activity and scope for optimization.
It is a technique combinedly developed and published by Qualcomm and Ansys. [17]
Identifying test cases well suited to uncover power bugs is critical. This is where Differential
Energy analysis comes into play. This is a technique that involves several steps that are not supported
by the power analysis tools used in this Thesis and therefore was developed manually in steps using
Excel and other open-source tools cohesively with the power analysis tools using the underlying
concept explained in the subsequent section.
3.1.3.1 Concept
Looking through simulation for all redundant toggles in designs, is not the ideal way to identify
inefficiency in a design. Such a search is going to be exhaustive and time-intensive in contrast to the
Figure 3.3: Simplified view of test case tuning
Power analysis and reduction framework development methodology | 29
29
impact on power savings. This motivates the Differential Energy Analysis which points early to scope
for dynamic power optimization at RTL. The core of the idea is to focus on the energy of
jobs/operational scenarios of a design rather than power. Energy can be understood as power
integrated or summed up over time. It can be formulated as below.
𝐸𝑛𝑒𝑟𝑔𝑦 = ∫ 𝑃𝑜𝑤𝑒𝑟. 𝑑𝑡𝑡
0
= ∑(𝑃𝑖𝑒𝑐𝑒𝑤𝑖𝑠𝑒 𝑝𝑜𝑤𝑒𝑟 ∗ ∆ 𝑡𝑖𝑚𝑒)
This leads to a rather a novel technique where instead of looking directly at redundant switching,
the energy consumed by the design for a test case is compared with energy consumption of the slowed-
down version of the test case running on the design, achieved by mimicking stalls, starvations or
additional latencies in the test cases. The stalls or intra-job latencies do not affect the original
workload. This means for a given workload, irrespective of the stalls, the energy consumed by both
the typical and stalled test cases need to be the same. This is expected because, as these latencies
increase the duration of the test, the power has to decrease proportionally. The tests where the energy
is not similar in the two cases is a point of concern. This points to redundant switching in the design
because the same workload does not consume the same energy. This means although the same
workload was expected, it turns out it is not so, due to inefficiencies.
This idea is demonstrated in Figure 3.4, where the plot on the left shows the power profile of the
two variations of the job/test as the ideal example of a design optimized for dynamic power efficiency.
The blue area refers to the typical job’s power versus time plot, with the area under it being the energy.
The yellow area refers to the same job with intra-job latencies leading to a longer job. But since both
the jobs carry out the same workload, the energy, area under the graph are the same. Although,
designs in general, with different levels of inefficiencies, do not always behave ideally. The plot on the
right side shows the realistic scenario where the energy of the job with intra-job delays exceeds the
energy of a typical job execution. This points to redundant toggles occurring in the design during the
idle durations or stalls. So according to the plot, rather than being only dependent on workload, the
energy consumed becomes a function of the run-time of the test too. This is undesirable and points
to inefficiencies. This helps in identifying the right test cases that are useful for power analyses of
subsystems when focusing on uncovering power bugs and looking to optimize.
Figure 3.4: Understanding Differential Energy Analysis - uncovering inefficiencies, Image from [17]
30 | Power analysis and reduction framework development methodology
3.1.3.2 Implementation
Differential Energy analysis, despite being an interesting approach towards identifying tailored
test cases for power optimization, is not a technique that can be accomplished end-to-end using the
power analysis tools used. This necessitates the implementation of the methodology manually using
other tools. This involves a sequence of steps that should precede the exploration of power reduction.
The first step to deciding the feasibility of a test case for optimization is to simulate the test case
and dump its VCD for the test case. Now we need a version of the same test, that performs the same
workload, but with bubbles/stalls/intra-job latencies introduced in the test framework. In the case of
the Common Memory Controller (CMC) and its verification framework, the Intensity of read/write
transactions (discussed in section 3.1.1.4) is decreased. This can also be understood as an increase in
the time delay between the start of a transaction (SOT) of one access, to the SOT of the next. When
the two test cases are ready and are simulated to dump their respective VCDs, the next step is to obtain
their power profile (power vs. time). A Spyglass Power estimation analysis on the test case generates
this Power vs time plot. Once we have the power profile plots corresponding to the two cases, the next
step is to calculate the energy content of the two curves for the differential energy analysis. But since
this is not a feature that is readily available in the power analysis tools, the curves need to be converted
into data points to calculate the energy content manually. The open-source tool WebPlotDigitizer
Version 4.1 distributed under the GNU Affero General Public License Version 3 [18], is used to convert
the plots obtained from the power analysis of the two cases, into Comma Separated Values (CSV) for
further computations. It takes in the power profile plot images as inputs and with some manual
scoping, converts the plot into a CSV consisting of the time and corresponding power values. Once
the CSV is obtained from the plot digitizer tool, the energy content of the curves is calculated in
Microsoft excel using the equation for energy defined in section 3.1.3.1. Since we are looking at a
digitized plot, we use the summation of the area method, by splitting the area under the curve into
tiny rectangles to get the closest integral estimates. Once we have the energy of the two plots in Excel,
these are subtracted and noted for excess energy in the delayed test case. The higher the energy
difference between the two cases the more the scope for identifying redundant toggling activity in the
design. This whole procedure implemented using the tools is represented in Figure 3.6
So, to make more meaning out of the procedure described, we start with the total power curves
for the differential energy analysis, i.e the total power profile with and without the bubbles are
compared. If we see a significant difference in energy between the two cases, we perform a similar
procedure on the internal power and switching power plots. Once we choose the candidate with high
energy difference we go further down the tree and find the energy difference in the memory,
sequential and Combinational components of both Internal and switching powers. This helps us
narrow down the components of the design that mainly contribute to the redundant power
dissipation. This flow-tree of going down the analysis can be visualized as in Figure 3.7. These
inferences made from the internal and switching energies can be analyzed as in [17].
Figure 3.5: Inferences based on the scenarios of energy difference between the two test cases [17]
Power analysis and reduction framework development methodology | 31
31
Fig
ure
3.6
: D
iffere
ntia
l En
erg
y A
naly
sis
Imp
lem
en
tatio
n flo
w
32 | Power analysis and reduction framework development methodology
In Figure 3.5 the red arrows translate to an increase in the energy in the stalled test case compared
to a typical job execution. The green arrows represent no noticeable change in the energy calculated.
Inferences can be made on the sources of inefficiencies based on the components that show significant
energy difference as shown in the table. A very common scenario is represented in point 2 where
redundant clock toggles continue to occur even when data on the D/Q pins don’t toggle. These are
pointers to implement the widely used clock gating in the design. This is a proven positive technique
that improves the Dynamic power performance of the designs.
Once the major problem areas are identified by going down this analysis tree, we are enabled to
look into the specific areas for localized redundancy reduction by looking at the register level metrics
computed in the power analysis in Spyglass. Once the root causes of these problems are identified
they can be fixed at RTL and re-simulated for power analysis. This enables early power bug detection
and therefore the improved power performance of the designs. The flow of the complete analysis can
be visualized as in Figure 3.8.
The flow starts with the power test cases identified from the previous sections using the functional
verification framework. These tests are simulated with and without stalls controlled from the
verification environment to obtain switching activity details. A power analysis run over the two cases
leads to the power profiles for the different power components discussed in previous sections. These
power profiles are used to calculate energy consumption and are categorized based on the power
components for memory, sequential and combinational circuits. This helps localize problem areas
and once we have narrowed down specific pointers for RTL based optimization are taken up.
Figure 3.7: Differential Energy Analysis - Flow sequence for redundancy localization
Power analysis and reduction framework development methodology | 33
33
3.2 RTL based early Power Analysis and Optimization
Once the power analysis flow is enabled to obtain stimulus from the performance verification
framework, the knobs that control the test parameters are specified, created, and tested. Then the
optimization test-cases are filtered out and tailored for the flow (using techniques like the differential
energy analysis), the flow proceeds on to the power analysis. For this Thesis, Spyglass Power (from
Synopsys) is used as the main analysis tool in the RTL based early power analysis flow. This choice of
early analysis is motivated by the obvious advantages in early problem identification in the IP design
flow. The power analysis flow focusses on a two-fold purpose for any design block under analysis.
These purposes being characterization and optimization of the blocks.
Figure 3.8: Analysis and problem localization using Differential Energy Analysis
Figure 3.9: Purpose-to-use case/operating point correlation to utilize the flow for ASIC IP blocks
34 | Power analysis and reduction framework development methodology
The goals of the power analysis flow developed in the Thesis are connected to the use-cases of the
ASIC IP block as shown in Figure 3.9. The operating points in the figure originate at the system level.
These are propagated down to the block and subsystem level. The operating points are categorized
based on the various utilization scenarios. Idle scenario involves low utilization, like situations
involving sleep or up from reset. The typical operating point refers to the typical usage scenario of the
ASIC IP block. High operation point refers to the use-cases that are generally short-lived high
utilization scenarios of the block. The thermal operating point is derived to consider overutilization
to plan for thermal and cooling requirements. Special cases refer to the specific use-cases of the block
developed to uncover power inefficiencies and optimize the block. These operating points propagated
down from system-level are all used to collect characterization data for the block that can be used as
a database that would add up with all the sub-blocks in the ASIC and help estimate the power with a
good correlation using a System-level power analysis. The only use-cases interesting for uncovering
power bugs and optimization are the low and typical operating points as these are the points that can
help uncover redundant dynamic activity. The special cases like the Differential energy analysis test
cases are only used as test cases to provide pointers to optimization.
3.2.1 Characterization of the block
Power characterization is an important requirement for ASIC IP Blocks. All ASIC designs have
system-level power and thermal design requirements, corresponding to different operating points.
These system-level operating points are propagated down to the subsystem and block levels. Hence it
is important to characterize the power of a hierarchical IP block such as Common Memory Controller
(CMC). Therefore, we use the performance verification framework and the power knobs developed to
incrementally load the CMC to track the scaling of activity and power profile with the load capacity.
We monitor these power-split-up numbers, activity, and gating efficiency numbers with varying
loading, document, and plot them. The goal is to have the power and activity numbers baselined for
a design. These serve as good starting points when working on optimizing the blocks or when scaling
the block for future designs. This helps predict the power and thermal performance in prior, by the
SOC architects, in a relatively accurate manner. This leads to better planning and reduced risk of
uninformed decisions. Figure 3.10 represents a sample of the system-level operation point
characterization, which can be propagated down to sub-block levels.
Figure 3.10: A representation of power-based operating points of an ASIC IP Block
Power analysis and reduction framework development methodology | 35
35
In the case of CMC, the block was loaded incrementally in terms of the number of ports/clients
accessing memory areas. This was done by simulating the power test cases used from the verification
framework, with different arguments provided as input for the number of port accesses knob. The
argument was varied across all logical values relating to the range of available ports on CMC. Each of
these simulations produced an FSDB file which was provided as the input for the corresponding
Spyglass Power analysis for that loading condition. The inputs and other fundamentals for the
Spyglass analysis are as discussed in section 2.2.2. An activity analysis is performed first for the
different operational points followed by Spyglass Power analyses for these test cases. The power
analyses focus on collecting, analyzing, and visualizing the power metrics for the varied loading
conditions. The metric, tabulated and plotted are the Activity in the design, power consumption, and
their split-up and the clock gating metrics discussed in this Thesis. The trends in these metrics with
loading are studied and visualized. The technology-independent metrics collected in this process are
helpful for the SoC architects to foresee design requirements such as a power delivery system, a heat
exchange mechanism, and other aspects of product design.
The cost of access, in power numbers, of a memory access by one client, is one of the other
characterization metrics calculated for the CMC. Then, the study of the power cost with respect to the
increase in the number of clients performing unique memory accesses is done. This helps achieve a
relationship between the load on CMC and the power. This helps obtain the trend of the cost per
access as the number of clients increase. This is performed by isolating test cases for the cost analysis
using the knobs and incrementing the clients over a few cases such as 1 client, 5 clients, and 10 clients.
This variation is to monitor the tracking of cost as the clients increase and look for a relationship for
higher client numbers. The memory access location is fixed, and a default of 100 access is captured to
average the cost over. Then activity analysis is performed over the block by using the VCD2RPT++
tool. The active hierarchical elements during the access are monitored and noted for all the test cases.
Once the active blocks are identified using this tool, power analysis is done using Spyglass Power, and
the power numbers corresponding to the individual access are gathered. This leads to the estimation
of power for individual memory accesses for clients. Also, the overhead that comes to the cost for
multiple clients is tried to be captured by performing this on 5 and 10 clients accessing memory.
A study on the CMC with default values of all the power test knobs except the number of port
accesses, with the incremental analysis on it, lays the basis for characterization of the blocks. The
number of client accesses is one of the major knobs in controlling the loading of the block although
similar analysis can also be done on other knobs such as intensity in characterizing the block. From
the Spyglass analysis all possible power numbers and power efficiency metrics are captured,
tabulated, and visualized. The activity parameters captured are the average activity, the average
register activity, the average register D pin activity, and the average combinational net activity. The
power-related metrics captured are the average total power, internal power, switching power, total
Dynamic power, and Static leakage power. The power performance metrics captured are the clock
gating efficiency, the average Register Output activity density for Flop (ROADF), and the average
Register Output Activity for Enables (ROADE). This forms a baseline analysis that characterizes CMC.
The characterization of CMC leads to data points corresponding to the technology-independent
metrics such as activity and clock gating efficiency, and other technology-dependent numbers such as
power split-up. The technology-independent metrics when characterized and well documented in a
design, lead to forming good supporting data for Electronic system-level (ESL) power analysis, which
is a more holistic IC level analysis. ESL power analysis utilizes physical layout and fabrication vendor’s
inputs to be extrapolated to predict modified future design decisions by using the technology-
independent metrics that are characterized by this flow. It helps provide the largely needed starting
point to predict, sufficiently accurately, the effect of modifications on baselined blocks in use.
A series of relationships are arrived at from the metrics and power numbers attained as data
points and presented as graphs and these details are shared with the SoC architects and other
36 | Power analysis and reduction framework development methodology
shareholders of the blocks future designs to help make extrapolations on power estimates and foresee
future requirements such as power management systems and heat management systems.
This form of characterization at the block level helps form a database of power behavior of each
block under different loads. The database is formed out of the metrics such as register and
combinational switching activity and the memory utilization. When such ‘power behavior databases’
are accumulated for all the blocks, system-level analysis of power is enabled and forms the basis of
analyzing the system power and foreseeing improvements as a system.
3.2.2 Analysis and Optimization flow
Apart from characterizing the design block, one of the major motivations for early power analysis is
the optimization of the block. The Thesis focuses on dynamic power optimization by eliminating
redundant switching activities in the design. This is achieved through a sequence of steps starting
with power analysis using the Spyglass Power tool. Introducing early identification of power bugs and
optimization into the IP design flow, improves the whole flow, and reduces the space for finding costly
bugs later in the design flow. The goal of this power analysis flow implementation is to develop a flow
that can be incorporated into the design cycle, similarly to a code coverage analysis where realistic
goals are set for designs, striving for which iterative analyses are performed.
An analysis and optimization flow is feasible through the use of metrics that quantify the current
performance. In the case of power optimization, the metrics associated with power need to be
understood and quantified clearly before they can be optimized for. The metrics choice and definitions
are explained in section 2.3.2. The goal is to set metric goals as part of the power analysis flow using
which the design needs to be optimized. These goals need to ensure good dynamic power performance
of the design and yet be realistic for the design. A suggested approach for a new implementation of
the power analysis goal is to start with qualitative soft goals and fine-tune these goals with time and
incremental knowledge of the block’s power performance in the use-cases they are exposed to. The
dynamic power performance metrics mainly used in the flow are the dynamic and static clock gating
efficiency (DCGE and SCGE) and Register Output Activity Density for Enables (ROADE).
So, this flow starts with a power analysis on the design using Spyglass Power. The power analysis
is performed in a sequence of steps as described in section 2.2.2. This provides a hierarchical split-up
of power numbers and the metrics discussed. Once the analysis is set up and all the power and metric
results of the analysis are available for an optimization test case (generated from the performance
framework), we perform the analysis as a flow described in Figure 3.11. This is performed on the test
cases that were developed specifically for optimization using techniques, such as Differential Energy
analysis, described that are used to tune test cases for optimization. The differential energy analysis
used to arrive at such test cases also provides pointers to the specific areas of the design that are the
main contributors to dynamic power inefficiency.
The flow in Figure 3.11 describes the iterative approach to power optimization. The flow uses a
metric based optimization flow, where the metrics are the outcome of a Spyglass Power analysis run
on the design. A good dynamic clock gating efficiency is a metric that rules out the need for any further
optimization. Hence that is the initial hurdle that the design has to pass. For Common Memory
Controller (CMC), a good soft goal of 80% is set for DCGE for this flow development. Designers
associate good estimates for the metric goals and fine-tune them based on the knowledge of the
design’s power performance gained over time. If the DCGE is insufficient, the flow proceeds to analyze
the hierarchical level further. There could be two reasons for low DCGE. The first reason is a low
percentage of instantiated clock gates in the design. This is quantified by the static clock gating
efficiency (SCGE). A low SCGE means insufficient clock gates in the design. This is mitigated by the
instantiation of more clock gates in the design. This can be done by going down the hierarchy to the
register level and identifying the register banks without the provision for clock gating and then
Power analysis and reduction framework development methodology | 37
37
introducing gating conditions in the corresponding RTL. We set a soft goal of 90% on SCGE. This
should necessarily improve the SCGE of the design on the next iteration of power analysis. But that
might not be the case with DCGE. So, in the subsequent iteration the DCGE is rechecked. If the SCGE
improves and the DCGE does not improve, the flow proceeds to check the next metric, Register
Output Activity Density for Enables (ROADE). ROADE is a metric that reflects the quality of the clock
gating enables. If the ROADE is not high enough the gating conditions are not good, and this has led
to inefficient clock gating in the design. This motivates going down hierarchies where ROADE is low
and identifying register banks with poor enable conditions and logically improving them. This should
improve the ROADE. A soft goal of 95% is set on ROADE of the design. A hierarchy meeting the
ROADE goal, mitigates the failing of the DCGE goal. This is the implication of the fact that the DCGE
reaches its peak value for ROADE greater than 95% and the DCGE cannot go any higher for that
design because of the number of data transitions, that need clock edges enabled, to be captured. A
failure in the ROADE goal should lead to modification of the clock gating enable conditions for
registers with bad ROADE. Improvement in ROADE will necessarily lead to increased DCGE. Thus,
RTL change is performed for this modification and power analysis iteration is performed. Once all
the metrics in the sequence are passed, the flow proceeds to the next hierarchical entity. This flow is
continued across all hierarchical elements until all elements are passed or waived for special reasons.
Such an analysis can be compared to the code coverage analysis.
Such a flow is developed and performed on the CMC and as a result, a document was created that
suggested the register level pointers to the improvement of the metrics with the scope for
improvements. This was shared with the design team at Ericsson and discussed for taking up as
changes that could be made to improve the design. An optimization effort is made based on the
analysis of one of the sub-blocks by RTL modification as well.
38 | Power analysis and reduction framework development methodology
Fig
ure
3.1
1:
Dyn
am
ic p
ow
er o
ptim
izatio
n im
ple
men
tatio
n m
eth
od
olo
gy
Power analysis and reduction framework development methodology | 39
39
3.3 Netlist based power analysis
An early power analysis is focused on as the first step in the Thesis because, the earlier amends are
made in the power perspective of a design, higher the impact on the power savings. But it is to be
noted that the accuracy of power estimation is higher, the further down the design stage the ASIC IP
Block is in. Thus, the power analysis flow developed in the Thesis needs to incorporate pre-layout
netlist-based activity and power analysis and this section deals with this. Power analysis towards the
stages of signoff power estimation for relatively more accurate results due to higher switching-
activity-annotation accuracy.
Functional verification environments such as UVM and TLM are built around RTL model
simulations. This presents challenges to plug-and-play netlist DUTs in such an environment. So,
when accuracy motivates Gate level power analysis, Gate level simulation is necessitated. Developing
a verification framework around netlist DUTs is a time and resource-intensive task that can inhibit
gate-level power analysis [21]. This leads us to the first steps to gate-level power analysis using the in-
house tool at Ericsson. This tool enables gate-level simulation using an approach to leverage the RTL
verification environment with a netlist DUT for simulation. The subsequent step uses PrimeTimePX
from Synopsys to perform netlist level power estimation for accurate signoff power estimation. This
can be used to as the characterization tool after the iterative power analysis flow finalizes the design.
In this Thesis this tool is used for validation of power savings for a sub-block of Common Memory
Controller (CMC) after modifying the block to implement clock gating after a scope for dynamic power
inefficiency is identified. The subsequent section proceeds to the first steps of netlist-based simulation
and power analysis.
3.3.1 VCD2TB – Netlist simulation using Dump, Convert, Replay
VCD2TB is an in-house tool designed at Ericsson that helps gate-level simulation for power analysis
using the RTL verification stimuli used for simulation [21]. It enables quicker Gate level simulation
by generating stimulus from the RTL simulation environment that can be replayed on the netlist. It
can be understood from the sequence of steps as depicted in Figure 3.12.
The procedure towards gate-level power analysis starts with an RTL simulation. The RTL design
is simulated in cohesion with the UVM based verification environment. As a result of the simulation,
an input stimuli VCD is generated. The VCD2TB tool utilizes the input stimuli VCD to create a
testbench that can be used to simulate a netlist on compilation. As a result of the gate level simulation
a gate-level VCD file is generated for activity and power analysis. This VCD can be used to study the
activity and Clock gating efficiency using the VCDRPT++ tool. The gate-level VCD can also be used
as input for a power analysis tool such as Spyglass Power along with the netlist design to estimate
power and related metrics with higher accuracy. This technique significantly saves time and the
resources requirement by eliminating the need to generate a verification framework for Netlist based
simulation. It follows the sequence of dumping VCD based on the RTL simulation, converts the input
stimuli into a testbench, which is then replayed with the netlist to simulate and then generate Gate
level VCDs which can be used for a more accurate analysis of power.
40 | Power analysis and reduction framework development methodology
This flow/approach leads to a quick netlist-based Gate level simulation and VCD dumping which
enables power analysis. The ability to use such a tool comes to use in an iterative analysis flow such
as the one developed in this Thesis and avoids the need to simulate at gate level and also helps achieve
and visualize activity profiles extremely fast in comparison to using a netlist-based power analysis
tool such as PrimeTimePX, which is preferable for final validation steps and for a specific narrow time
window in a test case leading to narrow simulation which is not as resource-intensive.
3.3.2 PrimeTimePX –Gate level sign-off power estimation
PrimeTimePX is a power analysis tool as introduced in section 2.2.3, from Synopsys that is used in
the Thesis for a more accurate gate-level sign-off power estimation. This forms one of the later steps
in the analysis flow. One of the major scenarios where the Thesis flow utilizes PrimeTimePX is to
validate an improvement scenario identified using methodologies discussed.
PrimeTimePX is a tool utilized in the power analysis flow when narrowing down on problem areas
and trying to validate improvement through modification, during the later stages of the analysis. One
such problematic area was identified in the Common Memory Controller (CMC). Problem areas were
identified early in the flow and validated for modification in the later stages of the analysis flow. Once
the problem areas are identified, the subsequent steps are to implement RTL changes to identify clock
gating conditions and implement them as discussed in section 2.3.1. After modification of the RTL,
for finalizing the changes, a netlist level power estimation is performed to validate optimization. This
involves synthesizing the modified RTL to generate a new pre-layout netlist and then using this for
the subsequent PrimeTimePX power analysis. A comparative analysis is performed between the two
Figure 3.12: Using VCD2TB for gate-level simulation
Power analysis and reduction framework development methodology | 41
41
versions of the design, one with manually instantiated clock gates in RTL and the other without it.
The results of the two are compared and analyzed. This is used as a sign-off validation step in the
analysis flow developed in the Thesis. This example analysis performed in CMC is discussed in the
subsequent section.
3.4 Cache block analysis – A sample analysis & optimization
The cache block in Common Memory Controller (CMC) is used as an example for the analysis and
optimization flow achieved using the flow developed in this Thesis. This section deals with the analysis
of the cache-block for issues in the power metrics visualization, hypothesizing the cause of the
problem from the analysis, identifying the corresponding root-cause in RTL, modifying the RTL for
improved performance, and the eventual validation of the modification using sign-off power analysis
tool, PrimeTimePX.
3.4.1 Problem
The power analysis flow developed in the Thesis is in such a way that it begins at a higher level in the
hierarchy and digs down to specific problem areas that can be improved. The CMC consists of a cache
subsystem whose early analysis points to issues related to bad clock gating. The first step to analysis
was to simulate the block in a test case that expected low activity on the cache block. Then, the input
VCDs generated from the simulation are used as inputs to the VCD2TB tool as discussed in section
3.3.1. This generates the testbench for the Gate level simulation. The gate-level VCD generated from
the simulation is used to analyze activity and clock gating efficiency using the in-house tool
VCD2RPT++. This helps visualize the activity and the Clock gating efficiency of each of the
hierarchical elements in the CMC across the duration of the test case.
The cache block consists of associated cache memory, a controller and its interface block. It
handles cache access requests from clients and provides access to the content of the memory if it is a
Figure 3.13: Methodology of identifying power inefficiency in CMC - VCD2RPT++ screenshot
42 | Power analysis and reduction framework development methodology
‘cache hit’ else it sends an intimation of a ‘cache miss’ and proceeds to move the data into a free cache
slot. Figure 3.13 shows the screen captures from the VCD2RTP++ tool’s user interface, focusing on
the cache sub-block, for a test case that expects low/almost no activity on the block at a time instant
in the test. It shows the sub-blocks within the cache block in CMC, visualized as rectangles, where
each rectangle represents a hierarchical element inside the cache. The activity view on the left shows
the activity across the sub-blocks (color-scaled as in the scale on top of the screenshot) and the CGE
view (color scaled according to the scale on top of the screenshot) on the right represents the
visualization of clock gating efficiency across the sub-blocks. The highlighted sub-block with the dark-
blue outline represents an acknowledgment (ACK) first-in-first-out queue (FIFO) sub-block in the
cache that handles acknowledgment requests for cache accesses. There are multiple instances of such
ACK FIFO, and this analysis narrows down to this level for analysis and optimization.
Figure 3.13 shows that the ACK FIFO highlighted in the two views, at a time instant of the test
case shows a low (from the color code) activity, which is expected from the test cases that doesn’t
create high activity in the cache block. The CGE view shows that the clock gating efficiency is low
(from the color code) for the same FIFO sub-block. This points to redundant clock activity when data
/functionality doesn’t need the hierarchical element to be active. A low activity, yet a low clock gating
efficiency points to redundant clock switching which is unnecessary for the functionality of the design.
This analysis helps hypothesize that there is some form of redundancy that creates bad clock gating
efficiency in low activity scenarios. This can be seen across all the six instances of the ACK FIFOs in
Figure 3.13. This indicates to scope for improvement in the clock gating scenario of the FIFO blocks
by modifying the RTL. This visualization-based technique is one of the techniques the flow uses to
pin-point coarse-grained redundancies early in the design.
3.4.2 Solution and validation
As the analysis above identified the possibility to improve clock gating conditions in the RTL of the
ACK FIFOs, the subsequent step is to study all the possibilities for clock gating enable conditions in
the RTL corresponding to these FIFOs. The goal of this step is to identify redundant read/write
operations where the logic does not gate/disable redundant clocking where there is no update in the
data. Such a situation is identified for the read operation of the RTL, the code snippet of which is
represented in Figure 3.14 and Figure 3.15.
By focusing on the problematic area based on the metrics evaluated in the power analysis flow,
we identify the modules whose RTL can be modified to improve the power performance by
introducing clock gating. This is the case with the ACK FIFOs when the analysis discussed above
pointed to improper clock gating efficiency, the RTL was analyzed for better gating conditions for the
registers in the FIFO. As shown in the code snippets compared in Figure 3.15 and Figure 3.14, it can
be analyzed from the RTL that the read operation is redundant in the original design without clock
gating for cases where there are not data updates. This leads to using a data update checking flag to
be used as an enable for the clock gating, as highlighted in the green box in the modified snippet with
CG insertion. The If condition wrapping the complete array read operation, disables redundant reads
when there is no new data to be updated. This modification is aimed to improve the switching
performance of the ACK FIFO in the Cache interface block.
The cache interface block consists of multiple such FIFO ACKs. It is noted that a similar
modification for all these instances of the FIFO can translate to an activity and power performance
improvement accumulated over the number of instances. The Common Memory Controller (CMC)
consists of multiple interconnect blocks, each consisting of its cache block. This leads to a multiplied
improvement of power performance across the chip by identifying a single such scope for
improvement. A PrimeTimePX analysis is performed to validate the significance of this modification
on power and related metrics. These results are analyzed and discussed in the later chapters.
Power analysis and reduction framework development methodology | 43
43
Figure 3.14: FIFO RTL without Clock gating for comparison
44 | Power analysis and reduction framework development methodology
Figure 3.15: FIFO RTL with Clock Gating
Power analysis and reduction framework development methodology | 45
45
The procedure involved in analyzing the cache block in CMC using PrimeTimePX (PtPX, in flow
diagram) is visualized in Figure 3.16. The results are collected from the PrimeTimePX reports and the
effect on the RTL changes on the netlist are also considered in the decision-making process. The
validation step using a sign-off power analysis tool such as PrimeTimePX forms the critical,
penultimate step of the power analysis flow before making the decision on the effect of the RTL change
iteration’s validity.
Figure 3.16: Validation and signoff analysis of RTL improvements
46 | Power analysis and reduction framework development methodology
Fig
ure
3.1
7:
Po
wer a
na
lysis
an
d R
ed
uc
tion
Flo
w S
um
mary
Power analysis and reduction framework development methodology | 47
47
3.5 Summary – An end-to-end power analysis and reduction flow
The procedures discussed so far in this chapter form the steps of a comprehensive power analysis and
optimization flow developed in this Thesis. The flow starts with a focus on its two purposes –
characterization and optimization of the ASIC IP Block. Beginning with identifying and creating test
cases for the two purposes by connecting to the existing performance analysis framework, the flow
proceeds to different approaches to analysis, characterization, and optimization.
Figure 3.17 provides a summary of the tools and flow of methodology used in the development of
the Thesis. The flow representation shows the two front-end design stages of RTL and the Netlist of
the design and shows the succeeding steps in each case for the implementation of the discussed power
analysis and optimization flow. The flow covers the design tools, the simulation and verification tools
and the analysis tools involved in the flow developed. An ideal early power analysis starts from the
top-left corner of the flow and finishes at the bottom-right corner of the flow. All the tools represented
in the flowchart and explained in the previous sections provide the varied dimensions and
opportunities to analyze the block for power and improve. Each tool used forms a platform to power
perspective although the complete sequence of flow in the chart provides a suitably more elaborate
power perspective of the designs under analysis. As shown in the figure, this power analysis flow
addresses the major goal of connecting the power analysis flow to the block’s verification framework
to generate and drive its stimuli. Since power analysis is primarily a stimuli-based analysis, this
connection to the verification frameworks is a critical step of the power analysis flow implementation.
This forms the basis to create power test cases that form the basis for the successive steps of the flow.
The RTL design stage-based analysis shown in the flow comes with the early advantages of power
savings. The earliness of this analysis leads to two obvious advantages. It leads to larger implications
on power savings as discussed in the previous sections. It also leads to faster, iterative analyses that
lead to a flow that is well-suited for such an analysis and sits in well with the other iterative verification
steps in the IP design flow. This flow representation in Figure 3.17 is validated using the power tools
for the Common Memory Controller (CMC) block at Ericsson. The goal of this thesis is to present this
flow as a methodical approach that can be implemented for any ASIC IP Block using power analysis
tools that can perform similar functions. This flow diagram can be understood to be representative of
a more generalized flow that can be developed independent of the ASIC IP Block on which the analysis
is done, or the tools used to perform these.
For developing power-efficient ASIC designs, a flow such as that discussed here needs to be
implemented on all the hierarchical elements developed at the respective levels. The results of the
power analysis on CMC using the tools described in the flow are collected, visualized, and analyzed
for the two purposes for which the flow has been developed, namely the characterization and
optimization of CMC. The results consist of data necessary to characterize a block design release as
part of the design flow. The result also consists of data from the optimization and validation steps of
the power analysis flow developed, where an optimization at RTL is verified for power savings based
on the validation step of the flow developed involving the netlist-based signoff power analysis tool
PrimeTimePX. The metrics collected for the validation are analyzed and a decision is taken regarding
the validity of the RTL modification for power. These results obtained are discussed in the upcoming
section. The results are consolidated from different parts of the flow.
48 | Results and analysis
4 Results and analysis
The main goal of this Thesis has been to arrive at a power analysis and reduction flow for CMC using
the power tools and the existing verification framework and then validate the flow for its
characterization and optimization goals for the Common Memory Controller (CMC) block at Ericsson.
The flow development and implementation discussed in the previous sections are the first result that
the Thesis has strived for. The subsequent results collected for characterization and optimization of
CMC are an outcome of this flow put to proper use and a means to validate the flow. This report tries
to present the results from the Thesis by focusing on the outcome of the power analysis flow without
going into detailed information relevant to the design blocks at Ericsson, rather focusing on the
applicability of the flow developed. This section tries to present the results in a way of validating the
power analysis and reduction flow using relative metrics without focusing on the absolute
performance metric values relevant to CMC.
4.1 Characterization results for Common Memory Controller (CMC)
The characterization of the CMC is performed as a result of the power analysis flow developed in this
Thesis. This goal of the flow focusses on profiling the blocks on the basis of the power metrics. This
section deals with the results of the characterization of CMC. The characterization results are based
on the methodology as discussed in section 3.2.1. The CMC block is simulated over the range of the
number of clients accessing memory and power metrics are collected. The results are collected based
on the simulation on the variation of the knobs on the test case, which are discussed in section 3.1.1.
The characterization data involves a correlation between the number of clients/load scenarios
(one of the power knobs in the performance analysis framework) on the block and other power metrics
discussed in earlier sections. These are data collected from power analysis for different test scenarios,
which simulate different operational scenarios, and visualized for comprehension. The first block-
level visualization focusses on the variation of the average activity of the design with different
numbers of clients per interconnect block. The activity is split into average combinational activity,
average register activity, and average activity.
CMC block is incrementally loaded to simulate the different operating points discussed in section
3.2.1. Technology independent metrics are collected across the operational points. The system-level
operating points are achieved using the knobs of the flow (discussed in section 3.1.1) in the
performance analysis framework of CMC. The power/activity metrics of CMC are collected with the
incremental loading of the block and operating points are achieved. This contributes to forming a
‘power profile database’ that can be collected using such metrics for all sub-blocks of the ASIC and
thereby provide the complete picture of the power performance of the ASIC. Table 1 presents the data
that classifies the different operating points for CMC and corresponding metrics. The first column in
Table 1 captures the qualitative states of the test knobs for which the metric measurements
correspond. The data like that for CMC in Table 1 collects the technology-independent metrics of the
block that, once collected, are easy to port across different projects and products that utilize the block
or an extension or reduction of the block.
Results and analysis | 49
49
Knob states: Stochastic access w/ Average access size = Medium size (~100B); Access intensity = Medium intensity (~70 clock cycles) between Start-of-
transactions; Unrestricted Interconnect block access (uniform access); No. Of accesses per port = Medium (~100)
S.no Operating
Profile Total Ports
Avg. Reg. Activity (%)
Avg. Activity
(%)
Avg. Combinational
activity (%)
Avg DCGE
(%)
1 Low
5 0,09463 0,2649 0,6759 73,863
2 10 0,1266 0,3026 0,7634 73,758
3 Typical
25 0,2229 0,43 1,1 73,448
4 50 0,3893 0,6347 1,5 72,926
5 High
100 0,7276 1 2,4 71,913
6 150 1,1 1,4 3,3 70,92
7 Thermal
200 1,4 1,7 4 69,972
8 250 1,7 2,1 4,8 69,019
Table 1: Operating point vs profiling metrics data for CMC
Figure 4.1: Correlation between activity and loading condition of the block
0
1
2
3
4
5
6
1 2 5 10 20 30 40 50
Avg
. Act
ivit
y (
Per
cen
tage
)
No. of ports access per Interconnect block
Average Activity (Percentage) vs Number of clients access
Combinational Activity
Average Activity
Average Register Activity
50 | Results and analysis
Figure 4.1 shows the plot where all the activity metrics are represented. The increment of
port/clients accesses from the verification framework forms the x-axis of the plot. This plot highlights
a trend of increasing activity with increased port accesses. The trend seen in the plot is predicted, as
a higher number of clients/ports accesses means higher utilization, and therefore higher activity.
Considering the definition of activity, that is the average percentage of combinational nets
(combinational activity), register’s clock and/or data nets (Register activity), and all nets (Average
activity), that are transitioning at a given clock edge, the data from these results indicates that in
comparison to other nets of the CMC design, the average percentage of combinational nets active
during clock transitions are higher. The plot shows the variation of combinational, average register,
and total average activity as a function of the number of port accesses. It is notable from the plot that
the incremental trend is common across all the activity components (combinational nets, registers,
and overall).
The metric that follows activity analysis is the clock gating efficiency of a design. The variation of
clock gating efficiency with the number of ports/clients is plotted as in Figure 4.2. The plot shows the
trend of decreasing average dynamic clock gating efficiency of CMC as the number of port/client
accesses increases. The values of the performance framework knobs are as specified in Table 1. The
knobs try to qualitatively capture the typical operating scenario of the ASIC.
The decreasing trend of dynamic clock gating efficiency with the increasing number of clients is
anticipated as higher activity leads to lesser clock transitions that can be disabled. This follows from
the correlation that higher load leads to higher activity and therefore lesser registers can be gated to
meet the functionality at that activity level. The interesting point to be noted is that the average
dynamic clock gating efficiency at lower port utilization is not as high if a linear behavior is expected
between the loading of the block and clock gating efficiency. This implies that for low activities the
dynamic clock gating efficiency is not as high as it could be if the curve was linear. For an ideal design
the curve can be extrapolated backward to be linear. This points to the fact that a higher deviation of
the DCGE curve from the linear interpolation of the curve for low activity scenarios points to larger
66
67
68
69
70
71
72
73
74
75
5 10 25 50 100 150 200 250
DC
GE
%
Number of ports
Avg. Dynamic CGE
Avg. DCGE
Figure 4.2: Variation of the average clock gating efficiency with increased loading of the block
Results and analysis | 51
51
inefficiencies in the design. They show up during the low activity scenarios where switching is not
expected, and redundant switching continues to exist, and clock gating is not as efficient as it could
be. Activity and clock gating efficiency metrics, being a technology-independent parameter, its
variation with different client numbers help to extrapolate design performance when including
technology details using an excel-database based Electronic system-level power analysis (ESL). ESL
power analysis is performed based on a database of such metrics collected across the blocks in an
ASIC and forms a basis for a more realistic ASIC level power analysis relevant to specific projects and
scaling them to design needs and technology-specific details from vendors.
Once the technology-independent terms such as the activity of the design and Clock gating
efficiency are visualized and baselined for a design, an important characterization requirement is the
block-level power consumption. The area of power characterization focusses on the study of variation
of switching power, internal power, total dynamic power, leakage power, and the total power with
respect to the incremental loading conditions, in the typical working scenarios of the block. The plot
in Figure 4.3 plots the variation of average total power, average switching power, average internal
power, and the total dynamic power. These power metrics are represented as a factor of the switching
power corresponding to one port access. This helps provide a relationship between the values for
understanding the trend without getting into numerical details of the power consumption of the
block. This kind of a relationship serves the SoC architects to predict the performance of the block for
the various loading scenarios and therefore help perform thermal planning and other supporting
aspects of ASIC IP design and scaling for future products.
The above characterization plots provide an overview of the power, activity, and clock gating
efficiency numbers of the block, that contribute to the power performance of the design. Any design
Figure 4.3: Power split-up relationship with respect to incremental loading of block
0
0,5
1
1,5
2
2,5
3
3,5
4
4,5
5
5 10 25 50 100 150 200 250
Po
wer
as
a fa
cto
r o
f D
ynam
ic P
ow
er f
or
5 p
ort
ac
cess
No. of ports access
Power as a factor of 5 port dynamic power vs Number of clients access
Switching Power
Internal Power
Total Dynamic Power
Leakage Power
Average Total Power
52 | Results and analysis
needs to have such an analysis performed and the data baselined and created as a database to be used
in cohesion with other block details. Such a database helps perform Electronic system-level power
analysis that helps perform chip-level decisions and predictions. These informations help take
peripheral, thermal-design related decisions for future designs of the block, which are very important.
This lays down the scope for maintenance, optimization, and enhancement of the design block and
acts as a starting point for these approaches. These are some of the results the power analysis flow
developed aims to have collected and visualized for every ASIC IP block under analysis. The goal of
the flow is to incorporate block characterization as an outcome inclusive in the IP design flow.
4.2 Power analysis results for optimization
The second objective of the power analysis being the improvement of the design for power, is
addressed using the power tools, analysis metrics, and the power analysis flow developed in this
Thesis. The results are based on the spyglass analysis discussed in section 3.2.2.
As a summary of section 3.2.2, Figure 3.11 depicts the flow of the optimization flow developed in
the thesis using Spyglass. This involves a methodology of iterative analysis towards achieving certain
power metric goals. Modifications to the RTL design are made based on the problems identified based
on the metric goals as described in the flow. An analysis of this type generates a detailed worksheet
with issues identified to the register level.
One of the results of this analysis is a ‘Power optimization worksheet’ that addresses each
hierarchical block in Common Memory Controller (CMC), looks at its metrics, static clock gating
efficiency (SCGE), Dynamic clock gating efficiency (DCGE) and Register output density for Enables
(ROADE) and points to 2 major dynamic power improvement pointers.
• Insufficient Clock gating in design – Points to scope for enabling more instantiated
Integrated Clock Gating Cells in the design.
• Inefficient Clock gating in design – Points to scope for improvement of the CG enable
when it is feasible that further activity reduction can be done by disabling design regions
for more clock cycles, thereby meeting the optimum switching for the designed
functionality.
The CMC was studied based on these metrics and a power optimization worksheet was developed
that highlighted pointers, to designers, of the possible focus areas to power savings. It highlights the
hierarchy-wise metrics and pointers to the two above optimization problems. Each problematic
hierarchical element is addressed in detail until register level problems are identified. The design
team was called in to evaluate these findings and decide on taking up or waiving identified issues
based on the criticality and feasibility of the optimizations. Such a discussion was held with the CMC
design team at Ericsson.
Once optimization pointers are collected, motivated modifications made to the design are
validated using the signoff power analysis tool, PrimeTimePX. Section 3.4 discusses such an
identification of a possible area of optimizing the dynamic power in the ACK FIFOs in the cache
interface blocks. The power inefficiency pointers were identified for the cache block in CMC. The RTL
design of the cache block is modified to introduce better clock gating enable conditions as discussed
in this section. Henceforth power analysis is performed as a validation step. In this section, the power
measurement result of the modified RTL based netlist is compared with respect to the original netlist-
based power analysis for the cache block in CMC. This section presents the results from the validation
step using netlist level power analysis. A gist of the results is presented in Table 2. The analysis
focusses on the Cache interface block which contains the ACK FIFOs which are modified for this
validation effort. (discussed in section 3.4)
Results and analysis | 53
53
Comparison Metric Metric value after RTL change w.r.t to original RTL
Average CGE (%) Same
Number of Clock gates 29% Lower
Net Switching Power 13% lower
Cell Internal Power 8% Lower
Total Dynamic Power 10% Lower
Total Power 8% Lower
Active area (um2 ) 2% Lower
The table above consists of the power and clock gating metrics used to compare the design
modified for power with respect to the original design, as a validation step. The percentage difference
in these metrics of the modified design from the original design is shown in the table. These metrics
are captured at the cache level in the design, where the cache block consists of multiple instances of
ACK FIFO module on which the modification was performed.
Upon analyzing the results, we see that, although the major goal of the modification was to
enhance the clock gating efficiency of the Cache interface block through improvement of the ACK
FIFO designs, we see that the clock gating efficiency is similar before and after the optimization effort.
This can be analyzed and interpreted as follows. The RTL-introduced clock gating achieves no
improvement in comparison to netlist without explicit Clock gating enable specified at RTL. After
some exploration it is understood that the lack of difference is because the synthesis tool infers
possible gating scenarios although clock gating is not explicitly defined in the initial RTL. The
synthesis tool is smart in identifying possible clock gating scenarios and implements clock gating in
the design with a self-generated clock gating Enable condition. Although the activity analysis tool
VCD2RPT++ identifies the problem in RTL and visualizes it as in Error! Reference source not
found.Figure 3.13, the synthesis tool smartly self -instantiates the clock gates at the synthesis stage.
This can be inferred as the justification for similar clock gating efficiency in both the versions of the
RTL (pre and post-optimization effort).
Table 3 presents the metrics at the ACK_FIFO level for comparison of the modified RTL with
respect to the original. The comparison provides a comparative indication as a validation step for the
RTL modification and the results show a clear improvement in the power metrics presented.
Comparison Metric Metric value after RTL change w.r.t to original RTL
Internal Power 46% Lower
Total Power 36% Lower
Number of Clock gates per FIFO 62% Lower
From Table 2 and Table 3 we understand that although the clock gating efficiency remains
unchanged in the two analyses, there is around 30% reduction in the number of clock gates used in
the modified design. There is around 13% reduction in the net switching power of the cache when the
Table 2: Metric comparison between the modified and original RTL at Cache level
Table 3: Metric comparison of metrics between modified and original RTL for ACK_FIFOs
54 | Results and analysis
RTL modification is implemented. There is about an 8% reduction in the internal power of the cells
at the cache level. The RTL change introduces around 10% reduction in the total dynamic power of
cache and around 8% reduction in the total power consumed by the cache block. There is a 2%
decrease in the estimated area of the cache block by the synthesis tool for the netlist created with the
RTL change. These are the effects on the cache subsystem due to the RTL change on the ACK FIFOs
inside the cache.
At the ACK FIFO level, where the modification was made, it can be noticed that there is around
47% reduction in the internal power consumption of the ACK FIFOs, which is a significant
improvement. There is around 36% reduction in the total power consumed by the ACK FIFOs, which
is a very positive improvement and validation of the optimization effort. It can also be inferred from
the results that the number of clock gates in the ACK FIFO has decreased by around 62%. This is
despite the clock gating efficiency remaining the same. This translates to a point that the explicit clock
gating modification makes the clock gating logic significantly more efficient and enables the synthesis
tools to reduce the number of unnecessary clock gates it decides to place across the design. These
improvements are significant considering these improvements accumulate over multiple
instantiations of the FIFOs that are used in the cache block.
To summarize, although the clock gating efficiency is similar in the two versions, the
implementation of these gating cells and enable conditions are different in the two approaches. This
is validated from the result of the analysis that the modified implementation mandates lesser clock
gates, lower dynamic and lower total power, and lesser area. These point to the idea that the manually
introduced clock gates through the modified RTL seem to be more for the implementation efficiency,
power, and area. Therefore, this helped decide to validate and enable the choice of the modified RTL
approach, although the clock gating efficiency between the two cases did not show a big improvement
as expected initially.
Looking at the whole analysis involving the cache block in CMC using the power analysis flow in
the Thesis, we can draw some conclusions that can improve the design for the power performance.
The modification notably improves several aspects of the design and leads to reduced power
consumption by significant percentages. Such an analysis using the power analysis flow and resulting
optimization of ASIC IP Blocks can lead to significant power savings.
Figure 4.4: Power bug location in a block using Power analysis Flow
| 55
Common Memory Controller (CMC) is an ASIC IP block constituted by millions of lines of code.
These methodologies and techniques that constitute the power analysis flow formulated through this
Thesis help identify parts/lines of the code that cause power bugs and localize problem zones as
depicted in Figure 4.4. This ability of the flow to “find the needle in a haystack” is highly advantageous.
The cache subsystem example, in section 3.4, identified an issue in CMC and localized the problem to
the cache subsystem. This lowered scope improves the designer’s ability to identify and solve the
issue. It was demonstrated that manual instantiation of clock gate in the RTL in the ACK FIFO in
cache improves the dynamic power of the cache subsystem by around 10%. The flow developed helped
narrow down to this level to achieve savings.
The design modifications are validated using netlist power analysis (which provides the savings
numbers). This is an example of the type of analysis that will be important in the later stages of the
Power analysis flow developed in the Thesis. This step as a validation of the optimizations will help
finalize the design changes before being baselined for release.
56 | Conclusion and Future
5 Conclusion and Future
FinFETs, today’s leading-edge transistors are so far delivering on the promise of scalability,
performance, and power, but the road ahead is bumpy and filled with a slew of technical and cost
challenges. The free power reduction that came with improving transistor nodes is no longer the case
and 7 nm a relatively long-lived node. Designers can no longer rely on this auto power scaling to
happen anymore. This necessitated a power analysis flow such as the one developed in this Thesis
that helps characterize the ASIC IP Block for its power consumption and efficiency metrics and helps
perform the major focus area of power improvement in FinFET based designs, dynamic power
optimization.
This Thesis fulfills the goal to realize a custom-tailored power analysis and reduction flow, for a
memory controller subsystem at Ericsson, that connects the existing verification environment to the
low-level power analysis tools. This is enabled by an extraction of power metrics by using in-house
and commercial front-end power tools, leading to a top and sub-block level analysis to facilitate quick
power exploration and profiling. The eventual goal in the Thesis focusses on optimization by pin-
pointing the largest and redundant power consumers in the subsystem during the right workload
scenarios. These pointers are shortlisted to a list of prime candidates and communicated to the design
team and a case is taken up to realize RTL changes and demonstrate achievable power and area
savings.
The Thesis starts with understanding the concepts and ideas of power dissipation in ASICs and
formulizing them using metrics. Then the Common Memory Controller (CMC), the ASIC IP Block
under analysis is studied and its usage scenarios are characterized. Then the verification framework
is understood with enough knowledge to be able to generate knobs and tweak them to create power
test cases. With the understanding that power analysis is only relevant for an ASIC IP Block in
correlation to its use-case, power analysis is performed by generating the power-related uses cases
for the two purposes, characterizing the block and uncovering power bugs in the block. Then the
understanding of the CMC, the test cases, the power metrics are all used to methodically develop a
power analysis and reduction framework at different stages if the design using different tools. The
Thesis also focusses on two stages of power analysis, the early analysis at RTL level of the design which
maximizes power savings and impact on the design by lowering the chances bugs found at later stages
which are costlier. The second analysis takes place at the netlist level where the focus is on the
accuracy of measurement to real consumption and validation of design modifications. At every stage
of the Thesis, the findings have been continuously kept in the loop with the architects and designers
to ensure that the Thesis improves the knowledge and quality of the design in terms of power. The
whole analysis is performed using the designs, verification environments, and tool features and
licenses at Ericsson. All the parts of the Thesis involved learning and the right usage of these
environments and facilities in enabling the Thesis.
The power analysis and reduction flow developed in the Thesis, although focusses on a specific
design block at Ericsson and a set of tools, tries to create a flow that can be generalized to other ASIC
IP Blocks and other tools within or outside of the organization. The focus is on the implementation
methodology, the metrics that support the flow, and the physics behind the reduction of dynamic
power. Although the results, in terms of metrics and power numbers specific to the block, form an
important part in validating the flow, the goal is to present the power analysis and reduction flow as
a plug-in iterative analysis step that fits into the IP design flow at ASIC design centers such as
companies and universities. Such a flow impacts power and area savings, that accumulate to huge
cost and environmental benefits that accumulate when ASICs, SoCs, and other forms of application-
specific hardware options become ubiquitous with the emergence of the Internet of things, edge
computing and other innovative ideas. This thesis aims to motivate to be able to custom-tailor a
similar flow for any ASIC IP Block using any available toolkit.
Conclusion and Future | 57
57
A comprehensive power analysis and reduction methodology such as the one implemented in this
thesis, is a helpful tool in understanding and improving the power behavior of an IP block from the
early steps of the design. It would be highly beneficial as the future work to be able to predict, power,
energy, and be able to optimize them at an algorithmic level. This is where the adoption of a structured
design flow advocated by Silago [22], becomes an interesting approach. Thereby, discretely built-up
power and energy prediction model become more reliable and accurate. This leads to interesting
avenues to enable predictable IP designs towards high tape-out power estimation accuracy from an
early algorithmic level.
References | 59
References
[1] J. D. Meindl, ‘A history of low power electronics: how it began and where it’s headed’, in Proceedings of 1997 International Symposium on Low Power Electronics and Design, 1997, pp. 149–151. DOI: 10.1109/LPE.1997.621267
[2] J. D. Meindl, ‘Low power microelectronics: retrospect and prospect’, Proc. IEEE, vol. 83, no. 4, pp. 619–635, Apr. 1995. DOI: 10.1109/5.371970
[3] Michael Keating, Ed., Low power methodology manual: for system-on-chip design. New York, NY: Springer, 2007, ISBN: 978-0-387-71818-7.
[4] ‘Semiconductor Engineering - The 7nm Pileup’. [Online]. Available: https://semiengineering.com/the-7nm-pileup/. [Accessed: 08-Sep-2019]
[5] Anantha P Chandrakasan and Robert W Brodersen, Low Power Digital CMOS Design. Boston, MA: Springer US, 1995, ISBN: 978-1-4615-2325-3 [Online]. Available: https://public.ebookcentral.proquest.com/choice/publicfullrecord.aspx?p=3080665. [Accessed: 08-Sep-2019]
[6] ‘A Review Paper on CMOS, SOI and FinFET Technology’. [Online]. Available: https://www.design-reuse.com/articles/41330/cmos-soi-finfet-technology-review-paper.html. [Accessed: 08-Sep-2019]
[7] ‘Semiconductor Engineering - Designing For Ultra-Low-Power IoT Devices’. [Online]. Available: https://semiengineering.com/designing-for-ultra-low-power-iot-devices/. [Accessed: 08-Sep-2019]
[8] ‘Spyglass Power: Complete Solution for Power Optimizations At RTL | Register Form’. [Online]. Available: https://www.synopsys.com/cgi-bin/verification/dsdla/pdfr1.cgi?file=spyglass-power-ds.pdf. [Accessed: 08-Sep-2019]
[9] ‘Static Timing Analysis - PrimeTime’. [Online]. Available: https://www.synopsys.com/implementation-and-signoff/signoff/primetime.html. [Accessed: 08-Sep-2019]
[10] ‘Semiconductor Engineering - Why Chips Die’. [Online]. Available: https://semiengineering.com/why-chips-die/. [Accessed: 08-Sep-2019]
[11] Sanjay Churiwala and Sapan Garga, Principles of VLSI RTL design: a practical guide. New York: Springer, 2011, ISBN: 978-1-4419-9295-6.
[12] Amara Amara, Frédéric Amiel, and Thomas Ea, ‘FPGA vs. ASIC for low power applications’, Microelectron. J., vol. 37, no. 8, pp. 669–677, Aug. 2006. DOI: 10.1016/j.mejo.2005.11.003
[13] H. J. M. Veendrick, ‘Short-circuit dissipation of static CMOS circuitry and its impact on the design of buffer circuits’, IEEE J. Solid-State Circuits, vol. 19, no. 4, pp. 468–473, Aug. 1984. DOI: 10.1109/JSSC.1984.1052168
[14] Ioannis Savvidis, Power Savings in MPSoC. 2009 [Online]. Available: http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-147365. [Accessed: 14-Oct-2019]
[15] ‘Power analysis of clock gating at RTL’, Design And Reuse. [Online]. Available: https://www.design-reuse.com/articles/23701/power-analysis-clock-gating-rtl.html. [Accessed: 20-Sep-2019]
[16] Himanshu Bhatnagar, Ed., ‘Asic Design Methodology’, in Advanced ASIC Chip Synthesis Using Synopsys® Design CompilerTM Physical CompilerTM and PrimeTime®, Boston, MA: Springer US, 2002, pp. 1–17 [Online]. DOI: 10.1007/0-306-47507-3_1
[17] ‘Differential Energy Analysis to Optimize Mobile GPU Power’, p. 6. [18] ‘WebPlotDigitizer - Extract data from plots, images, and maps’. [Online].
Available: https://automeris.io/WebPlotDigitizer/index.html. [Accessed: 25-Sep-2019] [19] Spyglass Power User guide from Synopsys, Internal document. [20] PrimeTimePX user guide from Synopsys, Internal document. [21] Dump, convert and replay: A Targeted Methodology to mitigating gate-level power
simulations effort, Ioannis Savvidis, Ericsson AB, Presented at DAC 2019. Internal Document.
[22] S. M. A. H. Jafri, N. Farahini, and A. Hemani, “Silago-cog: Coarse-grained grid-based design for near tape-out power estimation accuracy at high level,” in 2017 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pp. 25–31, IEEE, 2017
TRITA-EECS-EX-2020:458
www.kth.se