Identifying Program Power Phase Behavior Using Power Vectors

Identifying Program Power Phase Behavior Using Power VectorsCanturk Isci & Margaret MartonosiWWC-610.27.2003Austin, TX

Power Phase BehaviorExistence of distinguishable intervals during an applications execution lifetime such that:They share significantly higher resemblance within themselves in terms of power behavior the application exhibits on a given processorThis similarity is carried out by not only the total processor power, but also the distribution of power into processor sub-units

Our Power Phase AnalysisGoal: Identify phases in program power behavior

Determine execution points that correspond to these phases

Define small set of power signatures that represent overall power behavior

Our ApproachOur Approach Outline:

Collect samples of estimated power values for processor sub-units at application runtime

Define a power vector similarity metric

Group sampled program execution into phases

Determine execution points and representative signature vectors for each phase group

Analyze the accuracy of our approximation

MotivationCharacterizing power behavior: Future power-aware architectures and applications Dynamic power/thermal management Architecture research Utilizing power vectors: Direct relation to actual processor power consumption Acquired at runtime Identify program phases with no knowledge of application

Generating Power VectorsPOWERCLIENTPOWERSERVERVoltage readings via RS232 to logging machine Convert voltage to measured power Convert access rates to component powers1mV/Adc conversionCounter based access rates over ethernet22 Entries of each power vector sample

Power Vector Similarity MetricHow to quantify the power behavior dissimilarity between two execution points?Consider solely total power difference Consider manhattan distance between the corresponding 2 vectors Consider manhattan distance between the corresponding 2 vectors normalized Consider a combination of (2) & (3) Construct a similarity matrix to represent similarity among all pairs of execution pointsEach entry in the similarity matrix:

Gcc & Gzip Matrix Plots(Similar to SimPoints work of Sherwood et al.)

Gcc & Gzip Matrix Plots

Grouping Execution PointsThresholding Algorithm:Define a threshold of similarity < % of max dissimilarity> Start from first execution point (0,0) and identify ones in the fwd execution path that lie within threshold for both normalized and absolute metrics Tag the corresponding execution points (j,j) as the same group

Find next untagged execution point (r,r) and do the same along forward path Rule: A tagged execution point cannot add new elements to its group!We demonstrate the outcome of thresholding with Grouping Matrices

Representative Vectors & Execution PointsWe have each execution point assigned to a groupFor Each Group:

For Each Execution Point:

We can represent whole execution with as many power vectors as the number of generated groups

Define a representative vector as the average of all instances of that group Select the execution point that started the group (The earliest point in each group)

Assign the corresponding groups representative vector as that points power vector Assign the power vector of the selected execution point for that group as that points power vector

Reconstructing Power Trace with Representative Vectors:

Reconstructing Power Trace with Selected Execution Points:

Approximation ErrorDue to thresholding algorithm Errors for selected exec. points are bounded with the thresholdMax Error: 4.71W & RMS Error: 3.08WAs representative vectors are group centroids Cumulative errors for repr. vectors are lowerMax Error: 7.10W & RMS Error: 2.31WError in total power< (Component errors)

Conclusion

Presented a power oriented methodology to identify program phases that uses power vectors generated during program runtimeProvided a similarity metric to quantify power behavior similarity of different execution samplesDemonstrated our representative sampling technique to characterize program power behavior

Can be useful for power & characterization research:Power Phase identification/predictionReduced power simulation Dynamic power/thermal management

EXTRAS IF TIME PERMITS:

Related WorkDhodapkar and Smith [ISCA02] Working set signatures to detect phase changesSherwood et. al. [PACT01,ASPLOS02,ISCA30]Similarity analysis based on program basic block profiles to identify phasesTodi [WWC01]Clustering based on counter information to identify similar behavior

Our work in comparisonPower orientedPower behavior similarity metricRuntimeNo information about the application is requiredBounded approximation error with thresholding

Gzip Grouping MatricesGzip has 974 power vectors Cluster vectors based on similarity using thresholdingMax Gzip power dissimilarity: 47.35W

Generated Group Distributions

Component Power Characterizations

EXTRA SLIDESPower Vector ComponentsMore detail on Power Vectors Different Similarity MetricsSimilarity matrices and equations for all discussed techniques Similarity & Grouping MatricesExemplified description of the two matrices and plots Current & Future ResearchDiscussion of the ongoing and future researchIncludes also some new ideas and some things to do to make our current analysis solid Questions & RebuttalsStarts with the discussion of presented workDiscusses some shortcomings, things that need to be done to improve and to verify that it is unique and solid(Some parts of current work also discusses these issues)Also provides some answers to reviewers questionsIncludes some possible new ideas

Defining Components

P4 Architecture vs LayoutComponents to Model:Bus ControlL2 Cache2nd Level BPUITLB & IfetchL1 CacheMOBMem ControlDTLBInt EXEFP EXEInt RFFP RFDecodeTrace $1st Level BPUMicrocode ROMAllocationRenameInst-n QsScheduleInst-n QsRetirement Back

Defining Events Access RatesWe determined 24 events to approximate access rates for 22 componentsUsed Several Heuristics to represent each access rate

Examples:

Need to rotate counters 4 times to collect all event dataUsed 15 counters & 4 rotations to collect all event data

Access Rates Component PowersPerformance Counter based Access Rate estimations are used as proxy for max component power weighting together with microarchitectural details in order to estimate processor sub-unit powers

EX: Trace cache delivers 3 uops/cycle in deliver mode and 1 uop/cycle in build mode:Power(TC)=[Access-Rate(TC)/3 + Access-Rate(ID)] x MaxPower(TC) + Non-gated TC CLK powerTotal power is computed as the sum of all 22 component powers + measured idle power (8W):

Counter Access Heuristics1) BUS CONTROL:No 3rd Level cache BSQ allocations ~ IOQ allocationsMetric1: Bus accesses from all agentsEvent: IOQ_allocationCounts various types of bus transactionsShould account for BSQ as wellaccess based rather than durationMASK:Default req. type, all read (128B) and write (64B) types, include OWN,OTHER and PREFETCHMetric2: Bus Utilization(The % of time Bus is utilized)Event: FSB_data_activityCounts DataReaDY and DataBuSY events on BusMask: Count when processor or other agents drive/read/reserve the busExpression: FSB_data_activity x BusRatio / Clocks ElapsedTo account for clock ratios

Counter Access Heuristics2) L2 Cache:Metric: 2nd Level cache referencesEvent: BSQ_cache_referenceCounts cache ref-s as seen by bus unitMASK:All MESI read misses (LD & RFO)2nd level WR misses3) 2nd Level BPU:Metric 1: Instructions fetched from L2 (predict)Event: ITLB_ReferenceCounts ITLB translationsMask:All hits, misses & UC hitsMetric 2: Branches retired (history update)Event: branch_retiredCounts branches retiredMask:Count all Taken/NT/Predicted/MissP

Counter Access Heuristics4) ITLB & I-Fetch:etc10) FP Execution:Metric: FP instructions executedevent1: packed_SP_uopcounts packed single precision uopsevent2: packed_DP_uopcounts packed single precision uopsevent3: scalar_SP_uopcounts scalar double precision uopsevent4: scalar_DP_uopcounts scalar double precision uopsevent5: 64bit_MMX_uopcounts MMX uops with 64bit SIMD operandsevent6: 128bit_MMX_uopcounts integer SSE2 uops with 128bit SIMD operandsevent7: x87_FP_UOPcounts x87 FP uopsevent8: x87_SIMD_moves_uopcounts x87, FP, MMX, SSE, SSE2 ld/st/mov uops

Back

Similarity Based on Total Power

Similarity Based on Absolute Power Vectors

Similarity Based on Normalized Power Vectors

Similarity Based on Both Absolute and Normalized Power Vectors Back

Similarity Matrix ExampleConsider 4 vectors, each with 4 dimensions:

Log all distances in the similarity matrixColor-scale from black to white (only for upper diagonal)

0637603633077670

0637603633077670

Interpreting Similarity Matrix Plot Back

Grouping Matrix ExampleConsider same 4 vectors:Mark execution pairs with distance Threshold Back

0637603633077670

0637603633077670

0637603633077670

0637603633077670

Current & Future ResearchFOLLOWING SLIDES DISCUSS ONGOING RESEARCH RELATED TO POWER PHASES. PLANS FOR FUTURE RESEARCH ARE ALSO DISCUSSED

THE BIG PICTURETo Estimate component power & temperature breakdowns for P4 at runtimeTo analyze how power phase behavior relates to program structure Bottom line

Phase Branch

Power Phase BehaviorSimilarity Based on Power VectorsIdentifying similar program regions

Profiling Execution FlowSampling process executionPCsampler LKM

Program StructureExecution vs. Code spacePower Phases Exec. PhasesNOT YET?

POWER PHASE BEHAVIOR



Program StructureExecution vs. Code spacePower Phases Exec. PhasesNOT YET

Identifying Power PhasesMost of the methodology is readyComplete Gzip case in WWCExtensibility to other benchmarksGenerated similarity metrics for severalPerforming phase identification with thresholdingRepeatibilty of the experimentSeveral other possible ideas such as:Thresholding + k-means clusteringTwo-pass thresholdingPCA for dimension reduction (or SVD?)Manhattan = L1 normEuclidian (L2) not interestingChebyschev (Linf) - ??

DiscussionFrom here onGenerality of the techniqueAll Spec benchmarks show distinct phase behavior:Repeatability of the experimentNeed to be able to arrive at similar phase behavior in order to characterize an applicationCorrelation between vector componentsInherent redundancy in power vectorsCould be removed with PCAAlternative norms for similarityApplicability of selected execution points

Program Execution Profile



Program StructureExecution vs. Code spacePower Phases Exec. PhasesNOT YET

Program Execution ProfileSample program flow simultaneously with powerOur LKM implementation: PCsamplerNot FinishedGenerate code space similarity in parallel with power space similarityRelative comparisons of methods for:ComplexityAccuracyApplicability, etc.

CURRENT STATESample PC Binding to functionsReacquire PIDThose SPECs, Runspec: always in fixed address at ELF_program interpreterBenches: change pid between datasetsVerify PC with objdumpSo we can make sure it is the PC were sampling

Initial Data: PC?? Trace For gzip-sourceCorrespond to: functions Back

DiscussionFrom here onGenerality of the techniqueAll Spec benchmarks show distinct phase behavior:Repeatability of the experimentNeed to be able to arrive at similar phase behavior in order to characterize an applicationCorrelation between vector componentsInherent redundancy in power vectorsCould be removed with PCAAlternative norms for similarityApplicability of selected execution points

Similarity Analysis for Other Applications?SPECs show similar applicable behaviorNot always phase-like, i.e. twolf has more like a power gradientResults for other benchmarks: Gcc & Twolf:# of groups w.r.t. thresholdsErrors plots for reconstructed & selected vectorsApply to other applications:Desktop applicationsWill follow the bursty behavior, maybe determine action signatures??Saving, computation, streaming, etc.Ghostscript might be interestingCorrelation between phases vs. locations of images

Dependence of Results to the Applied Power Model?The generic technique requires only sufficiently detailed power breakdowns that add up to total powerDoesnt matter how you acquire the power vectors otherwiseIf you use some other characterization data other than power vectorsCan still perform phase analysis, but cannot provide a direct estimation for reconstructed power behaviorMaybe use some kind of mapping?Log measured power data as well as characterization metrics?Would still be unable to predict component-wise

Possibility for Other Processors?Most recent processors are keen on power managementThere will be enough power variability to exploit for power phase analysisPorting the power estimation to other architecturesRequires significant effort to Define power related metricsImplement counter reader and power estimation user and kernel SWPorting to same architecture, different implementationMore straightforward Reevaluate max/idle/gated power estimatesExperiences with other architectures:Castle project for Pentium Pro (P6) Few watts of variationLow dimensionalityIBM Power3 IIVery low measured power variation

Other Statistical Techniques?Alternative measures of distance:Different norms Canberra Distance Squared Chi-squared distance, etc.Other similarity metrics:Pearsons correlation coefficientCosine similarity

SPEClite vs. SimPoints vs. Power VectorsApproaches of 3 methods:SPEClite: Performance counts (runtime) (Perf. Oriented) (k-means partitioning for vectors normalized to 0 mean and unit variance) (PCA to create reduced vectors) (selects vectors closest to centroids)SimPoints: Basic block accesses (Simulation) (Perf. Oriented) (k-means partitioning for normalized basic block vectors) (selects vectors closest to centroids except for Early Simpoints)Power Vectors:Component power consumptions (Runtime) (Power Oriented) (threshold based partitioning for normalized + absolute vectors) (Selects earliest vector of each group)

Why Power Vectors w.r.t. Others?Provides a direct interpretation for power consumptionCould be used to identify specific power behavior for dynamic power/thermal managementPower phases might be not a perfect translation of performance phases

I.e. same basic block accesses during different architectural states Generated at runtime Easy repeatability, etc.Thresholding provides upper bound estimate for the power approximation with selected execution points

Modified StuffFOLLOWING SLIDES I MODIFIED FROM THE ORIGINAL, BUT STILL KEEP EM

Our Power Phase AnalysisGoal: Identify phases in program power behaviorDetermine execution points that correspond to these phasesDefine small set of power signatures that represent overall power behaviorOur Approach: Collect samples of estimated power values for processor sub-units at application runtimeDefine a similarity metric regarding these power vectorsProcess the outcome of this similarity metric to group sampled execution points (/Power Vectors) into phasesDetermine execution points and representative signature vectors for each phase groupQuantify the closeness of our approximation based on these vectors to original power behavior

MotivationCharacterizing power behavior:Future power-aware architectures and applicationsDynamic power/thermal managementMulticonfigurable hardwareThread scheduling, DVS, DFSRecurring phase predictionArchitecture researchRepresentative reduced simulation pointsUtilizing power vectors:Direct relation to actual processor power consumptionAcquired at runtime Similarity relations generated quicklyEasy repeatability for different datasets/compilationsIdentify (recurring) phases over large scales of executionIdentify program phases with no knowledge of application (i.e. no basic block profile, PC sampling, code space info, etc.)

Power Vector Similarity MetricHow to quantify the power behavior dissimilarity between two execution points?Consider solely total power difference Conceals significant phase informationConsider manhattan distance between the corresponding 2 vectors Vectors with small magnitudes are inherently closerConsider manhattan distance between the corresponding 2 vectors normalized Indifferent to magnitude of power consumptionConsider a combination of (2) & (3) Restricts us to both absolute and ratio-wise similarities Construct a similarity matrix to represent similarity among all pairs of execution pointsEach entry in the similarity matrix:

Similarity for Simplicity?So, we can identify similar power phases:I.e. informally: if similarity matrix(r,c) is DARK Execution points r & c have similar power behavior2 Questions:1) How do we group the execution points (power vectors) based on their similarity?2) Could we represent power behavior with reasonable accuracy, with a small number of signature vectors?Our answer to Q.1: Thresholding Algorithm:Define a threshold of similarity < % of max dissimilarity>Start from first execution point (0,0) and identify ones in the fwd execution path that lie within threshold for both normalized and absolute metricsTag the corresponding execution points (j,j) as the same groupFind next untagged execution point (r,r,) and do the same along fwd pathRule: A tagged execution point cannot add new elements to its group!We demonstrate the outcome of thresholding with Grouping Matrices

Gzip Grouping MatricesGzip has 974 power vectors Cluster vectors based on similarity using thresholdingMax Gzip power dissimilarity: 47.35W

Conclusion

Presented a power oriented methodology to identify program phases that uses power vectors generated during program runtimeProvided a similarity metric to quantify power behavior similarity of different execution samplesDemonstrated our representative sampling technique to characterize program power behaviorRepresentative vectors for program power signaturesExecution points for representative simulationCan be useful for power & characterization research:Power Phase identification/predictionReduced power simulation Dynamic power/thermal management

The title is our goalBefore describing the essence of our work, What do we mean by power phase behaviorBasically, what well talk about today:Collect samples of estimated power values for processor sub-units at application runtimeDefine a similarity metric regarding these power vectorsProcess the outcome of this similarity metric to group sampled execution points (/Power Vectors) into phasesDetermine execution points and representative signature vectors for each phase groupQuantify the closeness of our approximation based on these vectors to original power behavior

Dynamic power/thermal managementMulticonfigurable hardwareThread scheduling, DVS, DFSRecurring phase predictionArchitecture researchRepresentative reduced simulation pointsAcquired at runtime Similarity relations generated quicklyEasy repeatability for different datasets/compilationsIdentify (recurring) phases over large scales of executionIdentify program phases with no knowledge of application (i.e. no basic block profile, PC sampling, code space info, etc.)

Consider solely total power difference Conceals significant phase informationConsider manhattan distance between the corresponding 2 vectors Vectors with small magnitudes are inherently closerConsider manhattan distance between the corresponding 2 vectors normalized Indifferent to magnitude of power consumptionConsider a combination of (2) & (3) Restricts us to both absolute and ratio-wise similarities

So, we can identify similar power phases:I.e. informally: if similarity matrix(r,c) is DARK Execution points r & c have similar power behavior

2 Questions:1) How do we group the execution points (power vectors) based on their similarity?2) Could we represent power behavior with reasonable accuracy, with a small number of signature vectors?

Back to the question:Could we represent power behavior with reasonable accuracy, with a small number of signature vectors? And also select execution points?Fig-s small, but show the dist-nExpoints more evenly distributedDemonstrated our representative sampling technique to characterize program power behaviorRepresentative vectors for program power signaturesExecution points for representative simulation

Some here, rest in paper

Similarity based on both distribution and magnitude of powerOur metric is more appropriate for powerAlso shows for 10%, 100-150 & 200-280s regions are discriminatedSome here, rest in paper

Similarity based on both distribution and magnitude of powerOur metric is more appropriate for powerI am not gonna talk about P4 architecture becoz of timeinstead Ill use the die layout as a proxymemory sub: L2 ve BUS to chipset main memFront end: fetch, decode, ITLB ref etc.O-O-O: does buffr alloc., reg rename, sched. and dispatch, and retire reorders instr-sEX: Rapid exec. + L1 stuff,

Actual layout components important to use as thermal model based on those physical components22 components hereFirst lets have a look at the March & layout to see what we are dealing with2nd step in the flowchart3rd bullet is all about pwr conversion:we weight the comp. max powers with access rates and consider max. number of accesses that would mean max power, but at the end, they all end up as a single weighting factor merged togetherprocessor core puts accesses to BSQ, those that go to front side bus will be put in FSQ_IOQ (in our case all, as no L3)processor core puts accesses to BSQ, those that go to front side bus will be put in FSQ_IOQ (in our case all, as no L3)processor core puts accesses to BSQ, those that go to front side bus will be put in FSQ_IOQ (in our case all, as no L3)now here is how the rest goesnow here is how the rest goesnow here is how the rest goesnow here is how the rest goesnow here is how the rest goesnow here is how the rest goesnow here is how the rest goesnow here is how the rest goesnow here is how the rest goesnow here is how the rest goesnow here is how the rest goesnow here is how the rest goesSome questions we need to answer for our method first 2 are important last 2 are not so muchnow here is how the rest goesnow here is how the rest goesnow here is how the rest goesnow here is how the rest goesSome questions we need to answer for our method first 2 are important last 2 are not so muchnow here is how the rest goesnow here is how the rest goesnow here is how the rest goesnow here is how the rest goesnow here is how the rest goes2) Memory references? FWDing, speculation? Or like blocked process still holding bus powerBasically, what well talk about todayConsider solely total power difference Conceals significant phase informationConsider manhattan distance between the corresponding 2 vectors Vectors with small magnitudes are inherently closerConsider manhattan distance between the corresponding 2 vectors normalized Indifferent to magnitude of power consumptionConsider a combination of (2) & (3) Restricts us to both absolute and ratio-wise similarities

Also shows for 10%, 100-150 & 200-280s regions are discriminatedSome questions we need to answer for our method first 2 are important last 2 are not so much

Identifying Program Power Phase Behavior Using Power Vectors

Documents

Transcript of Identifying Program Power Phase Behavior Using Power Vectors