Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization

26
Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization Ajay Nair, Roman Lysecky Department of Electrical and Computer Engineering University of Arizona Tucson, AZ USA {ajaynair, rlysecky}@ece.arizona.edu

description

Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization . Ajay Nair, Roman Lysecky Department of Electrical and Computer Engineering University of Arizona Tucson, AZ USA {ajaynair, rlysecky}@ece.arizona.edu. Introduction Application Profiling. - PowerPoint PPT Presentation

Transcript of Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization

Page 1: Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization

Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization

Ajay Nair, Roman LyseckyDepartment of Electrical and Computer Engineering

University of ArizonaTucson, AZ USA

{ajaynair, rlysecky}@ece.arizona.edu

Page 2: Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization

University of Arizona 2

Application profiling is useful for many purposes Often used to identify frequently executed code

regions Allowing a designer to focus on optimizing those regions Map frequently executed code and data regions to non-

interfering cache regions Used within binary translation approaches to store

translation results x86, Transmeta Crusoe

Can be used to create optimized SW or HW implementations selected at runtime

And many others….

IntroductionApplication Profiling

Page 3: Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization

University of Arizona 3

IntroductionApplication Profiling – HW/SW Partitioning

Hardware/software Partitioning Profiling is a critical step

within hardware/software partitioning

Often utilized to determine critical software region

Frequently executed loops or functions

Critical kernels can be re-implemented in hardware

Speedup of 2X to 10X Speedup of 1000X possible

Energy reduction of 25% to 95%

Software Application

(C/C++)Application

Profiling

Critical Kernels Partitioning

HW SW

µPI$

D$

HW COPROCESSOR (ASIC/FPGA)

Page 4: Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization

University of Arizona 4

IntroductionApplication Profiling – Warp Processing Overview

µP

On-chip CAD

I$

D$

Profiler

W-FPGA

APPLICATION INITIALLY EXECUTES ON MICROPROCESSOR

1

PROFILER DYNAMICALLY DETECTS APPLICATION’S KERNELS

2

ON-CHIP CAD MAPS KERNELS ONTO FPGA3

WARPED EXECUTION IS 2-100X FASTER – OR – CONSUMES 75% LESS POWER

5

CONFIGURE FPGA AND UPDATE APPLICATION BINARY

4

Page 5: Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization

University of Arizona 5

IntroductionApplication Profiling – Warp Processing

Warp Processing - Dynamic Hardware/Software Partitioning Dynamically re-implements critical

kernels as HW within W-FPGA Requires non-intrusive profiling to

determine critical kernels at runtime

Incorporated Frequent Loop Detection Profiler [Gordon-Ross, Vahid – TC 2005] Monitors short backwards branches Maintains a small list of branch

executions frequency May lead to sub-optimal partitioning as

it does not provide detailed loop execution statistics

µP

On-chip CAD

I$

D$

Profiler

W-FPGA

Page 6: Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization

University of Arizona 6

IntroductionApplication Profiling – HW/SW Partitioning

Loop iteration count alone may not provide sufficient information for accurate performance estimation

Example Assume we want to partition only

one of the following two loops to HW:

With profile data from Frequent Loop Detection Profiler, kernel B appears to be the better candidate

Kernel Total Iterations % Exec Time

A 10,000 33%

B 12,000 45%

Software Application

(C/C++)Application

Profiling

Critical Kernels Partitioning

HW SW µP

I$

D$

HW COPROCESSOR (ASIC/FPGA)

Page 7: Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization

University of Arizona 7

IntroductionApplication Profiling – Warp Processing

However, communication requirements can significantly impact overall performance

Kernel A may in fact be the better choice

syncInitcomm

commloopHWloopSWSWSWHW

SWHW

SWSWHW

TTExecsT

TTTTTTT

S

*)()(/

//

Kernel Total Iterations % Exec Time

A 10,000 33%

B 12,000 45%

Avg Iters/Exec Execs

5000 2

2 6000

Software Application

(C/C++)Application

Profiling

Critical Kernels Partitioning

HW SW µP

I$

D$

HW COPROCESSOR (ASIC/FPGA)

Page 8: Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization

University of Arizona 8

IntroductionApplication Profiling – Goal: Non-Intrusive Profiling

Non-intrusive Application Profiling Goal: Profile application at runtime to

determine detailed loop execution statistics with no impact on application execution

Runtime overhead cannot be tolerated by many applications at runtime

E.g. Real-time and embedded systems May lead to missed deadlines and

potentially system failure

Software Application

(C/C++)Application

Profiling

Critical Kernels Partitioning

HW SW µP

I$

D$

HW COPROCESSOR (ASIC/FPGA)

Page 9: Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization

University of Arizona 9

IntroductionApplication Profiling – Existing Profiling Methods

Software Based Profiling Instrumenting - insert code directly

within software E.g., monitor branches, basic blocks,

functions, etc. Intrusive: Increases code size and

introduces runtime overhead Statistical Sampling

Periodically interrupt processor – or execute additional software task – to monitor program counter

Statistically determine the application profile

Very good accuracy with reduced overhead compared to instrumentation

Intrusive: Introduces runtime overhead

Software Application

(C/C++)Application

Profiling

Critical Kernels Partitioning

HW SW µP

I$

D$

HW COPROCESSOR (ASIC/FPGA)

Page 10: Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization

University of Arizona 10

IntroductionApplication Profiling – Existing Profiling Methods

Hardware Based Profiling Processor Support – Event Counters

Many processors include event counters that can be used to profile an application

Intrusive: Requires additional software support to process event counters to profile application

JTAG – Joint Test Action Group Standard interface for reading

register within hardware devices Intrusive: Requires the processor to

be halted to read the values

Software Application

(C/C++)Application

Profiling

Critical Kernels Partitioning

HW SW µP

I$

D$

HW COPROCESSOR (ASIC/FPGA)

Page 11: Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization

University of Arizona 11

Dynamic Application Profiler (DAProf)Non-intrusive Dynamic Application Profiling

Dynamic Application Profiler (DAProf) Non-intrusively monitors both loop

executions and iterations Monitors processor’s instruction bus

and branch execution behavior to build application profile

Requires a short backwards branch (sbb) signal from microprocessor

µPI$

D$

DAProf

iAddr

sbb

FPGA/ASIC

Profiler FIFO Profiler Controller

PROFILE CACHE

TAG (30)

OFFSET (8)

CURRITER (10)

AVGITER (13)

EXECS (16)

INLOOP (1)

FOUNDINDEX

REPLACEINDEX

FOUNDSBB

IADDR

SBB

IADDR

IOFFSET

DYNAMIC APPLICATION PROFILER (DAPROF)

FRESH-NESS(3)

Page 12: Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization

University of Arizona 12

Dynamic Application Profiler (DAProf)Non-intrusive Dynamic Application Profiling Profiler FIFO

Small FIFO that stores the instruction address (iAddr) and instruction offset (iOffset) of all executed sbb’s

Synchronizes between processor execution frequency and slower internal profiler frequency

Profiler FIFO Profiler Controller

PROFILE CACHE

TAG (30)

OFFSET (8)

CURRITER (10)

AVGITER (13)

EXECS (16)

INLOOP (1)

FOUNDINDEX

REPLACEINDEX

FOUNDSBB

IADDR

SBB

IADDR

IOFFSET

DYNAMIC APPLICATION PROFILER (DAPROF)

FRESH-NESS(3)

Page 13: Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization

University of Arizona 13

Dynamic Application Profiler (DAProf)Non-intrusive Dynamic Application Profiling Profile Cache

Tag: Address of the short backwards branch Offset: Negative branch offset

Corresponds to the size of the loop Currently supports loops with less than 256 instructions

Profiler FIFO Profiler Controller

PROFILE CACHE

TAG (30)

OFFSET (8)

CURRITER (10)

AVGITER (13)

EXECS (16)

INLOOP (1)

FOUNDINDEX

REPLACEINDEX

FOUNDSBB

IADDR

SBB

IADDR

IOFFSET

DYNAMIC APPLICATION PROFILER (DAPROF)

FRESH-NESS(3)

Page 14: Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization

University of Arizona 14

Dynamic Application Profiler (DAProf)Non-intrusive Dynamic Application Profiling Profile Cache

CurrIter: Number of iterations for the current loop execution

AvgIter: Average Iterations per execution of the loop 13-bit fixed point representation with 10 bits integer and 3

bits fractional

Profiler FIFO Profiler Controller

PROFILE CACHE

TAG (30)

OFFSET (8)

CURRITER (10)

AVGITER (13)

EXECS (16)

INLOOP (1)

FOUNDINDEX

REPLACEINDEX

FOUNDSBB

IADDR

SBB

IADDR

IOFFSET

DYNAMIC APPLICATION PROFILER (DAPROF)

FRESH-NESS(3)

Page 15: Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization

University of Arizona 15

Dynamic Application Profiler (DAProf)Non-intrusive Dynamic Application Profiling Profile Cache

InLoop: Flag indicating loop is currently executing Utilized to distinguish between loop iterations and loop

executions Freshness: Indicates how recently a loop has been

executed Utilized to ensure newly identified loops are not immediately

replaced from the profile cache

Profiler FIFO Profiler Controller

PROFILE CACHE

TAG (30)

OFFSET (8)

CURRITER (10)

AVGITER (13)

EXECS (16)

INLOOP (1)

FOUNDINDEX

REPLACEINDEX

FOUNDSBB

IADDR

SBB

IADDR

IOFFSET

DYNAMIC APPLICATION PROFILER (DAPROF)

FRESH-NESS(3)

Page 16: Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization

University of Arizona 16

Dynamic Application Profiler (DAProf)Non-intrusive Dynamic Application Profiling Profile Cache Outputs

found: Indicates if current loop (identified by iAddr) is found within the profile cache

foundIndex: Location of loop within profile cache, if found replaceIndex: Loop that will be replaced upon new loop

execution Loop not identified as fresh with least total iterations

Profiler FIFO Profiler Controller

PROFILE CACHE

TAG (30)

OFFSET (8)

CURRITER (10)

AVGITER (13)

EXECS (16)

INLOOP (1)

FOUNDINDEX

REPLACEINDEX

FOUNDSBB

IADDR

SBB

IADDR

IOFFSET

DYNAMIC APPLICATION PROFILER (DAPROF)

FRESH-NESS(3)

Page 17: Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization

University of Arizona 17

Dynamic Application Profiler (DAProf)Non-intrusive Dynamic Application Profiling Profiler Controller

If loop is found within cache

If InLoop flag is set New iteration Increment current

iterations Otherwise

New execution Increment executions Set current iterations

to 1 Set InLoop flag Update Freshness

DAProf (iAddr, iOffset, found, foundIndex, replaceIndex):if ( found ) if ( InLoop[foundIndex] ) CurrIter[foundIndex] += 1 else { for all i, Fresh[i] = Fresh[i] – 1 Execs[foundIndex] = Execs[foundIndex] + 1 CurrIter[foundIndex] = 1 InLoop[foundIndex] = 1 Fresh[foundIndex] = MaxFresh if ( Execs[foundIndex] = MaxExecs ) for all i, Execs[i] = Execs[i] >> 1 }else { for all i, Fresh[i] = Fresh[i] – 1 Tag[replaceIndex] = iAddr Offset[replaceIndex] = iOffset CurrIter[replaceIndex] = 1 AvgIter[replaceIndex] = 0 Execs[replaceIndex] = 1 InLoop[replaceIndex] = 1 Fresh[replaceIndex] = MaxFresh}for all i, if !( inLoop[i] && iAddr <= Tag[i] && iAddr >= Tag[i]-Offset[i] ) { InLoop[i] = 0 AvgIter[i] = (AvgIter[i]*7 + CurrIter[i])/8}

Page 18: Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization

University of Arizona 18

Dynamic Application Profiler (DAProf)Non-intrusive Dynamic Application Profiling Profiler Controller

If loop is not found within cache

Replace profile cache entry

Initialize execution and current iterations to 1

Set InLoop flag Update Freshness

DAProf (iAddr, iOffset, found, foundIndex, replaceIndex):if ( found ) if ( InLoop[foundIndex] ) CurrIter[foundIndex] += 1 else { for all i, Fresh[i] = Fresh[i] – 1 Execs[foundIndex] = Execs[foundIndex] + 1 CurrIter[foundIndex] = 1 InLoop[foundIndex] = 1 Fresh[foundIndex] = MaxFresh if ( Execs[foundIndex] = MaxExecs ) for all i, Execs[i] = Execs[i] >> 1 }else { for all i, Fresh[i] = Fresh[i] – 1 Tag[replaceIndex] = iAddr Offset[replaceIndex] = iOffset CurrIter[replaceIndex] = 1 AvgIter[replaceIndex] = 0 Execs[replaceIndex] = 1 InLoop[replaceIndex] = 1 Fresh[replaceIndex] = MaxFresh}for all i, if !( inLoop[i] && iAddr <= Tag[i] && iAddr >= Tag[i]-Offset[i] ) { InLoop[i] = 0 AvgIter[i] = (AvgIter[i]*7 + CurrIter[i])/8}

Page 19: Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization

University of Arizona 19

Dynamic Application Profiler (DAProf)Non-intrusive Dynamic Application Profiling Profiler Controller

If current sbb (iAddr) is detected outside a loop within the profile cache

AND, the loop’s InLoop flag is set

Reset InLoop flag Update average

iterations Ratio based average

iteration calculation Simple hardware

requirements Good accuracy for

applications considered

DAProf (iAddr, iOffset, found, foundIndex, replaceIndex):if ( found ) if ( InLoop[foundIndex] ) CurrIter[foundIndex] += 1 else { for all i, Fresh[i] = Fresh[i] – 1 Execs[foundIndex] = Execs[foundIndex] + 1 CurrIter[foundIndex] = 1 InLoop[foundIndex] = 1 Fresh[foundIndex] = MaxFresh if ( Execs[foundIndex] = MaxExecs ) for all i, Execs[i] = Execs[i] >> 1 }else { for all i, Fresh[i] = Fresh[i] – 1 Tag[replaceIndex] = iAddr Offset[replaceIndex] = iOffset CurrIter[replaceIndex] = 1 AvgIter[replaceIndex] = 0 Execs[replaceIndex] = 1 InLoop[replaceIndex] = 1 Fresh[replaceIndex] = MaxFresh}for all i, if !( inLoop[i] && iAddr <= Tag[i] && iAddr >= Tag[i]-Offset[i] ) { InLoop[i] = 0 AvgIter[i] = (AvgIter[i]*7 + CurrIter[i])/8}

ii

i CurrIterAvgIterAvgIter 8

*7

Page 20: Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization

University of Arizona 20

Dynamic Application Profiler (DAProf)Hardware Implementation DAProf Hardware

Implemented fully associative, 16-way associative, and 8-way associative profiler design in Verilog

Synthesized using Synopsys Design Compiler targeted at UMC .18µm

DesignArea

Maximum Frequency

mm2 Gates % of ARM 9

Fully Associative 1.75107,47

7 20.00% 415 MHz16-way Associative 1.22 74,744 14.00% 438 MHz8-way Associative 0.96 59,036 11.00% 495 MHz

Profiler FIFO Profiler Controller

PROFILE CACHETAG (30)

OFFSET (8)

CURRITER (10)

AVGITER (13)

EXECS (16)

INLOOP (1)

FOUNDINDEX

REPLACEINDEX

FOUNDSBB

IADDR

SBB

IADDR

IOFFSET

DYNAMIC APPLICATION PROFILER (DAPROF)

FRESH-NESS

(3)

Page 21: Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization

University of Arizona 21

Dynamic Application Profiler (DAProf)Profiling Accuracy DAProf Profiling Accuracy

Compared profiling accuracy of top tens loops for several MiBench applications – compared to detailed simulation based profiling

Results presented for 8-way DAProf design All three associativity performed similarly well

0%

10%

20%

30%

40%

lame

madcjp

egdjp

eg

tiffmed

ian

tiffdit

her

tiff2rg

ba

tiff2b

w

Averag

e

Benchmark

% D

iffer

ence

Average Iterations Executions % Execution Time

90% accuracy for average iterations

97% accuracy for executions

95% accuracy for % execution time

Page 22: Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization

University of Arizona 22

Dynamic Application Profiler (DAProf)Profiling Accuracy – Function Call Interference DAProf Profiling Accuracy

Some applications are affected by function call interference Loop execution within functions called from within a loop may

lead to InLoop flag being incorrectly reset for calling loop Average iterations will be incorrectly updated

0%

10%

20%

30%

40%

lame

madcjp

egdjp

eg

tiffmed

ian

tiffdit

her

tiff2rg

ba

tiff2b

w

Averag

e

Benchmark

% D

iffer

ence

Average Iterations Executions % Execution Time

Function Call Interference

Page 23: Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization

University of Arizona 23

Current Work – Dynamic Application Profiler Function Call Support

Extended DAProf Profiler with Function Call Support Monitors function calls and returns to avoid function call

interference InFunc: Flag within Profile Cache to determine is a loop

has called a function Will not update average iterations until function call returns

Profiler FIFO Profiler Controller

PROFILE CACHE

TAG (30)

OFFSET (8)

CURRITER (10)

AVGITER (13)

EXECS (16)

INLOOP (1)

FOUNDINDEX

REPLACEINDEX

FOUNDSBB

IADDR

SBB

IADDR

IOFFSET

DYNAMIC APPLICATION PROFILER (DAPROF)

FRESH-NESS(3)Profiler FIFO Profiler Controller

PROFILE CACHE

TAG (30)

OFFSET (30)

CURRITER (10)

AVGITER (13)

EXECS (16)

INLOOP (1)

FRESH-NESS

(3)FOUNDINDEX

REPLACEINDEX

FOUND

SBB

FUNC

RET

IADDR

SBB

FUNC

RET

IADDR

IOFFSET

DYNAMIC APPLICATION PROFILER (DAPROF)

INFUNC(1)

Page 24: Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization

University of Arizona 24

Current Work – Dynamic Application ProfilerProfiling Accuracy with Function Call Support DAProf Profiling Accuracy with Function Support

Compared profiling accuracy of top tens loops for several MiBench applications – compared to detailed simulation based profiling

Results presented for 8-way DAProf design All three associativity performed similarly well

0%10%20%30%40%

Benchmark

% D

iffer

ence

Average Iterations Executions % Execution Time

95% accurate for average iterations, executions, and % execution time

Page 25: Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization

University of Arizona 25

Conclusions

Conclusions Developed a non-intrusive dynamic application profiler (DAProf)

Profiles an application at runtime providing detailed loop execution characteristics

Developed efficient methods for identifying loop executions from loop iterations

Developed Freshness based replacement policy to ensure newly executed loops are not immediately replaced

Developed efficient method for monitoring function call executions Achieves excellent profiling accuracy

On average, better than 95% accurate for average iterations per executions, loops executions, and estimated percentage of total application execution time

Efficient Hardware Implementation Area requirement as little as 11% of an ARM9 processor Maximum operating frequency of 495 MHz

Page 26: Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization

University of Arizona 26

Current/Future Work

Current/Future Work Current DAProf performs excellently for profiling single

threaded software applications However, multitasked/multithreaded applications may

lead to context switch interference Similar implications as that of function call interferences Need for task/thread aware profiling