Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization
description
Transcript of Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization
Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization
Ajay Nair, Roman LyseckyDepartment of Electrical and Computer Engineering
University of ArizonaTucson, AZ USA
{ajaynair, rlysecky}@ece.arizona.edu
University of Arizona 2
Application profiling is useful for many purposes Often used to identify frequently executed code
regions Allowing a designer to focus on optimizing those regions Map frequently executed code and data regions to non-
interfering cache regions Used within binary translation approaches to store
translation results x86, Transmeta Crusoe
Can be used to create optimized SW or HW implementations selected at runtime
And many others….
IntroductionApplication Profiling
University of Arizona 3
IntroductionApplication Profiling – HW/SW Partitioning
Hardware/software Partitioning Profiling is a critical step
within hardware/software partitioning
Often utilized to determine critical software region
Frequently executed loops or functions
Critical kernels can be re-implemented in hardware
Speedup of 2X to 10X Speedup of 1000X possible
Energy reduction of 25% to 95%
Software Application
(C/C++)Application
Profiling
Critical Kernels Partitioning
HW SW
µPI$
D$
HW COPROCESSOR (ASIC/FPGA)
University of Arizona 4
IntroductionApplication Profiling – Warp Processing Overview
µP
On-chip CAD
I$
D$
Profiler
W-FPGA
APPLICATION INITIALLY EXECUTES ON MICROPROCESSOR
1
PROFILER DYNAMICALLY DETECTS APPLICATION’S KERNELS
2
ON-CHIP CAD MAPS KERNELS ONTO FPGA3
WARPED EXECUTION IS 2-100X FASTER – OR – CONSUMES 75% LESS POWER
5
CONFIGURE FPGA AND UPDATE APPLICATION BINARY
4
University of Arizona 5
IntroductionApplication Profiling – Warp Processing
Warp Processing - Dynamic Hardware/Software Partitioning Dynamically re-implements critical
kernels as HW within W-FPGA Requires non-intrusive profiling to
determine critical kernels at runtime
Incorporated Frequent Loop Detection Profiler [Gordon-Ross, Vahid – TC 2005] Monitors short backwards branches Maintains a small list of branch
executions frequency May lead to sub-optimal partitioning as
it does not provide detailed loop execution statistics
µP
On-chip CAD
I$
D$
Profiler
W-FPGA
University of Arizona 6
IntroductionApplication Profiling – HW/SW Partitioning
Loop iteration count alone may not provide sufficient information for accurate performance estimation
Example Assume we want to partition only
one of the following two loops to HW:
With profile data from Frequent Loop Detection Profiler, kernel B appears to be the better candidate
Kernel Total Iterations % Exec Time
A 10,000 33%
B 12,000 45%
Software Application
(C/C++)Application
Profiling
Critical Kernels Partitioning
HW SW µP
I$
D$
HW COPROCESSOR (ASIC/FPGA)
University of Arizona 7
IntroductionApplication Profiling – Warp Processing
However, communication requirements can significantly impact overall performance
Kernel A may in fact be the better choice
syncInitcomm
commloopHWloopSWSWSWHW
SWHW
SWSWHW
TTExecsT
TTTTTTT
S
*)()(/
//
Kernel Total Iterations % Exec Time
A 10,000 33%
B 12,000 45%
Avg Iters/Exec Execs
5000 2
2 6000
Software Application
(C/C++)Application
Profiling
Critical Kernels Partitioning
HW SW µP
I$
D$
HW COPROCESSOR (ASIC/FPGA)
University of Arizona 8
IntroductionApplication Profiling – Goal: Non-Intrusive Profiling
Non-intrusive Application Profiling Goal: Profile application at runtime to
determine detailed loop execution statistics with no impact on application execution
Runtime overhead cannot be tolerated by many applications at runtime
E.g. Real-time and embedded systems May lead to missed deadlines and
potentially system failure
Software Application
(C/C++)Application
Profiling
Critical Kernels Partitioning
HW SW µP
I$
D$
HW COPROCESSOR (ASIC/FPGA)
University of Arizona 9
IntroductionApplication Profiling – Existing Profiling Methods
Software Based Profiling Instrumenting - insert code directly
within software E.g., monitor branches, basic blocks,
functions, etc. Intrusive: Increases code size and
introduces runtime overhead Statistical Sampling
Periodically interrupt processor – or execute additional software task – to monitor program counter
Statistically determine the application profile
Very good accuracy with reduced overhead compared to instrumentation
Intrusive: Introduces runtime overhead
Software Application
(C/C++)Application
Profiling
Critical Kernels Partitioning
HW SW µP
I$
D$
HW COPROCESSOR (ASIC/FPGA)
University of Arizona 10
IntroductionApplication Profiling – Existing Profiling Methods
Hardware Based Profiling Processor Support – Event Counters
Many processors include event counters that can be used to profile an application
Intrusive: Requires additional software support to process event counters to profile application
JTAG – Joint Test Action Group Standard interface for reading
register within hardware devices Intrusive: Requires the processor to
be halted to read the values
Software Application
(C/C++)Application
Profiling
Critical Kernels Partitioning
HW SW µP
I$
D$
HW COPROCESSOR (ASIC/FPGA)
University of Arizona 11
Dynamic Application Profiler (DAProf)Non-intrusive Dynamic Application Profiling
Dynamic Application Profiler (DAProf) Non-intrusively monitors both loop
executions and iterations Monitors processor’s instruction bus
and branch execution behavior to build application profile
Requires a short backwards branch (sbb) signal from microprocessor
µPI$
D$
DAProf
iAddr
sbb
FPGA/ASIC
Profiler FIFO Profiler Controller
PROFILE CACHE
TAG (30)
OFFSET (8)
CURRITER (10)
AVGITER (13)
EXECS (16)
INLOOP (1)
FOUNDINDEX
REPLACEINDEX
FOUNDSBB
IADDR
SBB
IADDR
IOFFSET
DYNAMIC APPLICATION PROFILER (DAPROF)
FRESH-NESS(3)
University of Arizona 12
Dynamic Application Profiler (DAProf)Non-intrusive Dynamic Application Profiling Profiler FIFO
Small FIFO that stores the instruction address (iAddr) and instruction offset (iOffset) of all executed sbb’s
Synchronizes between processor execution frequency and slower internal profiler frequency
Profiler FIFO Profiler Controller
PROFILE CACHE
TAG (30)
OFFSET (8)
CURRITER (10)
AVGITER (13)
EXECS (16)
INLOOP (1)
FOUNDINDEX
REPLACEINDEX
FOUNDSBB
IADDR
SBB
IADDR
IOFFSET
DYNAMIC APPLICATION PROFILER (DAPROF)
FRESH-NESS(3)
University of Arizona 13
Dynamic Application Profiler (DAProf)Non-intrusive Dynamic Application Profiling Profile Cache
Tag: Address of the short backwards branch Offset: Negative branch offset
Corresponds to the size of the loop Currently supports loops with less than 256 instructions
Profiler FIFO Profiler Controller
PROFILE CACHE
TAG (30)
OFFSET (8)
CURRITER (10)
AVGITER (13)
EXECS (16)
INLOOP (1)
FOUNDINDEX
REPLACEINDEX
FOUNDSBB
IADDR
SBB
IADDR
IOFFSET
DYNAMIC APPLICATION PROFILER (DAPROF)
FRESH-NESS(3)
University of Arizona 14
Dynamic Application Profiler (DAProf)Non-intrusive Dynamic Application Profiling Profile Cache
CurrIter: Number of iterations for the current loop execution
AvgIter: Average Iterations per execution of the loop 13-bit fixed point representation with 10 bits integer and 3
bits fractional
Profiler FIFO Profiler Controller
PROFILE CACHE
TAG (30)
OFFSET (8)
CURRITER (10)
AVGITER (13)
EXECS (16)
INLOOP (1)
FOUNDINDEX
REPLACEINDEX
FOUNDSBB
IADDR
SBB
IADDR
IOFFSET
DYNAMIC APPLICATION PROFILER (DAPROF)
FRESH-NESS(3)
University of Arizona 15
Dynamic Application Profiler (DAProf)Non-intrusive Dynamic Application Profiling Profile Cache
InLoop: Flag indicating loop is currently executing Utilized to distinguish between loop iterations and loop
executions Freshness: Indicates how recently a loop has been
executed Utilized to ensure newly identified loops are not immediately
replaced from the profile cache
Profiler FIFO Profiler Controller
PROFILE CACHE
TAG (30)
OFFSET (8)
CURRITER (10)
AVGITER (13)
EXECS (16)
INLOOP (1)
FOUNDINDEX
REPLACEINDEX
FOUNDSBB
IADDR
SBB
IADDR
IOFFSET
DYNAMIC APPLICATION PROFILER (DAPROF)
FRESH-NESS(3)
University of Arizona 16
Dynamic Application Profiler (DAProf)Non-intrusive Dynamic Application Profiling Profile Cache Outputs
found: Indicates if current loop (identified by iAddr) is found within the profile cache
foundIndex: Location of loop within profile cache, if found replaceIndex: Loop that will be replaced upon new loop
execution Loop not identified as fresh with least total iterations
Profiler FIFO Profiler Controller
PROFILE CACHE
TAG (30)
OFFSET (8)
CURRITER (10)
AVGITER (13)
EXECS (16)
INLOOP (1)
FOUNDINDEX
REPLACEINDEX
FOUNDSBB
IADDR
SBB
IADDR
IOFFSET
DYNAMIC APPLICATION PROFILER (DAPROF)
FRESH-NESS(3)
University of Arizona 17
Dynamic Application Profiler (DAProf)Non-intrusive Dynamic Application Profiling Profiler Controller
If loop is found within cache
If InLoop flag is set New iteration Increment current
iterations Otherwise
New execution Increment executions Set current iterations
to 1 Set InLoop flag Update Freshness
DAProf (iAddr, iOffset, found, foundIndex, replaceIndex):if ( found ) if ( InLoop[foundIndex] ) CurrIter[foundIndex] += 1 else { for all i, Fresh[i] = Fresh[i] – 1 Execs[foundIndex] = Execs[foundIndex] + 1 CurrIter[foundIndex] = 1 InLoop[foundIndex] = 1 Fresh[foundIndex] = MaxFresh if ( Execs[foundIndex] = MaxExecs ) for all i, Execs[i] = Execs[i] >> 1 }else { for all i, Fresh[i] = Fresh[i] – 1 Tag[replaceIndex] = iAddr Offset[replaceIndex] = iOffset CurrIter[replaceIndex] = 1 AvgIter[replaceIndex] = 0 Execs[replaceIndex] = 1 InLoop[replaceIndex] = 1 Fresh[replaceIndex] = MaxFresh}for all i, if !( inLoop[i] && iAddr <= Tag[i] && iAddr >= Tag[i]-Offset[i] ) { InLoop[i] = 0 AvgIter[i] = (AvgIter[i]*7 + CurrIter[i])/8}
University of Arizona 18
Dynamic Application Profiler (DAProf)Non-intrusive Dynamic Application Profiling Profiler Controller
If loop is not found within cache
Replace profile cache entry
Initialize execution and current iterations to 1
Set InLoop flag Update Freshness
DAProf (iAddr, iOffset, found, foundIndex, replaceIndex):if ( found ) if ( InLoop[foundIndex] ) CurrIter[foundIndex] += 1 else { for all i, Fresh[i] = Fresh[i] – 1 Execs[foundIndex] = Execs[foundIndex] + 1 CurrIter[foundIndex] = 1 InLoop[foundIndex] = 1 Fresh[foundIndex] = MaxFresh if ( Execs[foundIndex] = MaxExecs ) for all i, Execs[i] = Execs[i] >> 1 }else { for all i, Fresh[i] = Fresh[i] – 1 Tag[replaceIndex] = iAddr Offset[replaceIndex] = iOffset CurrIter[replaceIndex] = 1 AvgIter[replaceIndex] = 0 Execs[replaceIndex] = 1 InLoop[replaceIndex] = 1 Fresh[replaceIndex] = MaxFresh}for all i, if !( inLoop[i] && iAddr <= Tag[i] && iAddr >= Tag[i]-Offset[i] ) { InLoop[i] = 0 AvgIter[i] = (AvgIter[i]*7 + CurrIter[i])/8}
University of Arizona 19
Dynamic Application Profiler (DAProf)Non-intrusive Dynamic Application Profiling Profiler Controller
If current sbb (iAddr) is detected outside a loop within the profile cache
AND, the loop’s InLoop flag is set
Reset InLoop flag Update average
iterations Ratio based average
iteration calculation Simple hardware
requirements Good accuracy for
applications considered
DAProf (iAddr, iOffset, found, foundIndex, replaceIndex):if ( found ) if ( InLoop[foundIndex] ) CurrIter[foundIndex] += 1 else { for all i, Fresh[i] = Fresh[i] – 1 Execs[foundIndex] = Execs[foundIndex] + 1 CurrIter[foundIndex] = 1 InLoop[foundIndex] = 1 Fresh[foundIndex] = MaxFresh if ( Execs[foundIndex] = MaxExecs ) for all i, Execs[i] = Execs[i] >> 1 }else { for all i, Fresh[i] = Fresh[i] – 1 Tag[replaceIndex] = iAddr Offset[replaceIndex] = iOffset CurrIter[replaceIndex] = 1 AvgIter[replaceIndex] = 0 Execs[replaceIndex] = 1 InLoop[replaceIndex] = 1 Fresh[replaceIndex] = MaxFresh}for all i, if !( inLoop[i] && iAddr <= Tag[i] && iAddr >= Tag[i]-Offset[i] ) { InLoop[i] = 0 AvgIter[i] = (AvgIter[i]*7 + CurrIter[i])/8}
ii
i CurrIterAvgIterAvgIter 8
*7
University of Arizona 20
Dynamic Application Profiler (DAProf)Hardware Implementation DAProf Hardware
Implemented fully associative, 16-way associative, and 8-way associative profiler design in Verilog
Synthesized using Synopsys Design Compiler targeted at UMC .18µm
DesignArea
Maximum Frequency
mm2 Gates % of ARM 9
Fully Associative 1.75107,47
7 20.00% 415 MHz16-way Associative 1.22 74,744 14.00% 438 MHz8-way Associative 0.96 59,036 11.00% 495 MHz
Profiler FIFO Profiler Controller
PROFILE CACHETAG (30)
OFFSET (8)
CURRITER (10)
AVGITER (13)
EXECS (16)
INLOOP (1)
FOUNDINDEX
REPLACEINDEX
FOUNDSBB
IADDR
SBB
IADDR
IOFFSET
DYNAMIC APPLICATION PROFILER (DAPROF)
FRESH-NESS
(3)
University of Arizona 21
Dynamic Application Profiler (DAProf)Profiling Accuracy DAProf Profiling Accuracy
Compared profiling accuracy of top tens loops for several MiBench applications – compared to detailed simulation based profiling
Results presented for 8-way DAProf design All three associativity performed similarly well
0%
10%
20%
30%
40%
lame
madcjp
egdjp
eg
tiffmed
ian
tiffdit
her
tiff2rg
ba
tiff2b
w
Averag
e
Benchmark
% D
iffer
ence
Average Iterations Executions % Execution Time
90% accuracy for average iterations
97% accuracy for executions
95% accuracy for % execution time
University of Arizona 22
Dynamic Application Profiler (DAProf)Profiling Accuracy – Function Call Interference DAProf Profiling Accuracy
Some applications are affected by function call interference Loop execution within functions called from within a loop may
lead to InLoop flag being incorrectly reset for calling loop Average iterations will be incorrectly updated
0%
10%
20%
30%
40%
lame
madcjp
egdjp
eg
tiffmed
ian
tiffdit
her
tiff2rg
ba
tiff2b
w
Averag
e
Benchmark
% D
iffer
ence
Average Iterations Executions % Execution Time
Function Call Interference
University of Arizona 23
Current Work – Dynamic Application Profiler Function Call Support
Extended DAProf Profiler with Function Call Support Monitors function calls and returns to avoid function call
interference InFunc: Flag within Profile Cache to determine is a loop
has called a function Will not update average iterations until function call returns
Profiler FIFO Profiler Controller
PROFILE CACHE
TAG (30)
OFFSET (8)
CURRITER (10)
AVGITER (13)
EXECS (16)
INLOOP (1)
FOUNDINDEX
REPLACEINDEX
FOUNDSBB
IADDR
SBB
IADDR
IOFFSET
DYNAMIC APPLICATION PROFILER (DAPROF)
FRESH-NESS(3)Profiler FIFO Profiler Controller
PROFILE CACHE
TAG (30)
OFFSET (30)
CURRITER (10)
AVGITER (13)
EXECS (16)
INLOOP (1)
FRESH-NESS
(3)FOUNDINDEX
REPLACEINDEX
FOUND
SBB
FUNC
RET
IADDR
SBB
FUNC
RET
IADDR
IOFFSET
DYNAMIC APPLICATION PROFILER (DAPROF)
INFUNC(1)
University of Arizona 24
Current Work – Dynamic Application ProfilerProfiling Accuracy with Function Call Support DAProf Profiling Accuracy with Function Support
Compared profiling accuracy of top tens loops for several MiBench applications – compared to detailed simulation based profiling
Results presented for 8-way DAProf design All three associativity performed similarly well
0%10%20%30%40%
Benchmark
% D
iffer
ence
Average Iterations Executions % Execution Time
95% accurate for average iterations, executions, and % execution time
University of Arizona 25
Conclusions
Conclusions Developed a non-intrusive dynamic application profiler (DAProf)
Profiles an application at runtime providing detailed loop execution characteristics
Developed efficient methods for identifying loop executions from loop iterations
Developed Freshness based replacement policy to ensure newly executed loops are not immediately replaced
Developed efficient method for monitoring function call executions Achieves excellent profiling accuracy
On average, better than 95% accurate for average iterations per executions, loops executions, and estimated percentage of total application execution time
Efficient Hardware Implementation Area requirement as little as 11% of an ARM9 processor Maximum operating frequency of 495 MHz
University of Arizona 26
Current/Future Work
Current/Future Work Current DAProf performs excellently for profiling single
threaded software applications However, multitasked/multithreaded applications may
lead to context switch interference Similar implications as that of function call interferences Need for task/thread aware profiling