Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering...

69
Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA [email protected]
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    216
  • download

    2

Transcript of Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering...

Page 1: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Hardware and Software Tracing

David KaeliDepartment of Electrical and Computer

EngineeringNortheastern University

Boston, [email protected]

Page 2: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Trace Collection Methodologies

• Hardware– Monitors and instrumentation– Microcode

• Software– Trap-based system– Emulators– Code annotation (source, object,

executable)– Direct execution

Page 3: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Metrics for Evaluating Trace Collection Methodologies

• Speed – trace capture rate• Memory – extra memory used• Accuracy – address perturbation• Intrusiveness – tracing overhead• Completeness – OS, interrupts, libraries• Granularity – smallest traceable unit• Flexibility – ease of use• Portability – platform dependence• Capacity – trace storage space• Cost - $$, time

Page 4: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Hardware Monitors• Capture trace at peak execution rates• Challenge - match storage media speed to

tracing needs utilizing interleaving and multiplexing

• Pros:– Non-intrusive– Accurate– Complete

• Cons:– Expensive– Limited probeability– Limited trace length

Page 5: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Examples of Hardware Monitors• Monster – (U. of Michigan 1992) – R2000

traces using a DAS9200• BACH (BYU, 1992) – i486, Pentium

SPARC, 68K – developed a customized pod – being used by Intel today

• Real-time Tracer (IBM 1992) – Customized SRAM array

• National Instruments (2006) – provides a family of programmable instrumentation monitors

Page 6: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Microcode-based Tracing

• Places hooks in microcode to capture machine state

• Pros:– Complete (OS, application)– Minimal slowdown (2-10x)

• Cons:– Microcode is dated technology– Nonportable

Page 7: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Example Microcode-based Tracing

• ATUM (Stanford 1986) – VAX traces• PatchWrx (DEC WRL 1995, NU

1996) – Complete OS-rich traces on Alpha running NT

Page 8: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Intrumenting NT-based Workloads

Page 9: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Participants

• Chakib Ouarraoui – EMC• Jason Casmira – Intel• John Fraser – US Air Force• David Hunter – VMWare• Sharon Smith – HP• Richard Sites – Adobe Systems

Page 10: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Tracing tools that capture OS activity

Name Avg. Slowdown Addr. Perturb OS Activity Platform(s)Pixie 10X - 100X Y N MIPSEEL 10X Y Y SPARC SolarisQPT 2X - 6X Y N SPARC SolarisShade 6X N N SPARC V8, V9ATUM 20X N Y DEC VAXATOM 10X - 100X N Y DEC UNIXSimOS 10X - 50,000X N Y DEC UNIX, SGI IRIX, SPARC SolarisEtch 35X Y N ix86 Windows NT 4.0NT-Atom 10X - 100X N N Alpha Windows NT 4.0PatchWrx 4X N Y Alpha Windows NT 4.0

Page 11: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

OS Rich and NT-based Instrumentation Tools

• SimOS– UNIX-based platforms – (basis for VMWare)– OS, memory, I/O activity– High overhead (10X - 50,000X)

• Etch– Intel x86-based platform– No OS activity– 35X slowdown

Page 12: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

PatchWrx Overview• Dynamic execution tracing tool suite• Captures full system workloads• Traces branches executed by the processor• Reconstructs full instruction stream• DEC Alpha 21064 Windows NT 4.0 platforms• Low overhead with minimum slowdown

– 2X while running– 4X while tracing

Page 13: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

PatchWrx Components

• PALcode – Alpha Privileged Architecture Library• Reserves trace buffer upon boot• Captures trace info• Facilitates long branches

• Patch – instrument all NT images• Trace – collect runtime information• Reconstruct – reconstitute the information

Page 14: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Patching an Image

• Instrument all WinNT binary image types– COM, EXE, DLL, SYS, DRV

• Replace branch-type instructions with branches to PatchWrx PAL calls

• Log trace entry of branch type into buffer

• Branch to original target

Page 15: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Patching an Image

A

B

A’

B

PAL

PWX PAL BR

1

234

ORIGINAL IMAGE PATCHED IMAGE

PATCHSECTION

Page 16: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Patching Large Images

• Normal Alpha ISA branch instruction– (PC+4) + SEXT(disp21) * 4

• New PatchWrx long branches– LBR (PC+4) + SEXT(disp25) * 4– LBSR (PC+4) + ZEXT(disp20) * 32

Page 17: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Patching Large Images

A’

B

PAL

PWX PAL BR

1

2

34

PATCHED IMAGE

PATCHSECTION

5

6

LONG

CAPTURE

Page 18: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Tracing with PatchWrx

• Trace

• User controlled start/stop/dump

• Dumps captured trace to binary file

• Captures VA mapping snapshot of active processes during trace capture

Page 19: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

RECONSTRUCTTOOL

RAWTRACE

VAMAP

IMAGE0

IMAGEn

I-STREAMAND/OR

D-STREAM....

SYMBOLTABLE

0

SYMBOLTABLE

n

Reconstructing Execution

Page 20: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

OS-Rich Workload Characterization

• Execution domain analysis• Hot EXEs / DLLs (system resources)• Instruction mix

– Application-only– Full system

• Branching behavior– Branch frequency (average basic block size)– Branch prediction in presence of OS

Page 21: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Workloads Investigated

Workload Descriptionfourier BYTEmark; numerical analysis routine for calculating series

approximations of waveformsli SPEC95 Xlisp interpreter benchmarkgo SPEC95 Go! game benchmarkie Microsoft Internet Explorer V2.0 following a series of web page linksvc50 Microsoft Visual C++ 5.0 compiling a 3000 line C programfx32 FX!32 V1.1 interpreting/translating included openGL sample

Intel x86 applicationword Microsoft Word97 V7.0, spell-checking a 15 page document

Page 22: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Five most frequently used images in each benchmark or application

Workload 1st 2nd 3rd 4th 5th Other

fourier bytecpu.exe (99.5%)

winsrv.dll (0.2%)

win32k.sys (0.1%)

ntoskrnl.ece (0.1%)

user32.dll (.02%)

(0.8%)

li li.exe (97.7%)

win32k.sys (1.0%)

ntoskrnl.exe (0.6%)

user32.dll (0.1%)

qv.dll (0.1%)

(0.5%)

go go.exe(95.5%)

win32k.sys(2.0%)

ntoskrnl.exe(1.0%)

hal.dll(0.4%)

gv.dll(0.1%)

(1.0%)

ie iexplore.exe(37.2%)

win32k.sys(19.3%)

ntoskrnl.exe(17.5%)

fastfat.sys((6.1%)

ntdll.dll(6.0%)

(13.9%)

vc50 c1.exe(83.1%)

ntoskrnl.exe(10.5%)

msvcrt.dll(2.8%)

nsfs.sys(1.2%)

win32k.sys(1.1%)

(1.3%)

word mssp232.dll(36.4%)

msgren32.dll (34.0%)

ntoskrnl.exe(10.2%)

win32k.sys(7.7%)

hal.dll(4.0%)

(7.7%)

fx!32 hal.dll(42.5%)

s3.dll(24.6%)

opengl32.dll(12.2%)

msvcrt.dll (11.7%)

glu32.dll (2.7%)

(6.3%)

Page 23: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Average basic block lengths

0

2

4

6

8

10

12

14

Instruction count

Fourier Go vc50 word

Workload

AllOSDLLAPPDLLAPP

Page 24: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Conditional Branch Prediction2-level BTB, 12-bit PHR, 4096 entries, gshare

0

5

10

15

20

25

30

35

MispredictionRatio

Fourier Go vc50 word

Workload

AllOSDLLAPPDLLAPP

Page 25: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Summary of Results

• Benchmarks execute almost entirely within the application domain– Desktop applications execute across many images

and interact with the kernel and system DLLs

• Branch prediction accuracy can change drastically (sometimes it can even improve) when the operating system interaction is considered

• The instruction mix in desktop applications changes significantly in the presence of OS– Increased number of indirect branches and

privileged instructions (e.g., PALcalls)

Page 26: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

For Further Information

1. “Tracing and Characterization of Windows NT-based System Workloads,” J.P. Casmira, D.P. Hunter and D.R. Kaeli, Digital Technical Journal, Vol. 10, No. 1, 1998, pp. 6-21 (www.digital.com/info/DTJ01/DTJ01HM.HTM).

2. “Operating System Impact on Trace-Driven Simulation,” J.P. Casmira, J. Fraser and D.R. Kaeli, Proceedings of the 31st Simulation Symposium, Boston, MA, April 1998, pp. 76-82.

3. “A Code Annotation Tool for Capturing Operating System Execution,” J.Fraser, Northeastern University Technical Report, NUCAR_6-97-1, June 1997 (on the NUCAR website).

http://www.ece.neu.edu/groups/nucar

Page 27: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

And now back to tracing……..

Page 28: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Trap Based

• Interrupt the application at selected points in order to save trace records

• Pros:– Available on many CPUs– Portable– Inexpensive

• Cons:– Considerable slowdown (1000x)– Intrusive (ISR), especially when considering

real-time events– How we decide where to interrupt the processor

and still maintain a representative trace?

Page 29: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Example Trap Based Systems

• VAX-Tracer – Clark&Emer study on VAX

• OS2-Tracer – Intel 386• Wisconsin Wind Tunnel – ECC error

trapping – CM5 (SPARC)• Tapeworm II system – ECC error

trapping – OS trap handler

Page 30: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Emulators

• Simulating the target ISA using one or a multiple machine instructions on the host ISA

• Pros:– Minimal slowdown (10-100x)– Opportunity for JIT compilation– Portable– Flexible – software controlled

• Cons:– Serious programming effort needed– Extra memory needed– Typically single process tracing

Page 31: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Emulators

• Shade (UW 1994) – dynamic translation– Compiles emulated instructions to native instructions (many

elements of Shade have shown up in Transmeta products)– Host – SPARC-V8– Targets – SPARC-V8, SPARC-V9, MIPS

• Spa (Sun 1993) – Iterative interpretation– Reinterprets instructions on each occurrence– Host – MIPS-1– Targets – MIPS-1, MIPS-2

• SPIM (U of Wisc 1991) – predecoded interpretation– Provides pointers to instruction handler and operands to

speed decoding– Hosts – SPARC, 680x0, MIPS, HP-PA– Target – MIPS-1

Page 32: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

More Recent Emulators

• VisualDSP (Analog Devices 1995-present)– Simulator for SHARC and BlackFin DSPs that runs on

WinTel and Linux-x86– Provides C/C++ compilation environment– Statistical profiling– Cycle-accurate simulator– Provides a full visualization environment for machine

performance

• AMD Opteron X86-64 (2003) – Simulator for the new 64-bit X86 from AMD– Runs on 32-bit Linux-x86– Comes complete with a X86-64 version of gcc– http://www.x86-64.org/

Page 33: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

MP Emulators

• MINT (University of Rochester 1994)– Predecoded interpretation – memory

references– Host – R3000 (SGI, DECstations)– Target – R3000, (an Alpha-based derivative

was developed called AINT)

• RSim (Rice Univ 1997) – Simulator for high-ILP Multiprocessors– Detailed cycle-based emulation– Host – SPARC, SGI PowerChallenge– Target – MIPS R10K

Page 34: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Machine Emulators• Simics (1996-present) Virtutech

– Developed out research work at SICS– Provides a large number of CPU targets

• Alpha, ARM, Itanium, MIPS, Pentium, PowerPC, SPARC, X86-64– Provides both detailed simulation/emulation and high

throughput– http://www.simics.com/

• SimOS (1997) Stanford University– Originally designed to run on an SGI platform– Actually boots a full operating system (SGI IRIX and DEC

UNIX)– Implementations on Alpha and MIPS platforms– Designed around the operating system, emulating IO and

other system-related events– Provided the base technology for VMWare products

Page 35: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Code Annotation

• Instrumented program produces trace while the application is run

• Three levels of annotation– Source code modification– Object code modification– Binary code modification

• Pros:– Ease of implementation– Small slowdown (10x)– Inexpensive

• Cons: – Limited completeness (OS, multiprocessing)– May not capture DLLs– Memory dilation

Page 36: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Source Code Annotation

• TRAPEDS (Univ. of Illinois 1989)– Adds a call upon exit from a basic block

• MPTrace (Univ. of Washington 1990) – I386, instruments only MP-relevant

events

• Tangolite (Stanford 1993)– Annotates all memory events in an MP

environment

Page 37: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Object Code Annotation

• Epoxie (DEC WRL 1989) – Titan MP • Epoxie2 (DEC WRL 1993) – R3000• ATOM (DEC WRL 1994) – Alpha• Alto (Univ. of Arizona 1996) –

Alpha• PLTO (Univ. of Arizona 2001) –

IA32

Page 38: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Binary Code Annotation

• Pixie (DEC 1991) – MIPS• Goblin (IBM/CMU 1991) – RS/6000• IDtrace (Univ. of Mich.) – i486• QPT (Univ. of Wisc.) – MIPS, SPARC• EEL (Univ. of Wisc.) – MIPS, SPARC• DSPTune (NEU) – ADI SHARC DSP• Pin (Intel 2005) – X86, XScale,

Itanium

Page 39: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Embedded Systems Profiling Tools

• Enhance current embedded system compilation environments, providing profile-driven analysis and feedback capabilities

• DSPTune - instrumentation and analysis package for the SHARC family of DSPs

• Allows for full instrumentation of C and C++ codes at the source, assembly and ELF binary levels

• Supported by Analog Devices and the NSF

Page 40: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

The DSPTune Toolset

• A set of library routines that enable the user to instrument C and assembly programs

• Function calls can be inserted at various locations in the application code, enabling execution driven simulation

• The user provides:– instrumentation routines, which specify the

selected instrumentation events (e.g., loads, branches, traps)

– analysis routines, which carry out the desired simulation (e.g., caches, stacks, branch predictors)

Page 41: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

User application code

Parser

Intermediate Representation

Instrumenting Tool

Instrumented IR

Code Generator

Instrumentedapplication code

Assembler

Linker

User instrumentation code

User analysis code

Step I

Step IV

Step III

Step II

Instrumentedapplication executable

Page 42: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

BDSPTune

• Provides similar capabilites as DSPTune

• Allows ELF binaries to be instrumented

• Enable instrumentation and profiling to include library routines

Page 43: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Summary of Tracing Methodologies

Slow down OS coverage

Samplesize

Cost

SourceCode

10X NO >GB LOW

ObjectCode

10X SOME >GB LOW

Binary Code

10X NO >GB LOW

Microcode 10X YES >GB MEDIUM

I-Stepping 1000X YES unlimited MEDIUM

Emulation 10-100X YES unlimited MEDIUM

Real-time 1X YES <GB HIGH

Page 44: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Counter-based Profiling and Instrumentation

David KaeliDepartment of Electrical and Computer

EngineeringNortheastern University

Boston, [email protected]

Page 45: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Counters are used to:

• Identify Performance Bottlenecks– especially unpredictable dynamic stalls

e.g. cache misses, branch mispredicts, TLB misses, etc.

– complex out-of-order processors make this difficult

• Guide Optimizations– help programmers understand and improve code– automatic, profile-driven optimizations

• Profile Production Workloads– low overhead– transparent– profile whole system

Page 46: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Performance Counters

• Interfaced through a device driver and supporting GUI (e.g., VTune)

• Counters increment based on a set of events of interest (e.g., cache misses, pipeline stalls)

• Interrupt will occur that signals that the counter has overflowed

• An interrupt service routine reads the counter information and tags it to a program counter (PC) value

• Information is then available for offline analysis

Page 47: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Performance Counters

• Low overhead method for obtaining performance and profiling information– Typically less than 5% slowdown

• Requires no modification of the binary• May require root level access to system• Lacks precision in cause/affect analysis• Come for free on most ISAs• Commonly used today to measure

performance and estimate power usage

Page 48: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Counter Library

• A number of counter libraries are available to provide an API to program and access common architectures– Rabbit

• for Intel/AMD Processors and Linux• URL: www.scl.ameslab.gov/Projects/Rabbit/

– PAPI• Linux IA32, IA64• Allows counters to be captured on a per thread

basis• URL: icl.cs.utk.edu/projects/papi/

Page 49: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Counters available on different ISAsCategory PentiumI

I21064 21164 IBM604e R10K Ultra2

# counters

2 2 3 4 2 2

CounterRange

40 8, 12, 16 14,16 32 32 32

VariableRange

No Yes No No No No

Sampling

Freq

Variable Fixed Fixed Variable Variable Fixed

R/W Access

Yes No Yes Yes Yes Yes

Duration Counting

Yes No No No No NO

CountingModes

DifferentPrivilege Levels

SelectedProcesse

s

User,Kernel,PALmod

e

User, Kernel,

Processes

User,Kernel

User,Kernel

Page 50: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Events countable on different ISAsEvent PentiumI

I21164 IBM604e R10K Ultra2

L1 data cache read Y N Y N Y

L1 data cache write Y N N N Y

L1 data cache r/w N Y N N N

L1 data cache miss Y Y Y Y Y

L1 inst cache read Y N N N N

L1 inst cache r/w N Y N N Y

L1 inst cache hit N Y N N Y

L1 inst cache miss Y Y Y Y Y

Page 51: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Events countable on different ISAsEvent Pentium

221164 IBM604e R10K Ultra2

TLB miss N N Y Y N

Data TLB miss N Y Y N N

Inst TLB miss Y Y Y N N

Retired Branches Y N N Y N

Mispredicted Branches

Y Y N Y N

Taken Branches Y N N N N

Mispredicted Retired B

Y N N N N

Page 52: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Events countable on different ISAsEvent Pentium

221164 IBM604e R10K Ultra2

Retired Instructions Y Y Y Y Y

Issued Instructions Y N Y Y N

Integer Inst Executed

N Y Y N N

FP Inst Executed Y Y Y Y N

Load Inst Executed N Y Y Y N

Store Inst Executed N Y N Y N

Branch Inst Executed

Y N Y N N

Page 53: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Events countable on different ISAsEvent Pentium

221164 IBM604e R10K Ultra2

Total cycles Y Y Y Y Y

Cycles BPU is idle N N Y N N

Cycles IU is idle N N Y N N

Cycles LSU is idle N N Y N N

Cycles LSU stalls N N Y N N

Cycles FPU stalls Y N Y N N

Cycles BPU stalls N N Y N N

Page 54: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Existing Instruction-Level Sampling• Use Hardware Event Counters

– small set of software-loadable counters– each counts a single event at a time, e.g. dcache miss– counter overflow generates interrupt

• Advantages– low overhead vs. simulation and instrumentation– transparent vs. instrumentation– complete coverage, e.g. kernel, shared libs, etc.

• Effective on In-Order Processors– analysis computes execution frequency – heuristics identify possible reasons for stalls– example: DIGITAL’s Continuous Profiling Infrastructure

Page 55: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Problems with Event-Based Counters• Cannot simultaneously monitor all events

• Limited information about events– “event has occurred”, but no additional context

e.g. cache miss latencies, recent execution path, ...

• Blind spots in non-interruptible code• Key problem: imprecise attribution

– interrupt delivers restart PC, not the PC that caused event– problem worse on out-of-order processors

Page 56: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Problem: Imprecise AttributionExample: Finding the single operation that introduces a long

latency operation to occur (e.g., cache miss, TLB miss, branch mispredict)

• Most counter-based schemes provide the PC at the point a counter overflowed

• Inorder processors – (Alpha 21164)– Imprecise exceptions/interrupts hinder our ability to quickly identify

the cause of latencies during execution– It is possible to post-analyze the problem to attempt to identify the

responsible instruction

• Out-Of-Order processors – (Alpha21264, Pentium4)– Due to the lack of sequentiality in the execution, the distance

between the responsible instruction and the current PC could be far– It is nearly impossible to identify the cause of the latency

Page 57: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Profile-Me Profiling Strategy – (DEC 1998)

• PC + Retire Status execution frequency• PC + Cache Miss Flag cache miss rates• PC + Branch Mispredict mispredict

rates• PC + Event Flag event rates• PC + Branch Direction edge frequencies• PC + Branch History path execution rates• PC + Latency instruction stalls

Page 58: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Identifying True Botttlenecks

• ProfileMe: Detailed Data for Single Instruction• In-Order Processors

– ProfileMe PC + latency data identifies stalls– stalled instructions back up pipeline

• Out-of-Order Processors– explicitly designed to mask stall latency

e.g. dynamic reordering, speculative execution– stall does not necessarily imply bottleneck

• Example: Does This Stall Matter?load r1, … add …,r1,… average latency: 35.0

cycles… other instructions …

Page 59: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Example: Retire Count Convergence

0

0.5

1

1.5

2

0 250 500

Number of Retired Samples (N )

Es

tim

ate

/ A

ctu

al

Accuracy 1/N

Page 60: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

How to handle concurrency and OOO?

Appropriate concurrency metrics– retired instructions per cycle– issue slots wasted while an instruction is in flight– pipeline stage utilization

How to measure concurrency?• Special-purpose hardware

– some metrics difficult to measuree.g. need retire/abort status

• Sample potentially-concurrent instructions– aggregate info from pairs of samples– statistically estimate metrics

Page 61: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

How to handle concurrency and OOO?

• Sample Two Instructions– sample instructions, not events– may be in-flight simultaneously– replicate ProfileMe hardware, add intra-pair distance

• Nested Sampling– sample window around first profiled instruction– randomly select second profiled instruction– statistically estimate frequency for F (first, second)

+W

... ...

... ...

... ...

... ...

-W

time

overlap no overlap

Page 62: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Other Uses of Paired Sampling

• Path Profiling– two PCs close in time can identify

execution path– identify control flow, e.g. indirect

branches, calls, traps

• Direct Latency Measurements– data load-to-use– loop iteration cost

Page 63: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

VTune: IA32 Instrumentation and Profiling

• Supports all versions of IA32 Intel processors• Provides a rich GUI to ease programming and

reading of hardware counters• Features include:

– Time and event-based sampling– Call graph profiling– Provides source-level tuning advice– Allows for integrated visualization of source and

counter information– Supports C/C++, Fortran, Java and IA32 assembly

Page 64: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

VTune Time Sample

Page 65: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

VTune Call Graph

Page 66: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

VTune Hot Spot Analyzer

Page 67: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

VTune Tuning Assistant

Page 68: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Using Performance Counters for Power Profiling/Estimation

• Profile power-consuming events– Cache misses– TLB misses– Pipeline stallsOpportunities to wait slower!

• How can we tie high counts to when to adjust voltage/frequency? (more on this later in the class….)

Page 69: Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu.

Summary• Tracing/Instrumentation is still used today by industry

and academia – The field has evolved significantly– Industry uses software-based tools for performance and

hardware-based tools for power/energy– Most performance studies today use some form of emulation or

virtualized execution to obtain trace data

• Counters can be used effectively to capture performance data – The entry cost for using counters is low– OO microarchitectures inhibit the use of counters– Paired sampling can be an effective technique for handling

imprecision

• A number of high-quality free and commercial tools are available (and we are going to use at least one of them)