Department of Computer Science Mining Performance Data from Sampled Event Traces Bret Olszewski IBM...

Department of Computer Science

Mining Performance Data from Sampled Event Traces

Bret OlszewskiIBM Corporation – Austin, TX

Ricardo Portillo, Diana Villa, Patricia J. Teller The University of Texas at El PasoDepartment of Computer Science


Outline

Motivation Data Collection Environment

• Workload & Platform• Monitored Events

Data Analysis & Results Conclusions and Future Work


Motivation

Capturing Event Traces System Simulation: Overhead penalty is too high Real-time Metrics: Capture every event during actual execution

Problem Growing size of full event traces is becoming unmanageable

GoalUse sampled event traces to analyze execution behavior


Data Collection Environment

Workload• TPC-C benchmark

Commercial OLTP

Platform• IBM eServer pSeries 690 architecture (p690)

8- and 32-processor configurations


P X

XP

XP

L2

L2

L2

L3

MCM 0

8-processor p690 configurationPlatform

P X

XP

XP

P

L2

L2

L2

L2

L3

MCM 1

X XP

L2


32-processor p690 configurationPlatformP P

PP

PP

P

L2

L2

L2

L2

L3

MCM 0

P

P P

PP

PP

P

L2

L2

L2

L2

L3

MCM 2

P

P P

PP

PP

P

L2

L2

L2

L2

L3

MCM 1

P

P P

PP

PP

P

L2

L2

L2

L2

L3

MCM 3

P


Monitored Events

L2-cache data-load misses• L2.5• L2.75• L3• L3.5• MEM


P X

XP

XP

L2

L2

L3

MCM 0

P X

XP

XP

P

L2

L2

L2

L2

L3

MCM 1

X XP

L2

Where is L2 Miss Resolved?

L2


P X

XP

XP

L2

L2

L2

L3

MCM 0

P X

XP

XP

P

L2

L2

L2

L2

L3

MCM 1

X XP

L2


L2.5 Event


P X

XP

XP

L2

L3

MCM 0

P X

XP

XP

P

L2

L2

L2

L2

L3

MCM 1

X XP

L2

L2 L2


L2.5 Event L2.75 Event


P X

XP

XP

L2

L3

MCM 0

P X

XP

XP

P

L3

MCM 1

X XP

L2

L2 L2

L2 L2

L2L2


L2.5 Event L2.75 EventL3 Event


P X

XP

XP

L2

MCM 0

P X

XP

XP

P

L3

MCM 1

X XP


L2.5 Event L2.75 EventL3 Event

L3

L2

L2L2

L2 L2

L2L2

L3.5 Event


Data Collection

Performance Monitoring Unit (PMU)• Special-purpose registers• Programming interface

Kernel extension

eprof• PMU configuration• Event-based sampling


Sampled Event Trace

10-minute observation interval• Record periodic occurrences of an event• 100 events/sec/CPU

Event record372872 184469 0.328104637 000000000000A8C4 0000000000218880

PID TID Timestamp Effective Instruction Address

EffectiveData Address

Average number of samples collected/event• 238,448 for 8-processor data • 212,396 for 32-processor data


Analysis

• Memory Hotspots

• Individual Address Region

• Process Migration


• L3 and Memory are most active memory levels

• Counted total number of L3 hits

• Counted number of L3 hits per address region

• Counted number of unique cache lines referenced per region

Memory Hotspots


Distribution of L3 Data Load Hits

0 0.1 0.2 0.3 0.4 0.5

Kernel

Text

Data,BSS,Heap

BufferPool

Stack

Ublock&KernelStack

M_BUF

KERN_HEAP

Ad

dre

ss r

egio

n

Fraction of data loads

Unique cache line

Hit %

Memory Hotspots


Individual Address Region

• We can look at an address region in more detail

• Looked at Buffer Pool region

• Counted number of references per memory level

• Counted number of unique cache lines referenced per memory level


0

20000

40000

60000

80000

100000

120000

L2 L2.5 MOD L2.75 MOD L3 L3.5 MEMEvent Name

Distribution of Data Load Hits: BUFFER_POOL

DataLoadHits

UniqueCacheLines

Individual Address Region


Process Migration

• Process migration from one chip to another can degrade performance when all or part of the process' working set must follow, via L2-cache misses

• Looked at 885 threads

• Counted number of migrations per thread

• Counted number of L2.5 hits per thread


Process Migration32-Way L2.5 Hits VS. Intra-MCM Migrations

0

5000

10000

15000

20000

25000

0 1000 2000 3000 4000 5000 6000

Intra-MCM Migrations

L2.

5 M

od

ifie

d H

its


Only a few addresses in Buffer Pool region are causing most of its L3 hits

For Buffer Pool, heavily referenced shared data is constantly resolved outside an MCM

Process migration is not a source of performance degradation

Conclusions


Quantify representativeness of sampled event traces

Suggest more ways to improve p690 application performance

Study sampled event traces for other workloads

In depth study of process characterization

Future Work


Thank You!

Department of Computer Science Mining Performance Data from Sampled Event Traces Bret Olszewski IBM...

Documents

Transcript of Department of Computer Science Mining Performance Data from Sampled Event Traces Bret Olszewski IBM...