Performance Predictability Carole’s Group Talk on 5-13-2009.

Performance Predictability

Carole’s Group Talk on 5-13-2009

What if hmmer is the high-priority application?

bzip2 gcc

gobmk

hmmerlbm mcf

milc

perlbench

sjeng

0.50.55

0.60.65

0.70.75

0.80.85

0.90.95

1

9 Heterogeneous Applications

un-managedalone

Nor

mal

ized

IPC

over

the

Appl

ica-

tion

Runn

ing

Alon

e

Can we predict performance trends?

Number of Instructions

Perf

orm

ance

Performance upper bound of the high-priority application

Performance of the high-priority (HP) applicationwhile running with other non-HP applications

???

Can we predict performance trends?

Number of Instructions

Perf

orm

ance

Performance upper bound of the high-priority application

Performance of the HP applicationrunning with other non-HP applications

???

Performance improvement due to the new dynamic resource allocation setting

If we can predict accurately the performance trend of hmmer by determining the degree of

inter-application inference at run time, we can act early!

Studies have shown there exist a high correlation between application performance and the number of

long-latency cache misses.

We can use cache usages (e.g. LLCache) to identify the degree of inter-application interference at run time, we can predict performance for applications of interest.

Outline

Motivation Observation Cache & Observation Sets Experimental Setup Results Conclusion

How do we identify inter-application conflict misses?

E F G H

.

.

.

.

.

.

.

.

.

.

.

.

A B C D

.

.

.

.

.

.

.

.

.

.

.

.memory references

LRU MRU LRU MRU

I

A, B, C, D, I -- HP references E, F, G, H -- non-HP references

Observation Cache


F G H I

.

.

.

.

.

.

.

.

.

.

.

.

A B C D

.

.

.

.

.

.

.

.

.

.

.

.memory references

LRU MRU LRU MRU

A

Is A an inter-application conflict miss? No!


Observation Cache


F G H I

.

.

.

.

.

.

.

.

.

.

.

.

A B C D

.

.

.

.

.

.

.

.

.

.

.

.memory references

LRU MRU LRU MRU

# of non-HP cache lines

.

.

.

.

.

.

.

.

.

.

.

.

3

if( (set_assoc – lru_hit_cnt) <= #_non_HP_lines ){ inter_app miss++;de_allocate( address );

}else non_inter_app_miss++;


0 set_assoc-1

Observation Cache

Do we really need all these crazy hardware?

• 4MB observation cache + 4096 32-bit counters (for each cache set)

Hopefully not!!

Approach 1: Dynamic profiling within way-partitioning infrastructure

Given a way-partitioned 16-way set-associative cache, of which n ways (n<8) are dedicated to a HP-app. Use the n ways for HP-app’s exclusive access as

the observation cache to measure the degree of inter-application interference and the rest of (16-n) ways as the normal shared cache.

Approach 1: Dynamic profiling within way-partitioning infrastructure

E F G I

.

.

.

.

.

.

.

.

.

.

.

.

C B A

.

.

.

.

.

.

.

.

.

.

.

.memory references

MRU LRU

.

.

.

.

.

.

.

.

.

.

.

.

4H

Observation Cache

Approach 2: Set sampling

There are 4096 sets in a 16-way set-associative, 4MB cache with 64-byte lines Spare a small part of the entire cache as “leader sets” and

use the rest as “follower sets” [Utility-based Cache Partition, Qureshi and Patt, MICRO ‘06]

Approach 2: Set sampling version1

memory references

E9 F4 G3 I2 A0 B5 H8 D9

.

.

.

.

.

.

E1 E2 G8 P1 E4 F6 G1 P5

P1 P5 P7 P9 P2 P3 P8 P0

sam

e se

t

m1

m1’

The number of inter-application conflict miss is (m1’-m1) for the high-priority application!

MRU LRU

Pi -- HP references Ei, Gi-- non-HP references

Approach 2: Set sampling version2

E F G Imemory references

MRU LRU

6

A B H D

.

.

.

.

.

.

E1 E2 G8 P4 E4 F6 G1 P6

P1 P5 P7 P9 P2 P3 P8 P0

Pi -- HP references Ei, Gi-- non-HP references

# of non-HP cache lines

Exte

nded

Se

t

Observation Set

Outline

Motivation Observation Cache & Observation Sets Experimental Setup Results Conclusion

Experimental Setup

• GEMS: Simics (in-order) + Ruby (memory module)

• 8-core CMP• Private L1 cache per core (32KB, 4-way set-assoc., 64-

byte cache lines)• Shared L2 cache (4MB, 16-way set-assoc., 64-byte

cache lines) “4096 cache sets”

• 4 SPEC2006 applications in the workload• bzip2, mcf, gobmk, and hmmer [10 billion cycles]

Miss Identification for Applications in the Workload

0200000400000600000

bzip2 (bmgh)

InterApp_Misses NonInterApp_Misses

Time (cycles)Num

ber o

f Mis

ses

0100000200000300000400000500000

mcf (bmgh)


Time (cycles)

Num

ber o

f Mis

ses

0200000400000600000

gobmk (bmgh)


Time (cycles)Num

ber o

f Mis

ses

050000

100000150000200000

hmmer (bmgh)


Time (cycles)

Num

ber o

f Mis

ses

How well do observation sets perform [%10]?

0 5000000000 100000000000

50000100000150000200000250000

020406080100

bzip2

InterApp_MissesInterApp_Miss (Observation Sets)Error Rate (%)

Time (cycles)

Num

ber o

f Mis

ses

Erro

r Rat

e (%

)0 5000000000 10000000000

050000

100000150000200000250000

020406080100

mcf


Time (cycles)

Num

ber o

f Mis

ses

Erro

r Rat

e (%

)

How well do observation sets perform [%10]?

0 5000000000 100000000000

50000100000150000200000250000

020406080100

gobmk


Time (cycles)

Num

ber o

f Mis

ses

Erro

r Rat

e (%

)0 5000000000 10000000000

0

50000

100000

150000

200000

020406080100

hmmer


Time (cycles)

Num

ber o

f Mis

ses

Erro

r Rat

e (%

)

# of observation sets vs. prediction accuracy

bzip2 mcf gobmk hmmer average0

1

2

3

4

5

6

7

8

9

10

Precision for Observation Sets

[%10][%40][%64][%67][%100]

Erro

r Rat

e (%

) 10% of the 4MB cache as observation sets offers >99% prediction accuracy

1% of the 4MB cache as observation sets offers >97% prediction accuracy

Can we predict accurately the performance trend by determining the degree of inter-

application inference at run time?

Hopefully, I have convinced you that it is possible!

Conclusion

Dynamic profiling is powerful Feasible – no pre-run (static profiling) ever!

Accurate run-time prediction for performance trends provides Performance predictability Quality of service Better resource allocation decisions

With 10% of the 4MB cache as observation sets, we can achieve >99% prediction accuracy. 1% of the 4MB cache >97% prediction accuracy

Still awake?

Thank you

Next…

Summer internship with Google on power usage prediction algorithms for

Google’s data centers. Welcome to visit me in Mountain View, CA !!!

Cache Occupancy per Application

0

10

20

30

40

50

60

70

80

90

100

Cache Occupancy per Application (bmgh)

OS activitieshmmergobmkmcfbzip2

Time (cycles)

Cach

e O

ccup

ancy

(%)

Performance Predictability Carole’s Group Talk on 5-13-2009.

Documents

Transcript of Performance Predictability Carole’s Group Talk on 5-13-2009.