Slide 1 Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse Kristof Beyls and...

28
Slide 1 Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse Kristof Beyls and Erik D’Hollander International Conference on Computational Science June 2004

Transcript of Slide 1 Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse Kristof Beyls and...

Page 1: Slide 1 Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse Kristof Beyls and Erik D’Hollander International Conference on Computational.

Slide 1

Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse

Kristof Beyls and Erik D’Hollander

International Conference on Computational Science

June 2004

Page 2: Slide 1 Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse Kristof Beyls and Erik D’Hollander International Conference on Computational.

Overview

1. Introduction

2. Reuse Distance

3. Optimizing Cache Behavior: By Hardware,Compiler or Programmer?

4. Visualization

5. Case Studies

6. Conclusion

Page 3: Slide 1 Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse Kristof Beyls and Erik D’Hollander International Conference on Computational.

1.a Introduction

Anti-law of Moore (2003: half of execution time is lost due to data cache misses)

1

10

100

1000

1980 1985 1990 1995 2000

PROCESSOR

MEMORY

Speed Gap

Re

latieve sp

eed

versu

s 1

980

Page 4: Slide 1 Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse Kristof Beyls and Erik D’Hollander International Conference on Computational.

1.b Observation:Capacity misses dominate

3 kinds of cache misses (3 C’s): Cold, Conflict, Capacity

8K

1

6K

3

2K

6

4K

12

8K

25

6K

51

2K

10

24

K 1

24

8

full

0%

20%

40%

60%

80%

100%

per

cen

tag

e ca

pac

ity

mis

ses

(SP

EC

2000

, C

anti

n a

nd

Hil

l)

cache sizeassoc.

Page 5: Slide 1 Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse Kristof Beyls and Erik D’Hollander International Conference on Computational.

Overview

1. Introduction

2. Reuse Distance

3. Optimizing Cache Behavior: By Hardware,Compiler or Programmer?

4. Visualization

5. Case Studies

6. Conclusion

Page 6: Slide 1 Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse Kristof Beyls and Erik D’Hollander International Conference on Computational.

2.a Reuse Distance

Definition: The reuse distance of a memory access is the number of unique memory locations accessed since the previous access to the same data.

2

C

1022∞∞∞distance

ABBACBAaddress

Page 7: Slide 1 Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse Kristof Beyls and Erik D’Hollander International Conference on Computational.

2.b Reuse Distance - property

Lemma: In a fully. assoc. LRU cache with n lines, an access hits the cache reuse distance < n.

Consequence: In every cache with n lines, a cache miss with distance d is:

Cold missd = ∞

Capacity missn ≤ d < ∞

Conflict missd < n

Page 8: Slide 1 Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse Kristof Beyls and Erik D’Hollander International Conference on Computational.

2.c Reuse distance histogram Spec95fp

0

10

20

30

0 5 10 15 20

Bil

lio

ns

log2(reuse distance)

hits

misses

Page 9: Slide 1 Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse Kristof Beyls and Erik D’Hollander International Conference on Computational.

0

1

2

3

0 5 10 15 20

Bil

lio

ns

log2(reuse distance)

hits

misses

2.d Classifying cache misses SPEC95fp

Cache size

Conflict Capacity

Page 10: Slide 1 Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse Kristof Beyls and Erik D’Hollander International Conference on Computational.

2.e Reuse distance vs. cache hit probability

0

20

40

60

80

100

0 5 10 15 20

Hit

Perc

enta

ge

log2(Reuse Distance)

Direct Mapped Fully Associative

Page 11: Slide 1 Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse Kristof Beyls and Erik D’Hollander International Conference on Computational.

Overview

1. Introduction

2. Reuse Distance

3. Optimizing Cache Behavior: By Hardware,Compiler or Programmer?

4. Visualization

5. Case Studies

6. Conclusion

Page 12: Slide 1 Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse Kristof Beyls and Erik D’Hollander International Conference on Computational.

3a. Removing Capacity misses

1. Hardware Enlarge cache

0

1

2

3

0 5 10 15 20

Bill

ion

s

log2(reuse distance)

hits

misses

CSCSCS CSCS CS

Reuse distance must be smaller than cache size

1. Compiler– Loop tiling– Loop fusion

2. Algorithm

CS

Page 13: Slide 1 Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse Kristof Beyls and Erik D’Hollander International Conference on Computational.

3.b Compiler optimizations: SGIpro for Itanium (spec95fp)

0E+0

1E+9

2E+9

3E+9

0 5 10 15 20

log2(reuse distance)

nu

mb

er o

f m

isse

s

original

after optimization

Conflict Capacity

30% conflict misses eliminated,1% capacity misses eliminated.

Page 14: Slide 1 Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse Kristof Beyls and Erik D’Hollander International Conference on Computational.

Overview

1. Introduction

2. Reuse Distance

3. Optimizing Cache Behavior: By Hardware,Compiler or Programmer?

4. Visualization

5. Case Studies

6. Conclusion

Page 15: Slide 1 Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse Kristof Beyls and Erik D’Hollander International Conference on Computational.

4.a Objectives for cache visualization

Cache behavior is shown in the source code. Cache behavior is presented accurately and

concisely. Independent of specific cache parameters

(e.g. size, associativity,…).

Reuse Distance allows to meet the above objectives

Page 16: Slide 1 Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse Kristof Beyls and Erik D’Hollander International Conference on Computational.

4.b Example: MCF

Page 17: Slide 1 Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse Kristof Beyls and Erik D’Hollander International Conference on Computational.

for( ; arc < stop_arcs; arc += nr_group ) { if( arc->ident > BASIC ) { red_cost = bea_compute_red_cost( arc ); if( red_cost<0 && arc->ident == AT_LOWER ||

red_cost>0 && arc->ident == AT_UPPER ) { basket_size++; perm[basket_size]->a = arc; perm[basket_size]->cost = red_cost; perm[basket_size]->abs_cost = ABS(red_cost); } } }

68% of capacity misses

4.c Example: MCF

68.3% / sl=21%

Page 18: Slide 1 Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse Kristof Beyls and Erik D’Hollander International Conference on Computational.

Overview

1. Introduction

2. Reuse Distance

3. Optimizing Cache Behavior: By Hardware,Compiler or Programmer?

4. Visualization

5. Case Studies

6. Conclusion

Page 19: Slide 1 Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse Kristof Beyls and Erik D’Hollander International Conference on Computational.

5.a Optimization: classification

1. Eliminate memory access with poor locality (+++)

2. Reduce reuse distance (keep data in cache between use and reuse) (++)

3. Increase spatial locality (++)

4. Hide latency by prefetching (+)

Page 20: Slide 1 Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse Kristof Beyls and Erik D’Hollander International Conference on Computational.

5.b 3 case studies

From Spec2000: With large memory bottleneck:

Mcf (90%) – optimization of bus schedule Art (87%) – simulation of neural network Equake (66%) – simulation of earthquake

Percentage of execution time thatthe processor is stalled waiting fordata from memory and cache.(Itanium1 733Mhz)

Page 21: Slide 1 Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse Kristof Beyls and Erik D’Hollander International Conference on Computational.

5.c Equake

For every time step: Sparse matrix-vector

multiplication Vector rescaling

Optimizations:1. Long reuse distance between consecutive time

steps:• Shorten distance by performing multiple time

steps on limited part of matrix.• Eliminated memory accesses:

1. K[Anext][i][j] (3 accesses) K[Anext*N*9 + 3*i + j] (1 access)

Page 22: Slide 1 Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse Kristof Beyls and Erik D’Hollander International Conference on Computational.

5.d Art (neural network)

• Poor spatial locality(0% - 20%)

• Neuron is C-structure containing 8 fields. Every loop updates one field, for each neuron.

typedef struct { typedef struct { double I; double* I; … … double R; double* R;} f1_neuron; } f1_neurons;

f1_neuron[N] f1_neurons f1_layer; f1_layer;

F1_layer[y].W f1_layer.W[y]

Page 23: Slide 1 Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse Kristof Beyls and Erik D’Hollander International Conference on Computational.

5.e Mcf

• Reordering of accesses is hard.

• Therefore: prefetching

for( ; arc < stop_arcs; arc += nr_group ) {#define PREFETCH_DISTANCE 8 PREFETCH(arc+nr_group*PREFETCH_DISTANCE) if( arc->ident > BASIC ) { red_cost = bea_compute_red_cost( arc ); if( red_cost<0 && arc->ident == AT_LOWER ||

red_cost>0 && arc->ident == AT_UPPER ) { basket_size++; perm[basket_size]->a = arc; perm[basket_size]->cost = red_cost; perm[basket_size]->abs_cost = ABS(red_cost); } } }

Page 24: Slide 1 Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse Kristof Beyls and Erik D’Hollander International Conference on Computational.

5.f Measurements

0123456789

101112

athlonXP Alpha Itanium average

Sp

eed

up

mcf

art

equake

18M264KAlpha 2126442M696K416KItanium

16256K264KAthlonXPassocsizeassocsizeassocsize

L3L2L1processor

cc –O5Alpha

ecc –O3Itanium

icc –O3AthlonXP

CompilerProcessor

Page 25: Slide 1 Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse Kristof Beyls and Erik D’Hollander International Conference on Computational.

5.g Reuse Distance Histograms

Art

0

1

2

3

4

5

6

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17log2(reuse distance)

num

ber

of

acce

sses

(b

illio

ns)

Original

Optimized

Equake

0

2

4

6

8

10

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21log2(reuse distance)

nu

mb

er

of

acc

ess

es

(bill

ion

s)

Original

Optimized

Page 26: Slide 1 Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse Kristof Beyls and Erik D’Hollander International Conference on Computational.

Overview

1. Introduction

2. Reuse Distance

3. Optimizing Cache Behavior: By Hardware,Compiler or Programmer?

4. Visualization

5. Case Studies

6. Conclusion

Page 27: Slide 1 Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse Kristof Beyls and Erik D’Hollander International Conference on Computational.

6. Conclusion

Reuse distance predicts cache behaviour accurately. Compiler-optimizations are not powerful enough to

remove a substantial portion of the capacity misses. The programmer often has a global overview of program

behaviour. However, cache behavior is invisible in source code. Visualisation

Mcf, Art, Equake: 3x faster on average, on different CISC/RISC/EPIC platforms, with identical source code optimisations.

Visualization of reuse distance enables portable and platform-independent cache optimisations.

Page 28: Slide 1 Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse Kristof Beyls and Erik D’Hollander International Conference on Computational.

Questions?