An Experimental Comparison of Empirical and Model-based Optimization

49
An Experimental Comparison of Empirical and Model-based Optimization Keshav Pingali Cornell University Joint work with: Kamen Yotov 2 ,Xiaoming Li 1 , Gang Ren 1 , Michael Cibulskis 1 , Gerald DeJong 1 , Maria Garzaran 1 , David Padua 1 , Paul Stodghill 2 , Peng Wu 3 1 UIUC, 2 Cornell University, 3 IBM T.J.Watson

description

An Experimental Comparison of Empirical and Model-based Optimization. Keshav Pingali Cornell University Joint work with: Kamen Yotov 2 ,Xiaoming Li 1 , Gang Ren 1 , Michael Cibulskis 1 , Gerald DeJong 1 , Maria Garzaran 1 , David Padua 1 , Paul Stodghill 2 , Peng Wu 3 - PowerPoint PPT Presentation

Transcript of An Experimental Comparison of Empirical and Model-based Optimization

Page 1: An Experimental Comparison of  Empirical and Model-based Optimization

An Experimental Comparison of Empirical and Model-based Optimization

Keshav PingaliCornell University

Joint work with:Kamen Yotov2 ,Xiaoming Li1, Gang Ren1, Michael Cibulskis1,Gerald DeJong1, Maria Garzaran1, David Padua1,Paul Stodghill2, Peng Wu3

1UIUC, 2Cornell University, 3IBM T.J.Watson

Page 2: An Experimental Comparison of  Empirical and Model-based Optimization

Context: High-performance libraries Traditional approach

Hand-optimized code: (e.g.) BLAS Problem: tedious to develop

Alternatives: Restructuring compilers

General purpose: generate code from high-level specifications Use architectural models to determine optimization parameters Performance of optimized code is not satisfactory

Library generators Problem-specific: (e.g.) ATLAS for BLAS, FFTW for FFT Use empirical optimization to determine optimization parameters Believed to produce optimal code

Why are library generators beating compilers?

Page 3: An Experimental Comparison of  Empirical and Model-based Optimization

How important is empirical search? Model-based optimization and empirical

optimization are not in conflict Use models to prune search space

Search intelligent search Use empirical observations to refine models

Understand what essential aspects of reality are missing from model and refine model appropriately

Multiple models are fine Learn from experience

Page 4: An Experimental Comparison of  Empirical and Model-based Optimization

Previous work

Compared performance of code generated by a sophisticated compiler like SGI

MIPSpro code generated by ATLAS found ATLAS code is better

Hard to answer why Perhaps ATLAS is effectively doing transformations that

compilers do not know about Phase-ordering problem: perhaps compilers are doing

transformations in wrong order Perhaps parameters to transformations are chosen sub-

optimally by compiler using models

Page 5: An Experimental Comparison of  Empirical and Model-based Optimization

Our Approach

DetectHardware

Parameters

ATLAS SearchEngine

(MMSearch)NR

MulAddLatency

L1SizeATLAS MM

Code Generator(MMCase)

xFetchMulAddLatency

NBMU,NU,KU MiniMMM

Source

Compile,Execute,Measure

MFLOPS

DetectHardware

ParametersModelNR

MulAddLatency

L1I$Size ATLAS MMCode Generator

(MMCase)xFetchMulAddLatency

NBMU,NU,KU MiniMMM

Source

L1Size

Original ATLAS Infrastructure

Model-Based ATLAS Infrastructure

DetectHardware

Parameters

DetectHardware

Parameters

ATLAS MMCode Generator

(MMCase)

ATLAS MMCode Generator

(MMCase)

ATLAS SearchEngine

(MMSearch)

Model

Page 6: An Experimental Comparison of  Empirical and Model-based Optimization

Detecting Machine Parameters Micro-benchmarks

L1Size: L1 Data Cache size Similar to Hennessy-Patterson book

NR: Number of registers MulAdd: Fused Multiply Add (FMA)

“c+=a*b” as opposed to “c+=t; t=a*b” Latency: Latency of FP Multiplication

Page 7: An Experimental Comparison of  Empirical and Model-based Optimization

Code Generation: Compiler View

DetectHardware

Parameters

ATLAS SearchEngine

(MMSearch)NR

MulAddLatency

L1SizeATLAS MM

Code Generator(MMCase)

xFetchMulAddLatency

NBMU,NU,KU MiniMMM

Source

Compile,Execute,Measure

MFLOPS

DetectHardware

ParametersModelNR

MulAddLatency

L1I$Size ATLAS MMCode Generator

(MMCase)xFetchMulAddLatency

NBMU,NU,KU MiniMMM

Source

L1Size

Original ATLAS Infrastructure

Model-Based ATLAS Infrastructure

Page 8: An Experimental Comparison of  Empirical and Model-based Optimization

BLAS

Let us focus on BLAS-3 Code for MMM: for (int i = 0; i < M; i++) for (int j = 0; j < N; j++) for (int k = 0; k < K; k++) C[i][j] += A[i][k]*B[k][j]

Properties Very good reuse: O(N2) data, O(N3) computation Many optimization opportunities

Few “real” dependencies Will run poorly on modern machines

Poor use of cache and registers Poor use of processor pipelines

CBAC

Page 9: An Experimental Comparison of  Empirical and Model-based Optimization

Optimizations

Cache-level blocking (tiling) Atlas blocks only for L1 cache

Register-level blocking Highest level of memory hierarchy Important to hold array values in registers

Software pipelining Unroll and schedule operations

Versioning Dynamically decide which way to compute

Back-end compiler optimizations Scalar Optimizations Instruction Scheduling

Page 10: An Experimental Comparison of  Empirical and Model-based Optimization

Cache-level blocking (tiling) Tiling in ATLAS

Only square tiles (NBxNBxNB)

Working set of tile fits in L1 Tiles are usually copied to

continuous storage Special “clean-up” code

generated for boundaries Mini-MMM

for (int j = 0; j < NB; j++) for (int i = 0; i < NB; i++) for (int k = 0; k < NB; k++) C[i][j] += A[i][k] * B[k][j]

NB: Optimization parameter

B

N

M

A C

NB

NB

K

K

Page 11: An Experimental Comparison of  Empirical and Model-based Optimization

Short excursion into tiling

Page 12: An Experimental Comparison of  Empirical and Model-based Optimization

MMM miss ratio L1 Cache Miss Ratio for Intel Pentium III

– MMM with N = 1…1300– 16KB 32B/Block 4-way 8-byte elements

Page 13: An Experimental Comparison of  Empirical and Model-based Optimization

IJK version (large cache)

DO I = 1, N//row-major storage DO J = 1, N DO K = 1, N C(I,J) = C(I,J) + A(I,K)*B(K,J)

Large cache scenario: Matrices are small enough to fit into cache Only cold misses, no capacity misses Miss ratio:

Data size = 3 N2

Each miss brings in b floating-point numbers Miss ratio = 3 N2 /b*4N3 = 0.75/bN = 0.019 (b = 4,N=10)

C

B

A

K

K

Page 14: An Experimental Comparison of  Empirical and Model-based Optimization

IJK version (small cache)

DO I = 1, N DO J = 1, N DO K = 1, N C(I,J) = C(I,J) + A(I,K)*B(K,J)

Small cache scenario: Matrices are large compared to cache

reuse distance is not O(1) => miss Cold and capacity misses Miss ratio:

C: N2/b misses (good temporal locality) A: N3 /b misses (good spatial locality) B: N3 misses (poor temporal and spatial locality) Miss ratio 0.25 (b+1)/b = 0.3125 (for b = 4)

C

B

A

K

K

Page 15: An Experimental Comparison of  Empirical and Model-based Optimization

MMM experiments L1 Cache Miss Ratio for Intel Pentium III

– MMM with N = 1…1300– 16KB 32B/Block 4-way 8-byte elements

Tile size

Page 16: An Experimental Comparison of  Empirical and Model-based Optimization

Register-level blocking

Micro-MMM A: MUx1 B: 1xNU C: MUxNU MUxNU+MU+NU registers

Unroll loops by MU, NU, and KU Mini-MMM with Micro-MMM inside

for (int j = 0; j < NB; j += NU) for (int i = 0; i < NB; i += MU) load C[i..i+MU-1, j..j+NU-1] into registers for (int k = 0; k < NB; k++) load A[i..i+MU-1,k] into registers load B[k,j..j+NU-1] into registers multiply A’s and B’s and add to C’s store C[i..i+MU-1, j..j+NU-1]

MU, NU, KU: optimization parameters

B

NB

NB

A C

K

MU

NU

K

KU times

Page 17: An Experimental Comparison of  Empirical and Model-based Optimization

Scheduling

FMA Present? Schedule Computation

Using Latency Schedule Memory Operations

Using IFetch, NFetch,FFetch

Latency, xFetch: optimization parameters

M1

M2

M3

M4

MMU*NU

A1

A2

A3

A4

AMU*NU

L1

L2

L3

LMU+NU

La

ten

cy=2

A1

A2

AMU*NU

Computation

MemoryOperationsComputation

MemoryOperations

Computation

MemoryOperations

Computation

MemoryOperations

Computation

MemoryOperations

IFetch Loads

NFetch Loads

NFetch Loads

NFetch Loads

Page 18: An Experimental Comparison of  Empirical and Model-based Optimization

Comments Optimization parameters

NB: constrained by size of L1 cache MU,NU: constrained by NR KU: constrained by size of I-Cache xFetch: constrained by #of OL MulAdd/Latency: related to hardware parameters

Similar parameters would be used by compilers

parameter

MFlopsMFlops

parameter

Sensitive parameter Insensitive parameter

Page 19: An Experimental Comparison of  Empirical and Model-based Optimization

ATLAS Search

DetectHardware

Parameters

ATLAS SearchEngine

(MMSearch)NR

MulAddLatency

L1SizeATLAS MM

Code Generator(MMCase)

xFetchMulAddLatency

NBMU,NU,KU MiniMMM

Source

Compile,Execute,Measure

MFLOPS

DetectHardware

ParametersModelNR

MulAddLatency

L1I$Size ATLAS MMCode Generator

(MMCase)xFetchMulAddLatency

NBMU,NU,KU MiniMMM

Source

L1Size

Original ATLAS Infrastructure

Model-Based ATLAS Infrastructure

Page 20: An Experimental Comparison of  Empirical and Model-based Optimization

High-level picture

Multi-dimensional optimization problem: Independent parameters: NB,MU,NU,KU,… Dependent variable: MFlops Function from parameters to variables is given implicitly; can be evaluated repeatedly

One optimization strategy: orthogonal range search Optimize along one dimension at a time, using reference values for parameters not

yet optimized Not guaranteed to find optimal point, but might come close

Page 21: An Experimental Comparison of  Empirical and Model-based Optimization

Specification of OR Search

Order in which dimensions are optimized Reference values for un-optimized

dimensions at any step Interval in which range search is done for

each dimension

Page 22: An Experimental Comparison of  Empirical and Model-based Optimization

Search strategy

1. Find Best NB

2. Find Best MU & NU

3. Find Best KU

4. Find Best xFetch

5. Find Best Latency (lat)

6. Find non-copy version tile size (NCNB)

Page 23: An Experimental Comparison of  Empirical and Model-based Optimization

Find Best NB

Search in following range 16 <= NB <= 80 NB2 <= L1Size

In this search, use simple estimates for other parameters (eg) KU: Test each candidate for

Full K unrolling (KU = NB) No K unrolling (KU = 1)

Page 24: An Experimental Comparison of  Empirical and Model-based Optimization

Find best MU, NU : try all MU & NU that satisfy

In this step, use best NB from previous step

Find best KU

Find best Latency [1…6] Find best xFetch IFetch: [2,MU+NU], Nfetch:[1,MU+NU-IFetch]

Finding other parameters

1 <= MU,NU <= NB

MU*NU + MU + NU <= NR

Page 25: An Experimental Comparison of  Empirical and Model-based Optimization

Our Models

DetectHardware

Parameters

ATLAS SearchEngine

(MMSearch)NR

MulAddLatency

L1SizeATLAS MM

Code Generator(MMCase)

xFetchMulAddLatency

NBMU,NU,KU MiniMMM

Source

Compile,Execute,Measure

MFLOPS

DetectHardware

ParametersModelNR

MulAddLatency

L1I$Size ATLAS MMCode Generator

(MMCase)xFetchMulAddLatency

NBMU,NU,KU MiniMMM

Source

L1Size

Original ATLAS Infrastructure

Model-Based ATLAS Infrastructure

Page 26: An Experimental Comparison of  Empirical and Model-based Optimization

Modeling for Optimization Parameters Optimization parameters

NB Hierarchy of Models (later)

MU, NU

KU maximize subject to L1 Instruction Cache

Latency, MulAdd hardware parameters

xFetch set to 2

NRLatencyNUMUNUMU *

Page 27: An Experimental Comparison of  Empirical and Model-based Optimization

Largest NB for no capacity/conflict misses

Tiles are copied into contiguous memory Condition for cold misses only:

3*NB2 <= L1Size

A

k

B

j

k

i

NB

NBNB

NB

Page 28: An Experimental Comparison of  Empirical and Model-based Optimization

Largest NB for no capacity misses

MMM: for (int j = 0; i < N; i++)

for (int i = 0; j < N; j++) for (int k = 0; k < N; k++) c[i][j] += a[i][k] * b[k][j]

Cache model: Fully associative Line size 1 Word Optimal Replacement

Bottom line:N2+N+1<C One full matrix One row / column One element

A

M (I)

K

C

B

N (J)

K

Page 29: An Experimental Comparison of  Empirical and Model-based Optimization

Extending the Model

Line Size > 1 Spatial locality Array layout in memory matters

Bottom line: depending on loop order either or

B

C

B

NB

B

NB

1

2

B

CNB

B

NB

1

2

Page 30: An Experimental Comparison of  Empirical and Model-based Optimization

Extending the Model (cont.) LRU (not optimal replacement) MMM sample: for (int j = 0; i < N; i++)

for (int i = 0; j < N; j++) for (int k = 0; k < N; k++) c[i][j] += a[i][k] * b[k][j]

Bottom line:

jijNBNBijijiCBABABA

,,,,22,,11,

jNBjNBNBNBjNBjNB

jNBNBNBNBNB

jNB

jNB

CBABABA

CAAA

CAAA

CAAA

,,,,22,,11,

,1,12,11,1

,2,22,21,2

,1,12,11,1

B

CNB

B

NB

13

2

B

C

B

NB

B

NB

13

2

B

C

B

NBNB

B

NB

12

2

B

CNB

B

NB

B

NB

12

2

Page 31: An Experimental Comparison of  Empirical and Model-based Optimization

Summary: Modeling for Tile Size (NB) Models of increasing complexity

3*NB2 ≤ C Whole work-set fits in L1

NB2 + NB + 1 ≤ C Fully Associative Optimal Replacement Line Size: 1 word

or

Line Size > 1 word

or

LRU Replacement

B

N

M

A C

NB

NB

K

KB

C

B

NB

B

NB

1

2

B

CNB

B

NB

1

2

B

C

B

NB

B

NB

B

NB

12

2

B

CNB

B

NB

13

2A

M(I)

K

C

B

N (J)

KB

A

M(I)

K

C

B

N (J)

KL

Page 32: An Experimental Comparison of  Empirical and Model-based Optimization

Comments

Lot of work in compiler literature on automatic tile size selection

Not much is known about how well these algorithms do in practice Few comparisons to BLAS

Not obvious how one generalizes our models to more complex codes Insight needed: how sensitive is performance to tile size?

Page 33: An Experimental Comparison of  Empirical and Model-based Optimization

Experiments

Architectures: SGI R12000, 270MHz Sun UltraSPARC III, 900MHz Intel Pentium III, 550MHz

Measure Mini-MMM performance Complete MMM performance Sensitivity of performance to parameter variations

Page 34: An Experimental Comparison of  Empirical and Model-based Optimization

Installation Time of ATLAS vs. Model

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

SGI Atlas SGI Model Sun Atlas Sun Model Intel Atlas Intel Model

tim

e (

s)

Detect Machine Parameters Optimize MMM Generate Final Code Build Library

Page 35: An Experimental Comparison of  Empirical and Model-based Optimization

MiniMMM Performance

SGI ATLAS: 457 MFLOPS Model: 453 MFLOPS Difference: 1%

Sun ATLAS: 1287 MFLOPS Model: 1052 MFLOPS Difference: 20%

Intel ATLAS: 394 MFLOPS Model: 384 MFLOPS Difference: 2%

Page 36: An Experimental Comparison of  Empirical and Model-based Optimization

MMM Performance

00.5

11.5

22.5

33.5

44.5

5

0 1000 2000 3000 4000 5000Matrix Size

TL

B M

isse

s (B

illio

ns)

0

100

200

300

400

500

600

0 1000 2000 3000 4000 5000Martix Size

MF

LO

PS

0200

400600800

100012001400

16001800

0 1000 2000 3000 4000 5000Martix Size

MF

LO

PS

0

100

200

300

400

500

600

0 1000 2000 3000 4000 5000Martix Size

MF

LO

PS

SGI Sun

Intel

BLAS

MODELF77ATLAS

Page 37: An Experimental Comparison of  Empirical and Model-based Optimization

Optimization Parameter Values

NB MU/NU/KU F/I/N-FetchLatency

ATLAS SGI: 64 4/4/64 0/5/1 3 Sun: 48 5/3/48 0/3/5 5 Intel: 40 2/1/40 0/3/1 4

Model SGI: 62 4/4/62 1/2/2 6 Sun: 88 4/4/78 1/2/2 4 Intel: 42 2/1/42 1/2/2 3

Page 38: An Experimental Comparison of  Empirical and Model-based Optimization

Sensitivity to NB and Latency: Sun

0

200

400

600

800

1000

1200

1400

1600

20 40 60 80 100 120 140Tile Size

MF

LO

PS

0

200

400

600

800

1000

1200

1400

1600

1 2 3 4 5 6 7 8 9 10 11 12 13 14Latency

MF

LO

PS

Tile Size (NB) Latency

ATLAS MODELBEST ATLASMODEL BEST

Page 39: An Experimental Comparison of  Empirical and Model-based Optimization

0

100

200

300

400

500

600

20 220 420 620 820

Tile Size

MF

LO

PS

Sensitivity to NB: SGI3*NB2 ≤ C2

NB2 + NB + 1 ≤ C2

ATLASMODEL BEST

Page 40: An Experimental Comparison of  Empirical and Model-based Optimization

Sensitivity to NB: Intel

M

A, B

0

50

100

150

200

250

300

350

400

450

20 70 120 170 220 270 320

Tile Size (B: Best, A: ATLAS, M: Model)

MF

LO

PS

Page 41: An Experimental Comparison of  Empirical and Model-based Optimization

Sensitivity to MU,NU: SGI

12

34

56

78

MU12345678

NU

200

300

400

12

34

56

78

MU

1 2 3 4 5 6 7 8MU

1

2

3

4

5

6

7

8

NU

200300400

1

2

3

4

5

6

7

8

NU

Page 42: An Experimental Comparison of  Empirical and Model-based Optimization

Sensitivity to MU,NU: Sun

12

34

56

78

MU12345678

NU250500750

10001250

12

34

56

78

MU

1 2 3 4 5 6 7 8MU

1

2

3

4

5

6

7

8

NU

2505007501000

12501

2

3

4

5

6

7

8

NU

Page 43: An Experimental Comparison of  Empirical and Model-based Optimization

12

34

56

78

MU12345678

NU

0100200300

12

34

56

78

MU

1 2 3 4 5 6 7 8MU

1

2

3

4

5

6

7

8

NU

0100

2003001

2

3

4

5

6

7

8

NU

Sensitivity to MU,NU: Intel

Page 44: An Experimental Comparison of  Empirical and Model-based Optimization

Shape of register tile matters

0

50

100

150

200

250

300

350

400

450

1 2 3 4 5 6 7 8

MU / NU

MF

LO

PS

MU x 1 1 x NU

Page 45: An Experimental Comparison of  Empirical and Model-based Optimization

Sensitivity to KU

BA

B

A

B A

0

200

400

600

800

1000

1200

1400

1600

0 10 20 30 40 50 60

KU (B: Best, A: ATLAS, M: Model)

MF

LO

PS

SGI Sun Intel

Page 46: An Experimental Comparison of  Empirical and Model-based Optimization

Conclusions

Search is not as important as one might think

Compilers can achieve near-ATLAS performance if they implement well known transformations use models to choose parameter values

There is room for improvement in both models and empirical search Both are 20-25% slower than BLAS Higher levels of memory hierarchy cannot be neglected

Page 47: An Experimental Comparison of  Empirical and Model-based Optimization

Future Directions Study hand-written BLAS codes to understand

performance gap Repeat study with FFTW/SPIRAL

Uses search to choose between algorithms Combine models with search

Use models to speed up empirical search Use empirical studies to enhance models

Feed insights back into compilers How do we make it easier for compiler writers to

implement transformations? Use insights to simplify memory system

Page 48: An Experimental Comparison of  Empirical and Model-based Optimization

Information

URL: http://iss.cs.cornell.edu

Email: [email protected]

Page 49: An Experimental Comparison of  Empirical and Model-based Optimization

Sensitivity to Latency: Intel

M

A

0

50

100

150

200

250

300

350

400

450

1 2 3 4 5 6 7 8

Latency (B: Best, A: ATLAS, M: Model)

MF

LO

PS