Mesh Layouts for Block-Based Caches Sung-Eui Yoon Peter Lindstrom Lawrence Livermore National...

66
Mesh Layouts for Block-Based Caches Sung-Eui Yoon Peter Lindstrom Lawrence Livermore National Laboratory

Transcript of Mesh Layouts for Block-Based Caches Sung-Eui Yoon Peter Lindstrom Lawrence Livermore National...

Mesh Layouts for Block-Based Caches

Sung-Eui Yoon

Peter Lindstrom

Lawrence Livermore National Laboratory

2

Goal

● Provide cache-coherent layouts of meshes and graphs

● Derive metrics that measure cache-coherence of layouts● Generality● Simplicity● Efficiency● Accuracy

3

Cache-Coherent Metrics

● Measure the expected number of cache misses of a layout given block-based caches● Should correlate well with the observed

number of cache misses

● Cache-aware metrics● Measure cache-coherence given known

cache parameters (e.g., block size)● Cache-oblivious metrics

● Consider all possible cache parameters

4

Motivation

Accumulated growth rate

during 1993 – 2004(log scale)

Courtesy: Anselmo Lastra, http://www.hcibook.com/e3/online/moores-law/

● Lower growth rate of data access speed

1.5X

20X46X

130X

during 99 - 04

5

Memory Hierarchies and Block-Based Caches

CPU

Fast memory or cache

Slow memory

Blocktransfer

Disk

10-2 secAccess time: 10-7 sec10-8 sec

6

Main Contributions

● Propose novel and practical cache-aware and cache-oblivious metrics● Derive metrics given block-based caches● Propose efficient cache-coherent layout

constructions● Apply to different applications

7

Related Work

● Computation reordering● Data layout optimization

8

Computational Reordering

● Cache-aware [Coleman and McKinley 95, Vitter 01, Sen et al. 02]

● Cache-oblivious [Frigo et al. 99, Arge et al. 04]

Focus on specific problems such as sorting and linear algebra computations

9

Data Layout Optimization

● Graph and matrix layout [Diaz et al. 02]● Minimum linear arrangement (MLA),

bandwidth, and wavefront, etc.● Space-filling curves

● [Sagan 94, Pascucci and Frank 01, Lindstrom and Pascucci 01, Gopi and Eppstein 04]

● Rendering and processing sequences● [Deering 95, Hoppe 99, Bogomjakov and

Gotsman 02, Isenburg and Lindstrom 05]● Cache-oblivious mesh layout

● [Yoon et al. 05]

10

Outline

● Computation models● Cache-aware and cache-oblivious

metrics● Results

11

Outline

● Computation models● Cache-aware and cache-oblivious

metrics● Results

12

nc

General Framework of Layout Computation

na

nb ndInput directed graph, G (N, A)

Layout algorithm, φ

1D layout, φ(N)

……..na nd nc nb

Cache-coherent metric

13

nc

Two-Level I/O Model [Aggarwal and Vitter 88]

na

nb ndInput directed

graph

Layout algorithm

Cache

M cache blocks, whose size is B

1D layout with block size = 3

……..na nd nc nb

na nd nc nb

14

Graph Representation

● Directed graph, G = (N, A)● Represent access patterns between nodes

● Nodes, N● Data element ● (e.g., mesh vertex or mesh triangle)

● Directed arcs, A● Connects two nodes if they are accessed

sequentially

nc

na

nb nd

15

Weights of Nodes and Arcs

● Indicate probabilities that each element will be accessed

● Computed in an equilibrium status during infinite random walks● Assume that applications infinitely access

the data according to the input graph● Correspond to eigen-values of the

probability transition matrix

16

Cache-Coherence of a Layout given Block-Based Caches

● Expected number of cache misses of a layout● Probability accessing a node from

another node by traversing an arc● Conditional probability that we will have

a cache miss given the above access pattern

nc

na

nb nd

17

Specialization to Meshes

● Expected number of cache misses of a layout● Probability accessing a node from

another node by traversing an arc● Conditional probability that we will have

a cache miss given the above access pattern

nc

na

nb nd

nc

na

nb nd

An input mesh Implicitly derived graph

= constant

1. Two opposite directed arcs2. Uniform distribution to access adjacent nodes given a node

18

Outline

● Computation models● Cache-aware and cache-oblivious

metrics● Results

19

Four Different Cases

Cache-aware casesingle cache block,

M=1

Cache-oblivious casesingle cache block,

M=1

Cache-aware casemultiple cache blocks,

M>1

Cache-oblivious casemultiple cache blocks,

M>1

20

Cache-Aware: Single Cache Block, M=1

nc

na

nb ndInput directed

graph

1D layout with block size = 3

……..na nd nc nb

Cache, whose block size is B

Straddling arcs

21

Cache-Aware: Multiple Cache Blocks, M>1

nc

na

nb ndInput directed

graph

1D layout with block boundary

……..na nd nc nb

Straddling arcsCache

22

Final Cache-Aware Metric

● Counts the number of straddling arcs of the layout given a block size B

Aji

BB jiSA ),(

|))()((|||

1

)(iB : block index containing the node, i

)(xS : Unit step function, 1 if x > 0 0 otherwise.

where

23

High Accuracy of Cache-Aware Metric

Linear correlation

[-1, 1]

Observed number of cache misses

With 5 cache blocks

With 25 cache blocks

Cache-aware metric 0.97 0.97Tested layouts:

Z-curve, Hilbert curve, H-order, minimum linear arrangement layout, βΩ-layout, geometric CO layout,(bi or uni) row-by-row, (bi or uni) diagnoal layouts

Z-curve on a uniform grid

Tested block size = 4KB

24

Four Different Cases

Cache-aware casesingle cache block,

M=1

Cache-oblivious casesingle cache block,

M=1

Cache-aware casemultiple cache blocks,

M>1

Cache-oblivious casemultiple cache blocks,

M>1

25

Cache-Oblivious: Single Cache Block, M=1

Cache

Does not assume a particular block size:Then, what are good representatives

for block sizes?

26

Two Possible Block Size Progressions

● Arithmetic progression● 1, 2, 3, 4, …

● Geometric progression● 20 , 21 , 22 , 23 , …● Well reflects current caching

architectures● E.g., L1: 32B, L2: 64B, Page: 4KB, etc.

27

Probability that an Arc is a Straddling Arc

……..na nd nc nb

Is an arc straddlinggiven a block size?

Computed as a probability

as a function of arc length, l

Arc length, l, = 2

28

Two Cache-Oblivious Metrics

● Arithmetic cache-oblivious metric,

● Geometric cache-oblivious metric,

Aji

ijA l),(

||1

Aji

ijA l),(

||1 )log(

log||

1

),(

A

Ajiijl

Arc length of arc (i, j)

Geometric mean of arc lengths

)(aCOM

)(gCOM

MLA metric,Arithmetic mean

29

Validation for Cache-Oblivious (CO) Metrics

● Geometric cache-oblivious metric● Practical and useful

Geometric CO layout

Arithmetic CO layout

97% of tested block sizes

The number of cache misseswhen M = 1(log scale)

73% of tested power-of-two block sizes

30

Correlations between Metrics and Observed Number of Cache Misses

Linear correlation

[-1, 1]

Observed number of cache misses

With 1 cache block

With 5 cache blocks

Geometric CO metric 0.98 0.81

Arithmetic CO metric -0.19 -0.32

Tested block size = 4KB

Tested with 10 different layouts on a uniform grid

31

Efficient Layout Computation for Our Metrics

● Cache-aware layouts● Optimized with cache-aware metric given

a block size B● Computed from the graph partitioning

● Geometric cache-oblivious metric● Very efficient● Can be used in different layout methods

32

Layout Computation with Geometric Cache-Oblivious Metric

● Multi-level construction method● Partition an input mesh into k different

sets● Layout partitions based on our metric

1. Partition 2. Lay out

•Generalized layout method for unstructured meshes

33

Outline

● Computation models● Cache-aware and cache-oblivious

metrics● Results

34

Applications

● Isosurface extraction● View-dependent rendering

35

Iso-Surface Extraction

● Uses contour tree [van Kreveld et al. 97]● Runtime is dominated by the traversal of

iso-surface● Layout graph

● Use an input tetrahedral mesh

Spx model(140K vertices)

36

High Correlation with Number of Cache Misses

Linear correlation

[-1, 1]

Observed number

of cache misses

With 1 cache block

With 10K cache blocks

Geometric CO metric 0.99 0.98

Tested block size = 4KB

Tested with 8 different layouts: our geometric CO, our cache-aware, breadth-first (and

depth-first) layouts, spectral [Juvan and Mohar 92], cache-oblivious mesh [Yoon et al. 05], Z-curve [Sagan 94], X-axis

sorted layouts

37

High Correlation with Runtime Performance

Linear correlation

[-1, 1]

First iso-surface

extraction time

Second iso-surface

extraction time

Geometric CO metric 0.94 0.94

Disk I/O time is major bottleneck

Memory access time is major bottleneck

38

Comparison with Other Layouts

0 0.5 1 1.5 2 2.5

Cache-aware layout

Geometric CO layout

Z-curve

COML

Depth-first layout

Spectral layout

Breadth-first layout

X-axis layout

The first iso-surface extraction time (sec)

8% - 77% improvement and

very close to the cache-aware

performance

39

View-Dependent Rendering

● Layout vertices and triangles of progressive meshes● Used in an efficient VDR system [Yoon et

al. 04]● Reduce misses in GPU vertex cache

40

Cache Miss Ratio on Bunny Model

GPU vertex cache miss

ratio

Vertex cache size

Hoppe [Hoppe 99]Theoretical lower bound

[Bar-Yehuda and Gotsman 96]

Universal rendering seq. [Bogomjakov and Gotsman 02]

GeometricCO layout

41

Cache Miss Ratio on Power Plant Model

GPU vertex cache miss

ratio

Vertex cache size

Z-curve

Hoppe’s renderingseq. [Hoppe 99]

Theoretical lower bound

[Bar-Yehuda and Gotsman 96]

COML [Yoon et al. 05]

GeometricCO layout

42

Conclusion

● Novel cache-aware and cache-oblivious metrics to evaluate layouts● Derived metrics based on two-level I/O

model● Improved the performance of applications

without modifying codes

OpenCCL, open source libraryhttp:://gamma.cs.unc.edu/COL/OpenCCL

43

Ongoing and Future Work

● Derive a lower bound on our geometric cache-oblivious metric

● Employ mesh compression to further reduce disk I/O accesses

● Investigate efficient layout method for deforming models

● Apply to non-graphics applications● e.g., shortest path or other graph

computations

44

Cache-Efficient Layouts of Bounding Volume Hierarchies

● Yoon and Manocha, Eurographics 06

Ray tracingCollision detection

45

Acknowledgements

● Ajith Mascarenhas● Martin Isenburg● Dinesh Manocha● Fabio Bernardon, Joao Comba, and

Claudio Silva● For their unstructured tetrahedra

rendering program● Members of data analysis group in

LLNL● Anonymous reviewers

46

UCRL-PRES-225448

This work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract No. W-7405-ENG-48.

47

Additional slides

48

Cache-Coherence of Layouts

● Well known heuristics for cache-coherent layouts● Space-filling curves [Sagan 94]

● How can we compute better layouts?● Requires metrics measuring cache-

coherence of layouts

49

Main Results

● Define cache-coherence of layout as:● Expected number of cache misses during

random walks of a graph given block-based caches

● Then, the exp. number of cache misses= Number of straddling arcs in a cache-

aware cache≈ Geometric mean of arc lengths in a cache-

oblivious case

50

Data Layout Optimization

● Rendering sequences● Triangle strips● [Deering 95, Hoppe 99, Bogomjakov and

Gotsman 02]● Processing sequences

● [Isenburg and Gumhold 03, Isenburg and Lindstrom 05]

Assume that access patternglobally follows the layout order

51

Data Layout Optimization

● Space-filling curves● [Sagan 94, Velho and Gomes 91, Pascucci

and Frank 01, Lindstrom and Pascucci 01, Gopi and Eppstein 04]

Assume geometric regularity

52

Data Layout Optimization

● Graph and matrix layout● A survey [Diaz et al. 02]● Minimum linear arrangement (MLA)● Bandwidth● Profile● Wavefront, etc.

Does not necessarily produce good layouts for block-based caches

53

Cache-Aware Metric

● Expected number of cache misses given a block size B

Aji

BB iiSA ),(

|))()((|||

1

54

Correlation between Cache-Aware Metric and Observed Number of Cache Misses

Cache-aware metric

M = 5 M = 25Observed number of cache misses

Cache block size = 4KB

R2 = 0.97 R2 = 0.97

55

Evaluating Existing Layouts

● No known tight bound

● Compare against the best layout we can construct● Employ an efficient

sampling method

Is it close to the optimal

layout?

An existing layout, φ

Use it

Build a new one

56

Implementation

● Modify our open source layout computation codes, OpenCCL● Based on METIS graph partitioning library

[Karypis and Kumar 98]

● Processing speed● 15k triangles / second

57

Comparison with Cache-Oblivious Mesh Layouts (COML) [Yoon et al. 05]

● Two major improvements● Accuracy● Usability

58

Pros and Cons

● Limitations● Not directly applicable to dynamic models

● Pros● Generality● Can have benefits without modifying

underlying codes

59

Specialization to Meshes

● Assume an equally likelihood to access adjacent nodes given a node

nc

na

nb nd

nc

na

nb nd

An input mesh Implicitly derived graph

60

Two Possible Block Size Progressions

● Arithmetic progression● 1, 2, 3, 4, …

● Geometric progression● 20 , 21 , 22 , 23 , …● Well reflects current

caching architectures● E.g., L1: 32B, L2: 64B,

Page: 4KB, etc.

B

Pr(B)

B

Pr(B)Geometricdistribution

Uniform distribution

61

Cache Miss Ratio with Different Mesh Resolutions of Bunny Model

GPU vertex cache miss

ratio(case size: 32)

Mesh resolution

Hoppe’s rendering sequence[Hoppe 99]

COLg

62

Correlation between Cache-Aware Metric and Number of Cache Misses

M = 5 M = 25Observed number of cache misses

Cache block size = 4KB

R2 = 0.97 R2 = 0.97

63

Correlations with Observed Number of Cache Misses

Cache missesOne blk : Mult blks

ArithmeticCO metric

GeometricCO metric

R2 = 0.98 R2 = 0.81 Cache block size = 4KB

,

64

Comparison with Other Layouts0.99 0.98 0.94 0.94

X-axis

BFLSpectral layout

[Juvan and Mohar 92]

DFL

COML[Yoon et al. 05]

Z-curveCOLgAware

Correlation:

Up to 2X speedup9% lower than cache-

aware layout

65

Comparison with Other Layouts

● Compute eight different layouts● Our geometric cache-oblivious layout● Our cache-aware layout● Breadth-first layout● Depth-first layout● Spectral layout [Juvan and Mohar 92]● Cache-oblivious mesh layout [Yoon et al.

05]● Z-curve [Sagan 94]● X-axis sorted layout

66

Our Layout of Bunny Model