CS671 Parallel Programming in the Many-Core Eraeddy.zhengzhang/cs671_fall2013/lectur… · ‣Too...

18
CS671 Parallel Programming in the Many-Core Era Lecture 4: Introduction to Locality Theory and Practice Zheng Zhang Rutgers University

Transcript of CS671 Parallel Programming in the Many-Core Eraeddy.zhengzhang/cs671_fall2013/lectur… · ‣Too...

Page 1: CS671 Parallel Programming in the Many-Core Eraeddy.zhengzhang/cs671_fall2013/lectur… · ‣Too costly: N up to 120 billion, M 25 million. Reuse Distance Calculation II ... •

CS671 Parallel Programming in the Many-Core Era

Lecture 4: Introduction to Locality Theory and Practice

Zheng Zhang

Rutgers University

Page 2: CS671 Parallel Programming in the Many-Core Eraeddy.zhengzhang/cs671_fall2013/lectur… · ‣Too costly: N up to 120 billion, M 25 million. Reuse Distance Calculation II ... •

Review: Memory Wall

‣ The processor memory performance gap

Page 3: CS671 Parallel Programming in the Many-Core Eraeddy.zhengzhang/cs671_fall2013/lectur… · ‣Too costly: N up to 120 billion, M 25 million. Reuse Distance Calculation II ... •

Memory Hierarchy

‣Hierarchical memory* L1, L2, L3 cache* scratch-pad, off-chip memory, disk cache ...* automatic placement and replacement* separation of concerns: data usage vs. coherence management

‣Trading space for time* the faster the access* the smaller the data capacity

‣Software solution* exploit locality -- temporal and/or spatial* transform computation order or data layout* compilers, runtime, performance tuning tools

Page 4: CS671 Parallel Programming in the Many-Core Eraeddy.zhengzhang/cs671_fall2013/lectur… · ‣Too costly: N up to 120 billion, M 25 million. Reuse Distance Calculation II ... •

The Story of the Locality Theory

‣Started as an empirical observation “During any interval of execution, a program favors a subset of its pages, and this set of favored pages changes slowly” -- Peter Denning

‣How to quantify?* the performance of a machine* the demand of a program* the locality of an operation* is there a “primary” metric?

‣Two example quantities* reuse time & footprint

Page 5: CS671 Parallel Programming in the Many-Core Eraeddy.zhengzhang/cs671_fall2013/lectur… · ‣Too costly: N up to 120 billion, M 25 million. Reuse Distance Calculation II ... •

Locality Statistics‣ Miss Ratio

‣ Reuse Distance

‣ Footprint

Page 6: CS671 Parallel Programming in the Many-Core Eraeddy.zhengzhang/cs671_fall2013/lectur… · ‣Too costly: N up to 120 billion, M 25 million. Reuse Distance Calculation II ... •

Locality Statistics‣ Miss Ratio

‣ Reuse Distance

‣ Footprint

Page 7: CS671 Parallel Programming in the Many-Core Eraeddy.zhengzhang/cs671_fall2013/lectur… · ‣Too costly: N up to 120 billion, M 25 million. Reuse Distance Calculation II ... •

Cache Miss Ratio

‣ Cache Performance of the Integer portion of the SPEC CPU2000

Page 8: CS671 Parallel Programming in the Many-Core Eraeddy.zhengzhang/cs671_fall2013/lectur… · ‣Too costly: N up to 120 billion, M 25 million. Reuse Distance Calculation II ... •
Page 9: CS671 Parallel Programming in the Many-Core Eraeddy.zhengzhang/cs671_fall2013/lectur… · ‣Too costly: N up to 120 billion, M 25 million. Reuse Distance Calculation II ... •
Page 10: CS671 Parallel Programming in the Many-Core Eraeddy.zhengzhang/cs671_fall2013/lectur… · ‣Too costly: N up to 120 billion, M 25 million. Reuse Distance Calculation II ... •

Locality Statistics‣ Miss Ratio

‣ Reuse Distance

‣ Footprint

Page 11: CS671 Parallel Programming in the Many-Core Eraeddy.zhengzhang/cs671_fall2013/lectur… · ‣Too costly: N up to 120 billion, M 25 million. Reuse Distance Calculation II ... •

Reuse Distance‣ Reuse distance of an access to datum d

the number of distinct data accessed after the last access to d

‣ Locality signature of an executionthe distribution of all finite reuse distances determines working set size and miss rate of caches of all sizes

Page 12: CS671 Parallel Programming in the Many-Core Eraeddy.zhengzhang/cs671_fall2013/lectur… · ‣Too costly: N up to 120 billion, M 25 million. Reuse Distance Calculation II ... •

Reuse Distance Calculation I

‣ Naive counting, O(N) time per access, O(N) space-- N is the number of memory accesses-- M is the number of distinct data elements

‣Too costly: N up to 120 billion, M 25 million

Page 13: CS671 Parallel Programming in the Many-Core Eraeddy.zhengzhang/cs671_fall2013/lectur… · ‣Too costly: N up to 120 billion, M 25 million. Reuse Distance Calculation II ... •

Reuse Distance Calculation II

‣Stack algorithm [Mattson+ IBM 70]-- O(M) time per access, O(M) space

Page 14: CS671 Parallel Programming in the Many-Core Eraeddy.zhengzhang/cs671_fall2013/lectur… · ‣Too costly: N up to 120 billion, M 25 million. Reuse Distance Calculation II ... •

Reuse Distance Calculation III

‣Tree based algorithm -- search tree [Olken LBL 81, Sugumar&Abraham UM 93] O(log M) time per access, O(M) space

Page 15: CS671 Parallel Programming in the Many-Core Eraeddy.zhengzhang/cs671_fall2013/lectur… · ‣Too costly: N up to 120 billion, M 25 million. Reuse Distance Calculation II ... •

Reuse Distance Calculation III

• Stack algorithm [Mattson+ IBM 70] O(M) time per access, O(M) space

• Search tree [Olken LBL 81, Sugumar&Abraham UM 93] O(log M) time per access, O(M) space

• Space cost remains a major problem

[Ding+ PLDI’03/TOPLAS’09]O(N log logM) time and O(logM) space

Page 16: CS671 Parallel Programming in the Many-Core Eraeddy.zhengzhang/cs671_fall2013/lectur… · ‣Too costly: N up to 120 billion, M 25 million. Reuse Distance Calculation II ... •

Locality Statistics‣ Miss Ratio

‣ Reuse Distance

‣ Footprint

Page 17: CS671 Parallel Programming in the Many-Core Eraeddy.zhengzhang/cs671_fall2013/lectur… · ‣Too costly: N up to 120 billion, M 25 million. Reuse Distance Calculation II ... •

Footprint‣ Amount of data access in an execution period

‣Example: “abbb”

‣Example “xyz xyz”

Footprint

• fp(w): average footprint of ALL windows of length w• length-n trace, O(n^2) windows• 1 billion accesses, half quintillion windows

• 3 length-2 windows: “ab”, “bb”, “bb”• footprints 2, 1, 1• the average fp(2) = (2 + 1 + 1)/3 = 4/3

• fp( i ) = i for 0 <= i <= 3• fp( i ) = 3 for i > 3

Reuse Time?[Xiang+ ASPLOS’13]

Page 18: CS671 Parallel Programming in the Many-Core Eraeddy.zhengzhang/cs671_fall2013/lectur… · ‣Too costly: N up to 120 billion, M 25 million. Reuse Distance Calculation II ... •

Footprint Measurement‣Working set

limit value in an infinitely long trace [Denning & Schwartz 1972]

‣ Direct countingsingle window size [Thiebaut & Stone TOCS’87] seminal paper on footprints in shared cache

‣ Statistical approximation[Denning & Schwartz 1972; Suh et al. ICS’01; Berg & Hagersten PASS’04; Chandra et al. HPCA’05; Shen et al. POPL’07]

‣Precise definition/solutionfootprint distribution, O(n log m) [Xiang et al. PPoPP’11]footprint function, O(n) [Xiang et al. PACT’11]