Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching
description
Transcript of Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching
![Page 1: Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813bd0550346895da4f717/html5/thumbnails/1.jpg)
Cache Conscious Allocation of Pointer Based Data
Structures, Revisited with HW/SW Prefetching
by: Josefin Hallberg, Tuva Palm and Mats Brorsson
Presented by: Lena Salman
![Page 2: Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813bd0550346895da4f717/html5/thumbnails/2.jpg)
IntroductionPointer-based data structures are usually randomly allocated in memory
Will usually not achieve good locality
Higher miss rates
Software approach against Hardware approach
![Page 3: Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813bd0550346895da4f717/html5/thumbnails/3.jpg)
Software approachTwo techniques: Cache concious allocation -
By far the most efficient Software prefetch –
Better suited for automization, better for implementation in compilers.
Combination of both cache-concious allocation and software prefetch, does not add significantly to performance
![Page 4: Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813bd0550346895da4f717/html5/thumbnails/4.jpg)
Hardware approach
Calculating and prefetching pointersCalculating pointer dependenciesEffects of effectively predicting what to evict from the cacheGeneral HW prefetch –
More likely to pollute the cacheProblem! All the hardware strategies take advantage of the increased locality of cache – consciously allocated data.
![Page 5: Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813bd0550346895da4f717/html5/thumbnails/5.jpg)
Prefetching and cache – concious allocation
Should complement each other’s weakness – Reduce the prefetch overhead of fetching
blocks with partially unwanted data.Prefetching should reduce the cache
misses and miss latencies between the nodes
![Page 6: Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813bd0550346895da4f717/html5/thumbnails/6.jpg)
Cache-conscious allocation
Excellent improvement in execution time performance
Can be adapted to specific need by choosing the cache-conscious block-size
(cc – block size)
Attempts to co-allocate data in the same cache-line. Nodes are referenced after each other on the same cache line.
![Page 7: Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813bd0550346895da4f717/html5/thumbnails/7.jpg)
Allocation to improve locality
![Page 8: Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813bd0550346895da4f717/html5/thumbnails/8.jpg)
Cache – conscious allocation
Attempts to allocate the data in the same cache- line
Better locality can be achieved
Improved cache performance by a reduction of misses
![Page 9: Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813bd0550346895da4f717/html5/thumbnails/9.jpg)
ccmalloc()
Does the cache-concious allocation of memory.
Takes extra argument – pointer to data structure that is likely to be referenced.
#ifdef CCMALOC
child=ccmaloc(sizeof(struct node), parent)
#else
child= malloc(sizeof(struct node));
#endif
![Page 10: Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813bd0550346895da4f717/html5/thumbnails/10.jpg)
ccmalloc()
Takes pointer to data that is likely to be referenced close ( in time) to the newly allocated stuctureInvokes calls to the standard malloc():
When allocating new cc-block When data is larger than cc-block
Otherwise: allocate in the empty slot of cc-block
![Page 11: Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813bd0550346895da4f717/html5/thumbnails/11.jpg)
cc-blocks
= Cache – conscious blocksDemands cache lines large enough to contain more than one pointer structureThe bigger the blocks – the lower the miss-rate if allocation is smart.Can be set dynamically in software, independently of the HW cache line size.In our study cc-block size– 256Bhardware cache line size – 16B – 256B
![Page 12: Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813bd0550346895da4f717/html5/thumbnails/12.jpg)
Prefetch
Prefetching will reduce the cost of cache – miss
Can be controlled by software and/or hardware
Software results in extra instructions
Hardware leads to complexity in hardware
![Page 13: Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813bd0550346895da4f717/html5/thumbnails/13.jpg)
Software controlled prefetch
Implemented by including prefetch instruction in the instruction set
Should be inserted well ahead of reference, according to prefetch algorithm
In this study: we will use greedy algorithm, by Mowry et al.
![Page 14: Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813bd0550346895da4f717/html5/thumbnails/14.jpg)
Software prefetch – Greedy algorithm
When a node is referenced, it prefetches all children at that node.Without extra calculation, can only be done to children, not to grandchildrenEasier to control and optimizeThe risk of polluting the cache decreases
(since prefetch only needed lines)
![Page 15: Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813bd0550346895da4f717/html5/thumbnails/15.jpg)
Software greedy prefetch
![Page 16: Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813bd0550346895da4f717/html5/thumbnails/16.jpg)
Hardware Controlled Prefetch
Depending on the algorithm used, prefetching can occur when a miss is caused
Or when a hint is given by the programmer through an instruction,
Or can always occur on certain types of data
![Page 17: Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813bd0550346895da4f717/html5/thumbnails/17.jpg)
Hardware prefetch
Techniques used:
Prefetch on miss
Tagged Prefetch
Attempt to utilize spatial locality
Do NOT analyze data access patterns
![Page 18: Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813bd0550346895da4f717/html5/thumbnails/18.jpg)
Prefetch-on-Miss
Prefetches the next sequantial line i+1,
when detecting miss on line i.
Line i : Miss!
Line i-1
Line i+1 : will be prefetched
![Page 19: Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813bd0550346895da4f717/html5/thumbnails/19.jpg)
Tagged Prefetch
Each prefetched line is tagged with a tag
When a prefetched line - i is referenced, the line i+1 is prefetched.
(no miss has occurred)
Efficient when memory is fairly sequential, and has been shown efficient
![Page 20: Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813bd0550346895da4f717/html5/thumbnails/20.jpg)
Pre-fetch on miss – for ccmalloc()
HW prefetch can be combined with ccmalloc(), by introducing a hint with address to the beginning of such a block.
![Page 21: Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813bd0550346895da4f717/html5/thumbnails/21.jpg)
Prefetch-one-cc on missPrefetch the next line after detecting a cache – miss on a cache-conciously allocated block.
Miss!!
![Page 22: Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813bd0550346895da4f717/html5/thumbnails/22.jpg)
Prefetch-all-cc on missDecides dynamically how many lines to prefetch.
Depends on where on cc-block the missing cache line is located.
Prefetches all the cache lines on the cc-block, from the address causing miss
Miss!!
![Page 23: Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813bd0550346895da4f717/html5/thumbnails/23.jpg)
Experimental Framework
MIPS-like, out-of-order processor simulator.
Memory latency equal to 50 ns random access time.
Benchmarks: health – simulates columbian health care system mst – creates graph and calculates minimum span
tree perimeter – calculates the perimeter of image treeadd – calculates recursive sum of values
![Page 24: Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813bd0550346895da4f717/html5/thumbnails/24.jpg)
More about benchmarks:
health – elements are moved between lists during execution, and there is more calculation between data.
mst – originally used a locality optimization procedure, which made ccmalloc() non noticeable.
perimeter – data allocated in an order similar to access order, resulting locality optimization.
treeadd – has calculation between nodes in a balanced binary tree.
![Page 25: Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813bd0550346895da4f717/html5/thumbnails/25.jpg)
Results: Execution time
![Page 26: Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813bd0550346895da4f717/html5/thumbnails/26.jpg)
Stalls:
Memory stall – an instruction waits a cycle, due to the oldest instruction waiting to be retired – load / store instr.FU stall – the oldest instr. Is not load / store instr.Fetch stall – there is no instruction waiting to be retired.
Prefetch is likely to affect when memory stalls are dominant!!
![Page 27: Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813bd0550346895da4f717/html5/thumbnails/27.jpg)
Graphs:
![Page 28: Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813bd0550346895da4f717/html5/thumbnails/28.jpg)
![Page 29: Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813bd0550346895da4f717/html5/thumbnails/29.jpg)
![Page 30: Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813bd0550346895da4f717/html5/thumbnails/30.jpg)
![Page 31: Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813bd0550346895da4f717/html5/thumbnails/31.jpg)
Cache performance - SW
Miss rates are improved by most strategies
Increased spatial locality with ccmalloc() reduces cache misses (less pollution)
Software shows some decrease of misses, but prefetches a lot unused data
Combination of software techniques achieves the lowest rates
![Page 32: Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813bd0550346895da4f717/html5/thumbnails/32.jpg)
Cache performance – cache lines
The larger cache lines the more effective is using ccmalloc()
HW prefetch alone, however, tends to pollute the cache, with unwanted data
SW prefetch alone, tends to bring data already existing in the cache
![Page 33: Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813bd0550346895da4f717/html5/thumbnails/33.jpg)
Cache performance:
SW prefetch achieves higher precision
HW prefetch alone, are no good.
HW prefetch is more sensitive to cache line size than the SW prefetch
![Page 34: Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813bd0550346895da4f717/html5/thumbnails/34.jpg)
Cache performance –SW pref. with ccmalloc()
Results in increased amount of used cache lines, among the prefetched linesThis is caused by increased spatial locality
However! Also results trying prefetching lines already in the cache.
![Page 35: Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813bd0550346895da4f717/html5/thumbnails/35.jpg)
Cache performance – HW prefetch with ccmalloc()
HW are greater improvement with cache-conscious allocation, then on their own,
Prefetch-on-miss and tagged-prefetch both show the same results
Still : large amount of unused prefetched lines
Unused lines decrease with larger cache lines, due to spatial locality, and lack of need to prefetch
![Page 36: Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813bd0550346895da4f717/html5/thumbnails/36.jpg)
Conclusions:
The best way still remains cache conscious allocation – ccmalloc()
Efficient to overcome the drawbacks of large cache line Creates locality necessary for prefetch
The larger the cache line – less prominent the prefetch strategy
![Page 37: Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813bd0550346895da4f717/html5/thumbnails/37.jpg)
Conclusions 2:
Cache-conscious allocation with HW prefetch, is not prominent, and it seems that ccmalloc() alone is enough
However, ccmalloc() can be used to overcome the negative effect of next-line prefetch
HW prefetch is better then SW prefetch
![Page 38: Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813bd0550346895da4f717/html5/thumbnails/38.jpg)
Conclusions 3:
When a compiler can use profiling info. and optimize memory allocation in cache-conscious manner – it’s preferable!
However, when profiling is too expensive – will likely to benefit from general prefetch support.
![Page 39: Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813bd0550346895da4f717/html5/thumbnails/39.jpg)
The endddd!!!
You can tell me, I can take it..
What’s up doc???
![Page 40: Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching](https://reader036.fdocuments.in/reader036/viewer/2022081603/56813bd0550346895da4f717/html5/thumbnails/40.jpg)
לנה סלמן
28.06.2004