A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International...
-
Upload
job-kristopher-woods -
Category
Documents
-
view
219 -
download
0
description
Transcript of A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International...
![Page 1: A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)](https://reader035.fdocuments.in/reader035/viewer/2022070605/5a4d1acd7f8b9ab0599701fc/html5/thumbnails/1.jpg)
A Memory-hierarchy Conscious and Self-
tunable Sorting Library
To appear in 2004 International Symposium on Code Generation and Optimization (CGO’04)
Xiaoming Li, María Jesús Garzarán, and David Padua
University of Illinois at Urbana-Champaign
![Page 2: A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)](https://reader035.fdocuments.in/reader035/viewer/2022070605/5a4d1acd7f8b9ab0599701fc/html5/thumbnails/2.jpg)
2
Motivation Sorting
– Core operation in many applications, such as databases
– Well understood symbolic computing problem
Libraries generators such as ATLAS and SPIRAL have used empirical search to adapt to – Architectural features of the target machine– Size of the input dataBut, performance of sorting also depends on the
distribution of the values to be sorted
![Page 3: A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)](https://reader035.fdocuments.in/reader035/viewer/2022070605/5a4d1acd7f8b9ab0599701fc/html5/thumbnails/3.jpg)
3
Main difficulties to build a sorting library
1. Theoretical complexity is not sufficient to measure quality• Cache effect, instructions executed
2. Performance depends on the characteristics of the input• Amount & distribution of data to sort• A single algorithm is not optimal for all
possible input sets
Motivation
![Page 4: A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)](https://reader035.fdocuments.in/reader035/viewer/2022070605/5a4d1acd7f8b9ab0599701fc/html5/thumbnails/4.jpg)
4
Contributions1. Identify the architectural and runtime factors
that affect the performance of the sorting algorithms.
2. Use empirical search to identify the best shape and parameter values of a sorting algorithm.
3. Use machine learning and runtime adaptation to select the best sorting algorithm for a specific input set.
![Page 5: A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)](https://reader035.fdocuments.in/reader035/viewer/2022070605/5a4d1acd7f8b9ab0599701fc/html5/thumbnails/5.jpg)
5
ContributionsIBM Power 3, sorting 12 M keys (integer 32 bits)
Standard deviation of the inputs
Exec
utio
n Ti
me
(Cyc
les)
![Page 6: A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)](https://reader035.fdocuments.in/reader035/viewer/2022070605/5a4d1acd7f8b9ab0599701fc/html5/thumbnails/6.jpg)
6
Outline Sorting Algorithms Factors that determine performance The Library Evaluation Future Work Conclusions
![Page 7: A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)](https://reader035.fdocuments.in/reader035/viewer/2022070605/5a4d1acd7f8b9ab0599701fc/html5/thumbnails/7.jpg)
7
Sorting Algorithms Our sorting library contains
– Quicksort– CC-Radix– Multiway Merge– Insertion Sort– Sorting Networks
For small partitions
![Page 8: A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)](https://reader035.fdocuments.in/reader035/viewer/2022070605/5a4d1acd7f8b9ab0599701fc/html5/thumbnails/8.jpg)
8
Quicksort Divide and conquer in-place sorting
algorithm
Our implementation includes Sedgewick’s optimizations:– Set guardians at both ends of the input array.– Eliminate recursion.– Correctly select the pivot.– Use insertion sort for small partitions.
![Page 9: A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)](https://reader035.fdocuments.in/reader035/viewer/2022070605/5a4d1acd7f8b9ab0599701fc/html5/thumbnails/9.jpg)
9
Radix sort
Non comparison algorithm
12233113 4 1
012345
Vectorto sort
2121
1234
counter
0235
1234
accum.
3231341
012345
Dest.vector
31 1122333 4
1223
112334
3
123
1231
![Page 10: A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)](https://reader035.fdocuments.in/reader035/viewer/2022070605/5a4d1acd7f8b9ab0599701fc/html5/thumbnails/10.jpg)
10
CC-radix (Cache Conscious Radix Sort) Tries to exploit data locality in caches Based on radix sort (Jimenez and Larriba – UPC)
if fits in cache (bucket) then radix sort (bucket)
CC-radix(bucket)
elsesub-buckets = Reverse sorting(bucket)
for each sub-bucket in sub-buckets CC-radix(sub-buckets) endfor endif
![Page 11: A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)](https://reader035.fdocuments.in/reader035/viewer/2022070605/5a4d1acd7f8b9ab0599701fc/html5/thumbnails/11.jpg)
11
Multiway Merge Sort
SortedSubset
SortedSubset
SortedSubset
SortedSubset
Heap
p subsets
2*p -1 nodes
This algorithm exploits data locality very efficiently
![Page 12: A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)](https://reader035.fdocuments.in/reader035/viewer/2022070605/5a4d1acd7f8b9ab0599701fc/html5/thumbnails/12.jpg)
12
Sorting algorithms for small partitions Insertion sort Exploits locality in the
cache line
Sorting networks Register blocking
![Page 13: A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)](https://reader035.fdocuments.in/reader035/viewer/2022070605/5a4d1acd7f8b9ab0599701fc/html5/thumbnails/13.jpg)
13
Performance Comparison
4000
4500
5000
5500
6000
6500
7000
100 1000 10000 100000 1000000 10000000
Standard Devi ati on
Exec
utio
n Ti
me (
Cycl
es)
I nt el MKLQui cksor t
Pentium III Xeon, 16 M keys (float)
![Page 14: A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)](https://reader035.fdocuments.in/reader035/viewer/2022070605/5a4d1acd7f8b9ab0599701fc/html5/thumbnails/14.jpg)
14
Outline Sorting Algorithms Factors that determine
performance The Library Evaluation Future Work Conclusions
![Page 15: A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)](https://reader035.fdocuments.in/reader035/viewer/2022070605/5a4d1acd7f8b9ab0599701fc/html5/thumbnails/15.jpg)
15
Factors that determine performance Architectural Factors Considered
– Cache / TLB size– Number of Registers– Cache Line Size
Runtime Factors Considered– Amount of data to Sort– Distribution of the data
![Page 16: A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)](https://reader035.fdocuments.in/reader035/viewer/2022070605/5a4d1acd7f8b9ab0599701fc/html5/thumbnails/16.jpg)
16
Architectural: Cache Size/TLB Size Tiling: Partition the data in subsets that fit in
the cache– Quicksort
•Using multiple pivots to tile– CC-radix
•Fit each partition into cache•The # active partitions < TLB size
– Multiway Merge Sort•Fit the heap into cache•Fit sorted subsets into cache
![Page 17: A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)](https://reader035.fdocuments.in/reader035/viewer/2022070605/5a4d1acd7f8b9ab0599701fc/html5/thumbnails/17.jpg)
17
Architectural: Number of Registers For small partitions, sort in place using the processor
registers Optimizations like unroll and scheduling can be applied
cmp&swap(r0,r1)cmp&swap(r2,r3)cmp&swap(r1,r2)cmp&swap(r0,r3)cmp&swap(r4,r5)…..
cmp&swap(r0,r1)cmp&swap(r2,r3)cmp&swap(r4,r5)cmp&swap(r1,r2)cmp&swap(r0,r3)
![Page 18: A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)](https://reader035.fdocuments.in/reader035/viewer/2022070605/5a4d1acd7f8b9ab0599701fc/html5/thumbnails/18.jpg)
18
Architectural: Cache Line Size Fanout = Cache Line Size Increase cache line utilization when accessing children nodes
…
Cache Line
![Page 19: A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)](https://reader035.fdocuments.in/reader035/viewer/2022070605/5a4d1acd7f8b9ab0599701fc/html5/thumbnails/19.jpg)
19
Runtime: Amount and Distribution Shape
Number of Keys (Millions)
Exec
utio
n Ti
me
(Cyc
les)
![Page 20: A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)](https://reader035.fdocuments.in/reader035/viewer/2022070605/5a4d1acd7f8b9ab0599701fc/html5/thumbnails/20.jpg)
20
Runtime: Amount and Distribution Shape
Exec
utio
n Ti
me
(Cyc
les)
Number of Keys (Millions)
![Page 21: A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)](https://reader035.fdocuments.in/reader035/viewer/2022070605/5a4d1acd7f8b9ab0599701fc/html5/thumbnails/21.jpg)
21
Runtime: Standard DeviationEx
ecut
ion
Tim
e (C
ycle
s)
Standard deviation of the keys
Pentium III Xeon, 16 M keys
![Page 22: A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)](https://reader035.fdocuments.in/reader035/viewer/2022070605/5a4d1acd7f8b9ab0599701fc/html5/thumbnails/22.jpg)
22
Outline Sorting Algorithms Factors that determine performance The Library Evaluation Future Work Conclusions
![Page 23: A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)](https://reader035.fdocuments.in/reader035/viewer/2022070605/5a4d1acd7f8b9ab0599701fc/html5/thumbnails/23.jpg)
23
Library adaptation Architectural Factors
– Cache / TLB size– Number of Registers – Cache Line Size
Empirical Search
Runtime Factors– Distribution shape of the data
– Amount of data to Sort – Standard Deviation
Does not matter
Machine learning and runtime adaptation
![Page 24: A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)](https://reader035.fdocuments.in/reader035/viewer/2022070605/5a4d1acd7f8b9ab0599701fc/html5/thumbnails/24.jpg)
24
The Library Building the library Intallation time
– Empirical Search– Learning Procedure
• Use of training data Running the library Runtime
– Runtime Procedure
RuntimeAdaptation
![Page 25: A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)](https://reader035.fdocuments.in/reader035/viewer/2022070605/5a4d1acd7f8b9ab0599701fc/html5/thumbnails/25.jpg)
25
Runtime Adaptation: Learning Procedure Goal function:
f:(N,E) {Multiway Merge Sort, Quicksort, CC-radix}
N: amount of input dataE: the entropy vector
– Use N to choose between Multiway Merge or Quicksort– Use the entropy and Winnow algorithm to learn the best
algorithm • Output: weight vector ( ) and threshold (Ө)
w→
![Page 26: A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)](https://reader035.fdocuments.in/reader035/viewer/2022070605/5a4d1acd7f8b9ab0599701fc/html5/thumbnails/26.jpg)
26
Runtime Adaptation:Runtime Procedure
Sample the input array Compute the entropy vector Compute S = ∑i wi * entropyi
If S ≥ Ө choose CC-radix
elsechoose others
![Page 27: A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)](https://reader035.fdocuments.in/reader035/viewer/2022070605/5a4d1acd7f8b9ab0599701fc/html5/thumbnails/27.jpg)
27
Outline Sorting Algorithms Factors that determine performance The Library Evaluation Future Work Conclusions
![Page 28: A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)](https://reader035.fdocuments.in/reader035/viewer/2022070605/5a4d1acd7f8b9ab0599701fc/html5/thumbnails/28.jpg)
28
Experimental Setup Test Platforms:
– SGI R12000: 300 Mhz; L1I/D=32KB; L2 = 4MB
– UltraSparcIII: 750 Mhz; L1I/D=32KB, 64KB; L2 = 8MB
– PentiumIII Xeon: 550 Mhz; L1I/D=16KB; L2 = 512KB
– IBM Power3: 375 Mhz, L1I/D=64KB; L2 = 8MB
![Page 29: A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)](https://reader035.fdocuments.in/reader035/viewer/2022070605/5a4d1acd7f8b9ab0599701fc/html5/thumbnails/29.jpg)
29
Sun UltraSparcIII: 12 M keysEx
ecut
ion
Tim
e (C
ycle
s per
key
)
Standard deviation of the keys
![Page 30: A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)](https://reader035.fdocuments.in/reader035/viewer/2022070605/5a4d1acd7f8b9ab0599701fc/html5/thumbnails/30.jpg)
30
IBM Power3: 12 M KeysEx
ecut
ion
Tim
e (C
ycle
s per
key
)
Standard deviation of the keys
![Page 31: A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)](https://reader035.fdocuments.in/reader035/viewer/2022070605/5a4d1acd7f8b9ab0599701fc/html5/thumbnails/31.jpg)
31
Conclusions Identify the architectural and runtime factors
Use empirical search to find the best parameters values
Our machine learning techniques prove to be quite effective:– Always selects the best algorithm.– The wrong decision introduces a 37% average
performance degradation– Overhead (average 5%, worst case 7%)
![Page 32: A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)](https://reader035.fdocuments.in/reader035/viewer/2022070605/5a4d1acd7f8b9ab0599701fc/html5/thumbnails/32.jpg)
32
Future Work1. Search in the space of sorting algorithms using
high-level primitives
2. Extend sorting to include more data types
3. Include other comparison strategies
4. Parallel algorithms
5. Explore other database operations, such as join.
For example, less than to sort vectors, graphs, …
![Page 33: A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)](https://reader035.fdocuments.in/reader035/viewer/2022070605/5a4d1acd7f8b9ab0599701fc/html5/thumbnails/33.jpg)
33
Empirical Search Adaptation to the architecture of the machine
– Quicksort and CC-radix, • the best configuration does not change significantly with the
characteristics of the input data set. • Quicksort, CC-Radix:
- Use of insertion sort/sorting networks for small partitions- Threshold to use them
• CC-radix- Size of the radix
– Multiway Merge Sort• the best configuration changes with the amount and the
distribution of the input data. • The best values will be searched during the learning
procedure.
![Page 34: A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)](https://reader035.fdocuments.in/reader035/viewer/2022070605/5a4d1acd7f8b9ab0599701fc/html5/thumbnails/34.jpg)
34
![Page 35: A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)](https://reader035.fdocuments.in/reader035/viewer/2022070605/5a4d1acd7f8b9ab0599701fc/html5/thumbnails/35.jpg)
35
Multiway Merge Sort
SortedRun
SortedRun
SortedRun
SortedRun
Heap
11 21 23 607 42
21 60
60
42
28
60
42
28
4
42
28
23
![Page 36: A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)](https://reader035.fdocuments.in/reader035/viewer/2022070605/5a4d1acd7f8b9ab0599701fc/html5/thumbnails/36.jpg)
36
Empirical SearchExample: Multiway Merge
• Search the heap size that obtains the best performance:- Different amount of data and
standard deviation