Amphisbaena: Modeling Two Orthogonal Ways to Hunt on Heterogeneous Many-cores an analytical...
-
Upload
ross-dickerson -
Category
Documents
-
view
215 -
download
0
Transcript of Amphisbaena: Modeling Two Orthogonal Ways to Hunt on Heterogeneous Many-cores an analytical...
![Page 1: Amphisbaena: Modeling Two Orthogonal Ways to Hunt on Heterogeneous Many-cores an analytical performance model for boosting performance Jun Ma, Guihai Yan,](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649e9f5503460f94ba0c3f/html5/thumbnails/1.jpg)
Amphisbaena: Modeling Two Orthogonal Ways to Hunt on Heterogeneous Many-coresan analytical performance model for boosting performance
Jun Ma, Guihai Yan, Yinhe Han and Xiaowei Li
State Key Laboratory of Computer ArchitectureInstitute of Computing Technology, C.A.S.
Univ. of Chinese Academy of Sciences
![Page 2: Amphisbaena: Modeling Two Orthogonal Ways to Hunt on Heterogeneous Many-cores an analytical performance model for boosting performance Jun Ma, Guihai Yan,](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649e9f5503460f94ba0c3f/html5/thumbnails/2.jpg)
Trends in Cloud Computing The increasing computing demands
More massive More diverse High service level agreement(response time, throughput)
The computing platform to meet these demands Multicore to manycore Homogeneous to heterogeneous
![Page 3: Amphisbaena: Modeling Two Orthogonal Ways to Hunt on Heterogeneous Many-cores an analytical performance model for boosting performance Jun Ma, Guihai Yan,](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649e9f5503460f94ba0c3f/html5/thumbnails/3.jpg)
Two Orthogonal Ways to Boost Performance Scale-out speedup: explore many cores for higher
thread-level parallelism
Scale-up speedup: explore heterogeneous cores for optimal application-core mapping
![Page 4: Amphisbaena: Modeling Two Orthogonal Ways to Hunt on Heterogeneous Many-cores an analytical performance model for boosting performance Jun Ma, Guihai Yan,](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649e9f5503460f94ba0c3f/html5/thumbnails/4.jpg)
Quantifying Scale-out and Scale-up Speedup The overall performance
Type Issue Width ROB Size
Core-A 4 64
Core-B 6 96
Core-C 8 128
Indicate how to improve overall performance of each application.
How to figure out the application-specific scale-out and scale-up speedup?
![Page 5: Amphisbaena: Modeling Two Orthogonal Ways to Hunt on Heterogeneous Many-cores an analytical performance model for boosting performance Jun Ma, Guihai Yan,](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649e9f5503460f94ba0c3f/html5/thumbnails/5.jpg)
Amphisbaena: an Analytical Approach to Model Performance
Amphisbaena, or shortly, Modeling the overall performance speedup coming from
two orthogonal ways
I’m I’m
The ratio of performance on target cores to current cores under the same multithreading configuration.
The ratio of performance on target multithreading configuration to current configuration on the same type of cores.
![Page 6: Amphisbaena: Modeling Two Orthogonal Ways to Hunt on Heterogeneous Many-cores an analytical performance model for boosting performance Jun Ma, Guihai Yan,](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649e9f5503460f94ba0c3f/html5/thumbnails/6.jpg)
Experimental Setup
cluster-based layoutdistributed, banked LLC
directory-based MOESI protocol
![Page 7: Amphisbaena: Modeling Two Orthogonal Ways to Hunt on Heterogeneous Many-cores an analytical performance model for boosting performance Jun Ma, Guihai Yan,](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649e9f5503460f94ba0c3f/html5/thumbnails/7.jpg)
Scale-out Speedup
– the serial part.– the parallelizable part.– the multithreading penalty.
![Page 8: Amphisbaena: Modeling Two Orthogonal Ways to Hunt on Heterogeneous Many-cores an analytical performance model for boosting performance Jun Ma, Guihai Yan,](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649e9f5503460f94ba0c3f/html5/thumbnails/8.jpg)
Observation
– modulating constant.– synchronization waiting
cycles per kilo-instructions(SPKI).
– thread number.
– modulating constant.– misses waiting cycles per
kilo-instructions(MPKI).– thread number squared.
![Page 9: Amphisbaena: Modeling Two Orthogonal Ways to Hunt on Heterogeneous Many-cores an analytical performance model for boosting performance Jun Ma, Guihai Yan,](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649e9f5503460f94ba0c3f/html5/thumbnails/9.jpg)
The Details of Multithreading Penalty
Coefficients Value Implementationsa0 1.837e-003 constant a1 0.05312 constant a2 -2.025e-005 constantk0 bias redundant computationsk1 SPKI bottleneck-identifying instructionsk2 MPKI built-in performance counters
offline
online
![Page 10: Amphisbaena: Modeling Two Orthogonal Ways to Hunt on Heterogeneous Many-cores an analytical performance model for boosting performance Jun Ma, Guihai Yan,](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649e9f5503460f94ba0c3f/html5/thumbnails/10.jpg)
Alpha Model Accuracy
benchmarks 12phases 50threads 33(1,2,4,6…64)total space 633600samples 600
Our error is under 5% on average, which outperforms the error of Amdahl’s Law with error of 11.4%.
![Page 11: Amphisbaena: Modeling Two Orthogonal Ways to Hunt on Heterogeneous Many-cores an analytical performance model for boosting performance Jun Ma, Guihai Yan,](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649e9f5503460f94ba0c3f/html5/thumbnails/11.jpg)
Scale-up Speedup
the frontend: issue width
• W [Big, Small]
the backend: ROB size
• R[Big, Small]
How to predict the CPI on various type of cores?
S B SB
B B S S
C0 C1
C2 C3
![Page 12: Amphisbaena: Modeling Two Orthogonal Ways to Hunt on Heterogeneous Many-cores an analytical performance model for boosting performance Jun Ma, Guihai Yan,](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649e9f5503460f94ba0c3f/html5/thumbnails/12.jpg)
Observation
this trend is well approximated by a power law. this trend fits an exponential function well.
![Page 13: Amphisbaena: Modeling Two Orthogonal Ways to Hunt on Heterogeneous Many-cores an analytical performance model for boosting performance Jun Ma, Guihai Yan,](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649e9f5503460f94ba0c3f/html5/thumbnails/13.jpg)
The Details of CPI Model
Coefficients Value Implementationsb0 0.2837 constant b1 1.1675 constant b2 1.8427 constantr bias b0×CPIbase
s memory intensity CPImem/CPIt computing intensity CPIbase/CPICPImem penalty with stalls CPI stack calculationCPIbase penalty without stalls CPI stack calculation
memory intensity.computing intensity.bias.
offline
online
online
![Page 14: Amphisbaena: Modeling Two Orthogonal Ways to Hunt on Heterogeneous Many-cores an analytical performance model for boosting performance Jun Ma, Guihai Yan,](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649e9f5503460f94ba0c3f/html5/thumbnails/14.jpg)
Beta Model Accuracy
benchmarks 12phases 50core types 6total space 18000samples 600
Our error is kept below 8% on average, which outperforms the error of PIE with error of 12.2%.
![Page 15: Amphisbaena: Modeling Two Orthogonal Ways to Hunt on Heterogeneous Many-cores an analytical performance model for boosting performance Jun Ma, Guihai Yan,](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649e9f5503460f94ba0c3f/html5/thumbnails/15.jpg)
Phi Model Accuracy
benchmarks 12phases 50threads 33(1,2,4,6…64)core types 6total space 633600×18000samples 1080
The prediction error of overall performance is kept below 12% on average.
![Page 16: Amphisbaena: Modeling Two Orthogonal Ways to Hunt on Heterogeneous Many-cores an analytical performance model for boosting performance Jun Ma, Guihai Yan,](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649e9f5503460f94ba0c3f/html5/thumbnails/16.jpg)
Orthogonality Validation
0: mmmityOrthogonal
benchmarks 12phases 50threads 33(1,2,4,6…64)core types 6total space 633600×18000measured 2268
mmm ,, three measured values.
For most applications, the error about orthogonality is below 5% on average.
![Page 17: Amphisbaena: Modeling Two Orthogonal Ways to Hunt on Heterogeneous Many-cores an analytical performance model for boosting performance Jun Ma, Guihai Yan,](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649e9f5503460f94ba0c3f/html5/thumbnails/17.jpg)
Application of Phi Model Using Phi for runtime management
Predict the performance speedup coming from scale-out and scale-up on any other target configurations online.
Invoke scheduling algorithm to figure out the optimal configuration in terms of maximizing performance.
The operating system enables the specified multithreading and application-core mapping.
![Page 18: Amphisbaena: Modeling Two Orthogonal Ways to Hunt on Heterogeneous Many-cores an analytical performance model for boosting performance Jun Ma, Guihai Yan,](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649e9f5503460f94ba0c3f/html5/thumbnails/18.jpg)
Phi Scheduling
Dout Dup Phi
“application with higher scale-out speedup should spawn more thread.”
“application with largest scale-up speedup is allocated with the fastest type of cores.”
“decide the thread number to spawn for each application.”
“decide the cores to map for each application.”
“Phi scheduling use the heuristic algorithm to maximize performance.”
function
policy
algorithm
![Page 19: Amphisbaena: Modeling Two Orthogonal Ways to Hunt on Heterogeneous Many-cores an analytical performance model for boosting performance Jun Ma, Guihai Yan,](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649e9f5503460f94ba0c3f/html5/thumbnails/19.jpg)
Performance Comparison
Baselines Scale-out Scale-upBias Dout memory-related samplesPIE Dout PIE modelStatic fixed thread number DupPhi Dout Dup
Phi averagely outperforms the other three baselines by 12.2% (Static), 13.3% (Bias) and 12.9% (PIE).
![Page 20: Amphisbaena: Modeling Two Orthogonal Ways to Hunt on Heterogeneous Many-cores an analytical performance model for boosting performance Jun Ma, Guihai Yan,](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649e9f5503460f94ba0c3f/html5/thumbnails/20.jpg)
Related Works
Performance prediction and optimization periodically Only decided the number of threads/active cores
• CPR: Composable Performance Regression for Scalable Multiprocessor – [Benjamin C. Lee etc. MICRO2008]
• FDT: Feedback-Driven Threading Power-Efficient and High-Performance Execution of Multi-threaded Workloads on CMPs– [M. Aater Suleman etc. ASPLOS2008]
Only decided the type of heterogeneous cores• Single-ISA Heterogeneous Multi-core Architectures for
Multithreaded Workload Performance– [Rakesh Kumar etc. ISCA2004]
• Scheduling Heterogeneous Multi-cores Through Performance Impact Estimation (PIE)– [Kenzo Van Craeynest etc. ISCA2012]
![Page 21: Amphisbaena: Modeling Two Orthogonal Ways to Hunt on Heterogeneous Many-cores an analytical performance model for boosting performance Jun Ma, Guihai Yan,](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649e9f5503460f94ba0c3f/html5/thumbnails/21.jpg)
Conclusion Analytical model for performance prediction
Scale-out speedup Scale-up speedup Overall performance
Phi scheduling Apply for runtime management Return optimal performance
![Page 22: Amphisbaena: Modeling Two Orthogonal Ways to Hunt on Heterogeneous Many-cores an analytical performance model for boosting performance Jun Ma, Guihai Yan,](https://reader034.fdocuments.in/reader034/viewer/2022042703/56649e9f5503460f94ba0c3f/html5/thumbnails/22.jpg)
Thanks for Your Attention
Q&A