SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads
description
Transcript of SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads
![Page 1: SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads](https://reader035.fdocuments.in/reader035/viewer/2022062411/5681632e550346895dd3a81b/html5/thumbnails/1.jpg)
Pınar TözünAnastasia Ailamaki
SLICC Self-Assembly of Instruction Cache Collectivesfor OLTP Workloads
Islam AttaAndreas Moshovos
![Page 2: SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads](https://reader035.fdocuments.in/reader035/viewer/2022062411/5681632e550346895dd3a81b/html5/thumbnails/2.jpg)
SLICC
$100 Billion/Yr, +10% annually•E.g., banking, online purchases, stock market…
Benchmarking•Transaction Processing Council
•TPC-C: Wholesale retailer
•TPC-E: Brokerage market
Online Transaction Processing (OLTP)
OLTP drives innovation for HW and DB vendors
© Islam Atta 2
![Page 3: SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads](https://reader035.fdocuments.in/reader035/viewer/2022062411/5681632e550346895dd3a81b/html5/thumbnails/3.jpg)
SLICC
Many concurrent transactions
Transactions Suffer from Instruction Misses
L1-I size
Foot
prin
t
Each
Tim
e
Instruction Stalls due to L1 Instruction Cache Thrashing© Islam Atta
3
![Page 4: SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads](https://reader035.fdocuments.in/reader035/viewer/2022062411/5681632e550346895dd3a81b/html5/thumbnails/4.jpg)
SLICC
Even on a CMP all Transactions Suffer
CoresL1-1 Caches
Transactions
All caches thrashed with similar code blocks
Tim
e
© Islam Atta 4
![Page 5: SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads](https://reader035.fdocuments.in/reader035/viewer/2022062411/5681632e550346895dd3a81b/html5/thumbnails/5.jpg)
SLICC
Opportunity
Footprint over Multiple Cores Reduced Instruction Misses
Technology:• CMP’s aggregate L1
instruction cache capacity is large enough
Application Behavior:• Instruction overlap within
and across transactions
Multiple L1-I caches
Multiple threads
Tim
e
© Islam Atta 5
![Page 6: SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads](https://reader035.fdocuments.in/reader035/viewer/2022062411/5681632e550346895dd3a81b/html5/thumbnails/6.jpg)
SLICC
Dynamic Hardware Solution• How to divide a transaction• When to move• Where to go
Performance•Reduces instruction misses by 44% (TPC-C), 68% (TPC-E)
•Performance improves by 60% (TPC-C), 79% (TPC-E)
Robust: • non-OLTP workload remains unaffected
SLICC Overview
© Islam Atta 6
![Page 7: SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads](https://reader035.fdocuments.in/reader035/viewer/2022062411/5681632e550346895dd3a81b/html5/thumbnails/7.jpg)
SLICC
• Intra/Inter-thread instruction locality is high• SLICC Concept• SLICC Ingredients• Results• Summary
Talk Roadmap
© Islam Atta 7
![Page 8: SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads](https://reader035.fdocuments.in/reader035/viewer/2022062411/5681632e550346895dd3a81b/html5/thumbnails/8.jpg)
SLICC
Many concurrent transactions
Few DB operations•28 – 65KB
Few transaction types•TPC-C: 5, TPC-E: 12
Transactions fit in 128-512KB
OLTP Facts
Overlap within and across different transactions
R() U() I() D() IT() ITP()
PaymentNew Order
CMPs’ aggregate L1-I cache is large enough© Islam Atta
8
![Page 9: SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads](https://reader035.fdocuments.in/reader035/viewer/2022062411/5681632e550346895dd3a81b/html5/thumbnails/9.jpg)
SLICC
Instruction Commonality Across Transactions
Lots of code reuse
More Yellow
Even higher across same-type transactions
Most
Few
Single
TPC-C TPC-E
All Threads
Per TransactionType
More Reuse
© Islam Atta 9
![Page 10: SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads](https://reader035.fdocuments.in/reader035/viewer/2022062411/5681632e550346895dd3a81b/html5/thumbnails/10.jpg)
SLICC
Enable usage of aggregate L1-I capacity•Large cache size without increased latency
Exploit instruction commonality•Localizes common transaction instructions
Dynamic•Independent of footprint size or cache configuration
Requirements
© Islam Atta 10
![Page 11: SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads](https://reader035.fdocuments.in/reader035/viewer/2022062411/5681632e550346895dd3a81b/html5/thumbnails/11.jpg)
SLICC
• Intra/Inter-thread instruction locality is high• SLICC Concept• SLICC Ingredients• Results• Summary
Talk Roadmap
© Islam Atta 11
![Page 12: SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads](https://reader035.fdocuments.in/reader035/viewer/2022062411/5681632e550346895dd3a81b/html5/thumbnails/12.jpg)
SLICC
Example for Concurrent Transactions
T1 T2 T3
Code segments that can fit into L1-I
TransactionsControl FlowGraph
© Islam Atta 12
![Page 13: SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads](https://reader035.fdocuments.in/reader035/viewer/2022062411/5681632e550346895dd3a81b/html5/thumbnails/13.jpg)
SLICC
T1 T2T1
T1
T3
T2 T3T1
T1
Scheduling Threads
T1 T2
T2 T3
T1 T3
0 1 2 3CORES
T3
Conventional
L1-I
T1
T2
T3
ThreadsTi
me
T1
T1
0 1 2 3CORES
SLICC
T1
T2
T3 T2
T1 T3
T3
T1T1
Cache Filled 10 times Cache Filled 4 times
T2 T2T2
© Islam Atta 13
![Page 14: SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads](https://reader035.fdocuments.in/reader035/viewer/2022062411/5681632e550346895dd3a81b/html5/thumbnails/14.jpg)
SLICC
• Intra/Inter-thread instruction locality is high• SLICC Concept• SLICC Ingredients• Results• Summary
Talk Roadmap
© Islam Atta 14
![Page 15: SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads](https://reader035.fdocuments.in/reader035/viewer/2022062411/5681632e550346895dd3a81b/html5/thumbnails/15.jpg)
SLICC
When to migrate? Step 1:
Detect: cache full
Step 2: Detect: new code segment
Where to go? Step 3:
Predict where is the next code segment?
Migration Ingredients
© Islam Atta 15
![Page 16: SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads](https://reader035.fdocuments.in/reader035/viewer/2022062411/5681632e550346895dd3a81b/html5/thumbnails/16.jpg)
SLICC
Migration Ingredients
Tim
e
Idle coresWhen to migrate?Step 1: Detect: cache full
Step 2: Detect: new segment
Where to go?Step 3: Where is the next segment?
Loops
IdleReturn back
T1
© Islam Atta 16
![Page 17: SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads](https://reader035.fdocuments.in/reader035/viewer/2022062411/5681632e550346895dd3a81b/html5/thumbnails/17.jpg)
SLICC
Migration Ingredients
When to migrate?Step 1: Detect: cache full
Step 2: Detect: new segment
Where to go?Step 3: Where is the next segment?
Tim
e
T2
© Islam Atta 17
![Page 18: SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads](https://reader035.fdocuments.in/reader035/viewer/2022062411/5681632e550346895dd3a81b/html5/thumbnails/18.jpg)
SLICC
Implementation
When to migrate?Step 1: Detect: cache full
Step 2: Detect: new segment
Where to go?Step 3: Where is the next segment?
Find signature blocks on
remote cores
Miss Counter
Miss Dilution
© Islam Atta 18
![Page 19: SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads](https://reader035.fdocuments.in/reader035/viewer/2022062411/5681632e550346895dd3a81b/html5/thumbnails/19.jpg)
SLICC
More overlap across transactions of the same-type
SLICC: Transaction Type-oblivious
Transaction Type-aware•SLICC-Pp: Pre-processing to detect similar transactions
•SLICC-SW : Software provides information
Boosting Effectiveness
© Islam Atta 19
![Page 20: SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads](https://reader035.fdocuments.in/reader035/viewer/2022062411/5681632e550346895dd3a81b/html5/thumbnails/20.jpg)
SLICC
• Intra/Inter-thread instruction locality is high• SLICC Concept• SLICC Ingredients• Results• Summary
Talk Roadmap
© Islam Atta 20
![Page 21: SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads](https://reader035.fdocuments.in/reader035/viewer/2022062411/5681632e550346895dd3a81b/html5/thumbnails/21.jpg)
SLICC
How does SLICC affect INSTRUCTION misses? Our primary goal
How does it affect DATA misses? Expected to increase, by how much?
Performance impact: Are DATA misses and MIGRATION OVERHEADS amortized?
Experimental Evaluation
© Islam Atta 21
![Page 22: SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads](https://reader035.fdocuments.in/reader035/viewer/2022062411/5681632e550346895dd3a81b/html5/thumbnails/22.jpg)
SLICC
Simulation•Zesto (x86)
•16 OoO cores, 32KB L1-I, 32KB L1-D, 1MB per core L2
•QEMU extension
•User and Kernel space
Workloads
Methodology
Shore-MT
© Islam Atta 22
![Page 23: SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads](https://reader035.fdocuments.in/reader035/viewer/2022062411/5681632e550346895dd3a81b/html5/thumbnails/23.jpg)
SLICC
Baseline: no effort to reduce instruction misses
Effect on MissesBe
tter
Reduce I-MPKI by 58%. Increase D-MPKI by 7%.
I-MPKI
D-MPKI
Base
SLIC
C
SLIC
C-SW
Base
SLIC
C
SLIC
C-SW
Base
SLIC
C
SLIC
C-SW
TPC-C-10 TPC-E MapReduce
05
1015202530354045
MPK
I
© Islam Atta 23
![Page 24: SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads](https://reader035.fdocuments.in/reader035/viewer/2022062411/5681632e550346895dd3a81b/html5/thumbnails/24.jpg)
SLICC
Next-line: always prefetch the next-lineUpper bound for Proactive Instruction Fetch [Ferdman, MICRO’11]
Performance
TPC-C-1 TPC-C-10 TPC-E MapReduce1
1.11.21.31.41.51.61.71.81.9
2
Spee
dup
Bette
r
TPC-C: +60% TPC-E: +79%
Storage per core- PIF: ~40KB- SLICC: <1KB.
Next-Line
PIF-No Overhead
SLICC
SLICC-SW
© Islam Atta 24
![Page 25: SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads](https://reader035.fdocuments.in/reader035/viewer/2022062411/5681632e550346895dd3a81b/html5/thumbnails/25.jpg)
SLICC
OLTP’s performance suffers due to instruction stalls.
Technology & Application Opportunities: • Instruction footprint fits on aggregate L1-I capacity of CMPs.• Inter- and intra-thread locality.
SLICC: • Thread migration spread instruction footprint over multiple
cores.• Reduce I-MPKI by 58%• Improve performance by
Summary
Baseline: +70%
Next-line: +44%
PIF: ±2% to +21%
© Islam Atta 25
![Page 27: SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads](https://reader035.fdocuments.in/reader035/viewer/2022062411/5681632e550346895dd3a81b/html5/thumbnails/27.jpg)
SLICC
Example: thread migrates from core A core B.
•Read data on core B that is fetched on core A.
•Write data on core B to invalidate data on core A.
•When returning to core A, cache blocks might be evicted by other threads.
Why data misses increase?
© Islam Atta 27
![Page 28: SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads](https://reader035.fdocuments.in/reader035/viewer/2022062411/5681632e550346895dd3a81b/html5/thumbnails/28.jpg)
SLICC
SLICC Agent per Core
MSV(Miss Shift-Vector)
Count “1”s
MC(Miss Counter)
≥
Fill-up_t
...
Enable shifting
Dilution_t
Locating Missed Blocks on Remote
Cores
Miss Tag-Queue (MTQ)
EnableMigration Select Matching Core
Mat
ched
_t
entr
ies
≥
EnableSearching
+Remote Cache Segment Search
Cache Full DetectionMiss(1)Hit(0)
Miss Dilution Tracking
© Islam Atta 28
![Page 29: SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads](https://reader035.fdocuments.in/reader035/viewer/2022062411/5681632e550346895dd3a81b/html5/thumbnails/29.jpg)
SLICC
Zesto (x86)Qtrace (QEMU extension)Shore-MT
Detailed Methodology
© Islam Atta 29
![Page 30: SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads](https://reader035.fdocuments.in/reader035/viewer/2022062411/5681632e550346895dd3a81b/html5/thumbnails/30.jpg)
SLICC
Hardware Cost
© Islam Atta 30
![Page 31: SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads](https://reader035.fdocuments.in/reader035/viewer/2022062411/5681632e550346895dd3a81b/html5/thumbnails/31.jpg)
SLICC
Larger I-caches?
16 32 64 128
256
512 16 32 64 128
256
512 16 32 64 128
256
512 16 32 64 128
256
512 16 32 64 128
256
512 16 32 64 128
256
512
Instructions Data Instructions Data Instructions DataTPC-C-10 TPC-E MapReduce
0
10
20
30
40
50
60
0
0.2
0.4
0.6
0.8
1
1.2
1.4Conflict Capacity Compulsory Speedup
MPK
I
Cache Size (K)
Spee
d Up
Bett
er
Bett
er
© Islam Atta 31
![Page 32: SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads](https://reader035.fdocuments.in/reader035/viewer/2022062411/5681632e550346895dd3a81b/html5/thumbnails/32.jpg)
SLICC
Different Replacement Policies?
TPC-C TPC-E MapReduce0
5
10
15
20
25
30
35
40 LRU LIP BIP DIP SRRIP BRRIP DRRIPL1
Inst
ruct
ion
MPK
I
Bett
er
© Islam Atta 32
![Page 33: SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads](https://reader035.fdocuments.in/reader035/viewer/2022062411/5681632e550346895dd3a81b/html5/thumbnails/33.jpg)
SLICC
Parameter Space (1)Ba
se 128
256
384
512
128
256
384
512
128
256
384
512
128
256
384
512
128
256
384
512
Base 128
256
384
512
128
256
384
512
128
256
384
512
128
256
384
512
128
256
384
512
2 4 6 8 10 2 4 6 8 10TPC-C TPC-E
0
10
20
30
40
50
60
70
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6I-MPKI D-MPKI Speedup
Fill-up_t (top), Matched_t (bottom)
MPK
I
Spee
dup
Bett
er
Bett
er
© Islam Atta 33
![Page 34: SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads](https://reader035.fdocuments.in/reader035/viewer/2022062411/5681632e550346895dd3a81b/html5/thumbnails/34.jpg)
SLICC
Parameter Space (2)2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
TPC-C TPC-E
0
10
20
30
40
50
60
00.20.40.60.811.21.41.61.82
I-MPKI D-MPKI Speedup
Dilution_t
MPK
I
Spee
dup
Bett
er
Bett
er
© Islam Atta 34
![Page 35: SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads](https://reader035.fdocuments.in/reader035/viewer/2022062411/5681632e550346895dd3a81b/html5/thumbnails/35.jpg)
SLICC
Partial Bloom Filter
Cache Signature Accuracy
512 1K 2K 4K 8K 512 1K 2K 4K 8KTPC-C TPC-E
96
97
98
99
100
101
BF AccuracyA
ccur
acy
(%)
Bett
er
© Islam Atta 35