Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via...
-
Upload
letitia-adams -
Category
Documents
-
view
224 -
download
0
Transcript of Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via...
![Page 1: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/1.jpg)
Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via
Heterogeneous-Reliability MemoryYixin Luo, Sriram Govindan, Bikash Sharma,
Mark Santaniello, Justin Meza, Aman Kansal, Jie Liu, Badriddine Khessib, Kushagra Vaid, Onur Mutlu
![Page 2: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/2.jpg)
Executive Summary• Problem: Reliable memory hardware increases cost•Our Goal: Reduce datacenter cost; meet availability target•Observation: Data-intensive applications’ data exhibit a
diverse spectrum of tolerance to memory errors‐ Across applications and within an application‐ We characterized 3 modern data-intensive applications
•Our Proposal: Heterogeneous-reliability memory (HRM)‐ Store error-tolerant data in less-reliable lower-cost memory‐ Store error-vulnerable data in more-reliable memory
•Major results:‐ Reduce server hardware cost by 4.7 %‐ Achieve single server availability target of 99.90 %
2
![Page 3: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/3.jpg)
Outline•Motivation•Characterizing application memory error tolerance•Key observations‐Observation 1: Memory error tolerance varies
across applications and within an application‐Observation 2: Data can be recovered by software
•Heterogeneous-Reliability Memory (HRM)•Evaluation
3
![Page 4: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/4.jpg)
Outline•Motivation•Characterizing application memory error tolerance•Key observations‐Observation 1: Memory error tolerance varies
across applications and within an application‐Observation 2: Data can be recovered by software
•Heterogeneous-Reliability Memory (HRM)•Evaluation
4
![Page 5: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/5.jpg)
Server Memory Cost is High•Server hardware cost dominates datacenter Total Cost of Ownership (TCO) [Barroso ‘09]
•As server memory capacity grows, memory cost becomes the most important component of server hardware costs [Kozyrakis ‘10]
5
128GB Memory cost~$140(per 16GB)×8= ~$1120 *
2 CPUs cost~$500(per CPU)×2= ~$1000 *
* Numbers in the year of 2014
![Page 6: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/6.jpg)
6
Memory Reliability is Important
System/app crash
System/apphang or
slowdown
Silent data corruption or incorrect app output
![Page 7: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/7.jpg)
1 Mb 4 Mb 16 Mb 64 Mb 256 Mb 1 Gb 4 Gb0
2
4
6
8
10
12[DocMemory '00] Predicted as trend
DRAM chip capacity
Testi
ng c
ost/
Mem
cos
t (%
)
Memory testing cost can be a significant fraction of memory cost as memory capacity grows
Existing Error Mitigation Techniques (I)
•Quality assurance tests increase manufacturing cost
7
![Page 8: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/8.jpg)
Stronger error protection techniques have higher cost
Incr
easi
ng st
reng
thExisting Error Mitigation Techniques (II)
•Error detection and correction increases system cost
8
Technique Detection Correction Added capacity
Added logic
NoECC N/A N/A 0.00% NoParity 1 bit N/A 1.56% Low
SEC-DED 2 bit 1 bit 12.5% LowChipkill 2 chip 1 chip 12.5% High
Parity 1 bit N/A 1.56%SEC-DED 2 bit 1 bit 12.5%SEC-DED 2 bit 1 bit 12.5% LowChipkill 2 chip 1 chip 12.5% High
![Page 9: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/9.jpg)
Goal: Design a new cost-efficient memory system that flexibly matches memory reliability with application memory error tolerance
Shortcomings of Existing Approaches
•Uniformly improve memory reliability‐Observation 1: Memory error tolerance varies across applications and with an application
•Rely only on hardware-level techniques‐Observation 2: Once a memory error is detected, most corrupted data can be recovered by software
9
![Page 10: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/10.jpg)
Outline•Motivation•Characterizing application memory error tolerance•Key observations‐Observation 1: Memory error tolerance varies
across applications and within an application‐Observation 2: Data can be recovered by software
•Heterogeneous-Reliability Memory (HRM)•Evaluation
10
![Page 11: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/11.jpg)
11Memory Error Outcomes
Characterization Goal
Memory Error
System/App Crash
Incorrect Response
Masked by Logic
Masked by Overwrite
Consumed by Application
Correct Result Incorrect Result
x = …010…x = …110…
Store
x = …000…
Load
if (x != 0) …
return x;or *x;
Quantify application memory error tolerancecorrupted
![Page 12: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/12.jpg)
Characterization Methodology•3 modern data-intensive applications
•3 dominant memory regions‐Heap – dynamically allocated data‐ Stack – function parameters and local variables‐Private – private heap managed by user
• Injected a total of 23,718 memory errors using software debuggers (WinDbg and GDB)•Examined correctness for over 4 billion queries
12
Application WebSearch Memcached GraphLabMemory footprint 46 GB 35 GB 4 GB
![Page 13: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/13.jpg)
Outline•Motivation•Characterizing application memory error tolerance•Key observations‐Observation 1: Memory error tolerance varies
across applications and within an application‐Observation 2: Data can be recovered by software
•Heterogeneous-Reliability Memory (HRM)•Evaluation
13
![Page 14: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/14.jpg)
14
WebSearch Memcached GraphLab0
2
4
6
8
10
12
14 System/Application Crash
Prob
abili
ty o
f Cra
sh (%
)
>10× difference
Observation 1: Memory Error Tolerance VariesAcross Applications
Showing results for single-bit soft errorsResults for other memory error types can be found in the paper with similar conclusion
![Page 15: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/15.jpg)
15
WebSearch Memcached GraphLab1E+1
1E+2
1E+3
1E+4
1E+5
1E+6
1E+7
1E+8 Incorrect Responses
# In
corr
ect/
Billi
on Q
uerie
s
>105× difference
Observation 1: Memory Error Tolerance VariesAcross Applications
Showing results for single-bit soft errorsResults for other memory error types can be found in the paper
![Page 16: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/16.jpg)
16
Private Heap Stack0
0.2
0.4
0.6
0.8
1
1.2 System/Application Crash
Prob
abili
ty o
f Cra
sh (%
)
>4× difference
Showing results for WebSearchResults for other workloads can be found in the paper
Observation 1: Memory Error Tolerance Variesand Within an ApplicationAcross Applications
![Page 17: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/17.jpg)
17
Private Heap Stack1E+0
1E+1
1E+2 Incorrect Responses
# In
corr
ect/
Billi
on Q
uerie
s
Showing results for WebSearchResults for other workloads can be found in the paper
15All averaged at a very low rate
Observation 1: Memory Error Tolerance Variesand Within an ApplicationAcross Applications
![Page 18: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/18.jpg)
Outline•Motivation•Characterizing application memory error tolerance•Key observations‐Observation 1: Memory error tolerance varies
across applications and within an application‐Observation 2: Data can be recovered by software
•Heterogeneous-Reliability Memory (HRM)•Evaluation
18
![Page 19: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/19.jpg)
• Implicitly recoverable – application intrinsically has a clean copy of the data on disk• Explicitly recoverable – application can create a copy of the
data at a low cost (if it has very low write frequency)
19
Observation 2: Data Can be Recovered by SoftwareImplicitly and Explicitly
Private Heap Stack Overall
88%
59%
1%
82%63%
28%16%
56%
WebSearch Recoverability
Implicitly recoverable
Explicitly recoverable
![Page 20: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/20.jpg)
Outline•Motivation•Characterizing application memory error tolerance•Key observations‐Observation 1: Memory error tolerance varies
across applications and within an application‐Observation 2: Data can be recovered by software
•Heterogeneous-Reliability Memory (HRM)•Evaluation
20
![Page 21: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/21.jpg)
21
App/Data A App/Data B App/Data C
Mem
ory
erro
r vul
nera
bilit
y
Vulnerable data
Tolerant data
Exploiting Memory Error Tolerance
Heterogeneous-Reliability Memory
Low-cost memoryReliable memory
Vulnerable data
Tolerant data
Vulnerable data
Tolerant data
• ECC protected• Well-tested chips
• NoECC or Parity• Less-tested chips
![Page 22: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/22.jpg)
22
Memory
Page A
Page B
Par+R: Parity Detection + Software Recovery
Implicit Recovery Explicit Recovery
Memory
Disk
Page A
Page A
Memory Error
Disk
Page A
Memory Error
Page B
Page B
Write
Write non-intensive
Write intensive
Intrinsic
copy
Copy Copy
![Page 23: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/23.jpg)
23
Heterogeneous-Reliability Memory
App 1 data A
App 1 data B
App 2 data A
App 2 data B
App 3 data A
App 3 data B
Step 2: Map application data to the HRM system enabled by SW/HW cooperative solutions
Step 1: Characterize and classify application memory error tolerance
Reliable memory
Parity memory + software recovery (Par+R)
Low-cost memory
UnreliableReliable
Vulnerable Tolerant
App 1 data A
App 2 data A
App 2 data B
App 3 data A
App 3 data B
App 1 data B
![Page 24: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/24.jpg)
Outline•Motivation•Characterizing application memory error tolerance•Key observations‐Observation 1: Memory error tolerance varies
across applications and within an application‐Observation 2: Data can be recovered by software
•Heterogeneous-Reliability Memory (HRM)•Evaluation
24
![Page 25: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/25.jpg)
25
Evaluated Systems
ConfigurationMapping
Pros and ConsPrivate(36 GB)
Heap(9 GB)
Stack(60 MB)
Typical Server ECC ECC ECC Reliable but expensiveConsumer PC NoECC NoECC NoECC Low-cost but unreliableHRM Par+R NoECC NoECC Parity only
Baseline systems HRM systems
Less-Tested (L) NoECC NoECC NoECC Least expensive and reliableHRM/L ECC Par+R NoECC Low-cost and reliable HRM
![Page 26: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/26.jpg)
Design Parameters
26
DRAM/server HW cost [Kozyrakis ‘10] 30%NoECC memory cost savings 11.1%Parity memory cost savings 9.7%
Less-tested memory cost savings 18%±12%Crash recovery time 10 mins
Par+R flush threshold 5 minsErrors/server/month [Schroeder ‘09] 2000
Target single server availability 99.90%
![Page 27: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/27.jpg)
Evaluation Metrics•Cost‐Memory cost savings‐Server HW cost savings(both compared with Typical Server)
•Reliability‐Crashes/server/month‐Single server availability‐# incorrect/million queries
27
![Page 28: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/28.jpg)
28
Improving Server HW Cost Savings
Typical Server
Consumer PC
HRM Less-Tested HRM/L0123456789
3.3 2.9
8.1
4.7
Serv
er H
W c
ost s
avin
gs
(%)
Reducing the use of memory error mitigation techniques in part of memory space can save
noticeable amount of server HW cost
![Page 29: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/29.jpg)
29
HRM systems are flexible to adjust and can achieve availability target
9797.5
9898.5
9999.5100
Sing
le s
erve
r ava
ilabi
lity
(%)
Achieving Target AvailabilitySingle server availability
target: 99.90%
![Page 30: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/30.jpg)
30
Achieving Acceptable Correctness
1
10
100
1000
33
9
163
12
# in
corr
ect/
mill
ion
quer
ies
HRM systems can achieve acceptable correctness
![Page 31: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/31.jpg)
31
Evaluation ResultsTypical ServerConsumer PCHRMLess-Tested (L)HRM/L
Bigger area means better tradeoff
Outer is betterInner is worse
![Page 32: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/32.jpg)
Other Results and Findings in the Paper
• Characterization of applications’ reactions to memory errors‐ Finding: Quick-to-crash vs. periodically incorrect behavior
• Characterization of most common types of memory errors including single-bit soft/hard errors, multi-bit hard errors‐ Finding: More severe errors mainly decrease correctness
• Characterization of how errors are masked‐ Finding: Some memory regions are safer than others
• Discussion about heterogeneous reliability design dimensions, techniques, and their benefits and tradeoffs
32
![Page 33: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/33.jpg)
Conclusion•Our Goal: Reduce datacenter cost; meet availability target
• Characterized application-level memory error tolerance of 3 modern data-intensive workloads
• Proposed Heterogeneous-Reliability Memory (HRM)‐ Store error-tolerant data in less-reliable lower-cost memory‐ Store error-vulnerable data in more-reliable memory
• Evaluated example HRM systems‐ Reduce server hardware cost by 4.7 %‐ Achieve single-server availability target 99.90 %
33
![Page 34: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/34.jpg)
Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via
Heterogeneous-Reliability MemoryYixin Luo, Sriram Govindan, Bikash Sharma,
Mark Santaniello, Justin Meza, Aman Kansal, Jie Liu, Badriddine Khessib, Kushagra Vaid, Onur Mutlu
![Page 35: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/35.jpg)
35
Why use a software debugger?•Speed‐Our workloads are relatively long running•WebSearch – 30 minutes•Memcached – 10 minutes•GraphLab – 10 minutes‐Our workloads have large memory footprint•WebSearch – 46 GB•Memcached – 35 GB•GraphLab – 4 GB
![Page 36: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/36.jpg)
36
What are the workload properties?•WebSearch‐ Repeat a real-world trace of 200,000 queries, with 400 qps‐ Correctness: Top 4 most relevant documents• Document id• Relevance and popularity
•Memcached‐ 30 GB of twitter dataset‐ Synthetic client workload, at 5,000 rps‐ 90% read requests and 10% write requests
•GraphLab‐ 11 million twitter users’ following relations, 1.3 GB dataset‐ TunkRank algorithm ‐ Correctness: 100 most influential users and their scores
![Page 37: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/37.jpg)
37
How many errors are injected to each application and each memory region?
•WebSearch – 20,576•Memcached – 983•GraphLab – 2,159• Errors injected to each memory region is proportional to
their sizesApplication Private Heap Stack TotalWebSearch 36 GB 9 GB 60 MB 46 GBMemcached N/A 35 GB 132 KB 35 GB
GraphLab N/A 4 GB 132 KB 4 GB
![Page 38: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/38.jpg)
38
Does HRM require HW changes
CPU
DIMMMem Ctrl 0Mem Ctrl 1*
Mem Ctrl 2*
DIMM
DIMM
Channel 0ECC DIMM ECC
Channel 1*
Channel 2*
* Memory controller/Channel without ECC support
DIMM
DIMM
![Page 39: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/39.jpg)
39
What is the injection/monitoring process?
(Re)Start App
Inject Errors (Soft/Hard)
Run Client Workload
App Crash?
Compare Result with Expected Output
NOYES
Repe
atStart
2
3
4
5
1
![Page 40: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/40.jpg)
40
Comparison with previous works?•Virtualized and flexible ECC [Yoon ‘10]‐ Requires changes to the MMU in the processor‐ Performance overhead ~10% over NoECC
•Our work: HRM‐ Minimal changes to memory controller to enable different ECC on
different channels‐ Low performance overhead‐ Enables the use of less-tested memory
![Page 41: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/41.jpg)
41
Other Results
![Page 42: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/42.jpg)
42
Variation within application
![Page 43: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/43.jpg)
43
Variation within application
![Page 44: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/44.jpg)
44
Other types of memory errors
![Page 45: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/45.jpg)
45
Other types of memory errors
![Page 46: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/46.jpg)
46
Explicit Recovery
![Page 47: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/47.jpg)
47
Quick to crash vs. periodic incorrect
![Page 48: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/48.jpg)
48
Safe ratio: masked by overwrite
![Page 49: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/49.jpg)
49
Potential to tolerate memory errors
![Page 50: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/50.jpg)
50
Design dimension
![Page 51: Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d6e5503460f94a4f5dd/html5/thumbnails/51.jpg)
51
Design dimension