Samsara: Efficient Deterministic Replay with Hardware Virtualization Extensions Peking University...
-
Upload
matthew-griffin -
Category
Documents
-
view
214 -
download
1
Transcript of Samsara: Efficient Deterministic Replay with Hardware Virtualization Extensions Peking University...
Samsara: Efficient Deterministic Replay
with Hardware Virtualization Extensions
Peking University
Shiru Ren, Chunqi Li, Le Tan, and Zhen XiaoJuly 27, 2015
Introduction1
Background &Motivation2
Samsara Overview3
R & R the Memory Interleaving with HAV4
2
Conclusion6
Evaluation5
Table of Contents
Introduction
3
Deterministic Replay
Gives computer users the ability to travel backward in time, recreating
past states and events in the computer.
Checkpoint + record all non-deterministic events
CheckpointExecute same instruction stream
Inject NDEs in logged points
Initial StateInstruction stream
Final State
Non-determinism Events (NDEs)(e.g. user / network input, interrupts… )
Final State’
NDEs log
RecordingPhase
ReplayPhase
Application Scenarios
For debugging: cyclic debugging
For security: forensics, intrusion detection, malware analysis
For fault tolerance: hot standby, data recovery
Introduction
4
Deterministic Replay for Multi-processor
Deterministic replay for single processor is relatively mature and well-
developed
Challenge on the multi-processor system: memory interleaving
Introduction1
Background &Motivation2
Samsara Overview3
R & R the Memory Interleaving with HAV4
5
Conclusion6
Evaluation5
Table of Contents
6
Software-only schemes
Modify OS, compiler, runtime libraries or VMM
Virtualization-based approaches—CREW protocol
CREW: Concurrent-Read & Exclusive-Write
Background & Motivation
Issues
Each memory access operation must be checked for logging
Serious performance degradation (more than 10X)
Huge log size (approximately 1 MB/processor/second)
Background & Motivation
7
Hardware-based schemes
Use special hardware support for recording memory-access interleaving
Redesign the cache coherence protocol
Issues
Huge space overhead which limits the duration of the recorded interval
Modeled only using software simulations
Only support Sequential Consistency (SC)
Impractical for use in realistic systems
We believe software-only approaches will remain in the focus for optimizations
as commercial processors with dedicated hardware-based R&R features are not
commonly available yet.
8
Evaluation results show that our system:
Incurs less than 3X overhead when compared to native execution
Reduce 90% log size (even smaller than hardware-based scheme)
Combine the merits of both solutions
An Efficient way to record memory interleaving
Without hardware changes
Utilize current hardware acceleration as much as possible
Hardware-assisted virtualization
Some hardware characteristics are available to boost performance
Efficient full virtualization using help from hardware capabilities
Background & Motivation
Introduction1
Background &Motivation2
Samsara Overview3
R & R the Memory Interleaving with HAV4
9
Conclusion6
Evaluation5
Table of Contents
Samsara Overview
10
Samsara Overview
11
System composition
Controller
DMA recorder
R&R Component
Log recorder daemon
Record and Replay Non-deterministic Events
Synchronous Events: record the contents
Asynchronous Events: record timestamp
Compound Events: DMA, record both
Memory interleaving: most important challenge
Introduction1
Background &Motivation2
Samsara Overview3
R & R the Memory Interleaving with HAV4
12
Conclusion6
Evaluation5
Table of Contents
R&R the Memory Interleaving with HAV Extensions
13
Motivation
point-to-point logging approach
Record dependences between pairs of instructions( log size, record overhead )
Avoid the large number of memory access detections
Chunk-based schemes ( only the total sequence of chunks is recorded )
Chunk-based Strategy
Restrict virtual processors’ execution into a series of chunks
Merely need to record commit order
Chunk execution must satisfy:
Atomicity
Serializability
14
Serializability: COW, conflict detection strategy
Atomicity: some instructions that hard to undo
R&R the Memory Interleaving with HAV Extensions
P0
Chunk Start
LD (A)
COW ST (A)
ST (A)
ST (B)COW
Chunk Complete
Truncation Reason: I/O Instruction
Commit
P1
LD (A)
Squash & Rollback
LD (B)
ST (B)
Re-execution
LD (D)
ST (D)
□C1
□C2
□C3 ConflictDetection
R-set { A }W-set { A , B }
Truncation Reason:Chunk Size Limit
R-set { D }W-set { D }
R-set { A , B }W-set { B }
……
15
Obtain R&W-set Efficiently via HAV Extensions
VM-based approaches: VM exit (hardware page protection)
Our approach: a single EPT traversal
Accessed and Dirty Flags of EPT
Optimization: tree-based design of EPT
W(b)R(b)
R(a)R(c)
W(b)R(b)
W(e)
W(b)R(b)
R(a)R(c)
W(b)R(b)
W(e)
a single EPT traversal
Just the first write to each memory page will trigger an EPT violation
VM exit
Reduce at least 50% extra VM exits
R&R the Memory Interleaving with HAV Extensions
16
R&R the Memory Interleaving with HAV Extensions
Observations
Chunk commit is a time-consuming
process
The time consumed on waiting for
this lock is excessive (40%)
Update write-back operation involves
serious performance degradation
P0
ChunkComplete
Wait forLock
DetectConflict
Broadcast Updates
Subsequent Chunk
Lock
Obtain R&W-set
Write-backUpdates
17
Reduce Lock Granularity with a Decentralized
Three-Phase Commit Protocol
Pages committed concurrently by
different chunks have no intersection
Move this out of the synchronized
block
Chunk committing out-of-order
R&R the Memory Interleaving with HAV ExtensionsP0
ChunkComplete
Wait forLock
DetectConflict
Broadcast Updates
Write-backUpdates
Subsequent Chunk
Lock
Insert into committing list
Update Chunk Info
Check Committing List
Obtain R&W-set
18
Replay Memory Interleaving
Guarantee all chunks will be properly re-built and executed in the original
order
1. Truncate a chunk at the recorded timestamp (hardware performance
counter)
2. Ensure that all preceding chunks have been committed successfully
before the subsequent chunk starts
Allowing processors execute concurrently in replay
R&R the Memory Interleaving with HAV Extensions
Introduction1
Background &Motivation2
Samsara Overview3
R & R the Memory Interleaving with HAV4
19
Conclusion6
Evaluation5
Table of Contents
Evaluation
20
Experimental Setup
4-core Intel Core i7-4790 processor, 12GB memory, 1TB Hard Drive
Host: Ubuntu 12.04 with Linux kernel version 3.11.0 and Qemu-1.2.2
Guest: Ubuntu 14.04 with Linux kernel version 3.13.0
Workload Application Domain Parallelization Granularity
Data UsageWorking Set
Sharing Exchange
blackscholes Financial Analysis coarse low low small
bodytrack Computer Vision medium high medium medium
raytrace Rendering medium high low unbounded
swaptions Financial Analysis coarse low low medium
freqmine Data Mining coarse high medium unbounded
x264 Media Processing coarse high high medium
Workloads
PARSEC 3.0
Evaluation
21
Log Size
Samsara generates log at an average rate of 0.127 MB/s and 0.151 MB/s for
recoding two and four processors, respectively
Reduce 90% log size (even smaller than hardware-based schemes)
For comparison : 0.131 MB/s (single processor)
Evaluation
22
Log Size
The size of the chunk commit log is practically negligible compared with
other non-deterministic events
2.33% with 2 processors and 7.37% with 4 processors on average
Evaluation
23
Recording Overhead Compared to Native Execution
Measure the overhead of the system relative to the base platform it runs on
(can not reflect the actual execution time of the system in real life)
2.6X and 5.0X for recording these workloads on two and four processors
Conclusion
24
We made the first attempt to leverage HAV extensions to achieve an efficient
software-based replay system
Record processors’ execution as a series of chunks
Avoid the large number of memory access detections by performing a
single EPT traversal at the end of each chunk
Propose a decentralized three-phase commit protocol to reduce the
lock granularity of the chunk commit process
Thanks
25