Samsara: Efficient Deterministic Replay with Hardware Virtualization Extensions Peking University...

25
Samsara: Efficient Deterministic Replay with Hardware Virtualization Extensions Peking University Shiru Ren, Chunqi Li, Le Tan, and Zhen Xiao July 27 2015

Transcript of Samsara: Efficient Deterministic Replay with Hardware Virtualization Extensions Peking University...

Page 1: Samsara: Efficient Deterministic Replay with Hardware Virtualization Extensions Peking University Shiru Ren, Chunqi Li, Le Tan, and Zhen Xiao July 27 ,

Samsara: Efficient Deterministic Replay

with Hardware Virtualization Extensions

Peking University

Shiru Ren, Chunqi Li, Le Tan, and Zhen XiaoJuly 27, 2015

Page 2: Samsara: Efficient Deterministic Replay with Hardware Virtualization Extensions Peking University Shiru Ren, Chunqi Li, Le Tan, and Zhen Xiao July 27 ,

Introduction1

Background &Motivation2

Samsara Overview3

R & R the Memory Interleaving with HAV4

2

Conclusion6

Evaluation5

Table of Contents

Page 3: Samsara: Efficient Deterministic Replay with Hardware Virtualization Extensions Peking University Shiru Ren, Chunqi Li, Le Tan, and Zhen Xiao July 27 ,

Introduction

3

Deterministic Replay

Gives computer users the ability to travel backward in time, recreating

past states and events in the computer.

Checkpoint + record all non-deterministic events

CheckpointExecute same instruction stream

Inject NDEs in logged points

Initial StateInstruction stream

Final State

Non-determinism Events (NDEs)(e.g. user / network input, interrupts… )

Final State’

NDEs log

RecordingPhase

ReplayPhase

Application Scenarios

For debugging: cyclic debugging

For security: forensics, intrusion detection, malware analysis

For fault tolerance: hot standby, data recovery

Page 4: Samsara: Efficient Deterministic Replay with Hardware Virtualization Extensions Peking University Shiru Ren, Chunqi Li, Le Tan, and Zhen Xiao July 27 ,

Introduction

4

Deterministic Replay for Multi-processor

Deterministic replay for single processor is relatively mature and well-

developed

Challenge on the multi-processor system: memory interleaving

Page 5: Samsara: Efficient Deterministic Replay with Hardware Virtualization Extensions Peking University Shiru Ren, Chunqi Li, Le Tan, and Zhen Xiao July 27 ,

Introduction1

Background &Motivation2

Samsara Overview3

R & R the Memory Interleaving with HAV4

5

Conclusion6

Evaluation5

Table of Contents

Page 6: Samsara: Efficient Deterministic Replay with Hardware Virtualization Extensions Peking University Shiru Ren, Chunqi Li, Le Tan, and Zhen Xiao July 27 ,

6

Software-only schemes

Modify OS, compiler, runtime libraries or VMM

Virtualization-based approaches—CREW protocol

CREW: Concurrent-Read & Exclusive-Write

Background & Motivation

Issues

Each memory access operation must be checked for logging

Serious performance degradation (more than 10X)

Huge log size (approximately 1 MB/processor/second)

Page 7: Samsara: Efficient Deterministic Replay with Hardware Virtualization Extensions Peking University Shiru Ren, Chunqi Li, Le Tan, and Zhen Xiao July 27 ,

Background & Motivation

7

Hardware-based schemes

Use special hardware support for recording memory-access interleaving

Redesign the cache coherence protocol

Issues

Huge space overhead which limits the duration of the recorded interval

Modeled only using software simulations

Only support Sequential Consistency (SC)

Impractical for use in realistic systems

We believe software-only approaches will remain in the focus for optimizations

as commercial processors with dedicated hardware-based R&R features are not

commonly available yet.

Page 8: Samsara: Efficient Deterministic Replay with Hardware Virtualization Extensions Peking University Shiru Ren, Chunqi Li, Le Tan, and Zhen Xiao July 27 ,

8

Evaluation results show that our system:

Incurs less than 3X overhead when compared to native execution

Reduce 90% log size (even smaller than hardware-based scheme)

Combine the merits of both solutions

An Efficient way to record memory interleaving

Without hardware changes

Utilize current hardware acceleration as much as possible

Hardware-assisted virtualization

Some hardware characteristics are available to boost performance

Efficient full virtualization using help from hardware capabilities

Background & Motivation

Page 9: Samsara: Efficient Deterministic Replay with Hardware Virtualization Extensions Peking University Shiru Ren, Chunqi Li, Le Tan, and Zhen Xiao July 27 ,

Introduction1

Background &Motivation2

Samsara Overview3

R & R the Memory Interleaving with HAV4

9

Conclusion6

Evaluation5

Table of Contents

Page 10: Samsara: Efficient Deterministic Replay with Hardware Virtualization Extensions Peking University Shiru Ren, Chunqi Li, Le Tan, and Zhen Xiao July 27 ,

Samsara Overview

10

Page 11: Samsara: Efficient Deterministic Replay with Hardware Virtualization Extensions Peking University Shiru Ren, Chunqi Li, Le Tan, and Zhen Xiao July 27 ,

Samsara Overview

11

System composition

Controller

DMA recorder

R&R Component

Log recorder daemon

Record and Replay Non-deterministic Events

Synchronous Events: record the contents

Asynchronous Events: record timestamp

Compound Events: DMA, record both

Memory interleaving: most important challenge

Page 12: Samsara: Efficient Deterministic Replay with Hardware Virtualization Extensions Peking University Shiru Ren, Chunqi Li, Le Tan, and Zhen Xiao July 27 ,

Introduction1

Background &Motivation2

Samsara Overview3

R & R the Memory Interleaving with HAV4

12

Conclusion6

Evaluation5

Table of Contents

Page 13: Samsara: Efficient Deterministic Replay with Hardware Virtualization Extensions Peking University Shiru Ren, Chunqi Li, Le Tan, and Zhen Xiao July 27 ,

R&R the Memory Interleaving with HAV Extensions

13

Motivation

point-to-point logging approach

Record dependences between pairs of instructions( log size, record overhead )

Avoid the large number of memory access detections

Chunk-based schemes ( only the total sequence of chunks is recorded )

Chunk-based Strategy

Restrict virtual processors’ execution into a series of chunks

Merely need to record commit order

Chunk execution must satisfy:

Atomicity

Serializability

Page 14: Samsara: Efficient Deterministic Replay with Hardware Virtualization Extensions Peking University Shiru Ren, Chunqi Li, Le Tan, and Zhen Xiao July 27 ,

14

Serializability: COW, conflict detection strategy

Atomicity: some instructions that hard to undo

R&R the Memory Interleaving with HAV Extensions

P0

Chunk Start

LD (A)

COW ST (A)

ST (A)

ST (B)COW

Chunk Complete

Truncation Reason: I/O Instruction

Commit

P1

LD (A)

Squash & Rollback

LD (B)

ST (B)

Re-execution

LD (D)

ST (D)

□C1

□C2

□C3 ConflictDetection

R-set { A }W-set { A , B }

Truncation Reason:Chunk Size Limit

R-set { D }W-set { D }

R-set { A , B }W-set { B }

……

Page 15: Samsara: Efficient Deterministic Replay with Hardware Virtualization Extensions Peking University Shiru Ren, Chunqi Li, Le Tan, and Zhen Xiao July 27 ,

15

Obtain R&W-set Efficiently via HAV Extensions

VM-based approaches: VM exit (hardware page protection)

Our approach: a single EPT traversal

Accessed and Dirty Flags of EPT

Optimization: tree-based design of EPT

W(b)R(b)

R(a)R(c)

W(b)R(b)

W(e)

W(b)R(b)

R(a)R(c)

W(b)R(b)

W(e)

a single EPT traversal

Just the first write to each memory page will trigger an EPT violation

VM exit

Reduce at least 50% extra VM exits

R&R the Memory Interleaving with HAV Extensions

Page 16: Samsara: Efficient Deterministic Replay with Hardware Virtualization Extensions Peking University Shiru Ren, Chunqi Li, Le Tan, and Zhen Xiao July 27 ,

16

R&R the Memory Interleaving with HAV Extensions

Observations

Chunk commit is a time-consuming

process

The time consumed on waiting for

this lock is excessive (40%)

Update write-back operation involves

serious performance degradation

P0

ChunkComplete

Wait forLock

DetectConflict

Broadcast Updates

Subsequent Chunk

Lock

Obtain R&W-set

Write-backUpdates

Page 17: Samsara: Efficient Deterministic Replay with Hardware Virtualization Extensions Peking University Shiru Ren, Chunqi Li, Le Tan, and Zhen Xiao July 27 ,

17

Reduce Lock Granularity with a Decentralized

Three-Phase Commit Protocol

Pages committed concurrently by

different chunks have no intersection

Move this out of the synchronized

block

Chunk committing out-of-order

R&R the Memory Interleaving with HAV ExtensionsP0

ChunkComplete

Wait forLock

DetectConflict

Broadcast Updates

Write-backUpdates

Subsequent Chunk

Lock

Insert into committing list

Update Chunk Info

Check Committing List

Obtain R&W-set

Page 18: Samsara: Efficient Deterministic Replay with Hardware Virtualization Extensions Peking University Shiru Ren, Chunqi Li, Le Tan, and Zhen Xiao July 27 ,

18

Replay Memory Interleaving

Guarantee all chunks will be properly re-built and executed in the original

order

1. Truncate a chunk at the recorded timestamp (hardware performance

counter)

2. Ensure that all preceding chunks have been committed successfully

before the subsequent chunk starts

Allowing processors execute concurrently in replay

R&R the Memory Interleaving with HAV Extensions

Page 19: Samsara: Efficient Deterministic Replay with Hardware Virtualization Extensions Peking University Shiru Ren, Chunqi Li, Le Tan, and Zhen Xiao July 27 ,

Introduction1

Background &Motivation2

Samsara Overview3

R & R the Memory Interleaving with HAV4

19

Conclusion6

Evaluation5

Table of Contents

Page 20: Samsara: Efficient Deterministic Replay with Hardware Virtualization Extensions Peking University Shiru Ren, Chunqi Li, Le Tan, and Zhen Xiao July 27 ,

Evaluation

20

Experimental Setup

4-core Intel Core i7-4790 processor, 12GB memory, 1TB Hard Drive

Host: Ubuntu 12.04 with Linux kernel version 3.11.0 and Qemu-1.2.2

Guest: Ubuntu 14.04 with Linux kernel version 3.13.0

Workload Application Domain Parallelization Granularity

Data UsageWorking Set

Sharing Exchange

blackscholes Financial Analysis coarse low low small

bodytrack Computer Vision medium high medium medium

raytrace Rendering medium high low unbounded

swaptions Financial Analysis coarse low low medium

freqmine Data Mining coarse high medium unbounded

x264 Media Processing coarse high high medium

Workloads

PARSEC 3.0

Page 21: Samsara: Efficient Deterministic Replay with Hardware Virtualization Extensions Peking University Shiru Ren, Chunqi Li, Le Tan, and Zhen Xiao July 27 ,

Evaluation

21

Log Size

Samsara generates log at an average rate of 0.127 MB/s and 0.151 MB/s for

recoding two and four processors, respectively

Reduce 90% log size (even smaller than hardware-based schemes)

For comparison : 0.131 MB/s (single processor)

Page 22: Samsara: Efficient Deterministic Replay with Hardware Virtualization Extensions Peking University Shiru Ren, Chunqi Li, Le Tan, and Zhen Xiao July 27 ,

Evaluation

22

Log Size

The size of the chunk commit log is practically negligible compared with

other non-deterministic events

2.33% with 2 processors and 7.37% with 4 processors on average

Page 23: Samsara: Efficient Deterministic Replay with Hardware Virtualization Extensions Peking University Shiru Ren, Chunqi Li, Le Tan, and Zhen Xiao July 27 ,

Evaluation

23

Recording Overhead Compared to Native Execution

Measure the overhead of the system relative to the base platform it runs on

(can not reflect the actual execution time of the system in real life)

2.6X and 5.0X for recording these workloads on two and four processors

Page 24: Samsara: Efficient Deterministic Replay with Hardware Virtualization Extensions Peking University Shiru Ren, Chunqi Li, Le Tan, and Zhen Xiao July 27 ,

Conclusion

24

We made the first attempt to leverage HAV extensions to achieve an efficient

software-based replay system

Record processors’ execution as a series of chunks

Avoid the large number of memory access detections by performing a

single EPT traversal at the end of each chunk

Propose a decentralized three-phase commit protocol to reduce the

lock granularity of the chunk commit process

Page 25: Samsara: Efficient Deterministic Replay with Hardware Virtualization Extensions Peking University Shiru Ren, Chunqi Li, Le Tan, and Zhen Xiao July 27 ,

Thanks

25