Fault-tolerant Stream Processing using a Distributed, Replicated File System

CMPE.516

TERM PRESENTATION

ERKAN ÇETİNER2008700030

Fault-tolerant Stream Processing using a Distributed, Replicated File

System

OUTLINE

Brief Information

Introduction

Background

Stream Processing Engines

Distributed, Replicated File System

SGuard Overview

MMM Memory Manager

Peace Scheduler

Evaluation

Related Work

Conclusion

Brief Information

SGuard – New fault-tolerance technique for distributed stream

processing engines running in clusters of commodity

servers

SGuard Less disruptive to normal stream processing

Leaves more resources available for normal stream processing

Based On : ROLLBACK RECOVERY It checkpoints the state of stream processing nodes periodically

Restarts failed nodes from their most recent checkpoints

Compared to previous proposals Performs checkpointing asynchronously

Operators continue processing streams during the checkpoints

Reduces potential disruption due to checkpointing activity

Introduction

Problem Definition

Today’s Web & Internet Services (e-mail , instant messaging , online games

etc.) must handle millions of users distributed across the world .

To manage these large scale systems , service providers must continuously

monitor the health of their infrastructures , the quality of service experienced

by users and any potential malicious activities .

SOLUTION -> Stream Processing Engines

Well suited to monitor since they process data streams continuously and with low latencies

ARISED PROBLEM -> Requires running these SPEs in clusters of commodity servers

Important Challenge = Fault Tolerance

Introduction

As size of cluster grows , one or more servers may fail

These failures may include both software crashes & hardware failures

The server may never recover & worst thing , multiple servers may

fail simultaneously

Failure A distributed SPE to block or to produce erroneous results

Negative Impact on Applications

To address this problem and create a higly-available SPE , several

fault tolerance techniques proposed .

These techniques based on replication -> the state of stream

processing operators is replicated on multiple servers

Introduction

Two Proposed Systems 1 - These servers all actively process the same input streams at same time

2 - Only one primary server performs processing while others passively hold replicated state in

memory until primary fails

The replicated state is periodically refreshed through a process called checkpointing to ensure

backup nodes hold recent copy of primary’s state

These techniques yields : Excellent protection from failures

At relatively high costs

When all processing is replicated normal stream processing is slowed down

When replicas only keep snapshots of primary’s state without performing any processing CPU

overhead is less but cluster loses at least half of its memory capacity

Introduction

What SGuard brings ? SPE with fault-tolerance at much lower cost & maintains the efiiciency of normal SPE operations

Based on 3 key innovations :

1. Memory Resources :

SPEs process streams in memory memory is a critical resource

To save memory resources SGuard saves checkpoints into disk

Stable storage previosuly not considered , since saving to disk was thought as too slow for streaming

app.

SGuard partly overcomes this performance challenge by using a new generation of distributed and

replicated file system (DFS) such as Google File System , Hadoop DFS .

These new DFSs are optimized for reads & writes of large data volumes and also for append-style

workloads

Maintain high availability in face of disk failures

Each SPE node now serve as both stream processing node and a file system node

First Innovation = Use of a DFS as stable storage for SPE checkpoints

Introduction

2. Resource Contention Problems :

When many nodes have large states to checkpoint Contend for disk and

network bandwith resources slowing down individual checkpoints

To address resource contention problem , SGuard extends the DFS master

node with a new type of scheduler called “PEACE”

Given a queue of write requests , PEACE selects the replicas for each data

chunk and schedules writes in a manner that significantly reduces the

latency of individual writes , while only modestly increasing the completion

time for all writes in queue .

Schedule as many concurrent writes as there are available resources

avoids resource contention

Introduction

3. Transparent Checkpoints :

To make checkpoints transparent a new memory management middleware

It enables to checkpoint the state of an SPE operator while it continues

executing

MMM partitions the SPE memory into “application-level pages” and uses

copy-on-write to perform asynchronous checkpoints

Pages are written to disk without requiring any expensive translation .

MMM checkpoints are more efficient & less disruptive to normal stream

processing than previous schemes

Background

Stream Processing Engines In an SPE

A query takes the form of a loop-free , directed graph of operators

Each operator processes data arriving on its input streams and produces data on its output streams

Processing is done in memory without going to disk

Query graphs are called “Query Diagrams” & can be distributed across multiple servers in LANs or WANs

Each server runs one instance of SPE and refer it to as a processing node

Background

Stream Processing Engines Goal : To develop a fault tolerant scheme that does not require reserving half the

memory in a cluster for fault tolerance Save Checkpoints to Disk

To meet the challange of high latency for writing to disks , SGuard extends the DFS

concept

Distributed , Replicated File System To manage large data volumes , new types of large-scale file systems or data stores in

general are emerging .

They bring two important properties :

Background

Distributed , Replicated File System

Support for Large Data Files

Designed to operate on large , Multi GB files

Optimized for bulk read operations

Different chunks are stored on different nodes

A single master node makes all chunk placement decisions

Fault- Tolerance through Replication

Assume frequent failures

To protect , each data chunk is replicated on multiple nodes using synchronous master replication

the client sends data to closest replica , which forwards it to the next closest replica in a chain

and so on

Once all replicas have data , client is notified that write has completed

SGuard relies on this aoutomatic replication to protect the distributed SPE against multiple

simultaneous failures

Background

SGuard Overview

Work as follows : Each SPE node takes periodic checkpoints of its state and writes these checkpoints to stable storage

SGuard uses the DFS as stable storage

To take a checkpoint, a node suspends its processing and makes a copy of its state (avoid this later)

Between checkpoints, each node logs the output tuples it produces

When a failure occurs, a backup node recovers by reading the most recent checkpoint from stable

storage and reprocessing all input tuples since then

A node sends a message to all its upstream neighbors to inform them of the checkpointed input tuples

Because nodes buffer output tuples, they can checkpoint their states independently and still ensure

consistency in the face of a failure

Within a node, however, all interconnected operators must be suspended and checkpointed at the same

time to capture a consistent snapshot of the SPE state

Interconnected groups of operators are called HA units

SGuard Overview

SGuard Overview

The level of replication of the DFS determines the level of fault-tolerance of

SGuard . k+1 replicas means , k simultaneous failures tolerated

Problem : Impact of chekpoints on normal stream processing flow

Since operators must be suspended during checkpoints , each checkpoint introduces

extra latency in the result stream

This latency will be increased by writing checkpoints to the disks rather than keeping

them in memory

To address these challanges , SGuard uses :

MMM (Memory Management Middleware)

PEACE Scheduler

MMM Memory Manager

Role of MMM To checkpoint operator’s states without serializing them first into a buffer

To enable concurrent checkpoints , where state of an operator is copied to disk while operator continues

its execution

Working Procedure of MMM

Partitions the SPE memory into a collection of “pages” - large fragments of memory allocated on the heap

–

Operator states are stored inside these pages

To checkpoint the state of an operator pages are recopied to disk

To enable an operator to execute during the checkpoint :

When the checkpoint begins, the operator is briefly suspended and all its pages are marked as read-

only

The operator execution is then resumed and pages are written to disk in the background

If the operator touches a page that has not yet been written to disk briefly interrupted while the

MMM makes a copy of the page

MMM Memory Manager

MMM Memory Manager

Page Manager :

Allocates , frees & checkpoints pages

Data Structures :

Implements data structure abstractions on top of PM’s page abstraction

Page Layout :

Simplifies the implementation of new data structures on top of PM

PEACE Scheduler

Problem :

When many nodes try to checkpoint HA units at the same time they contend for

network and disk resources slowing down individual checkpoints

PEACE’s Solution

Peace runs inside the Coordinator at the DFS layer

Given a set of requests to write one or more data chunks :

Peace schedules the writes in a manner that reduces the time to write each set of chunks while

keeping the total time for completing all writes small

By scheduling only as many concurrent writes as there are available resources

Scheduling all writes from the same set close together

Selecting the destination for each write in a manner that avoids resource contention

Advantage :

Adding a scheduler at the DFS rather than the application layer is that the scheduler can better control

and allocate resources

PEACE Scheduler

Takes as input a model of the network in the form of a directed graph G = (V,E), where V is the set of vertices and E the set of edges

PEACE Scheduler

All write requests used the following Algorithm

Nodes submit write requests to Coordinator in the form of triples (w,r,k)Algorithm iterates over all requestsFor each replica of each chunk , it selects best destination node by solving a min-cost max-flow problem over the graph G extended with source node s and destination node dS is connected to writer node with capacity 1D is connected to all servers except writer node with infinite capacityThe algorithm find minimum cost and maximum flow path from s to dTo ensure that different chunks& different replicas are written to different nodes , algorithm selects the destination for one chunk replica at a timeAll replicas of a chunk must be written to complete the whole write task

PEACE Scheduler

2 properties have to be satisfied :

To exploit network bandwidth resources efficiently, a node does not send a chunk directly to all

replicas. Instead, it only sends it to one replica, which then transmits the data to others in a pipeline

Second, to reduce correlations between failures, the DFS places replicas at different locations in the

network (e.g., one copy on the same rack and another on a different rack ) All edges have capacity 1 svr1.1 & svr2.1 requests to write 1 chunk

with replication factor 2 First , process request from svr1.1

assigns first replica to svr1.2 and second to svr2.2 –Constraint given that replicas have to be on different racks-

Then , process request of svr2.1 can not find a path for first replica on same rock –Constraint given that first replica on same rock –

Thus it is processed at time t+1

PEACE Scheduler

File-System Write Protocol

Each node submits write requests to the Coordinator indicating the number of chunks it needs to

write

PEACE schedules the chunks

The Coordinator then uses callbacks to let the client know when & where to write each chunk

To keep track of progress of writes , each node informs the Coordinator every time a write

completed

Once a fraction of all scheduled writes completed , PEACE declares the end of a timestep &

moves on next timestep

If the schedule does not allow a client to start writing right away , the Coordinator returns an

error message

In SGuard , when this message is received , PM cancels the checkpoints by marking all pages as

read-writable again

The state of HA unit is checkpointed again when the node can start finally writing

EVALUATION

MMM Evaluation

MMM implemented as C++ library = 9K lines of code

Modify the Borealis distributed SPE to use

To evaluate PEACE , HDFS used

500 lines of code added to Borealis

3K lines of code added to HDFS

EVALUATION

MMM EvaluationRuntime Operator Overhead Overhead of using the MMM to hold the state of an operator compared with using a standard data

structures library such as STL

Study Join and Aggregate, the two most common stateful operators

Executed on a dual 2.66GHz quad-core machine with 16GB RAM

Running 32bit Linux kernel 2.6.22 and single 500GB commodity SATA2 hard disk

JOIN :JOIN does a self-join of a stream using an equality predicate on unique tuple identifiers The overhead of using MMM for JOIN operator is within 3% for all 3 window sizes

EVALUATION

MMM EvaluationRuntime Operator Overhead

AGGREGATE : AGGREGATE groups input tuples by a 4-byte group name & computes count , min and max on a timestamp field More efficient for large number of groups because it forces the allocation of larger memory chunks at a time , amortizing the per group memory allocation cost

Negligible impact on operator performance

This needs only change 13 lines of JOIN operator code and 120 lines of AGGREGATE operator code

EVALUATION

MMM EvaluationCost of Checkpoint Preperation Before the state of an HA unit can be saved to disk, it must be prepared

This requires either serializing the state or copying all PageIDs and marking all pages as read-only

JOIN : Overhead is linear with the size of checkpointed stateAvoiding serialization is beneficial

EVALUATION

MMM EvaluationCheckpoint Overhead Overall runtime overhead by measuring the end-to-end stream processing latency while the SPE

checkpoints the state of one operator

AGGREGATE used shows the worst case performance since it touches randomly to the pages

Compare the performance of using the MMM against synchronously serializing state, and

asynchronously serializing state

Also compare with off-the-shelf VM (VMware)

Fed 2.0K tuples/sec for 10 min while checkpointing every minute

MMM interruption is 2.84 times lower than its nearest competitor

EVALUATION

MMM EvaluationCheckpoint Overhead Common technique for reducing checkpointing overhead is to partition the state of an operator

Aggregate operator in 4 partitions

EVALUATION

MMM EvaluationCheckpoint Overhead To hide synchronous and asynchronous serialization overhead split operator into 64 parts

Adds overhead of managing all partitions and may not always be possible

MMM is least disruptive to normal stream processing

EVALUATION

MMM EvaluationRecovery Performance Once the coordinator detects a failure & selects a recovery node , the recovery proceeds in 4 steps

The recovery node reads the checkpointed state from the DFS

It reconstructs the Page Manager state

It reconstructs any additional Data Structure state

It replays tuples logged at upstream HA units Total Recovery Time = Sum of 4 step + Failure Detection Time + Recovery Node Selection Time

Overhead is negligibleMMM recovery imposes a low extra overhead once pages are read from disk

EVALUATION

PEACE Evaluation Evaluate the performance of reading checkpoints from the DFS and writing them to the DFS

Cluster of 17 machines in two racks(8 & 9 machines each)

All machines in a rack share a gigabit ethernet switch and run the HDFS data node program

Two racks are connected with a gigabit ethernet link

First measure the performance of an HDFS client writing 1GB of data to the HDFS cluster using

varying degrees of parallelism As number of threads increased , total througput at client also increases , until network capacity is reached with 4 threads It takes more than one thread to saturate the network because client computes a checksum for each data chunk that it writes With 4 threads client reads 95 MB/s so it can recover a 128 MB HA unit in 2 seconds When more clients access , performance drops significantly for data node

EVALUATION

PEACE Evaluation Measure the time to write checkpoints to the DFS when multiple nodes checkpoint their states

simultaneously The data to checkpoint is already prepared Vary the number of concurrently checkpointing nodes from 8 to 16 and the size of each checkpoint

from 128MB to 512MB

Replication level is set to 3 Compare the performance against the original HDFS implementation

Peace waits until 80 % of writes complete in a timestep before starting next one Increased global completion time for larger aggregated volumes of I/O tasks Individual tasks complete I/O much faster than in original HDFS

EVALUATION

PEACE Evaluation

Peace significantly reduces the write latency for individual writes with only small

decrease in overall resource utilization

It now takes longer for all nodes to checkpoint their states, the maximum checkpoint

frequency is reduced

Longer recoveries as more tuples need to be re-processed after a failure

Correct trade-off since recovery with passive standby anyways imposes a small delay on

streams

Related Work

Semi-transparent approaches

The C3 application-level checkpointing

OODBMS Storage Manager

BerkeleyDB

CONCLUSION

SGuard leverages the existence of a new type of DFS to provide efficient faulttolerance

at a lower cost than previous proposals

SGuard extends the DFS with Peace, a new scheduler that reduces the time to write

individual checkpoints in face of high contention

SGuard also improves the transparency of SPE checkpoints through the Memory

Management Middleware, which enables efficient asynchronous checkpointing

The performance of SGuard is promising

With Peace and the DFS, nodes in a 17-server cluster can checkpoint 512MB of state

within less than 20s each

MMM efficiently hides this checkpointing activity

References

Y. Kwon, M. Balazinska, and A. Greenberg. Fault-Tolerant Stream Processing Using a Distributed, Replicated File System.PVLDB, 1(1):574–585, 2008.

Magdalena Balazinska , Hari Balakrishnan , Samuel R. Madden , Michael Stonebraker , “Fault-tolerance in the Borealis distributed stream processing system” , ACM New York, NY, USA , ISSN:0362-5915 , 2008

Fault-tolerant Stream Processing using a Distributed, Replicated File System

Documents

Transcript of Fault-tolerant Stream Processing using a Distributed, Replicated File System