MITRA: Byzantine Fault Tolerant Middleware for Transaction Processing on Replicated Databases
Fault-tolerant Stream Processing using a Distributed, Replicated File System
description
Transcript of Fault-tolerant Stream Processing using a Distributed, Replicated File System
CMPE.516
TERM PRESENTATION
ERKAN ÇETİNER2008700030
Fault-tolerant Stream Processing using a Distributed, Replicated File
System
OUTLINE
Brief Information
Introduction
Background
Stream Processing Engines
Distributed, Replicated File System
SGuard Overview
MMM Memory Manager
Peace Scheduler
Evaluation
Related Work
Conclusion
Brief Information
SGuard – New fault-tolerance technique for distributed stream
processing engines running in clusters of commodity
servers
SGuard Less disruptive to normal stream processing
Leaves more resources available for normal stream processing
Based On : ROLLBACK RECOVERY It checkpoints the state of stream processing nodes periodically
Restarts failed nodes from their most recent checkpoints
Compared to previous proposals Performs checkpointing asynchronously
Operators continue processing streams during the checkpoints
Reduces potential disruption due to checkpointing activity
Introduction
Problem Definition
Today’s Web & Internet Services (e-mail , instant messaging , online games
etc.) must handle millions of users distributed across the world .
To manage these large scale systems , service providers must continuously
monitor the health of their infrastructures , the quality of service experienced
by users and any potential malicious activities .
SOLUTION -> Stream Processing Engines
Well suited to monitor since they process data streams continuously and with low latencies
ARISED PROBLEM -> Requires running these SPEs in clusters of commodity servers
Important Challenge = Fault Tolerance
Introduction
As size of cluster grows , one or more servers may fail
These failures may include both software crashes & hardware failures
The server may never recover & worst thing , multiple servers may
fail simultaneously
Failure A distributed SPE to block or to produce erroneous results
Negative Impact on Applications
To address this problem and create a higly-available SPE , several
fault tolerance techniques proposed .
These techniques based on replication -> the state of stream
processing operators is replicated on multiple servers
Introduction
Two Proposed Systems 1 - These servers all actively process the same input streams at same time
2 - Only one primary server performs processing while others passively hold replicated state in
memory until primary fails
The replicated state is periodically refreshed through a process called checkpointing to ensure
backup nodes hold recent copy of primary’s state
These techniques yields : Excellent protection from failures
At relatively high costs
When all processing is replicated normal stream processing is slowed down
When replicas only keep snapshots of primary’s state without performing any processing CPU
overhead is less but cluster loses at least half of its memory capacity
Introduction
What SGuard brings ? SPE with fault-tolerance at much lower cost & maintains the efiiciency of normal SPE operations
Based on 3 key innovations :
1. Memory Resources :
SPEs process streams in memory memory is a critical resource
To save memory resources SGuard saves checkpoints into disk
Stable storage previosuly not considered , since saving to disk was thought as too slow for streaming
app.
SGuard partly overcomes this performance challenge by using a new generation of distributed and
replicated file system (DFS) such as Google File System , Hadoop DFS .
These new DFSs are optimized for reads & writes of large data volumes and also for append-style
workloads
Maintain high availability in face of disk failures
Each SPE node now serve as both stream processing node and a file system node
First Innovation = Use of a DFS as stable storage for SPE checkpoints
Introduction
2. Resource Contention Problems :
When many nodes have large states to checkpoint Contend for disk and
network bandwith resources slowing down individual checkpoints
To address resource contention problem , SGuard extends the DFS master
node with a new type of scheduler called “PEACE”
Given a queue of write requests , PEACE selects the replicas for each data
chunk and schedules writes in a manner that significantly reduces the
latency of individual writes , while only modestly increasing the completion
time for all writes in queue .
Schedule as many concurrent writes as there are available resources
avoids resource contention
Introduction
3. Transparent Checkpoints :
To make checkpoints transparent a new memory management middleware
It enables to checkpoint the state of an SPE operator while it continues
executing
MMM partitions the SPE memory into “application-level pages” and uses
copy-on-write to perform asynchronous checkpoints
Pages are written to disk without requiring any expensive translation .
MMM checkpoints are more efficient & less disruptive to normal stream
processing than previous schemes
Background
Stream Processing Engines In an SPE
A query takes the form of a loop-free , directed graph of operators
Each operator processes data arriving on its input streams and produces data on its output streams
Processing is done in memory without going to disk
Query graphs are called “Query Diagrams” & can be distributed across multiple servers in LANs or WANs
Each server runs one instance of SPE and refer it to as a processing node
Background
Stream Processing Engines Goal : To develop a fault tolerant scheme that does not require reserving half the
memory in a cluster for fault tolerance Save Checkpoints to Disk
To meet the challange of high latency for writing to disks , SGuard extends the DFS
concept
Distributed , Replicated File System To manage large data volumes , new types of large-scale file systems or data stores in
general are emerging .
They bring two important properties :
Background
Distributed , Replicated File System
Support for Large Data Files
Designed to operate on large , Multi GB files
Optimized for bulk read operations
Different chunks are stored on different nodes
A single master node makes all chunk placement decisions
Fault- Tolerance through Replication
Assume frequent failures
To protect , each data chunk is replicated on multiple nodes using synchronous master replication
the client sends data to closest replica , which forwards it to the next closest replica in a chain
and so on
Once all replicas have data , client is notified that write has completed
SGuard relies on this aoutomatic replication to protect the distributed SPE against multiple
simultaneous failures
Background
SGuard Overview
Work as follows : Each SPE node takes periodic checkpoints of its state and writes these checkpoints to stable storage
SGuard uses the DFS as stable storage
To take a checkpoint, a node suspends its processing and makes a copy of its state (avoid this later)
Between checkpoints, each node logs the output tuples it produces
When a failure occurs, a backup node recovers by reading the most recent checkpoint from stable
storage and reprocessing all input tuples since then
A node sends a message to all its upstream neighbors to inform them of the checkpointed input tuples
Because nodes buffer output tuples, they can checkpoint their states independently and still ensure
consistency in the face of a failure
Within a node, however, all interconnected operators must be suspended and checkpointed at the same
time to capture a consistent snapshot of the SPE state
Interconnected groups of operators are called HA units
SGuard Overview
SGuard Overview
The level of replication of the DFS determines the level of fault-tolerance of
SGuard . k+1 replicas means , k simultaneous failures tolerated
Problem : Impact of chekpoints on normal stream processing flow
Since operators must be suspended during checkpoints , each checkpoint introduces
extra latency in the result stream
This latency will be increased by writing checkpoints to the disks rather than keeping
them in memory
To address these challanges , SGuard uses :
MMM (Memory Management Middleware)
PEACE Scheduler
MMM Memory Manager
Role of MMM To checkpoint operator’s states without serializing them first into a buffer
To enable concurrent checkpoints , where state of an operator is copied to disk while operator continues
its execution
Working Procedure of MMM
Partitions the SPE memory into a collection of “pages” - large fragments of memory allocated on the heap
–
Operator states are stored inside these pages
To checkpoint the state of an operator pages are recopied to disk
To enable an operator to execute during the checkpoint :
When the checkpoint begins, the operator is briefly suspended and all its pages are marked as read-
only
The operator execution is then resumed and pages are written to disk in the background
If the operator touches a page that has not yet been written to disk briefly interrupted while the
MMM makes a copy of the page
MMM Memory Manager
MMM Memory Manager
Page Manager :
Allocates , frees & checkpoints pages
Data Structures :
Implements data structure abstractions on top of PM’s page abstraction
Page Layout :
Simplifies the implementation of new data structures on top of PM
PEACE Scheduler
Problem :
When many nodes try to checkpoint HA units at the same time they contend for
network and disk resources slowing down individual checkpoints
PEACE’s Solution
Peace runs inside the Coordinator at the DFS layer
Given a set of requests to write one or more data chunks :
Peace schedules the writes in a manner that reduces the time to write each set of chunks while
keeping the total time for completing all writes small
By scheduling only as many concurrent writes as there are available resources
Scheduling all writes from the same set close together
Selecting the destination for each write in a manner that avoids resource contention
Advantage :
Adding a scheduler at the DFS rather than the application layer is that the scheduler can better control
and allocate resources
PEACE Scheduler
Takes as input a model of the network in the form of a directed graph G = (V,E), where V is the set of vertices and E the set of edges
PEACE Scheduler
All write requests used the following Algorithm
Nodes submit write requests to Coordinator in the form of triples (w,r,k)Algorithm iterates over all requestsFor each replica of each chunk , it selects best destination node by solving a min-cost max-flow problem over the graph G extended with source node s and destination node dS is connected to writer node with capacity 1D is connected to all servers except writer node with infinite capacityThe algorithm find minimum cost and maximum flow path from s to dTo ensure that different chunks& different replicas are written to different nodes , algorithm selects the destination for one chunk replica at a timeAll replicas of a chunk must be written to complete the whole write task
PEACE Scheduler
2 properties have to be satisfied :
To exploit network bandwidth resources efficiently, a node does not send a chunk directly to all
replicas. Instead, it only sends it to one replica, which then transmits the data to others in a pipeline
Second, to reduce correlations between failures, the DFS places replicas at different locations in the
network (e.g., one copy on the same rack and another on a different rack ) All edges have capacity 1 svr1.1 & svr2.1 requests to write 1 chunk
with replication factor 2 First , process request from svr1.1
assigns first replica to svr1.2 and second to svr2.2 –Constraint given that replicas have to be on different racks-
Then , process request of svr2.1 can not find a path for first replica on same rock –Constraint given that first replica on same rock –
Thus it is processed at time t+1
PEACE Scheduler
File-System Write Protocol
Each node submits write requests to the Coordinator indicating the number of chunks it needs to
write
PEACE schedules the chunks
The Coordinator then uses callbacks to let the client know when & where to write each chunk
To keep track of progress of writes , each node informs the Coordinator every time a write
completed
Once a fraction of all scheduled writes completed , PEACE declares the end of a timestep &
moves on next timestep
If the schedule does not allow a client to start writing right away , the Coordinator returns an
error message
In SGuard , when this message is received , PM cancels the checkpoints by marking all pages as
read-writable again
The state of HA unit is checkpointed again when the node can start finally writing
EVALUATION
MMM Evaluation
MMM implemented as C++ library = 9K lines of code
Modify the Borealis distributed SPE to use
To evaluate PEACE , HDFS used
500 lines of code added to Borealis
3K lines of code added to HDFS
EVALUATION
MMM EvaluationRuntime Operator Overhead Overhead of using the MMM to hold the state of an operator compared with using a standard data
structures library such as STL
Study Join and Aggregate, the two most common stateful operators
Executed on a dual 2.66GHz quad-core machine with 16GB RAM
Running 32bit Linux kernel 2.6.22 and single 500GB commodity SATA2 hard disk
JOIN :JOIN does a self-join of a stream using an equality predicate on unique tuple identifiers The overhead of using MMM for JOIN operator is within 3% for all 3 window sizes
EVALUATION
MMM EvaluationRuntime Operator Overhead
AGGREGATE : AGGREGATE groups input tuples by a 4-byte group name & computes count , min and max on a timestamp field More efficient for large number of groups because it forces the allocation of larger memory chunks at a time , amortizing the per group memory allocation cost
Negligible impact on operator performance
This needs only change 13 lines of JOIN operator code and 120 lines of AGGREGATE operator code
EVALUATION
MMM EvaluationCost of Checkpoint Preperation Before the state of an HA unit can be saved to disk, it must be prepared
This requires either serializing the state or copying all PageIDs and marking all pages as read-only
JOIN : Overhead is linear with the size of checkpointed stateAvoiding serialization is beneficial
EVALUATION
MMM EvaluationCheckpoint Overhead Overall runtime overhead by measuring the end-to-end stream processing latency while the SPE
checkpoints the state of one operator
AGGREGATE used shows the worst case performance since it touches randomly to the pages
Compare the performance of using the MMM against synchronously serializing state, and
asynchronously serializing state
Also compare with off-the-shelf VM (VMware)
Fed 2.0K tuples/sec for 10 min while checkpointing every minute
MMM interruption is 2.84 times lower than its nearest competitor
EVALUATION
MMM EvaluationCheckpoint Overhead Common technique for reducing checkpointing overhead is to partition the state of an operator
Aggregate operator in 4 partitions
EVALUATION
MMM EvaluationCheckpoint Overhead To hide synchronous and asynchronous serialization overhead split operator into 64 parts
Adds overhead of managing all partitions and may not always be possible
MMM is least disruptive to normal stream processing
EVALUATION
MMM EvaluationRecovery Performance Once the coordinator detects a failure & selects a recovery node , the recovery proceeds in 4 steps
The recovery node reads the checkpointed state from the DFS
It reconstructs the Page Manager state
It reconstructs any additional Data Structure state
It replays tuples logged at upstream HA units Total Recovery Time = Sum of 4 step + Failure Detection Time + Recovery Node Selection Time
Overhead is negligibleMMM recovery imposes a low extra overhead once pages are read from disk
EVALUATION
PEACE Evaluation Evaluate the performance of reading checkpoints from the DFS and writing them to the DFS
Cluster of 17 machines in two racks(8 & 9 machines each)
All machines in a rack share a gigabit ethernet switch and run the HDFS data node program
Two racks are connected with a gigabit ethernet link
First measure the performance of an HDFS client writing 1GB of data to the HDFS cluster using
varying degrees of parallelism As number of threads increased , total througput at client also increases , until network capacity is reached with 4 threads It takes more than one thread to saturate the network because client computes a checksum for each data chunk that it writes With 4 threads client reads 95 MB/s so it can recover a 128 MB HA unit in 2 seconds When more clients access , performance drops significantly for data node
EVALUATION
PEACE Evaluation Measure the time to write checkpoints to the DFS when multiple nodes checkpoint their states
simultaneously The data to checkpoint is already prepared Vary the number of concurrently checkpointing nodes from 8 to 16 and the size of each checkpoint
from 128MB to 512MB
Replication level is set to 3 Compare the performance against the original HDFS implementation
Peace waits until 80 % of writes complete in a timestep before starting next one Increased global completion time for larger aggregated volumes of I/O tasks Individual tasks complete I/O much faster than in original HDFS
EVALUATION
PEACE Evaluation
Peace significantly reduces the write latency for individual writes with only small
decrease in overall resource utilization
It now takes longer for all nodes to checkpoint their states, the maximum checkpoint
frequency is reduced
Longer recoveries as more tuples need to be re-processed after a failure
Correct trade-off since recovery with passive standby anyways imposes a small delay on
streams
Related Work
Semi-transparent approaches
The C3 application-level checkpointing
OODBMS Storage Manager
BerkeleyDB
CONCLUSION
SGuard leverages the existence of a new type of DFS to provide efficient faulttolerance
at a lower cost than previous proposals
SGuard extends the DFS with Peace, a new scheduler that reduces the time to write
individual checkpoints in face of high contention
SGuard also improves the transparency of SPE checkpoints through the Memory
Management Middleware, which enables efficient asynchronous checkpointing
The performance of SGuard is promising
With Peace and the DFS, nodes in a 17-server cluster can checkpoint 512MB of state
within less than 20s each
MMM efficiently hides this checkpointing activity
References
Y. Kwon, M. Balazinska, and A. Greenberg. Fault-Tolerant Stream Processing Using a Distributed, Replicated File System.PVLDB, 1(1):574–585, 2008.
Magdalena Balazinska , Hari Balakrishnan , Samuel R. Madden , Michael Stonebraker , “Fault-tolerance in the Borealis distributed stream processing system” , ACM New York, NY, USA , ISSN:0362-5915 , 2008