Базы данных. ZooKeeper

54
Apache ZooKeeper Курс Базы данных Щитинин Дмитрий Computer Science Center 11 ноября 2013 г. Щитинин Д. А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 1 / 54

Transcript of Базы данных. ZooKeeper

Page 1: Базы данных. ZooKeeper

Apache ZooKeeperКурс «Базы данных»

Щитинин Дмитрий

Computer Science Center

11 ноября 2013 г.

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 1 / 54

Page 2: Базы данных. ZooKeeper

Содержание

1 Introducing

2 Data Model

3 API

4 Usages

5 Architecture

6 Throughput

7 Conclusion

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 2 / 54

Page 3: Базы данных. ZooKeeper

Introducing

Introducing

High-available (decentralized, no SPoF) and high-reliableserivce for

NamingConfiguration managementSynchronizationLeader electionMessage QueueNotification system

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 3 / 54

Page 4: Базы данных. ZooKeeper

Introducing History

Google Chubby

2006, Google published a paper on "Chubby"Google’s primary internal name serviceСommon mechanism for MapReduce, GFS, BigTableUsages:

election a primary from redundant replicasrepository for high availabable files

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 4 / 54

Page 5: Базы данных. ZooKeeper

Introducing History

History

ZooKeeper is the close clone of Chubby:2006-2008 - Developed at Yahoo!1

2008-2010 - Moved to Apache as Hadoop subprojectJan 2011-... - Became top-level project of Apacheprojects system

Commiters: Yahoo!, Cloudera, Google, Facebook,Microsoft, ...2

1http://www.sdtimes.com/content/article.aspx?ArticleID=330112http://zookeeper.apache.org/credits.html

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 5 / 54

Page 6: Базы данных. ZooKeeper

Introducing Data Model

Data Model

Hierarchical tree of nodes (similar to Google Chubby)called "znode"znode can contain either data and subnodesNot high-volume data storage (znode data - max1MB)Access by key (seq of znode’s names "/"delimieted,like FS paths)Watches/Notications of znode changesznodes have version counters and other metadata

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 6 / 54

Page 7: Базы данных. ZooKeeper

Introducing ACID

ACID

Atomicy per node: no partial reads or writesNo rollbacksSequential consistency (clients see a single sequenceof operations)Isolation per nodeDurable (commit log + replication)

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 7 / 54

Page 8: Базы данных. ZooKeeper

Introducing CAP

CAP

Guarantees consistency (writes are linearizable - havetotal order)May not be available for writes (strict quorum basedreplication of them)Partition Tolerance

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 8 / 54

Page 9: Базы данных. ZooKeeper

Introducing Users

Users3

Apache: HBase, HDFS, Hadoop MapReduce, Kafka,SolrKatta (distributed Lucene indexes)Eclipse Communication FrameworkNorbert, LinkedIn, partitioned routing, clustermanagementNeo4j, graph databaseS4, Yahoo, stream processingredis_failover (automatic master/slave failover)Yandex

3https://cwiki.apache.org/confluence/display/ZOOKEEPER/PoweredBy

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 9 / 54

Page 10: Базы данных. ZooKeeper

Introducing Users

Use in HBase

Leader ElectionEnsure there is at most 1 active master at any timeConfiguration ManagementStore the bootstrap locationGroup MembershipDiscovers servers and notifies about server death

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 10 / 54

Page 11: Базы данных. ZooKeeper

Introducing Motivation

Motivation

Making up own protocols for coordinate is almostfailedDistributed system architecture is hard problemraces, deadlocks, inconsistency, reliabilityZK provides tools for create correct distributedapplications:

hard problems already solvedno bycile re-inventionsusing open-source well-tested service and clients

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 11 / 54

Page 12: Базы данных. ZooKeeper

Data Model

Data Model

Hierarhical tree of nodes called "znodes"znode contains a data and ACLznode may contain children znodes1MB of data for any znodeAtomic data access (reads and writes)No append operationznodes referenced by path - "/"delimited strings

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 12 / 54

Page 13: Базы данных. ZooKeeper

Data Model

Data Model

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 13 / 54

Page 14: Базы данных. ZooKeeper

Data Model Znode Types

Znode Types

Can be persistent or ephemeral - set at cration time.Persistent znodes

not tied to a client sessiondeleted explicitly by a client (any according to ACL)may have children

Ephemeral znodesdeleted by ZK when creating client session endsalways leaf nodes - may not have subnodes (no evenephemeral)tied to a client session, visible to all clients (according toACL)ideal for wacth distributed resources availability

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 14 / 54

Page 15: Базы данных. ZooKeeper

Data Model Sequential Znodes

Sequential Znodes

sequential number by ZK is a part of znode namesequential flag sets at cration timenumbers are monotonically increasingsequence is maintained by parent znodeused to impose a global ordering on eventswe will learn how to build distributed lock on top ofthem

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 15 / 54

Page 16: Базы данных. ZooKeeper

Data Model Watches

Watches

Allow clients to get notifications when a znode changes

set by operations on ZKtriggered by other operationstriggered once - for multiple notification clientreregisters the watchused for quick reactions of different changes

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 16 / 54

Page 17: Базы данных. ZooKeeper

API Operations

Operations

Operation Descriptioncreate Creates a znode (the parent znode must already exist)delete Deletes a znode (the znode must not have any children)exists Tests whether a znode exists and retrieves its metadatagetACL, setACL Gets/sets the ACL for a znodegetChildren Gets a list of the children of a znodegetData, setData Gets/sets the data associated with a znodesync Synchronizes a client’s view of a znode with ZooKeeper

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 17 / 54

Page 18: Базы данных. ZooKeeper

API Guarantees

GuaranteesSequential Consistency - Updates from a clientwill be applied in the order that they were sentAtomicity - Updates either succeed or fail. Nopartial results.Single System Image - A client will see the sameview of the service regardless of the server that itconnects toReliability - Once an update has been applied, itwill persist from that time forward until a clientoverwrites the updateTimeliness - The clients view of the system isguaranteed to be up-to-date within a certain timeboundЩитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 18 / 54

Page 19: Базы данных. ZooKeeper

API Guarantees

Updates

Update operations (delete, setData) are conditional:

version number of znode under update is specifiedperforms only if version numbers are equalso updates are non-blocking

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 19 / 54

Page 20: Базы данных. ZooKeeper

API Clients

Clients

Core language bindings:Java (org.apache.zookeeper:zookeeper:3.4.5)C (zookeeper_st, zookeeper_mt libs)

Contrib bindings:PerlPythonREST clients

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 20 / 54

Page 21: Базы данных. ZooKeeper

API Clients

Async & sync

Each binding provides async and sync APIs.

both have same funtionalityasync is preferred for event-driven programmingmodelasync API can offer better throughput

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 21 / 54

Page 22: Базы данных. ZooKeeper

API Watch triggers

Watch triggers

exists, getChildren, getData may have watches set onthem

triggered by write operations: create, delete, setDatadon’t triggered by ACL operationswatch event generated includes path of modifiedznode

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 22 / 54

Page 23: Базы данных. ZooKeeper

API Watch triggers

Operations and triggers

create znode create child delete znode delete child set dataexists NodeCreated NodeDeleted NodeData

ChangedgetData NodeDeleted NodeData

ChagnedgetChildren NodeChildren

ChagnedNodeDeleted NodeChildren

Chagned

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 23 / 54

Page 24: Базы данных. ZooKeeper

API Watch triggers

Handling events

NodeCreated & NodeDeletedcontains modified znode pathNodeChildrenChangeneed to call getChildren for retrieve fresh list ofchildrenNodeDataChangedneed to call getData for retrieve fresh znode data

Warning!State of the znodes may have changed between receivingthe watch event and performing the read operation.

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 24 / 54

Page 25: Базы данных. ZooKeeper

API ACL

ACL

Available authentication types:

digestby username and passwordsaslKerberos 4 usedipIP address used

4http://en.wikipedia.org/wiki/Kerberos_(protocol)Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 25 / 54

Page 26: Базы данных. ZooKeeper

Usages Out of Box usages

Out of Box usages

Provided directly by API:Name ServiceConfiguration management

Provided by create, setData, getData methodsBenefits:

new nodes only need to know how to connect toZooKeeper and can then download all otherconfiguration information and determineapplication can subscribe to changes in theconfiguration and apply them quickly

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 26 / 54

Page 27: Базы данных. ZooKeeper

Usages Group Membership

Group Membership

group is represented by a node (persistent)members of the group are ephemeral nodes under thegroup nodeclients call getChildren() on group node with watchsetfailed member nodes will be removed automaticallyby ZKNodeChildrenChanged notification send to clients

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 27 / 54

Page 28: Базы данных. ZooKeeper

Usages Lock

LockObtain the lock:

define lock node - create(’_locknode_’)call create(’_locknode_/lock-’, sequential = true, ephemeral =true), remember created path as ’path-A’call getChildren(’_locknode_’) without setting the watchif ’path-A’ has lowest seq number suffix from all children clientobtains the lock and exits(if ’path-A’ has not lowest seq number), let ’path-B’ has next lowestseq number client calls exists() with the watch set on the ’path-B’if exists() returns false, go to 2 else wait for a notification for’path-B’ before going to step 2.

Release lock:Remove node created at step 1 (i.e. ’path-A’)

Note: node removal will only cause one client to wake up (eachnode is watched by exactly one client)no polling or timeouts

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 28 / 54

Page 29: Базы данных. ZooKeeper

Usages ReadWrite Lock

ReadWrite LockRead lock acquire:

read-A = create("_locknode_/read- sequential=true,ephemeral=true)getChildren("_locknode_ watch = false)are there children with path starting with "write-"and having a lowersequence number than ’read-A’ ?no: acquire the read lock and exityes: let ’write-B’ has next lowest seq number and path starts with’write-’call exists(watch=true) on ’write-B’if exists() returns false, go to 2 else wait for a notification for’write-B’ before going to step 2

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 29 / 54

Page 30: Базы данных. ZooKeeper

Usages ReadWrite Lock

ReadWrite LockWrite lock acquire:

"write-A- create("_locknode_/write- sequential=true,ephemeral=true)getChildren("_locknode_ watch =false)are there children with a lower sequence number than the "write-A"?no: acquire the write lock and exityes: let ’path-B’ has next lowest seq numbercall exists(watch=true) on ’path-B’if exists() returns false, go to step 2 else wait for a notification for’path-B’

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 30 / 54

Page 31: Базы данных. ZooKeeper

Usages Leader Election

Leader ElectionWork path - ’_election_’Leader volunteer behavior:

z = create("_election_/n_ sequential=true, ephemeral=true), z =n_i (i - seq number assigned for created node)C = getChildren("_election_")if i is the lowest seq suffix in C then current volunteer becomes aleaderelse watch for changes on "_election_/n_j where j is the largestsequence number and j < i

When receive a notification of znode deletion:C = getChildren("_election_")if i is the lowest seq suffix in C then current volunteer becomes aleaderelse watch for changes on "_election_/n_j where j is the largestsequence number and j < i

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 31 / 54

Page 32: Базы данных. ZooKeeper

Usages Apache Curator

Apache Curator

"Curator is a set of Java libraries that make using ApacheZooKeeper much easier"5

initially created by Netflix6

moved too Apache ecosystem

5http://curator.apache.org6http://en.wikipedia.org/wiki/Netflix

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 32 / 54

Page 33: Базы данных. ZooKeeper

Usages Apache Curator

Apache CuratorConsists from:

Framework - high-level API that simplifies using ZooKeeperhandles the complexity of managing connections to the ZooKeepercluster and retrying operationsRecipes - Implementations of some of the common ZooKeeper"recipes"Utilities - Various utilities that are useful when using ZooKeeperClient - A replacement for ZooKeeper class that takes care of somelow-level housekeepingErrors - How Curator deals with errors, connection issues,recoverable exceptions, etc.Extensions - Other recipes (ServiceDiscovery)

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 33 / 54

Page 34: Базы данных. ZooKeeper

Architecture Physical View

Physical View

leader is elected at startup and after leader failuresclients connect to exactly one server to submit requestsevery server services clients:

read requestswrite requests are forwarded to leader (responses are sent when a majorityof followers have persisted the change)

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 34 / 54

Page 35: Базы данных. ZooKeeper

Architecture Logical View

Logical View

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 35 / 54

Page 36: Базы данных. ZooKeeper

Architecture Logical View

Request Processor

prepares request for executionwhen write request received RP calculates what thestate of the system will ater its applying andtransforms it into a transactiontransactions are idempotemt (unlike client request)future state must be calculated: there may beoutstanding transactions that have not yet beenapplied to the database

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 36 / 54

Page 37: Базы данных. ZooKeeper

Architecture Logical View

Atomic Broadcast

Atomic broadcast (total order broadcast) - a broadcastmessaging protocol that ensures that messages arereceived reliably and in the same order by all participants.

Special protocol implementation used - Zab

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 37 / 54

Page 38: Базы данных. ZooKeeper

Architecture Logical View

Replicated Database

in-memory database containing the entire data treeeach replica has its own copyfor durability:

log updates to disk (replay log (a write-ahead) ofcommitted operations)force writes to be on the disk before they are applied tothe in-memory databasegenerate periodic snapshots of the in-memory database

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 38 / 54

Page 39: Базы данных. ZooKeeper

Architecture Broadcast algorithms

Broadcast algorithmsA broadcast algorithm transmits a message from one process (primary) toall other processes in a network or in a broadcast domain, including theprimary. Atomic broadcast satisfies:

ValidityIf a correct process broadcasts a message, then all correct processeswill eventually deliver itUniform AgreementIf a process delivers a message, then all correct processes eventuallydeliver that messageUniform IntegrityFor any message m, every process delivers m at most once, and onlyif m was previously broadcast by the sender of mUniform Total OrderIf processes p and q both deliver messages m and m’, then p deliversm before m’ if and only if q delivers m before m’

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 39 / 54

Page 40: Базы данных. ZooKeeper

Architecture Paxos

Paxos

Traditional protocol for solving distributed consensus 7.

7http://en.wikipedia.org/wiki/Consensus_(computer_science)Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 40 / 54

Page 41: Базы данных. ZooKeeper

Architecture Paxos

RolesClientissues a request to the distributed system, and waits for a responseAcceptor (Voter)Act as the fault-tolerant "memory"of the protocol. Acceptors arecollected into groups called Quorums.ProposerAdvocates a client request, attempting to convince the Acceptors toagree on itLearnerExecute the request and send a response to the client)Leader A distinguished Proposer (called the leader) to makeprogress.

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 41 / 54

Page 42: Базы данных. ZooKeeper

Architecture Paxos

Basic Paxos

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 42 / 54

Page 43: Базы данных. ZooKeeper

Architecture Why not Paxos

Why not Paxos

was not originally intended for atomic broadcastingbut consensus protocols can be used for atomicbroadcastingdoes not enable multiple outstanding transactionsdoes not require FIFO channels for communication,so it tolerates message loss and reorderingdoes not support order dependency of proposedvalues.(Problem could be solved by batching multipletransactions into a single proposal and allowing atmost one proposal at a time, but this hasperformance drawbacks)

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 43 / 54

Page 44: Базы данных. ZooKeeper

Architecture ZooKeeper Atomic Broadcast (ZAB)

Crash Recovery

Peers communicate by message passing. They maycrash and recover indefinitely many times.Quorum is subset of peers that contains more thanhalf onesAny two quorums have a non-empty intersectionBidirectional channel for every pair of peers mustsatisfy:

integrityprefix

ZAB uses TCP (therefore FIFO) channels forcommunication

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 44 / 54

Page 45: Базы данных. ZooKeeper

Architecture ZooKeeper Atomic Broadcast (ZAB)

Transactions

State changes that the primary propagate ("broadcasts")to the ensemble, and are represented by a pair (v, z):

v - new statez - transactional identifier - called zxid

zxid z of a transaction is a pair (e, c), where:e - the epoch number of the primary that generatedthe transactionc - is an integer acting as a counter

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 45 / 54

Page 46: Базы данных. ZooKeeper

Architecture ZooKeeper Atomic Broadcast (ZAB)

Broadcast protocol

Peer states:followingleadingelection

Whether a peer is a follower or a leader, it executes threeZab phases:

discoverysynchronizationbroadcast

Previous to Phase 1, a peer is in state election, when itexecutes a leader election algorithm.

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 46 / 54

Page 47: Базы данных. ZooKeeper

Architecture ZooKeeper Atomic Broadcast (ZAB)

Phase 0: Leader election

peers have state electionno specific leader election protocol needs to beemployedif peer p voted for peer p’, then p’ is called theprospective leader for p

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 47 / 54

Page 48: Базы данных. ZooKeeper

Architecture ZooKeeper Atomic Broadcast (ZAB)

Phase 1: Discovery

followers communicate with their prospective leaderthat leader gathers information about the mostrecent transactions that its followers accepteddiscover the most updated sequence of acceptedtransactions among a quorum, and to establish a newepoch so that previous leaders cannot commit newproposals

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 48 / 54

Page 49: Базы данных. ZooKeeper

Architecture ZooKeeper Atomic Broadcast (ZAB)

Phase 2: Synchronization

Recovery part of the protocolsynchronize the replicas in the ensemble using theleader’s updated history from the previous phaseleader communicates with the followers, proposingtransactions from its history. Followers acknowledgethe proposals if their own history is behind theleader’s history.when the leader sees acknowledgements from aquorum, it issues a commit message to them.At that point, the leader is said to be established andnot anymore prospective.

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 49 / 54

Page 50: Базы данных. ZooKeeper

Architecture ZooKeeper Atomic Broadcast (ZAB)

Phase 3: Broadcast

peers stay in this phase until any crashes occurperform broadcast transaction for client writerequests

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 50 / 54

Page 51: Базы данных. ZooKeeper

Architecture ZooKeeper Atomic Broadcast (ZAB)

Implemented protocol

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 51 / 54

Page 52: Базы данных. ZooKeeper

Throughput

Throughput

running on servers with dual 2Ghz Xeon and two SATA 15K RPM drivesone drive was used as a dedicated log device"servers"indicate the size of the ensemble

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 52 / 54

Page 53: Базы данных. ZooKeeper

Conclusion Off Screen

Off Screen

More ZooKeeper usagesZooKeeper cluster tuningError handling while native ZooKeeper client usageDeep study of Paxos and ZAB

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 53 / 54

Page 54: Базы данных. ZooKeeper

Conclusion Questions?

Questions?

Thanks! Questions?

Щитинин Д.А. (CompSciCenter) Apache ZooKeeper 11 ноября 2013 г. 54 / 54