The State of HBase Replication
description
Transcript of The State of HBase Replication
1
The State of HBase ReplicationJean-Daniel CryansMay 5th, 2014
©2014 Cloudera, Inc. All rights reserved.
About me
2
• Software Engineer at Cloudera, Storage team• Apache HBase committer since 2008, PMC member
©2014 Cloudera, Inc. All rights reserved.
Motivation for HBase Replication• Even though HBase is:
3
©2014 Cloudera, Inc. All rights reserved.
Motivation for HBase Replication• Even though HBase is:
• distributed;
3
©2014 Cloudera, Inc. All rights reserved.
Motivation for HBase Replication• Even though HBase is:
• distributed;• fault-tolerant;
3
©2014 Cloudera, Inc. All rights reserved.
Motivation for HBase Replication• Even though HBase is:
• distributed;• fault-tolerant;• highly available; and
3
©2014 Cloudera, Inc. All rights reserved.
Motivation for HBase Replication• Even though HBase is:
• distributed;• fault-tolerant;• highly available; and• almost magic.
3
©2014 Cloudera, Inc. All rights reserved.
Motivation for HBase Replication• Even though HBase is:
• distributed;• fault-tolerant;• highly available; and• almost magic.
3
©2014 Cloudera, Inc. All rights reserved.
The Current State• It’s production-ready.
4
©2014 Cloudera, Inc. All rights reserved.
The Current State• It’s production-ready.• It’s used to replicate data between thousands of nodes across continents.
4
©2014 Cloudera, Inc. All rights reserved.
The Current State• It’s production-ready.• It’s used to replicate data between thousands of nodes across continents.• It’s used for Disaster Recovery, geo-distributed serving, and more.
4
©2014 Cloudera, Inc. All rights reserved.5
Agenda• Four Years of Replication• Use Cases in Production• Roadmap
©2014 Cloudera, Inc. All rights reserved.
Design• Clusters are distinct• Pull VS push• Sync VS Async
6
©2014 Cloudera, Inc. All rights reserved.
Clusters are Distinct•HBase doesn’t span DCs, HDFSs
7
Master20 RS
Slave15 RS
©2014 Cloudera, Inc. All rights reserved.
Clusters are Distinct•HBase doesn’t span DCs, HDFSs• .META. operations aren’t replicated
7
Master20 RS
Slave15 RS
©2014 Cloudera, Inc. All rights reserved.
Clusters are Distinct•HBase doesn’t span DCs, HDFSs• .META. operations aren’t replicated
• Regions can be different
7
Master20 RS
Slave15 RS
©2014 Cloudera, Inc. All rights reserved.
Clusters are Distinct•HBase doesn’t span DCs, HDFSs• .META. operations aren’t replicated
• Regions can be different• Security has to be configured for each cluster
7
Master20 RS
Slave15 RS
©2014 Cloudera, Inc. All rights reserved.
Push instead of Pull
8
MySQLMaster
MySQLSlave
Get binlog
Apply locally
MySQL Replication uses PullCluster A Cluster B
©2014 Cloudera, Inc. All rights reserved.
Push instead of Pull
9
RS RSreplicate entries
Apply to cluster
HBase Replication uses PushCluster A Cluster B
©2014 Cloudera, Inc. All rights reserved.
Async instead of Sync
10
Cluster A Cluster B
RSHLog
MemStore
RSHLog
MemStore
Synchronous Replication
©2014 Cloudera, Inc. All rights reserved.
Async instead of Sync
10
Cluster A Cluster B
RSHLog
MemStore
RSHLog
MemStore
Put2
3
1
Synchronous Replication
©2014 Cloudera, Inc. All rights reserved.
Async instead of Sync
10
Cluster A Cluster B
RSHLog
MemStore
RSHLog
MemStore
Put2
3
1
Ack Ack
Put5
6
4
78
Synchronous Replication
©2014 Cloudera, Inc. All rights reserved.
Async instead of Sync
11
Asynchronous Replication
©2014 Cloudera, Inc. All rights reserved.
Async instead of Sync
11
Asynchronous ReplicationCluster A
RSHLog
MemStore
Put
Ack
2
3
1
4
©2014 Cloudera, Inc. All rights reserved.
Async instead of Sync
11
Asynchronous ReplicationCluster A
RSHLog
MemStore
Put
Ack
2
3
1
4
Cluster B
RSHLog
MemStoreAck
Put3
4
2
5
HLogTailingThread
1
©2014 Cloudera, Inc. All rights reserved.
First Release - 0.90.0• Simple master-slave (only one)•Disabled by default• Uses ZK as a metadata store
12
©2014 Cloudera, Inc. All rights reserved.
Original Implementation
13
replicateLogEntries()ReplicationSource
ZooKeeperWatcher
Region Server onMaster Cluster
ReplicationSink
HTablePut
Delete
Region Server onSlave Cluster
©2014 Cloudera, Inc. All rights reserved.
First Lesson Learned•HDFS doesn’t support tailing files being written to. It requires:• open()• seek()// go where we stopped last time• while (not EOF || enoughData)
•read()
• close()• repeat
14
©2014 Cloudera, Inc. All rights reserved.
Second Lesson Learned• Single threaded, non-batched ZK is slow• ZK didn’t have an atomic move operation
• Doubles # ops needed, race conditions
15
©2014 Cloudera, Inc. All rights reserved.
Second Lesson Learned• Single threaded, non-batched ZK is slow• ZK didn’t have an atomic move operation
• Doubles # ops needed, race conditions
15
/hbase /replication /RS1 /1 /hlog1 /hlog2...
/hbase /replication /RS2 /1-RS1 /hlog1
1. create new hlog22. delete old hlog2
©2014 Cloudera, Inc. All rights reserved.
Second Release - 0.92.0• Cyclic replication•Multi-slave (scope LOCAL or GLOBAL)• Enable / disable peer• Special configurations
16
©2014 Cloudera, Inc. All rights reserved.
Cyclic Replication
17
Cluster1
Cluster2
Cluster3
Put Row X
©2014 Cloudera, Inc. All rights reserved.
Cyclic Replication
17
Cluster1
Cluster2
Cluster3
Put Row X
Put Row X
©2014 Cloudera, Inc. All rights reserved.
Cyclic Replication
17
Cluster1
Cluster2
Cluster3
Put Row X
Put Row X
Put Row X
©2014 Cloudera, Inc. All rights reserved.
Cyclic Replication
17
Cluster1
Cluster2
Cluster3
Put Row X
Put Row X
Put Row X
Row X is from 1Don’t replicate!
©2014 Cloudera, Inc. All rights reserved.
Multi-Slave
18
Cluster1
Cluster2
Cluster3
Put Row X
©2014 Cloudera, Inc. All rights reserved.
Multi-Slave
18
Cluster1
Cluster2
Cluster3
Put Row X
Put Row X
©2014 Cloudera, Inc. All rights reserved.
Multi-Slave
18
Cluster1
Cluster2
Cluster3
Put Row X
Put Row X Put Row X
©2014 Cloudera, Inc. All rights reserved.
Enable / Disable Peers
19
Cluster 1
RSHLog
Cluster 2
RSHLogTailingThread
©2014 Cloudera, Inc. All rights reserved.
Enable / Disable Peers> disable_peer ‘2’
19
Cluster 1
RSHLog
Cluster 2
RSHLogTailingThread
Is the peer enabled?
©2014 Cloudera, Inc. All rights reserved.
Enable / Disable Peers> disable_peer ‘2’
19
Cluster 1
RSHLog
Cluster 2
RSHLogTailingThreadHLog
Is the peer enabled?
©2014 Cloudera, Inc. All rights reserved.
Enable / Disable Peers> disable_peer ‘2’
19
Cluster 1
RSHLog
Cluster 2
RSHLogTailingThreadHLog
HLog
Is the peer enabled?
©2014 Cloudera, Inc. All rights reserved.
Enable / Disable Peers> disable_peer ‘2’
19
Cluster 1
RSHLog
Cluster 2
RSHLogTailingThreadHLog
HLog
HLogIs the peer enabled?
©2014 Cloudera, Inc. All rights reserved.
Enable / Disable Peers> disable_peer ‘2’
19
Cluster 1
RSHLog
Cluster 2
RSHLogTailingThreadHLog
HLog
HLog
HLog Is the peer enabled?
©2014 Cloudera, Inc. All rights reserved.
Enable / Disable Peers> disable_peer ‘2’
19
Cluster 1
RSHLog
Cluster 2
RSHLogTailingThreadHLog
HLog
HLog
HLog
HLog
Is the peer enabled?
©2014 Cloudera, Inc. All rights reserved.
Special Configurations• KEEP_DELETED_CELLS
• Must be used on slaves with replication when deleting data.
20
©2014 Cloudera, Inc. All rights reserved.
Special Configurations• KEEP_DELETED_CELLS
• Must be used on slaves with replication when deleting data.
•MIN_VERSION• With TTL, makes it easy to configure a slave that contains only the last few days of data.
20
©2014 Cloudera, Inc. All rights reserved.
Third Lesson Learned• It’s easy to DDOS yourself.• Replication was using the normal handlers...• ... and using them to write back!
21
Handler1: PutHandler2: DeleteHandler3: ReplicateHandler4: GetHandler5: Put
Replicated Put goes in the queue
©2014 Cloudera, Inc. All rights reserved.
Fourth Lesson Learned• Instinctively, what would something called stop_replication do?
22
©2014 Cloudera, Inc. All rights reserved.
Fourth Lesson Learned• Instinctively, what would something called stop_replication do?•Good intentions, bad outcomes, HBASE-8861
22
start/stop_replicationX
©2014 Cloudera, Inc. All rights reserved.
Third Release - 0.96.0 / 0.98.0• Replication enabled by default!• Completely refactored for readability/extensibility (Chris Trezzo)• ReplicationSyncUp tool (HBASE-9047)• Throttling (HBASE-9501)• Finer grained replication controls (HBASE-8751)
23
©2014 Cloudera, Inc. All rights reserved.
ReplicationSyncUp Tool•Works on an offline cluster• Can finish replicating the queues in ZK• Useful to finish draining a master cluster
24
HBase
HDFS
ZooKeeper
HBase
HDFS
ZooKeeper
ReplicationSyncUp
©2014 Cloudera, Inc. All rights reserved.
Finer Grained Replication Controls> set_peer_tableCFs '2', "table1; table2:cf1,cf2; table3:cfA,cfB"•Meaning: enable replication to peer #2 for:
• All of table1• cf1 and cf2 from table2• cfA and cfB from table3
25
©2014 Cloudera, Inc. All rights reserved.26
Agenda• Four Years of Replication•Use Cases in Production• Roadmap
©2014 Cloudera, Inc. All rights reserved.
Flurry• Two data centers, coast to coast• Three clusters, in master-master pairs
• 1200 nodes• 800 nodes• 30 nodes
• Replication traffic: 2Gbps• Latency between DCs: 85ms
27
©2014 Cloudera, Inc. All rights reserved.
Opower• Two clusters, same data center
• Master: tens of nodes• Slave: tens of nodes
• Replication traffic: 1GB/day• Bulk load replication traffic: 180GB/day• Recent use case
28
©2014 Cloudera, Inc. All rights reserved.
Lily HBase Indexer• Collaboration between NGData & Cloudera.
• NGData are the creators of the Lily data management platform.
• Lily HBase Indexer • Service which acts as a HBase replication listener.• Custom sink writes to SolrCloud.• Integrates Cloudera Morphlines library for ETL of rows.
29
©2014 Cloudera, Inc. All rights reserved.30
Agenda• Four Years of Replication• Use Cases in Production• Roadmap
©2014 Cloudera, Inc. All rights reserved.
Stop Relying on Permanent Znodes• Current rule is to never rely on znodes to survive cluster restarts, upgrades, etc.• State data should be kept in an HBase table.•Notification done through a new mechanism• See: https://issues.apache.org/jira/browse/HBASE-10295
31
©2014 Cloudera, Inc. All rights reserved.
Define a Replication Interface• Replication is somewhat extendable but it lacks stable interfaces.• The HBase Indexer is such an extension and it required surgery every time a committer sneezed.• See: https://issues.apache.org/jira/browse/HBASE-10504
32
©2014 Cloudera, Inc. All rights reserved.
Distributed Counters• Incrementing consists of:
33
©2014 Cloudera, Inc. All rights reserved.
Distributed Counters• Incrementing consists of:
1.Taking a lock;
33
©2014 Cloudera, Inc. All rights reserved.
Distributed Counters• Incrementing consists of:
1.Taking a lock;2.Get’ing the current value; and
33
©2014 Cloudera, Inc. All rights reserved.
Distributed Counters• Incrementing consists of:
1.Taking a lock;2.Get’ing the current value; and3.Put’ing the newly incremented value.
33
©2014 Cloudera, Inc. All rights reserved.
Distributed Counters• Incrementing consists of:
1.Taking a lock;2.Get’ing the current value; and3.Put’ing the newly incremented value.
• This breaks in Master-Master because the Puts are overwriting each other.
33
©2014 Cloudera, Inc. All rights reserved.
Distributed Counters• Incrementing consists of:
1.Taking a lock;2.Get’ing the current value; and3.Put’ing the newly incremented value.
• This breaks in Master-Master because the Puts are overwriting each other.• See https://issues.apache.org/jira/browse/HBASE-2804
33
©2014 Cloudera, Inc. All rights reserved.
More Tooling• Replication management console, one shell to rule all the clusters!• Replication bootstrapping tool.• Tool that can move queues between region servers.• Tool that can throttle replication on a live cluster.
34
©2014 Cloudera, Inc. All rights reserved.
Questions?•Or ping me async:
• @jdcryans• [email protected]• jdcryans on #hbase irc.freenode.net
35