Untangling Cluster Management with Helix
-
Upload
amy-w-tang -
Category
Technology
-
view
585 -
download
0
description
Transcript of Untangling Cluster Management with Helix
Recruiting Solutions Recruiting Solutions Recruiting Solutions
Untangling Cluster Management with Helix
1
Helix team @ LinkedIn Kishore Gopalakrishna http://www.linkedin.com/in/kgopalak @kishoreg1980
Outline
What is Helix Use case 1: distributed data store Architecture Use case 2: consumer group Helix at LinkedIn Q&A
2
What is Helix
3
Cluster management framework for distributed systems using declarative state model
Distributed system examples
4
Motivation
A system starts out simple… …but gets complex in the real world …as you address real requirements
5
Application
client library
System Call Routing
Replica 1
Replica 2
…
Scale Failover Bootstrapping
…
Motivation
These are cluster management problems Helix solves them once… …so you can focus on your system
6
Scale Failover Bootstrapping
Outline
What is Helix Use case 1: distributed data store Architecture Use case 2: consumer group Helix at LinkedIn Q&A
7
Use-Case: Distributed Data Store
Distributed
8
Node 1 Node 3
P.1
Node 2
Use-Case: Distributed Data Store
Distributed Partitioned
9
Node 1 Node 3 Node 2
P.4
P.9 P.10
P.11
P.12
P.1 P.2 P.3 P.7 P.5 P.6
P.8
Use-Case: Distributed Data Store
Distributed Partitioned Replicated
10
Node 1 Node 3 Node 2
P.4
P.9 P.10
P.11
P.12
P.1 P.2 P.3 P.7 P.5 P.6
P.8 P.1 P.5 P.6
P.9 P.10
P.4 P.3
P.7 P.8 P.11 P.12
P.2 P.1
Partition Layout
Highly Available Master accepts writes Balanced distribution
11
Node 1 Node 3 Node 2
P.4
P.9 P.10
P.11
P.12
P.1 P.2 P.3 P.7 P.5 P.6
P.8 P.1 P.5 P.6
P.9 P.10
P.4 P.3
P.7 P.8 P.11 P.12
P.2 P.1
Master
Slave
Failover
Node 1
P.5 P.6
P.9 P.10
P.4
Node 3
P.9 P.10
P.11
P.4 P.3 P.12 P.7 P.8
P.1 P.2 P.3
P.1
Node 2
P.7
P.11 P.12
P.2
P.5 P.6
P.8 P.1
Master
Slave
P.1 P.2 P.3 P.4
Add Capacity
Node 1
P.5 P.6
P.9
P.4
Node 3
P.10
P.11
P.4 P.3 P.12 P.7
P.2 P.3
P.1
Node 2
P.7
P.11 P.10
P.8 P.12
P.2
P.9 P.1 P.5 P.6
P.8 P.1
Node 4
P.10
P.8 P.12 Master
Slave
P.1 P.5 P.9
Use-case requirements
• Partition constraints • 1 master per partition • Balance partitions across cluster • No single-point-of-failure: replicas on different nodes
• Handle failures: transfer mastership • Elasticity
• Distribute workload across added nodes Minimize partition movement
• Meet SLAs Throttle concurrent data movement
14
State machine – States
offline, slave, master – Transitions
O-S, S-O, S-M, M-S
COUNT=2
COUNT=1
minimize(maxnj∈N S(nj) ) t1≤ 5
Declarative Problem Statement
Constraints – States – Transitions
Objective – Partition placement
15
S
M O
t1 t2
t3 t4 minimize(maxnj∈N M(nj) )
Generalizing cluster management
16
STATE MACHINE
CONSTRAINTS OBJECTIVE
Outline
What is Helix Use case 1: distributed data store Architecture Use case 2: consumer group Helix at LinkedIn Q&A
17
Helix Based System Roles
18
Node 1 Node 3 Node 2
P.4
P.9 P.10
P.11
P.12
P.1 P.2 P.3 P.7 P.5 P.6
P.8 P.1 P.5 P.6
P.9 P.10
P.4 P.3
P.7 P.8 P.11
P.12
P.2 P.1
RESPONSE COMMAND
Controller Execution Flow
P1:OS P1:SM
Controller fault tolerance
20
Controller fault tolerance
21
Participant Plug-in code
22
Spectator Plug-in code
23
Benefits
Cluster operations “just work” – Bootstrapping – Failover – Add nodes
Global vs Local – Helix Controller
Global knowledge Makes cluster decisions
– Participant Local knowledge Follows orders
24
Outline
What is Helix Use case 1: distributed data store Architecture Use case 2: consumer group Helix at LinkedIn Q&A
25
consumer group
26
Consumer group: Scaling
27
Consumer group: Fault tolerance
28
Consumer group: state model
29
Outline
What is Helix Use case 1: distributed data store Architecture Use case 2: consumer group Helix at LinkedIn Q&A
30
Helix usage at LinkedIn (Pictures)
Espresso – a timeline-consistent, distributed data store
Databus – a change data capture service
Search as a Service – a multi-tenant service for multiple search applications
More planned
31
Summary
Generic framework Easy to use: declarative model Easy to operate
32
Helix: Future Roadmap
• Features • Span multiple data centers • Load balancing
• Announcement
• Open source: https://github.com/linkedin/helix • Apache incubation • New contributors
Questions?
34