Untangling Cluster Management with Helix

34
Untangling Cluster Management with Helix 1 Helix team @ LinkedIn Kishore Gopalakrishna http://www.linkedin.com/in/kgopalak @kishoreg1980

description

This talk was given by Kishore Gopalakrishna (Staff Software Engineer @ LinkedIn) at the 3rd ACM Symposium on Cloud Computing (SOCC 2012).

Transcript of Untangling Cluster Management with Helix

Page 1: Untangling Cluster Management with Helix

Recruiting Solutions Recruiting Solutions Recruiting Solutions

Untangling Cluster Management with Helix

1

Helix team @ LinkedIn Kishore Gopalakrishna http://www.linkedin.com/in/kgopalak @kishoreg1980

Page 2: Untangling Cluster Management with Helix

Outline

What is Helix Use case 1: distributed data store Architecture Use case 2: consumer group Helix at LinkedIn Q&A

2

Page 3: Untangling Cluster Management with Helix

What is Helix

3

Cluster management framework for distributed systems using declarative state model

Page 4: Untangling Cluster Management with Helix

Distributed system examples

4

Page 5: Untangling Cluster Management with Helix

Motivation

A system starts out simple… …but gets complex in the real world …as you address real requirements

5

Application

client library

System Call Routing

Replica 1

Replica 2

Scale Failover Bootstrapping

Page 6: Untangling Cluster Management with Helix

Motivation

These are cluster management problems Helix solves them once… …so you can focus on your system

6

Scale Failover Bootstrapping

Page 7: Untangling Cluster Management with Helix

Outline

What is Helix Use case 1: distributed data store Architecture Use case 2: consumer group Helix at LinkedIn Q&A

7

Page 8: Untangling Cluster Management with Helix

Use-Case: Distributed Data Store

Distributed

8

Node 1 Node 3

P.1

Node 2

Page 9: Untangling Cluster Management with Helix

Use-Case: Distributed Data Store

Distributed Partitioned

9

Node 1 Node 3 Node 2

P.4

P.9 P.10

P.11

P.12

P.1 P.2 P.3 P.7 P.5 P.6

P.8

Page 10: Untangling Cluster Management with Helix

Use-Case: Distributed Data Store

Distributed Partitioned Replicated

10

Node 1 Node 3 Node 2

P.4

P.9 P.10

P.11

P.12

P.1 P.2 P.3 P.7 P.5 P.6

P.8 P.1 P.5 P.6

P.9 P.10

P.4 P.3

P.7 P.8 P.11 P.12

P.2 P.1

Page 11: Untangling Cluster Management with Helix

Partition Layout

Highly Available Master accepts writes Balanced distribution

11

Node 1 Node 3 Node 2

P.4

P.9 P.10

P.11

P.12

P.1 P.2 P.3 P.7 P.5 P.6

P.8 P.1 P.5 P.6

P.9 P.10

P.4 P.3

P.7 P.8 P.11 P.12

P.2 P.1

Master

Slave

Page 12: Untangling Cluster Management with Helix

Failover

Node 1

P.5 P.6

P.9 P.10

P.4

Node 3

P.9 P.10

P.11

P.4 P.3 P.12 P.7 P.8

P.1 P.2 P.3

P.1

Node 2

P.7

P.11 P.12

P.2

P.5 P.6

P.8 P.1

Master

Slave

P.1 P.2 P.3 P.4

Page 13: Untangling Cluster Management with Helix

Add Capacity

Node 1

P.5 P.6

P.9

P.4

Node 3

P.10

P.11

P.4 P.3 P.12 P.7

P.2 P.3

P.1

Node 2

P.7

P.11 P.10

P.8 P.12

P.2

P.9 P.1 P.5 P.6

P.8 P.1

Node 4

P.10

P.8 P.12 Master

Slave

P.1 P.5 P.9

Page 14: Untangling Cluster Management with Helix

Use-case requirements

• Partition constraints • 1 master per partition • Balance partitions across cluster • No single-point-of-failure: replicas on different nodes

• Handle failures: transfer mastership • Elasticity

• Distribute workload across added nodes Minimize partition movement

• Meet SLAs Throttle concurrent data movement

14

Page 15: Untangling Cluster Management with Helix

State machine – States

offline, slave, master – Transitions

O-S, S-O, S-M, M-S

COUNT=2

COUNT=1

minimize(maxnj∈N S(nj) ) t1≤ 5

Declarative Problem Statement

Constraints – States – Transitions

Objective – Partition placement

15

S

M O

t1 t2

t3 t4 minimize(maxnj∈N M(nj) )

Page 16: Untangling Cluster Management with Helix

Generalizing cluster management

16

STATE MACHINE

CONSTRAINTS OBJECTIVE

Page 17: Untangling Cluster Management with Helix

Outline

What is Helix Use case 1: distributed data store Architecture Use case 2: consumer group Helix at LinkedIn Q&A

17

Page 18: Untangling Cluster Management with Helix

Helix Based System Roles

18

Node 1 Node 3 Node 2

P.4

P.9 P.10

P.11

P.12

P.1 P.2 P.3 P.7 P.5 P.6

P.8 P.1 P.5 P.6

P.9 P.10

P.4 P.3

P.7 P.8 P.11

P.12

P.2 P.1

RESPONSE COMMAND

Page 19: Untangling Cluster Management with Helix

Controller Execution Flow

P1:OS P1:SM

Page 20: Untangling Cluster Management with Helix

Controller fault tolerance

20

Page 21: Untangling Cluster Management with Helix

Controller fault tolerance

21

Page 22: Untangling Cluster Management with Helix

Participant Plug-in code

22

Page 23: Untangling Cluster Management with Helix

Spectator Plug-in code

23

Page 24: Untangling Cluster Management with Helix

Benefits

Cluster operations “just work” – Bootstrapping – Failover – Add nodes

Global vs Local – Helix Controller

Global knowledge Makes cluster decisions

– Participant Local knowledge Follows orders

24

Page 25: Untangling Cluster Management with Helix

Outline

What is Helix Use case 1: distributed data store Architecture Use case 2: consumer group Helix at LinkedIn Q&A

25

Page 26: Untangling Cluster Management with Helix

consumer group

26

Page 27: Untangling Cluster Management with Helix

Consumer group: Scaling

27

Page 28: Untangling Cluster Management with Helix

Consumer group: Fault tolerance

28

Page 29: Untangling Cluster Management with Helix

Consumer group: state model

29

Page 30: Untangling Cluster Management with Helix

Outline

What is Helix Use case 1: distributed data store Architecture Use case 2: consumer group Helix at LinkedIn Q&A

30

Page 31: Untangling Cluster Management with Helix

Helix usage at LinkedIn (Pictures)

Espresso – a timeline-consistent, distributed data store

Databus – a change data capture service

Search as a Service – a multi-tenant service for multiple search applications

More planned

31

Page 32: Untangling Cluster Management with Helix

Summary

Generic framework Easy to use: declarative model Easy to operate

32

Page 33: Untangling Cluster Management with Helix

Helix: Future Roadmap

• Features • Span multiple data centers • Load balancing

• Announcement

• Open source: https://github.com/linkedin/helix • Apache incubation • New contributors

Page 34: Untangling Cluster Management with Helix

Questions?

34