Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and...

61
Best Practices For Disaster Recovery and High Availability Stan Lukyanov Customer Solutions, GridGain Systems

Transcript of Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and...

Page 1: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

Best Practices For Disaster Recovery and High AvailabilityStan Lukyanov

Customer Solutions, GridGain Systems

Page 2: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

Agenda

1

• Disaster recovery essentials

• Disaster recovery options for Apache Ignite and GridGain

• Choosing the solution

Page 3: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

2

What is Disaster Recovery?

Page 4: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

What is Disaster Recovery?

App GridGain

Cluster

3

Page 5: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

What is Disaster Recovery?

App

Primary

DC

Secondary

DC

4

Page 6: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

What is Disaster Recovery?

App

Primary

DC

Secondary

DC

Cluster

Switchover

Data

Replication

5

Page 7: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

How to Compare DR Solutions?

6

• How long the service is down in case of a disaster?

• Is there a data loss in case of a disaster?

• How much does the solution cost?

• Any additional benefits to implementing that?

Page 8: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

Disaster Timeline

7

t

Disaster

Page 9: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

Disaster Timeline

8

t

Disaster

All OK All OK again

Page 10: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

Disaster Timeline

9

t

Disaster

All OK All OK againService is

down

Page 11: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

Disaster Timeline

10

t

Disaster

All OK All OK againLast updates

are lost

Service is

down

Page 12: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

Disaster Timeline

11

t

Disaster

All OK All OK againLast updates

are lost

Service is

down

Recovery Point Objective

(RPO)

Recovery Time Objective

(RTO)

Page 13: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

Recovery Point Objective (RPO)

12

• Maximum time for which data is allowed to be lost

• Defined by the replication lag

• RPO = 0 – updates are replicated immediately, i.e. replication is

synchronous

Page 14: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

Recovery Time Objective (RTO)

13

• Maximum allowed service interruption time

• RTO = 0 – second cluster is in standby AND client switches instantly

• In real life it is either:• RTO = failure detection time (seconds) – if the second cluster is in standby

• RTO = cluster startup time (minutes to hours) – if the second cluster is

dormant

Page 15: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

Comparison Parameters

14

Solution RPO RTO Cost and Benefits

??? 0 to hours 0* to hours

Minimal cost:

- Second DC

- Cluster switchover logic

- Data replication logic

* “RTO = 0” actually means “RTO = failure detection time”

Page 16: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

15

Disaster Recovery Options

Page 17: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

Disaster Recovery Options

App

Primary

DC

Secondary

DC

Cluster

Switchover

Data

Replication

16

Page 18: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

17

Option 1: Backup-Based DR

Page 19: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

Option 1: Backup-Based DR

App

Primary

DC

Secondary

DC

Cluster

Switchover

Disk

Storage

Backup

Restore

18

Page 20: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

Option 1: Backup-Based DR

19

• RPO = backup period (hours)

• RTO = 0 if second DC is in standby, minutes to hours otherwise

• Requires a backup solution and a disk storage

• Requires custom cluster switchover

• Backups are generally useful• E.g. can be restored in a development environment

Page 21: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

Option 1: Backup-Based DR

20

• Backups of a live cluster – no service disruption

• Incremental backups for more frequent backups – better RPO

• Automatic backup management (scheduling)

• Point-in-Time Recovery

• Only works with Native Persistence

• Available in GridGain Ultimate Edition

GridGain Solution: Snapshots

Page 22: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

21

Option 2: Stretched Cluster

Page 23: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

Option 2: Stretched Cluster

App

Primary

DC

Secondary

DC

Stretched cluster

22

Page 24: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

Option 2: Stretched Cluster

App

Primary

DC

Secondary

DC

Stretched Cluster

23

Split-Brain

Protection

Data Center

Awareness

Reliable Network

Page 25: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

Option 2: Stretched Cluster

24

• RPO = 0 – if writes in the cluster are synchronous (they often are)

• RTO = 0 – client switches as if on a node failure

• Doesn’t require cluster switchover nor replication solution

• Requires fast and reliable network between DCs

• Requires data center awareness

• Requires split-brain protection

Page 26: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

Option 2: Stretched Cluster

25

Split-Brain Explained

App

Primary

DC

Secondary

DC

App

Stretched Cluster

Page 27: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

Option 2: Stretched Cluster

26

Split-Brain Explained

App

Primary

DC

Secondary

DC

App

Stretched Cluster

Page 28: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

Option 2: Stretched Cluster

27

Split-Brain Explained

App

Primary

DC

Secondary

DC

App

Segmented Cluster

Segmented Cluster

Page 29: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

Option 2: Stretched Cluster

28

Split-Brain Explained

App

Primary

DC

Secondary

DC

App

Segmented Cluster

Segmented Cluster

Split-Brain

Protection

Page 30: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

Option 2: Stretched Cluster

29

Solution 1: TopologyValidator + SegmentationResolver

• TopologyValidator prevents updates in the segmented part

• SegmentationResolver stops the segmented part

• Available in GridGain Enterprise Edition

Solution 2: Zookeeper Discovery

• Zookeeper is responsible for keeping the cluster together

• Available in Apache Ignite and GridGain Community Edition

GridGain Solution For Split-Brain Protection

Page 31: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

Option 2: Stretched Cluster

30

Data Center Awareness Explained

Primary

DC

Secondary

DC

App

Stretched Cluster

Page 32: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

Option 2: Stretched Cluster

31

Data Center Awareness Explained

Primary

DC

Secondary

DC

App

Stretched Cluster

A A

B B

Page 33: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

Option 2: Stretched Cluster

32

Data Center Awareness Explained

Primary

DC

Secondary

DC

App

Stretched Cluster

A A

B B

Page 34: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

Option 2: Stretched Cluster

33

Data Center Awareness Explained

Primary

DC

Secondary

DC

App

Stretched Cluster

A B

A B

Data Center

Awareness

Page 35: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

Option 2: Stretched Cluster

34

Data Center Awareness Explained

Primary

DC

Secondary

DC

App

Stretched Cluster

A B

A B

Data Center

Awareness

Page 36: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

Option 2: Stretched Cluster

35

• RendezvousAffinityFunction.affinityBackupFilter

controls distribution of backups

• Available in Apache Ignite and GridGain Community Edition

GridGain Solution For Data Center Awareness

Page 37: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

Option 2: Stretched Cluster

36

Step 1: Implement backup filter

GridGain Solution For Data Center Awareness

class DcFilter implements IgniteBiPredicate<ClusterNode, List<ClusterNode>> {

@Override

public boolean apply(ClusterNode candidate, List<ClusterNode> assigned) {

String candidateDc = candidate.attribute(”dc”);

String primaryDc = assigned.get(0).attribute(”dc”);

return !Objects.equals(candidateDc, primaryDc);

}

}

Page 38: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

Option 2: Stretched Cluster

37

Step 2: Configure the filter in each cache

GridGain Solution For Data Center Awareness

<property name="affinity">

<bean

class="org.apache.ignite.cache.affinity.rendezvous.RendezvousAffinityFunction">

<property name="affinityBackupFilter">

<bean class="com.mycompany.DcFilter"/>

</property>

</bean>

</property>

Page 39: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

Option 2: Stretched Cluster

38

Step 3: Assign a value to each node

GridGain Solution For Data Center Awareness

<property name="userAttributes”>

<map>

<entry key=”dc” value=”dc-1"/>

</map>

</property>

Page 40: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

39

Option 3: Streaming

Page 41: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

Option 3: Streaming

App

Primary

DC

Secondary

DC

Cluster

Switchover

Stream

Processor

40

Page 42: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

Option 3: Streaming

41

• RPO = seconds – replication lag• Could be 0 but synchronous replication is usually a bad idea

• RTO = 0

• Requires a streaming platform and/or embedded change-data-capture

functionality

Page 43: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

Option 3: Streaming

42

• No additional software

• Active-Passive or Active-Active

• Allows for complex topologies, up to 32 data centers

• Available in GridGain Enterprise Edition

• Data Center Replication docs:

https://docs.gridgain.com/docs/data-center-replication

GridGain Solution: Data Center Replication

Page 44: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

Option 3: Streaming

43

<bean class="org.apache.ignite.configuration.IgniteConfiguration">

<property name="cacheConfiguration">

<bean class="org.apache.ignite.configuration.CacheConfiguration">

<property name="name" value="crossDrCache"/>

<property name="pluginConfigurations">

<bean class="org.gridgain.grid.configuration.GridGainCacheConfiguration">

<property name="drSenderConfiguration">

<bean class="org.gridgain.grid.cache.dr.CacheDrSenderConfiguration"/>

</property>

</bean>

</property>

</bean>

</property>

</bean>

GridGain Solution: Data Center Replication

Step 1: Configure caches on the Sender side

Page 45: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

Option 3: Streaming

44

<!–- In IgniteConfiguration.pluginConfigurations --><bean class="org.gridgain.grid.configuration.GridGainConfiguration">

<property name="dataCenterId" value="1"/>

<property name="drSenderConfiguration">

<bean class="org.gridgain.grid.configuration.DrSenderConfiguration">

<property name="connectionConfiguration">

<bean class="org.gridgain.grid.dr.DrSenderConnectionConfiguration">

<property name="dataCenterId" value="2"/>

<property name="receiverAddresses" value="172.16.2.100:50001"/>

</bean>

</property>

</bean>

</property>

</bean>

GridGain Solution: Data Center Replication

Step 2: Configure a Sender Hub

Page 46: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

Option 3: Streaming

45

<bean class="org.apache.ignite.configuration.IgniteConfiguration">

<property name="cacheConfiguration">

<bean class="org.apache.ignite.configuration.CacheConfiguration">

<property name="name" value="crossDrCache"/>

<property name="pluginConfigurations">

<bean class="org.gridgain.grid.configuration.GridGainCacheConfiguration">

<property name="drReceiverEnabled" value=”true”/>

</bean>

</property>

</bean>

</property>

</bean>

GridGain Solution: Data Center Replication

Step 3: Configure caches on the Receiver side

Page 47: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

Option 3: Streaming

46

<!–- In IgniteConfiguration.pluginConfigurations --><bean class="org.gridgain.grid.configuration.GridGainConfiguration">

<property name="dataCenterId" value=”2"/>

<property name="drReceiverConfiguration">

<bean class="org.gridgain.grid.configuration.DrReceiverConfiguration">

<property name="localInboundHost" value="172.16.2.100"/>

<property name="localInboundPort" value="50001"/>

</bean>

</property>

</bean>

GridGain Solution: Data Center Replication

Step 4: Configure a Receiver Hub

Page 48: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

Option 3: Streaming

47

• Certified by Confluent

• Requires a Kafka instance deployed separately

• Maximum flexibility

• Available in GridGain Enterprise Edition

• Detailed guide for GridGain DR using Kafka:

https://docs.gridgain.com/docs/certified-kafka-connector-examples-dr

GridGain Solution: Kafka Connector

Page 49: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

48

Option 4: System-Level Replication

Page 50: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

Option 4: System-Level Replication

AppVirtualized

Cluster

Virtualization

Software

Primary

DC disks

Secondary

DC disks

49

Page 51: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

Option 4: System-Level Replication

50

• RPO = 0 to minutes – depends on the vendor

• RTO = minutes – usually requires cluster restart

• Requires virtualized/cloud environment – significant operations effort

and costs

Page 52: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

51

Option 5: Application-Level Replication

Page 53: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

Option 5: Application-Level Replication

App

Primary

DC

Secondary

DC

Intelligent Data

Access Layer

52

Page 54: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

Option 5: Application-Level Replication

53

• RPO = 0 – synchronous writes to both DCs

• RTO = 0 – always connected to both DCs

• Requires significant development effort – full DIY

Page 55: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

54

Choosing The Solution

Page 56: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

Solutions Comparison

55

Solution RPO RTO Cost and Benefits

Backup-Based Hours 0 to minutes

- Cost: Backup solution and disk storage

- Cost: Custom switchover

- Benefit: Backups are generally useful

Stretched Cluster 0 0- Cost: Huge reliance on network

- Cost: Split-brain protection and data center awareness

GridGain DCR Seconds 0- Cost: Custom switchover

- Benefit: No additional components for replication

GridGain Kafka

ConnectorSeconds 0

- Cost: Kafka

- Cost: Custom switchover

- Benefit: Flexible, allows heterogenous consumers

System-Level 0 to minutes Minutes- Cost: VM/Cloud solution

- Cost: Huge reliance on network

Application-Level 0 0 - Cost: Full DIY

Page 57: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

Choosing The Solution For GridGain DR

56

• Default – go with GridGain Data Center Replication• Works out of the box, no additional components

• If RPO requirements allow – go with GridGain Snapshots• Simple and powerful

• If heterogenous receivers are required – go with GridGain Kafka

Connector• Flexible and robust

• If already running on VM or in the Cloud – check their guarantees• May fit out of the box – but if not, don’t worth it just for DR

Page 58: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

57

Q&A

Page 59: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

Apache Ignite Resources

58

• Apache Ignite documentation

– https://apacheignite.readme.io/docs

– Apache Ignite community resources

[email protected] – the mailing list

– https://ignite.apache.org/community/resources.html – other resources and

instructions

– http://apache-ignite-users.70518.x6.nabble.com – forum and archive

– https://stackoverflow.com/questions/tagged/ignite – StackOverflow questions

Page 60: Best Practices For Disaster Recovery and High Availability · - Cost: Split-brain protection and data center awareness GridGain DCR Seconds 0 - Cost: Custom switchover - Benefit:

GridGain Resources

59

Moving Apache® Ignite™ into Production webinars

• Initial Checklist

• Best Practices for Native Persistence and Data Recovery

• Best Practices for Monitoring Distributed In-Memory Computing

• Best Practices for Deploying Apache Ignite in the Cloud

GridGain forums: https://forums.gridgain.com

GridGain documentation: https://docs.gridgain.com/docs