Golden Topology and Best Practices 2018 - IBM

Golden Topology andBest Practices2018—Simon KapadiaDeveloper Portal Security Lead, APICIBM Development

Pleasenote

• IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice and at IBM’s sole discretion.

• Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision.

• The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract.

• The development, release, and timing of any future features or functionality described for our products remains at our sole discretion.

• Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.

2

Who am I?

• Currently: Developer Portal Security Lead for API Connect Development

• Previously: Security Lead for IBM MobileFirst Platform Development, EMEA Security Lead for Software Services for WebSphere

• WebSphere Specialist since 1999, DataPower Specialist since 2005, APIC since APIMv3

• 15 Years of implementing real world distributed computing systems in all manner of customers and industry, including “the serious ones” (banks, governments, law enforcement, pharmaceuticals, etc)

Who are you?

• A technical audience – so this is not a marketing presentation, and not targeted at executives!

• An interested audience – user group membership is optional! This presentation assumes that you will be building an APIC infrastructure and want my thoughts on how to do that.

• An informed audience – I assume that you already know APIC

• Hopefully, a difficult audience – I expect questions, so speak up; call me out on things I don’t explain well or if you disagree with me!

Agenda

Ø Runtime GoalsØ APICØ High AvailabilityØ Disaster Recovery

Ø Active/PassiveØ Active/Active

Ø Distributed ArchitecturesØ Final thoughts

5

Runtime Goals

• What do you actually want to achieve with your topology?• How much are you willing to pay to achieve it?

• Lets start with some definitions:

• HA – High Availability• DR – Disaster Recovery• CO – Continuous Operation• CA – Continuous Availability

HA – High Availability

• Ensuring that the system can continue to process work within onelocation after routine single component failures

• Usually we assume a single failure

• Usually the goal is very brief disruptions for only some users for unplanned events

CO – Continuous Operations

• Ensuring that the system is never unavailable during planned activities

• For example, if the application is upgraded to a new version, we do it in a way that avoids downtime

CA – Continuous Availability

• High Availability coupled with Continuous Operations• No tolerance for planned downtime • As little unplanned downtime as possible• Very expensive to implement• Note that while achieving CA almost always requires an aggressive DR

plan, they are not the same thing• Also referred to as “Always On”

DR – Disaster Recovery

• Ensuring that the system can be reconstituted and/or activated at anotherlocation and can process work after an unexpected catastrophic failure at one location• Often multiple single failures (which are normally handled by high

availability techniques) is considered catastrophic• There may or may not be significant downtime as part of a disaster

recovery• This environment may be substantially smaller than the entire production

environment, as only a subset of production applications demand DR • DR Measurements:

• Recovery Time Objective (RTO) = Service Recovery with little to no interruption• Recovery Point Objective (RPO) = Data Recovery and acceptable data loss

…but but but

• Surely with all this cloud and kubernetes and cassandra and other modern words, we don’t have to worry about any of this?• Right?• Well, no, you really do!

• You are still running a complex distributed computing system with non-functional requirements• The key is understanding those requirements and designing an

infrastructure which can meet them• Back to my first questions: What do you actually want to achieve with

your topology? How much are you willing to pay to achieve it?

Agenda




12

So, let’s talk about APIC

• Other presentations have discussed the architecture, the components and how they work together

• I will be presenting deployment options and how the components scale and fail

• Remember the icons on the right

Gateway Instance

Analytics Instance

Portal Instance

Manager Instance

Obligatory IBM APIC 2018.1 Marketing SlideCreateAutomatically create and test APIs to expose data, microservices, enterprise applications and SaaS services.

SecureEasily apply built-in and extensible policies to secure, control and mediate the delivery of APIs with unmatched scale.

ManageRapidly publish, lifecycle govern, socialize, analyze, monitor and monetize APIs with built-in capabilities.

The Scalable Multi-Cloud API Platform

A complete, modern and intuitive API lifecycle platform to create, securely

expose and manage APIs across clouds to power digital applications

14

API Connect V2018.1: Component Scope• Single Manager per API Connect Cloud, as it is the

brain of the API Management system

• Manger can span multiple Availability Zones, giving flexibility in deployment scenarios

• Multiple Portal, Analytics and Gateway Services per Cloud, and are scoped to a single Availability Zones

• API Connect Cloud defined as 1 APIM, with N Component Services, and most customers have 2+ Cloud environment (Development, Staging, Prod, etc…)

15

What’s an Availability Zone?− A logical

configuration construct

− Can be in one Datacentre or over multiple Datacentres

− Management Service can span multiple availability zones

High Availability in 2018.x

17

Why is there substantial change with APIC v2018.1?Major advances in application development, deployment and management have led companies to begin pursuing multi-cloud application strategies.

API Connect 2018.1 Ships Kubernetes in OVA Installations, giving customers some of the benefits of a cloud-native solution without having to install their own Kubernetes environment. This however comes with additional high availability requirements.

API Connect leverages Kubernetes and other underlying technologies (data persistence such as cassandra) to achieve the scalability and reliability needs for modern API management multi-cloud platforms. For E.g. if database within portal fails, traffic would be directed to remaining members and the failed node would be auto-restarted to support future traffic.

Kubernetes & Underlying APIC Components Technology Requires Quorum, and without quorum the services will begin behaving abnormally. Quorum requirements calculated as such: Node Failure tolerance of (N-1)/2Where N is number of instances or nodes in cluster (ICP Explanation)

***In 2018.x the API Connect team uses High Availability to refer to 3 instance deployments, while in v5 High Availability refers to 2 instances

© 2018 IBM Corporation

https://www.ibm.com/support/knowledgecenter/en/SSYGQH_6.0.0/admin/install/r_Orient_Me_CFC_HA.html

18

High Availability in v5

© 2018 IBM Corporation

High Availability in 2018.x+Vs.

o Cannot dynamically scaleo Slow upgrade processo Gateways reliant on manager for

gateway configurations o Does not require Quorum

o Dynamically scaleo Drastically reduced upgrade timeo Gateways self manage configurations at cluster

level o Requires Quorum

o Bottlenecks manager instanceo Does not promote remote gateway

deploymentso Only 1 Analytics Cluster per API

Connect Cloud

o Separated from manager instanceo Optimized for remote gateway deployments, by

deploying analytics next to gateway to reduce latencies o Deploy multiple analytics clusters per API Connect

Cloud

o No true active/active set upo Cloud dissociation (split-brain scenarios)o Impacted by analytics functionality

o True Active/ Active cluster configurations o Quorum avoids cloud dissociationo Better performance and stability with analytics removed

o Recommended to have 3+ Portal instanceso Only support 1 Portal Cluster per API

Connect Cloud

o No changes from v5o Deploy multiple portal clusters per API Connect

Cloud

Gateway

Analytics

Portal

Manager

High Availability: Quorum

Cluster has Quorum

Cluster does not have Quorum

Cluster has Quorum


Cluster has Quorum


Cluster has Quorum

Odd numbers are better for Quorum!• Cluster can scale to even number of nodes under increased load,

but better to always have odd number of members

OVA install 2018.1

Management Cluster(Shown Right)

API Server

etcd

Controller

SchedulerKubernetes Requires 3 Master Nodes Minimumo Etcd requires quorum or will loose write abilitieso If no write abilities can not update API server and may

send requests to failed or disconnected nodes

This impacts APIM, Portal, & Analytics OVA installs

Quorum requirements for kubernetes master nodes need to be applied to OVA and Container Deployments.

Link to k8s docs

MasterMicroservices

Worker

MicroservicesWorker

MicroservicesWorker

Ingress

VM

API Server

etcd

Controller

Scheduler

MasterMicroservices

Worker

MicroservicesWorker

MicroservicesWorker

API Server

etcd

Controller

Scheduler

MasterMicroservices

Worker

MicroservicesWorker

MicroservicesWorker

20

https://kubernetes.io/docs/admin/high-availability/building/

Agenda




21

• 3 instances minimum for HA based on quorumo (N-1)/2 dictates that this HA set up can handle a single

node failure o If more than 1 node or 1 instance fails then the

application begins behaving abnormally

• Nodes represent either Physical Machine or VMs

High Availability (HA)Single DC, 3 Instance, HA Deployment

Data Center 1

Node 1 Node 2 Node 3

**Nodes represent physical machine or VMs

• Scenario depicts either a failure of node 3, or a failure of the instances on node 3 or a loss connection from node 3

• Quorum is maintained between instances running on Node 1 & 2

Data Center 1




Data Center 1




• Scenario depicts either a failure of node 2 & 3, or a failure of the instances on node 2 & 3, or a loss connection from nodes 2 & 3

• Quorum is lost on Node 1 and thus the API Connect components begin having abnormal behavior

• Cluster can scale to even number of nodes under increased load, but better to always have odd number of members

Agenda




25

What is a Disaster?Some people have permanently lost access to the files on the affected disks as a result. A number of disks damaged following the lightning strikes did, however, later became accessible. Generally, data centres require more lightning protection than most other buildings.Google has said that lightning did not actually strike the data centre itself, but the local power grid and the BBC understands that customers, through various backup technologies, were able to recover all lost data.While four successive strikes might sound unlikely, lightning does not need to repeatedly strike the same place or the actual building to cause damage.Justin Gale, project manager for the lightning protection service Orion, said lightning could strike power or telecommunications cables connected to a building at a distance and still cause disruptions."The cabling alone can be struck anything up to a kilometre away, bring [the shock] back to the data centre and fuse everything that's in it," he said.Unlucky strikeThe Google Compute Engine (GCE) service allows Google's clients to store data and run virtual computers in the cloud. It's not known which clients were affected, or what type of data was lost.In an online statement, Google said that data on just 0.000001% of disk space was permanently affected."Although automatic auxiliary systems restored power quickly, and the storage systems are designed with battery backup, some recently written data was located on storage systems which were more susceptible to power failure from extended or repeated battery drain," it said.The company added it would continue to upgrade hardware and improve its response procedures to make future losses less likely.A spokesman for data centre consultants Future-Tech, commented that while data centreswere designed to withstand lightning strikes via a network of conductive lightning rods, it was not impossible for strikes to get through."Everything in the data centre is connected one way or another," said James Wilman, engineering sales director. "If you get four large strikes it wouldn't surprise me that it has affected the facility."Although the chances of data being wiped by lightning strikes are incredibly low, users do have the option of being able to back things up locally as a safety measure.

https://cloud.google.com/compute/docs/disks/persistent-disks

https://status.cloud.google.com/incident/compute/15056

Some definitions− Redundancy: The provision of additional or duplicate systems,

equipment, etc., that function in case an operating part or system fails, as in a spacecraft.

− Isolated: Separated from other persons or things; alone; solitary− Independent: Not dependent; not depending or contingent upon

something else for existence, operation, etc.

− All of the Above are Fundamental for Effective High Availability and Disaster Recovery

Disaster Recovery Objectives − Recovery Time Objective• How quickly the system will be able to accept traffic after the disaster• Shorter times require progressively more expensive techniques• e.g., a tape backup and restore is relatively inexpensive• e.g., a fully redundant fully operational data center is very expensive

− One challenge is detection time• It takes time to determine you are in a disaster state and trigger disaster

procedures• While you are deciding if you are down, you are probably missing your SLA. • Does the RTO include detection time?

Disaster Recovery Objectives − Recovery Point Objective• How much data you are willing to lose when there is a disaster• Limiting data loss raises costs• e.g., restoring from tape is relatively inexpensive but you'll lose everything

since the last backup• e.g., asynchronous replication of data and system state requires significant

network bandwidth to prevent falling far behind• e.g., synchronous replication to the backup data center guarantees no data

loss but requires VERY fast and reliable network and will significantly harm performance• Warning: in turn this results in increased latency which means capacity must

be increased at all layers

Disaster Recovery Objectives − Most RTO and RPO goals will deeply impact application and

infrastructure architecture and can't be done “after the fact”• e.g., if data is shared across data centers, your database and application

design will have to be careful to avoid conflicting database updates and/or tolerate them• e.g., application upgrades have to account for multiple versions of the

application running at once which can affect user interface design, database layout, etc

− Extreme RTO and RPO goals tend to conflict• e.g., using synchronous disk replication of data gives you a zero RPO but that

means the second system can't be operational, which raises RTO

− Trying to achieve a Zero RTO *and* a Zero RPO is Mutually Exclusive

Data Center Utilization Urban Legends− Legend:• Active/Active Improves Utilization

− Reality:• An Active/Active Topology at 40-50% Utilization in Each DC Is Equivalent to

An Active/Passive Datacenter Deployment with One Active at 80% to 90 % Utilization and the Other Passive • Running Active/Active at Greater Than 50% Of Total (both Datacenters)

Capacity Can Often Result in a Complete Loss of Service When a Data Center Outage Occurs • Insufficient Capacity in Remaining Data Center to Handle > 100% Capacity

Results in• Poor Response Time (at best) • Network and Server Overload, Resulting in a Complete Crash

Agenda




32

• Services are highly available in Data Center 1

• Data Center 2 is pre-configured and ready to have instances of the services deployed

• Periodic cronjobs are scheduled to backup data from each of the serviceso These backups are synced from DC1 to

DC2

• Customer only paying IBM for DC 1 services

Data Center 1: Active

Node 1

Node 2

Node 3

Data Center 2: Passive

Node 1

Node 2

Node 3

High Availability + Disaster Recovery (DR)2 DC, Active/Passive, CA Deployment



Node 1

Node 2

Node 3

• In the event of DC 1 failure, scripts can be executed in DC 2 that will begin deploying instances of the APIC services

• This gives users an answer to catastrophic events that cause datacenter failure

• However…. DC 2 can only recover last back-up • Any data not backed up is lost when failover

to DC 2

• Additionally the customer would need to wait until the infrastructure and software ready before operations could resume

Data Center 2: Passive

Node 1

Node 2

Node 3

High Availability + Disaster Recovery (DR)2 DC, Active/ Passive, CA Deployment


High Availability + Disaster Recovery (DR)2 DC, Active/ Passive, HA Deployment

IBM Sub-Capacity Licensing “In the case of a program or system configuration that is designed to support a high availability environment by using various techniques (for example, duplexing, mirroring of files or transactions, maintaining a heartbeat, or active linking with another machine, program, database, or other resource), the program is considered to be doing work in both the warm and hot situation and license entitlements must be acquired.”

Backup Entitlements Required

Hot Yes

Warm **Yes

Cold No

**Based on definitions of “Doing Work” APIC does not qualify for Warm Passive systems. Entitlements are needed for Warm set-ups See bottom of page 8 of IBM Software Licensing Guide

ftp://ftp.software.ibm.com/software/passportadvantage/License_Management/IBM_Distributed_Software_Licensing_Reference_Guide.pdf

Agenda




36

• Can tolerate failure node failure, connection drop, or DC failure

• Optionally deploy additional instances in any DC as long as quorum maintained in event of a failure in DC

• Cluster can scale to even number of nodes under increased load, but better to always have odd number of members

• Best option if operations plan requires datacenter fault tolerance


Node 2


Node 1


Node 3

Low Latency Network

HA Active/ Active 1: Odd Number of DCs 3 DC, Single AZ


Data Center 2

Node 1

Node 2

Node 3

• If DC 2 fails then,o All services maintain functionality

• If DC 1 fails, or instances are unavailable• Portal would be lost

• APIM registered to 2 different AZs

• Not the best topology for continuous operation of the Portal

Data Center 1

Node 1

Node 2

Node 3

Low Latency Network

HA Active/ Active 2: Multiple Services 2 DC Active/Active, Dual AZ


• If DC 2 fails then,o All services maintain functionality

• If DC 1 fails, or instances are unavailable• Quorum is lost and thus the

cluster begins having abnormal behavior

• Gateways in DC 2 will be able to continue to handle traffic

• Not the best topology for continuous operation

Data Center 1: ActiveMain-Site

Node 1

Node 2

Node 3


Node 4

Node 5

Low Latency Network

HA Active/ Active 3: Main-site 2 DC Active/Active, Single AZ


• If DC 2 fails, or connection drops between the datacenters or loose one of the DCs• Quorum is lost and thus the

cluster begins having abnormal behavior • Gateway can still handle traffic

• This leaves 3 options above as best deployment options Datacenter failure for API Manager & Portal:1. 3 DCs2. Multiple Services3. Main-site


Node 1

Node 2

Node 3


Node 4

Node 5

Node 6

Low Latency Network

HA Active/ Active: Not recommended2 DC, Single AZ


Networks are inherently unreliable

• Your network people are lying to you if they say otherwise• Things do go wrong.• Routers get misconfigured or have bugs (yes, a router is a computer just like

other computers and they have bugs).• People dig holes in the wrong place and cut through your special magical

custom designed dark fiber links.• Quality of service mechanisms get overwhelmed and misbehave• ”It’s not a network problem, pings are fine” usually guarantees it really is a

network problem!

• How sure are you that your “low latency network” is robust?

A word on Latency…1GB

10GB

Which is Faster ?

Both are exactly the same. You cannot change the laws of physics!

Distance 0.5 mile 100 miles 500 miles

1000 miles

Estimated Round Trip

2.1 ms 5.4 ms 12.5 ms 21.35 ms

10GB handles more traffic, then 1GB but both are equally fast !

The CAP Theorem• In a distributed environment, especially spanning data centers across LANs and WANs there are

three core requirements for a service: • Consistency

• Either the service works or fails• Traditional ACID of databases provides consistency and isolation

• Availability• Extremely important in web business model• In a large distributed system, one may have to compromise with consistency for the sake of availability

• Partition Tolerance• Network partition will happen when not all machines are connected • “No set of failures less than the total network failure is allowed to cause the system to respond incorrectly” –

Seth and Lynch • Quorum is used to guard against split brain syndrome

• Brewer’s CAP conjecture states that “ One can achieve only two not all three of the above mentioned requirements“

Multiple Active DCs and the CAP Theorem

• Active/Active requires you to sacrifice either consistency, availability or partition tolerance. • All three aren’t possible

• If you choose full availability, then you are going to lose guaranteed consistency. • So you need to design with this in mind, and build in mechanisms (typically involving queuing

technologies) that enable your system to "tend towards“ consistency. • Your data is going to be in two places, either partitioned or replicated.

• If the former, what happens when one site is down? • If the latter, what happens when users hitting each site see slightly different versions of the

current state? • These are very complex problems.

• Which is why I try to steer customers away from active/active and into an active/passive model with DR from active to passive. • But they always feel like they are wasting hardware…

Agenda




45

Continuous Operation: Distributed Architecture

Requirement for APIC Services Spanning DCs • Low Latency connection recommended to federate

services across DCso Latency of <30 ms roundtrip o True for Analytics, Portal and Gateway

Services

• Manager is a singleton that can span across AZs, as Cassandra works on eventual consistency

• 2 common patterns in distributed architectures1. Applications/ services across multiple clouds

and environments2. Setting up geo-graphic high availability of

applications

DC 2

Node 2

DC 1

Node 1

DC 3

Node 3

Low Latency Network

DC 2

Node 2

DC 1

Node 1

DC 3

Node 3

Low Latency Network

Distributed Architecture: Pattern 1 Hybrid Cloud Applications

• Single web & mobile APP, with geo-graphically disperse target services

• Managed from single API Management layer, optimized for Cloud scale

• Remote Gateways deployed next to different components of application

• Components could also be in same AZ in different DCs

• Two Analytics Options:1. Co-locate analytics in same AZ or Cloud to

reducing network overhead2. Centralize to single Analytics service

Target Service 1

Icons Represent Clusters

Target Services 2 &3

Target Services 4

Public Cloud On-Prem DCs (US East)

On-Prem DCs (US West)

Load Balancer

• Replicated services across geo-graphic regions

• Co-locate API Gateways with runtimes to reduce latency of application

• Client to deploy load balancer to route incoming requests to AZ best suited to serve request

Target Service 1

Target Service1

Target Service1

Public Cloud On-Prem DCs (US East)

On-Prem DCs (US West)

Distributed Architecture: Pattern 2 Geo-graph HA of Applications

Icon Represent Clusters

Example Customer 1

DC 2

Node 2

DC 1

Node 1

DC 3

Node 3

Low Latency Network

DC 2

Node 2

DC 1

Node 1

DC 3

Node 3

Low Latency Network

Customer Requirements: • Have internal set of DCs already running

internal apps o Add APIM layer to these APPs

• Use 3rd party cloud to deploy new set of apps & services for external parties to consume

• No direct access to any components running in internal network from anyone outside company

• Separate Portal for internal and external API consumers

• Separate analytics services for security and networking latency reasons

• Single APIM layer for both internal and externally facing environment

Customer Managed DCsInternal Traffic

3rd Party Cloud (AWS)3rd Party Traffic/ BP integrations

Agenda




50

Am I trying to scare you off?

• I’m pretty sure I am going to be accused by marketing of trying to frighten customers (they haven’t seen these slides yet J)• This is absolutely not the case.• The product is designed to work around the issues involved in

creating a large scale distributed environment. We have years of experience in doing this at IBM!• My point is that doing this requires thought, planning, and treating it

as the difficult computing endeavor that it is.• Focus on your goals – what do you want to achieve?

If you are planning a large deployment…

• We are actively searching for customers who are planning large scale deployments

• We have a Lab Advocacy program which can work with you to bring your feedback to development

• We are constantly striving to improve our products; feedback from customers is essential to that goal!

One thing we don’t talk about enough…

• What about your back-end systems?

• There is limited value in scaling your API gateway to multiple global datacenters if they all connect to a single back-end application in one of them!• Are your back-end systems capable of handling all of the load in the event of a

disaster?• Are all the subsidiary systems APIM relies upon (e.g. LDAP) available in all

datacenters and capable of handling the load?• Is your database available and capable of handling everything if things go

wrong?

Recovery time and Recovery Point Objectives

• Most RTO and RPO goals will deeply impact application and infrastructure architecture and can't be done “after the fact”• e.g. if data is shared across data centers, your database and application design

will have to be careful to avoid conflicting database updates and/or tolerate them• e.g. application upgrades have to account for multiple versions of the application

running at once which can affect user interface design, database layout, etc• Extreme RTO and RPO goals tend to conflict• e.g., using synchronous disk replication of data gives you a zero RPO but that

means the second system can't be operational, which raises RTO

• Trying to Achieve a Zero RTO *and* a Zero RPO is Mutually Exclusive

Test your Disaster Recovery!

• Have a complete detailed plan of what to do in a disaster• Test it! Actually cause a disaster.• No, ok, I don’t mean blow up a datacentre. That will get you in trouble.• But you can simulate a disaster. Take a network link down. Pull the plug on a

bunch of servers. At random.• Have someone who doesn’t know the environment at all walk into a

datacentre and just start pulling cables! Executives love doing this and it gets you brownie points (as long as everything goes according to plan).

Learn from mistakes

• Mistakes and failures will occur, learn from them• What separates mediocre organizations from the good and great isn't so much perfection as it is

the constant striving to get better – to not repeat mistakes

• After every outage perform • Root cause analysis

• Capture diagnostic information• Meet as a team including all key players to discuss• Determine precisely what went wrong

• Wrong doesn't mean “Bob made an error.”• Find the process flaw that led to the problem

• Determine a corrective action that will prevent this from happening again• If you can't, determine what diagnostic information is needed next time this happens and ensure it is collected

• Implement that corrective action• All too often this last step isn't done• Verify that action corrected problem

• A senior manager must own this process

Questions?

Golden Topology and Best Practices 2018 - IBM

Documents

Transcript of Golden Topology and Best Practices 2018 - IBM