Whitepaper: The Low Down on High Availability in ... - ctl.io › assets › pdf ›...

Whitepaper:

The Low Down on High Availability in the Cloud

There is no magic recipe for achieving high availability, whether it’s in the

Cloud or Hybrid IT or all physical infrastructure. No secret short cut or simple

formula. But with some thought and planning, and by partnering with the

right cloud services provider, it’s easier than you might think.

This paper takes a look at the considerations and best practices for designing

highly-available IT solutions, as well as the technologies included with or

available on CenturyLink Cloud to enable high availability.

CenturyLink Cloud Whitepaper: The Low Down on High Availability in the Cloud

Contents Introduction .......................................................................................................................................... 3 The Ultimate Goal of High Availability on the Cloud .......................................................... 4 Understanding the Differences: Load Balancing, Disaster Recovery and HA ........ 5

Load Balancing for Redundancy .................................................................................................... 5 Disaster Recovery for Business Continuity ................................................................................... 7 High Availability: The End Goal ....................................................................................................... 8

Architecting for HA ............................................................................................................................ 9 Assessing Risk Mitigation .............................................................................................................. 10 Hyperscale Anti-Affinity Policies ................................................................................................... 11 Leveraging Multiple Data Centers ................................................................................................ 12 The Database .................................................................................................................................. 13 Load Balancing Factors for HA ..................................................................................................... 14 Autoscaling ...................................................................................................................................... 15

Horizontal Autoscale .............................................................................................................. 15 Vertical Autoscale .................................................................................................................. 16

Managing Storage for HA .............................................................................................................. 19 HA for Hybrid IT ................................................................................................................................. 20

How is HA for Hybrid IT Different? ............................................................................................... 20 Using Managed DNS ...................................................................................................................... 22 Assessing Failure Risks with Hybrid IT Deployments ................................................................ 22 Calculating Availability for Distributed Systems ......................................................................... 23 Hybrid IT and HA Summary .......................................................................................................... 24

Putting it All Together on CenturyLink Cloud Platform .................................................. 25 The Platform: Fault Tolerant & Highly Available ........................................................................ 26

Hardened Firewall Clusters .................................................................................................. 26 Uninterruptible Switch Traffic Flow ..................................................................................... 26 Performance Optimization .................................................................................................. 26 Clustered SAN Storage .......................................................................................................... 26 Isolate Compute/Storage Hosts ........................................................................................... 26

Products and Services on CenturyLink Cloud for HA ................................................................ 27 Load Balancing ....................................................................................................................... 27 SafeHaven DRaaS .................................................................................................................. 27 Relational DB Service ............................................................................................................ 27 Managed MySQL and Managed MS SQL ............................................................................ 27 Managed DNS ........................................................................................................................ 27 Object Storage ....................................................................................................................... 28 Block Storage ......................................................................................................................... 28

Partner Technologies Enabling HA on CenturyLink Cloud ........................................................ 28 Microsoft SQL Server AlwaysOn ........................................................................................... 28 Double-Take .......................................................................................................................... 28 SoftNAS Cloud ........................................................................................................................ 28 CloudMaestro ADC ................................................................................................................ 29 Vormetric DSM ....................................................................................................................... 29

Conclusion ........................................................................................................................................... 30

Being focused on high

availability means not

only focusing on

unplanned downtime. It is

also about a passionate

resolution to eliminate

planned downtime.

Because every minute of

downtime is lost business

– lost revenue, lost

customers, lost

opportunities.


Introduction High availability is a priority for any organization today that relies on data and information technology (IT) to be successful. And in this hyper-warp-speed information-centric era, that means pretty much every organization – from the leanest startup to top enterprises around the globe. Being focused on high availability means not only focusing on unplanned downtime. It is also about a passionate resolution to eliminate planned downtime. Because every minute of downtime is lost business – lost revenue, lost customers, lost opportunities. Tech industry pundits provide varying estimates regarding the cost to organizations for downtime, and the numbers are staggering. Gartner’s Andrew Lerner reports:

“Based on industry surveys, the number we typically cite is $5,600 per minute, which extrapolates to well over $300K per hour.” (1)

Stephen Elliot from IDC provides equally sobering statistics based on a survey of over 20 Fortune 1000 organizations:

• $1.25 to $2.5 billion per year: the average total cost of unplanned application downtime.

• $100,000 per hour: the average cost of an infrastructure failure. • $500,000 to $1 million per hour: the average cost of a critical

application failure. (2) This whitepaper takes a deep dive look at high availability – what it is, what it isn’t, and how CenturyLink Cloud can help. Our cloud platform simplifies building out highly-available solutions, offering scalability, ease and speed of configuration, and managed infrastructure as a service (IaaS). The platform is deployed on enterprise-class infrastructure that is itself highly-available, and provides self-service management via an integrated, intuitive Control Portal for deep cloud orchestration and automation. We’ll discuss some of the tools and services that speak directly to high availability, as well as providing insights into how to leverage them to achieve the best results for your organization.

The cost to

organizations for

downtime? Analysts

claim it’s as much as

$5,600 per minute,

“…which extrapolates to

well over $300K per

hour.”


The Ultimate Goal of High Availability on the Cloud It’s common to assume that the cloud “auto-magically” delivers high availability (HA). But that’s simply not the case. Yes, the cloud provides a vehicle to build resilient systems and accommodate inevitable failures, but it’s up to you to implement the cloud products and managed services in a way that gives you the resilience you need. To build highly-availability applications, you have to design systems to handle both routine interruptions and unplanned failures of components and infrastructure — from a single instance all the way through the data center. Defining high availability depends on your perspective and the problems you're trying to solve. You might be primarily concerned with HA in terms of how you can ensure system-wide redundancy across a footprint of global data centers. Or perhaps you’re more interested in the full application stack and how to build highly-available apps from end to end. Contrast that with a network engineer, whose interest might be in creating redundancy across the cloud network backbone. But the ultimate goal is the same: ensuring cloud solutions are highly-available and scale easily. You want the ability to scale up, scale out, and scale across. Vertical scaling is going to let you add or remove server resources like CPU or RAM on demand for a single server or group of servers. Horizontal scaling lets you increase or decrease the capacity in your application environment – adding additional servers or shutting them down until needed. A data center footprint that’s dispersed geographically gives you breadth. This kind of scalability ensures that applications and environment can respond to activity at the right time in the right way, delivering high availability.

The ultimate goal of

high availability is the

same, no matter how

you approach it:

ensuring cloud solutions

are highly-available and

scale easily.


Understanding the Differences: Load Balancing, Disaster Recovery and High Availability It’s easy for cloud customers to get confused about the roles and responsibilities of their internal IT team versus their cloud services provider. That confusion is especially evident when it comes to application availability and business continuity planning. How does disaster recovery differ from high availability? Does my cloud provider automatically load balance my application servers? The answers to these questions are critical, but sometimes overlooked until a crisis occurs. So let’s peel away the layers of high availability, disaster recover and load balancing; how they differ, overlap and interrelate.

Load Balancing for Redundancy Load balancing is a significant component of high availability, but doesn’t guarantee HA in and of itself. You commonly see this technique employed in web applications where multiple web servers work together to handle inbound traffic. There are at least two reasons why load balancing is employed:

1) The required capacity is too large for a single machine. When running processes that consume a large amount of system resources (e.g., CPU and memory), it often makes sense to employ multiple servers to distribute the work instead of constantly adding capacity to a single server. In plenty of cases, it’s not even possible to allocate enough memory or CPU to a single machine to handle the entire workload. Load balancing across multiple servers makes it possible to host high traffic websites or run complex data processing jobs that demand more resources than a single server can deliver.

2) You require both high availability and flexibility in a solution. Even if you could run an entire application on a single server, it may not be a good idea. Load balancing can increase reliability by providing many servers which are able to do the same job. If one server becomes unavailable, the others can simply pick up the additional work until a new server comes online. Software updates become easier since a server can easily be taken out of the load balancing pool when a patch or reboot is necessary. Load balancing gives system administrators more flexibility in maintaining servers without negatively impacting the application as a whole.

From Wikipedia:

Load balancing is a

computer networking

method to distribute

workload across multiple

computers or a computer

cluster, network links,

central processing units,

disk drives, or other

resources, to achieve

optimal resource

utilization, maximize

throughput, minimize

response time, and avoid

overload. Using multiple

components with load

balancing, instead of a

single component, may

increase reliability through

redundancy. (3)


Load balancing can be accomplished using either a “push” or a “pull” model. In the former, web applications or database clusters sit behind a load balancer and inbound requests are “pushed” to the pool of servers based on an algorithm such as round-robin. In this scenario, servers await traffic sent to them by the load balancer. In the “pull” model, work requests are added to a centralized queue and a collection of servers retrieve those requests from that queue when they’re available. For instance, consider big data processing scenarios where many servers work to analyze data and return results. Each server takes a chunk of work and the overall processing load is distributed across many machines. To reiterate, although load balancing is a key component of both HA and business continuity, it guarantees neither. High availability is described through service level agreements and achieved through an architecture that focuses on constant availability even in the face of failures at any level of the system. While load balancing introduces redundancy, it’s not a strategy that alone can provide high availability. So servers sitting behind a load balancer may be running, but that doesn’t guarantee they’re available. CenturyLink Cloud offers multiple load balancing options. All public cloud customers have access to a free, shared load balancer. This load balancer service provides a range of capabilities including SSL offloading for higher performance, session persistence (known as “sticky sessions”), and routing of TCP, HTTP and HTTPS traffic for up to three servers. If you’re looking for more control over the load balancing configuration or have higher bandwidth needs, you can deploy a dedicated load balancer (virtual appliance) into the CenturyLink Cloud cloud. This gives you complete control over the load balancer setup so that you can modify the routing algorithm and enable or disable features that matter to your business. Both shared and dedicated load balancing services are based on the powerful Citrix Netscaler product.


Disaster Recovery for Business Continuity

High availability is likely to be a component of your disaster recovery plan, but more fundamentally, a DR strategy is about how you handle unexpected events. Typically, your cloud provider has to declare a disaster before explicitly initiating DR procedures. A brief network outage or storage failure in a data center is usually not enough to trigger a disaster response, although it can be a major inconvenience for the customer. There are two phrases that you often hear when defining a DR plan – RPO and RTO. A recovery point objective (RPO) describes the maximum window of data that can be lost because of a disaster. For example, an RPO of 12 hours means that when you get back online after a disaster, you may have lost the most recent 12 hours of data collected by your systems. A recovery time objective (RTO) identifies how long the IT systems (and processes) can be offline before being restored. So an RTO of 48 hours means that it may take two days before the systems lost in the disaster are brought back online and becoming usable again. While this provides assurances against losing all of your data in the event of a disaster, it still may not provide the level of business continuity and high availability that you require. If your business cannot tolerate more than a few moments of downtime, even in the event of a disaster, then it’s critical to architect a highly-available solution that can withstand the loss of an entire data center. That means identifying all the DNS, networking, compute and storage considerations for building systems that are not only highly-available within a single data center, but across multiple geographies. CenturyLink Cloud offers SafeHaven DR-as-a-Service to protect data and VMs in your own on-premises data center and on CenturyLink Cloud. You also might be interested in our Disaster Recovery Reference Architecture Guide, which outlines what’s included in the CenturyLink Cloud, and highlights key considerations that must be evaluated to maintain system availability during a disaster event.

From Wikipedia:

Disaster recovery (DR) is

the process, policies and

procedures that are

related to preparing for

recovery or continuation

of technology

infrastructure which are

vital to an organization

after a natural or human-

induced disaster. (4)


High Availability: The End Goal High availability is withstanding failure from all angles including the network, storage, and even the data center itself. So it’s the end goal of a solid DR plan. And it’s what you are seeking to achieve through techniques like load balancing. Enterprise cloud services like those from CenturyLink Cloud are built on a highly-available architecture that uses redundancy at all levels to ensure that no single component failure in a data center impacts overall system availability. This includes “passive” redundancy built into data centers to overcome power or internet provider failures, as well as “active” redundancy that leverages sophisticated monitoring to detect issues and initiate failover procedures. All of our customers get platform-level high availability when they use the CenturyLink Cloud “out of the box.” That means that you can rely on us for your enterprise-class workloads knowing that our architecture is well-designed and highly redundant. However – back to one of the initial statements of this paper – it’s the customer’s responsibility to design a high availability application architecture. Simply deploying an application to our cloud (or any other platform!) doesn’t make it highly-available. A cloud service provider should be able to simplify the complexity behind all of this so you can achieve the availability you need. But it still comes down to using the tools in a considered and intelligent way. The cloud provider offers the foundation. The customer chooses the level of redundancy. As an example, if you deploy a single Microsoft SQL Server instance in the CenturyLink Cloud cloud, you still don’t have a highly-available database. If that database server goes offline or network access is interrupted, your application’s availability will be impacted. To design a highly-available Microsoft SQL Server solution, you have multiple options. One choice is to create a cluster of database servers – where all nodes are active at the same time, or, nodes sit passively by waiting to be engaged – that access data from a shared disk. When a failure in the active node is detected, the alternate node is automatically called into action. So let’s look a bit deeper at the considerations around architecting a solution for high availability.

From Wikipedia:

High availability is a

system design approach

and associated service

implementation that

ensures a prearranged

level of operational

performance will be met

during a contractual

measurement period. (5)


Architecting for High Availability For a highly-available solution, you need to think about redundancy and the ability to configure servers with capacity to scale horizontally and vertically. You also need to consider load balancing to improve uptime, built-in disaster recovery and optimized infrastructure of the platform itself, and multiple connections to the backbone networks. In fact, the reliability of the network component is critical to building an HA solution and is often undervalued. You need to leverage some combination of dedicated private connections, data center cross-connects, MPLS, latency management, firewall policies, and VPNs.

But how do you determine how much time, effort and cost are warranted in designing a highly-available solution? Before you get started, you may find it useful to take an analytical approach to assessing risk mitigation.

"It’s up to you to

implement the cloud

products and managed

services offered by the

cloud provider in a way

that gives you the

resilience you need."

VP of Product,

CenturyLink Cloud


Assessing Risk Mitigation While the likelihood of a widespread outage at any individual data center is quite low, the probability is still always greater than zero. There are many reasons that data centers fail, but one of the most common is operator or human error, especially error associated with network infrastructure. Cloud providers typically offer service level agreements (SLAs) that are reflective of recent history and future operating performance expectations. For example, CenturyLink Cloud offers a network infrastructure availability SLA of 99.99%. Risk mitigation assessment involves estimating the costs and consequences of an outage and weighing them against the additional cost associated with server redundancy and a multi-data center deployment. For each deployment location, redundant physical or virtual machines are used to ensure web servers and application servers will failover to other local machines or instances. You may also want to break out the failure risks associated with hardware, software and network separately for each location, as shown in Table 1 below.

Risk Mitigation Matrix

This risk mitigation assessment provides a simple plan for how to address each layer of infrastructure across multiple cloud data centers. Now we’ll take a closer look at those mitigation strategies.

Table 1 - An example of a risk mitigation matrix for a simple high availability solution in the cloud.


Hyperscale Anti-Affinity Policies Affinity rules and anti-affinity rules tell the hypervisor to keep virtual machines together (affinity) or separated (anti-affinity). The rules, which can be applied as either required or preferred, help reduce traffic across networks and keep the virtual workload balanced on available hosts. In Table 1, Cloud Data Center B shows a failure mitigation step of ‘Use Redundant VMs on Different Physical Machines.’ By taking this step, you prevent a single machine failure from taking down the entire application. The CenturyLink Cloud Control Portal enables you to implement this step by creating a Hyperscale Anti-Affinity Policy. The policy ensures that new VM instances will not be located on the same physical machine, thereby avoiding a single point of failure at the physical machine level. Figure 1 shows the Control Portal interface for creating an anti-affinity policy on the CenturyLink Cloud.

Figure 31- Creating a Hyperscale Anti-Affinity Policy through the CenturyLink Cloud Control Portal.


Leveraging Multiple Data Centers Replication of data within a data center across multiple machines, and then replicating the data to a sister cluster in another facility is always a good practice – the result is a high availability cluster. By using DNS services to route between load balancers in different data centers, if one data center is down, your public site continues to function because the DNS record just routes to the secondary as a primary. The locations can be distributed regionally and/or internationally to survive even a complete data center outage. Data centers should also have either direct connectivity or blended transit for carrier neutrality. Moreover, you want the cloud provider to protect data centers with industry standards related to physical and logical security along with best practices for compliance standards like SOC, ISO, and PCI. Working with a global enterprise IT service provider like CenturyLink, you can rest assured we have experience with a wide range of security controls, regulatory requirements and industry standard compliance models, such as SOC 1/SSAE 16, ISO27001, PCI DSS and more. Benefit from our investment in these IT security frameworks to assess your internal readiness and accelerate compliance obligations. Information provided by CenturyLink around these compliance programs demonstrates how our automation platform provides a solid foundation for your risk mitigation strategy.

CenturyLink Cloud is available in 13 world-class data centers around the globe, allowing you to easily provision public cloud servers in geographically paired locations.

CenturyLink Cloud is

offered in 13 world-class

data canters around the

globe, allowing you to

easily provision public

cloud servers in

geographically paired

locations.


The Database It’s a reasonable argument to state that high availability starts with the database – since every mission-critical IT solution revolves around data. The apps, servers, operating systems, networks, and infrastructure exist to provide end users with access to the data. So the goal is to build high availability across the spectrum of components – starting with the data – and then moving outward to the data center infrastructure. Typically, data is stored in Storage Area Network (SAN) arrays or file-based Network Attached Storage (NAS). With both of these approaches, there’s some level of redundancy built-in, which means that if a single component fails, the data is still accessible to the customer. Also, the data is often replicated across physical disks in the event of a drive failure. However, there are newer scenarios where applications may deploy less redundant – and commensurately less expensive – underlying storage and rely on smarter databases to manage replication and redundancy. If you require granularity at the database level, there are several database service options you might want to consider. CenturyLink Cloud offers a Relational DB Service which allows you to deploy a MySQL-compatible database instance at the click of a button. Alternatively, you could choose to deploy a Managed MySQL or Managed Microsoft SQL database service, also available from CenturyLink, which can be custom-architected for high availability.

High availability starts

with the database –

every mission-critical IT

solution revolves

around data.


Load Balancing for HA Cloud computing effectively changes the relationship between infrastructure and workloads that run on it. The workload becomes both portable and pipelined across infrastructure that is multi-tier and redundant by design. Figure 2 shows how this looks, with load balancers that route traffic between multiple application server and web server instances. These redundant web and application servers provide both additional capacity and failover.

The steady state design workload for any single VM in this scenario may be as little as 50% or 60% of maximum capacity. Load balancers can be used to front-end both web server clusters as well as application servers. At each level, load balancers will automatically redirect load to healthy instances in case availability issues arise. The goal is to ensure that there is always sufficient reserve capacity for failover in case another instance becomes unavailable. Geo-Load Balancing is another approach, enabling users to reach web applications reliably and quickly, regardless of physical location. Administrators can direct traffic based on several scenarios including user geography and available capacity. Geo-Load Balancing is available with the CenturyLink Cloud Managed DNS Service.

Geo-Load Balancing

enables users to reach

web applications

reliably and quickly,

regardless of physical

location. Administrators

can direct traffic based

on several scenarios

including user

geography and

available capacity.

Figure 2 - Cloud architecture, including load balancers.


Autoscaling When a solution runs processes that consume a large amount of system resources (e.g., CPU and memory), it often makes sense to employ multiple servers to distribute the work instead of constantly adding capacity to a single server. In addition to gaining failover capacity, using multiple instances or VMs for each tier in an application provides a simple path to supporting larger workloads or handling more requests. For example, consider a website where people can register for a new, paid service. The system has to perform a fraud check, authenticate a payment method and create a container for the new user. A “new user signup” message is placed in a queue, and a set of servers is tasked with reading data from the queue and processing the request. If the number of signups spikes, these worker nodes can get overwhelmed and the new customers will be stuck waiting for their signup confirmations. The best practice to avert this outcome is to scale systems dynamically with usage. However, you run the risk of overprovisioning (and over-spending) if you only plan for maximum load. Of course, it’s possible to adjust capacity manually, but that’s not optimal. The recommended practice is to set your environment for autoscaling, if that feature is available on your cloud service. CenturyLink Cloud offers both Horizontal and Vertical Autoscaling, both of which are easily managed from our Control Portal.

Horizontal Autoscaling Horizontal Autoscaling involves adding or removing virtual servers from a defined pool. With Horizontal Autoscaling, your application can respond to changes in demand or load within minutes. It will automatically accommodate peak traffic and unforeseen demand. With this approach, you can define an autoscale policy that specifies when new servers will be added or deleted based on changes in application load. Figure 3 shows how CenturyLink Cloud’s Control Portal enables you to create such a policy. In this case, the admin has set the environment to scale out when CPU or memory usage goes over 80%. The threshold period represents a sliding window time fame used to buffer changes which keeps the system from reacting to short-lived usage spikes.

CenturyLink Cloud offers

both Horizontal and

Vertical Autoscaling,

both of which are easily

managed from our

Control Portal.


Figure 3 - CenturyLink Cloud’s Horizontal Autoscale interface with an Autoscale policy setting – scale out when CPU or memory usage exceeds 80%.

Vertical Autoscaling Vertical Autoscaling involves adding horsepower to existing servers instead of adding additional servers as you would with Horizontal Autoscaling. The choice between Horizontal and Vertical Autoscaling will depend on the specific use case. Both can accommodate changing demand scenarios. In some cases, it will not be possible to add new instances of components to cloud-based or hybrid applications. For example, when a database supports stateful web-database applications, it may not be feasible to quickly add new database instances due to the sheer volume of data involved. It’s also common to run databases on hardware that’s optimized for the specific dataset and workload. Observing conventions used to maintain referential integrity further constrains adding new database servers. Examples include synchronization of data between servers or partitioned indexes where each machine handles requests for a limited range of users or items. In these cases, Vertical Autoscaling enables compute instances to be scaled up for greater capacity as singular units without introducing new instances. This matters because


relational databases work in multi-server configurations, but each server may have significant resources allocated to it and they may divide and conquer large data set management in creative ways. In the case of databases which are specialized or customized to meet specific needs, adding more CPU/memory/storage to a given server is a perfectly viable way to handle new demand. CenturyLink Cloud’s Vertical Autoscaling Policy can be used to reduce CPU resources assigned to an instance once server load has declined. The Vertical Autoscaling Policy will automatically remove unneeded CPU capacity and reboot the server during a time window set for minimal impact within the policy. This is a powerful way to take advantage of cloud elasticity without rebuilding your existing applications for horizontal scale. Figure 4 shows our Control Portal interface for setting Vertical Autoscale policies. Schedule-based scaling is another variant of the autoscaling approach to managing load. In a seasonal business, it’s often a good practice to schedule the scaling up and down of resources in advance of actual loads. A scaling schedule can be created that gives you the ability to anticipate scaling events and accommodate them in advance.

Figure 4 – Defining a Vertical Autoscale Policy with CenturyLink Cloud. A scaling event may require careful planning and manual resizing because of the complexity of the target application. Consider the potential for unintended consequences if you were to automate the resizing of your NoSQL database, cache cluster, or mission-critical line of business system


whenever a heavy load is detected. While manual scaling is sometimes preferred, ideally your cloud platform provides options to streamline manual processes. For instance, with CenturyLink Cloud, user-defined scripts or system “Blueprints” can give you the ability to initiate your scaling manually while still benefiting from automation. The series of steps required to scale a deployment can be scripted so that the sequence of events required is a push button exercise.


Managing Storage for HA The best practice is to make sure your cloud applications have storage explicitly architected for fault-tolerance. Some cloud architectures offer storage directly attached to the virtual machine. However, when the VM goes down or is taken out of service, this “temporary VM storage” goes away and is typically not recoverable. The architect should think through how application state is persisted in the event of VM failure and how VM data output will be captured and forwarded to non-volatile storage so data isn’t lost when an instance goes down. To avoid such availability challenges, it’s recommended that you rely on persistent block storage or object storage and map out a volume plan for efficient provisioning. The storage options from the cloud provider should ideally enable block volumes that can be sized in discrete units to avoid overprovisioning. Moreover, your cloud platform should provide storage which reliably maintains your data even when virtual machines fail. Block Storage should be backed by Storage Area Networks (SANs) or equivalent, “striped” across multiple Redundant Array of Independent Disk (RAID) drives and mirrored within the RAIDs. So even if multiple RAIDs fail, you won’t lose your data. For large-scale cloud applications, Object Storage is far more efficient than hierarchical file systems. It stores diverse digital assets in a highly-available, secure, shared repository, with multiple levels of redundancy built in. Within a given data center, the best practice is to replicate your data across multiple machines. In addition, you should consider replicating data to a sister cluster in another facility. Users can then trust that data added to Object Storage will be readily available even when faced with unlikely node or data center failures. CenturyLink Cloud offers both SAN-based Block Storage and enterprise-grade, highly-available Object Storage, with automatic replication.

CenturyLink Cloud VMs

cloud store and manage

your files in a highly-

scalable, fault-tolerant

distributed datastore.

Our SAN-Based Block

Storage features built-

in disaster recovery and

offers a minimum of

2,500 IOPS up to 20,000

IOPS, with less than 5ms

latency.


High Availability for Hybrid IT Environments Experienced IT managers know that completely migrating existing enterprise solutions to the cloud is rarely feasible, at least not currently. As a result, for the next few years most enterprises will likely take a hybrid approach to IT, one that combines on-premises data centers, public and private clouds along with the networking that securely ties it all together. As more IT departments embrace the mixed cloud/on-premises construct of Hybrid IT, the whole HA picture changes. While running applications in the cloud can potentially enhance availability and reliability, hybrid applications also complicate the availability equation. The resulting hybrid solution will likely have end-to-end availability characteristics that are considerably different from one based solely in the cloud. Figure 5 shows a simple example of a hybrid deployment environment. In this scenario, cloud SLAs must be reconciled with on-premises availability to characterize the behavior of the end-to-end system. In the following section, we’ll examine the issue and offer some industry best practices guidance on ensuring HA in a Hybrid IT environment.

Figure 5 - A simplified example of a Hybrid IT environment, with an on-premises database and application server connected to a cloud-based web server and partner data source through a cloud-based API.

How is HA for Hybrid IT Different? Users of the cloud-based web server in Figure 5 are relying on the availability and responsiveness of an on-premises database and business logic layer, as well as a partner data source. Careful analysis is required to make this integrated hybrid solution maintainable, functional, and highly-available.

CenturyLink Cloud gives

businesses greater

flexibility with multiple

deployment options to

create powerful Hybrid

IT solutions. We offer

Public Cloud, Private

Cloud, Bare Metal

servers and colocation,

all tied together over a

global network. Our

approach is the perfect

mix for solutions that

require dynamic or

highly changeable

distributed workloads.


This is not always easy. Hybrid IT introduces new layers of security, new network segments and multiple providers, as well as “build for failure” architecture patterns that don’t assume reliability of infrastructure. Storage and databases in the cloud add further elements of challenge to availability. In a sense, achieving HA represents a litmus test of whether Hybrid IT is ready to support large enterprises. Managers responsible for Hybrid IT need a design approach for high availability regardless of the location of an IT asset. Hybrid IT introduces several new variations on familiar HA architectural practices. While a pure cloud deployment might depend on loosely-coupled processes that reside exclusively in the cloud, a Hybrid IT deployment will be based on loosely-coupled processes that are distributed across many different platforms, each with its own failure modes. Private cloud, public cloud, on-premises and colocation platforms all fail differently, so an important first step is to assess and manage failure risks for your platforms. A best practice when building distributed systems on a cloud platform is to build each core component as a separate, repeatable unit. These units are typically loosely-coupled, which generally makes scaling easier and failure scenarios more manageable. Figure 6 shows how the scenario depicted in Figure 5 can be set up with this approach to redundancy and failover. The on-premises database is replicated in a cloud data center along with a copy of the application server and web server. The original web server is in a different cloud data center. In this architecture, there is no single point of failure in the application topology. The database and app server fail over to Cloud Instance B. If Cloud Instance A fails, the web server from Cloud Instance B will take over.

Figure 6 - Architecting for HA in the hybrid environment using loosely-coupled units.


Using Managed DNS to Implement HA Architecture Overall availability through redundancy grows when infrastructure dependencies are minimized between instances of mission-critical applications. For instance, if you’re a cloud customer, you have no control over your application’s availability should there be a loss of network connectivity to the data center where it’s running. However, if you’ve architected your application with HA in mind and set up a redundant instance running in a second, remote cloud data center, your application should maintain availability despite the problems in the original data center. But you’ll need to approach this differently for a Hybrid IT solution. A managed Domain Name System (DNS) service enables you to implement a multiple data center approach. The highly-reliable DNS service resolves all requests for a specific web application and distributes the load to multiple load balancers or servers located in one or more locations. This will enable dynamic traffic re-routing when application components become unavailable for any reason. It routes traffic away from failed data centers, services and components.

Assessing Failure Risks with Hybrid IT Deployments In the case of Hybrid deployment scenario, applications should be designed with multiple failure scenarios in mind. A best practice is to evaluate the reliability of networks, machines and software for each location involved. One of the best predictors of future operating performance is past performance. It may be possible to estimate the statistical probability of an application failure using past performance or SLAs as numerical inputs. If a cloud data center has an SLA of 99.9% of uptime, it is likely to experience approximately 8.7 hours of downtime per year. (This is calculated as: .1% downtime x 24 hours/day x 365 days/year.) The future failure potential of new hybrid applications is not as simple as evaluating past performance or SLAs alone, however. A recommended practice for designing Hybrid IT systems for HA – as with any complex solution deployment – is to assess the costs and consequences of an outage and weigh them against the additional cost associated with server redundancy and a multi-data center deployment.


Doing the assessment involves a risk mitigation matrix, which is depicted in Table 2 for the example Hybrid IT scenario previously discussed. For each deployment location, redundant physical or virtual machines have been used to ensure web servers and application servers will failover to other local machines or instances.

Risk Mitigation Matrix for Hybrid IT Environment

In Table 2, Cloud Data Center B shows a failure mitigation step of ‘Use Redundant VMs on Different Physical Machines.’ We mentioned this earlier in this paper. By taking this step, you prevent a single machine failure from taking down the entire hybrid application.

Calculating Availability for Distributed Systems Hybrid applications are inherently distributed as they run on multiple platforms. As a result, the availability of each individual platform will factor into the calculation for end-to-end application availability. Like a chain that’s no stronger than its weakest link, the environment with the least favorable availability will dominate the overall reliability outcome. The level of reliability can be quantified. Some applications place two application components in “series,” meaning that if either Component A or B fails, the end-to-end application becomes unavailable. If each component were running on a different platform that had an SLA of 99.5%, when the components are placed in series, the resultant availability would be: AR = A1 x A2 or AR = 99.5% x 99.5% = 99.0%. The resultant system will be considerably less available than either component alone.

Table 2 - An example of a risk mitigation matrix for the Hybrid IT environment detailed in Figures 5 and 6.


In contrast, two application components can be placed in “parallel,” meaning that if either Component A or B fails, the other component will take over and assume the function of the component which failed with no loss of availability. In the case of parallel components, where each component x is the same, the resultant availability is calculated as: AR = 1- (1-Ax)2. If the two components each have an availability of 99.5% (as in the previous example), the resultant availability would be; AR = 1- (1- 0.995)2 = 99.9975% In this case, the resultant system becomes dramatically more reliable when redundant components are placed in parallel. Highly-available Hybrid IT systems should make use of such loosely-coupled, redundant and parallel components.

Hybrid IT and High Availability Summary Hybrid IT presents new challenges for system managers, developers and architects who need to ensure high availability while taking advantage of the flexibility offered by Hybrid IT. As connections and dependencies stretch from familiar on-premises data centers to potentially multiple cloud instances, risks and dependencies need to be evaluated, understood and sometimes mitigated. The good news is that solutions and practices for HA in Hybrid IT have emerged in parallel. Today, everyone has access to a global interconnected grid of data centers on which to build applications that can achieve true high availability. Stakeholders are no longer constrained to a single region. Addressing the challenges of HA in Hybrid IT is inherently multi-disciplinary. IT managers tasked with HA in a hybrid environment need to consider server location, redundancy, load balancing, storage, and network factors that affect availability and response times. Best practices recommend assessing your specific hybrid architecture in the context of seemingly familiar issues, such as server performance. They all need a second look, as actual system behavior in the cloud can be different from what is expected or actually needed. Then, with the kind of specific guidance offered in this brief, you can calibrate your hybrid environment to deliver the SLAs you require.


Putting it All Together on the CenturyLink Cloud Platform The CenturyLink Cloud Platform is an operating environment that is automated, self-service, and programmable, enabling you to provision and customize your CenturyLink Cloud offering anytime, anywhere. In other words, our platform simplifies the deployment of system high availability in terms of scale, configuration, and managed infrastructure services. It is a complete enterprise cloud platform with high availability a core component of infrastructure architecture.


The Platform: Fault Tolerant & Highly Available The CenturyLink Cloud Platform is built from the ground up to enable services deployed in any of our data centers to be fault tolerant and highly available. Every aspect of the infrastructure is built with local failure in mind. This overview highlights key pieces of the infrastructure and design around fault tolerance.

Hardened Firewall Clusters Within CenturyLink Cloud Platform there are multiple firewall clusters. Different clusters perform different security tasks. These clusters allow for highly available services that can survive hardware and software failures within the cluster. Firewalls are based on hardware or software clustered solutions.

Uninterruptible Switch Traffic Flow Each traffic type, regardless of destination or origin, always traverses a clustered switch solution. Switch, port, optic, and cables failures do not interrupt the flow of traffic.

Performance Optimization CenturyLink Cloud utilizes industry best hypervisor technology that allows for automatic restart of running cloud servers when host servers experience hardware/software failures. Resource utilization is managed primarily by the hypervisor, ensuring that cloud servers perform at the highest performance levels possible.

Clustered SAN Storage CenturyLink Cloud servers utilize clustered Storage Area Network technologies for cloud server data storage needs. A clustered SAN provides a highly available data platform that can survive hardware and software failures within the cluster. Clustered SANs allow for online storage system updates and expansions without impact to data availability.

Isolate Compute/Storage Hosts Hyperscale servers allow for isolated compute and storage Hypervisor hosts, the perfect solution for NoSQL, Cassandra, and other Big Data technologies.


Products and Services on CenturyLink Cloud for High Availability

Load Balancing CenturyLink Cloud offers multiple load balancing options to our customers. If you’re looking for more control over the load balancing configuration or have higher bandwidth needs, you can deploy a dedicated load balancer (virtual appliance) into your cloud solution.

SafeHaven DRaaS Often business critical applications require clearly defined Recovery Point Objective (RPO) and Recovery Time Objective (RTO) times for Disaster Recovery. CenturyLink SafeHaven DR-as-a-Service enables protection of critical applications from unexpected unavailability.

Relational DB Service Supporting rapid software delivery needs, Relational DB provides instant access to a high-performance, enterprise-hardened MySQL database instance deployed on our Hyperscale cloud platform with 100% flash storage.

Managed MySQL and Managed Microsoft SQL With Managed MySQL or Managed Microsoft SQL Server from CenturyLink Cloud, you can rely on our certified engineers to take your enterprise and mission-critical database workloads to the next level.

Managed DNS Host and manage your DNS zones within the Control Portal. Built-in DNS management provides a reliable and cost-effective way to route end users to infrastructure running on CenturyLink Cloud. Create custom DNS zones, and even set up Geo-Load Balancing.


Object Storage Designed for the enterprise, CenturyLink Cloud’s Object Storage is highly available, with automatic replication. Our cloud servers store and manage your files in a highly-scalable, fault-tolerant distributed datastore.

Block Storage CenturyLink Cloud offers SAN-based Block Storage, featuring built-in disaster recovery and delivering a minimum of 2,500 IOPS and up to 20,000 IOPS, with less than 5ms latency.

Partner Technologies Enabling HA on CenturyLink Cloud

Microsoft SQL Server AlwaysOn Microsoft SQL Server AlwaysOn is designed for high availability and disaster recovery on an enterprise-level. All configurations can be deployed using a CenturyLink Cloud Blueprint which also offers customizable configurations. Our Knowledge Base offers detailed instructions for configuring AlwaysOn for the CenturyLink Cloud.

Double-Take Vision Solution’s Double-Take provides CenturyLink customers high availability through 1:1 replication into and amongst cloud platforms, disaster recovery through many:1 replication, and migration services to the public cloud. Access the service via a simple Blueprint that’s hypervisor and OS agnostic. To get started with Double-Take, contact Vision Solutions directly by email; evaluation licenses are available.

SoftNAS Cloud SoftNAS Cloud is an enterprise-grade, full-featured cloud NAS filer and cloud storage gateway, which allows you to safely migrate business-critical applications to the CenturyLink Cloud without a physical storage appliance. High availability is included at no extra cost for CenturyLink Cloud users who deploy the two requisite instances.


CloudMaestro ADC CloudMaestro by Lagrange Systems is an Application Delivery Controller (ADC) and management solution built for maximizing performance and high availability. It is available to CenturyLink Cloud users via an easy-to-deploy Blueprint.

Vormetric DSM Replication of data in a data center across multiple machines, and then replicating the data to a sister cluster in another facility is a best practice. The result is a high availability cluster. CenturyLink Cloud customers can implement Vormetric Data Security Manager high availability in various approaches to meet business requirements.


Conclusion Designing highly-available systems is complex. Unfortunately, no cloud provider can offer a checkbox labeled “Make this application highly-available!” in their cloud management portal. Crafting a highly-available system involves a methodical approach that navigates through every single layer of the system and identifies single points of failure that should be made redundant. For components that cannot be made redundant, it’s important to make sure that the application can continue to run even if that component becomes unavailable. Our cloud platform simplifies building out highly-available solutions, offering scalability, ease and speed of configuration, and managed infrastructure as a service (IaaS). The CenturyLink Cloud is deployed on enterprise-class infrastructure that is itself highly-available, and is managed via an integrated, intuitive Control Portal for deep cloud orchestration and automation. Cloud services, cloud apps, database services, managed services, backup, redundancy, and network are all combined together on the CenturyLink Cloud Platform. We provide the foundation that offers customers the power and flexibility to build the cloud solution to meet their unique needs. Moreover, we consult and collaborate with customers to assemble our cloud tools in the most effective way to meet their business objectives. The CenturyLink Cloud professional services team consists of skilled, experienced architects who have designed and built cloud-scale solutions for a wide range of customers. They can collaborate with your team to make sure that you’ve taken advantage of every relevant feature that CenturyLink Cloud has to offer, while helping you make sure that your system landscape is constructed in a way that will ensure continual availability. Get started with a $2,500 free trial of CenturyLink Cloud products and services and see how easy and intuitive it is to build out a highly-available solution that meets the unique needs of your business. Contact our services team for a risk-free, no-cost consultation on how we can help you take your business to the next level.


References (1) The Cost of Downtime, Andrew Lerner, Gartner Blog, July 16, 2014

http://blogs.gartner.com/andrew-lerner/2014/07/16/the-cost-of-downtime/

(2) DevOps and the Cost of Downtime: Fortune 1000 Best Practice Metrics Quantified, Stephen Elliot, IDC Opinion, December 2014 http://info.appdynamics.com/rs/appdynamics/images/DevOps-metrics-Fortune1K.pdf

(3) Load balancing (computing), Wikipedia https://en.wikipedia.org/wiki/Load_balancing_(computing)

(4) Disaster Recovery, Wikipedia https://en.wikipedia.org/wiki/Disaster_recovery

(5) High availability, Wikipedia https://en.wikipedia.org/wiki/High_availability

© 2016 CenturyLink, Inc. All Rights Reserved. The CenturyLink mark, logo and certain CenturyLink product names are the property of CenturyLink, Inc. All other marks are the property of their respective owners.

Whitepaper: The Low Down on High Availability in ... - ctl.io › assets › pdf ›...

Documents

Transcript of Whitepaper: The Low Down on High Availability in ... - ctl.io › assets › pdf ›...