Application architecture operational aspects

1

Application Architecture Operational Aspects

Disclaimer

This presentation is a journey into the Application Architecture operational aspects.

This presentation is quite old but the concepts remains.

Most of the content was excerpted from external resources which I lost the references.

Course Given at La Rochelle University in France in 2009

http://eventtoons.com/home

2

http://eventtoons.com/home

3

Content

1. Introduction to Operational Aspects

2. Reliability

3. Availability and SLA

4. High Availability

5. Eliminating SPOF

6. Transaction

7. CAP Theorem

8. Scalability

9. Performance

10. Clustering

4

Introduction

What is the Operational Aspect of an architecture?

Concerns centered around Run time Environment

Achieve Service Level Requirements

Deployment Units, their connections, locations, nodes

Also called Quality of Service, ilities, etc.

5

The Operational Aspects are part of development lifecycle

6

Operational Aspects Concerns

Availability •Scheduled service hours

•Outage costs

•Speed of service

recovery

•Disaster recovery

Process and Data

integrity

Standards

Cost

Security

• Access to system /

data

• Threats

• Controls

Systems Management • Event and Log

Management

• Configuration

Management

• Security Management

• Performance

Management

• Scheduling

• Backup and Recovery

Timescales

Skills (User and IT)

There are more disciplines / areas of concern than listed here!

Data

Currency

Performance • Response

Time

• Throughput

• Capacity

Scalability

7

Modeling Operational Aspects

What Operational Model contains?

Domain analysis (actors and use cases)

Plans for achieving Functional and non-functional requirements

The systems management strategy and constraints

Requirements IT Architecture Design

Architecture

Overview

Diagram

Current IT

Environment

Interaction Diagram

Detailed Design

8

Operational Model Terminology

RAS

Reliability

Availability

Serviceability

RAS terms center entirely around uptime

Originally used by mainframe vendors

RASP

Reliability

Availability

Scalability

Performance

RASP terms describe both uptime and scalable performance

9

Operational Model Takeover

The operational aspect of architecture

Documents the placement of the solution's components

Outlines the systems management aspects of the solution

Focuses on runtime systems design

Developed in concert with the Component Model

Documented via several views, developed in parallel and iteratively through the phases of an engagement

10

Content


2. Reliability



5. Eliminating SPOF

6. Transaction

7. CAP Theorem

8. Scalability

9. Performance

10. Clustering

11

Reliability

Defined as the probability of a failure within a given time period (or, how frequently a failure occurs)

Failure rate is often notated as lambda

Typically measured as

MTBF: Mean Time Between Failure

FI T Rate: Failures In Time (failures per 1,000,000,000 hours)

12

Content


2. Reliability



5. Eliminating SPOF

6. Transaction

7. CAP Theorem

8. Scalability

9. Performance

10. Clustering

Availability

13

Availability means the system is open for business

Business for a retail store means Customers are browsing and buying

Most retail stores have planned downtime

for holidays, inventory or just close during

off-peak hours like late night or early morning

14

Dimensions of Availability

Functionality

Performance

Data Accuracy

Is the data provided by the system accurate and complete?

Does the system do what it is supposed to do?

Does the system function within the acceptable performance criteria?

15

Availability

Availability can be expressed numerically as the percentage of the time that a service is available for use.

Percentage of availability = (total elapsed time – sum of downtime)/total elapsed time

Influencing Factors

MTBF: Mean Time Between Failure

MTTR: Mean Time To Recovery

16

Availability

Defined as the percentage of time that an application is processing requests

Measured in terms of uptime

typically nines (99.999% is five nines)

8.75 hours 99.9

52 minutes 99.99

5 minutes 99.999

30 seconds 99.9999

By Year not available Availability (%)

3.65 days 99

17

Service Level Agreements

Define what you mean by available

The system is available when

The home page displays within 2 seconds when you navigate to the URL

You can add items to the shopping cart in 1 second or less

You can purchase items in your shopping cart using a credit card in 15 seconds or less

Your definition should be testable with automated tools or third party vendors

18

Availability Requires People

People are the biggest cause of downtime

Organization - ensure skills are available or on call when required

Procedures - Operators need correctly documented, tested and maintained procedures

19

Reliability vs. Availability

Reliability generally has the greater impact on end-user perception

Frequent failures are irritating

Good reliability, bad availability

Infrequent, but potentially major, downtimes

e.g. electrical power grid

Availability generally has the greater impact on operations

Extended downtime can cripple a business

Bad reliability, good availability

Frequent, but minor, failures

e.g. mobile communications

If the application is fault tolerant, and failover is instantaneous, then

Reliability and Availability can generally be treated as a single objective

The application is reliable as long as there is a server to failover to

The application is available as long as there is at least one server up

20

Content


2. Reliability



5. Eliminating SPOF

6. Transaction

7. CAP Theorem

8. Scalability

9. Performance

10. Clustering

High Availability

HA refers to application or service availability targets of 99.9 percent or higher availability.

In contrast, the Service Availability Forum defines HA applications or services as applications or services with an availability objective of ‘‘five nines,’’ i.e., 99.999 percent.

For an application that can be accessed at any time, the former definition (99.9 percent) implies unscheduled downtime of 8.76 hours (525.6 minutes) and availability of 8751.24 hours per year (given 8760 hours in a non leap year).

This is equivalent to a few unscheduled outages in a year and a restoration of service in minutes.

In order to provide this level of availability, an application or service requires a set of HA technologies, IT processes, and services supporting HA (the focus of this paper), as well as an IT organization that supports HA.

21

High Availability

The first critical aspect of HA management is the understanding and documenting of customer requirements for availability.

Understanding the business requirements clearly can help minimize overinvestment in areas that do not add needed value

Reaching this understanding can be a joint effort of the availability management, service level management and service financial management, requirements engineering, and architecture teams.

22

High Availability Four Key Goals (KGI)

Based on experience and industry trends, there are four key goals associated with HA:

1. Maximizing or extending application or service uptime, i.e., mean time between service failures (MTBSF);

2. Eliminating or minimizing the impact of service related incidents by detecting and resolving component incidents before they impact application or service availability

3. Minimizing unplanned or unscheduled downtime of applications or services, i.e., mean time to recover service (MTTRS);

4. Eliminating or minimizing planned or scheduled downtime (i.e., downtime for changes, releases, and maintenance work).

These goals (KGIs, a term introduced in the COBIT 3)

are consistent with ITIL V3 service design documentation.

Are related to continuous operations (CO) and continuous availability (HA þ CO).

23

High Availability Four KGI and IT Processes

24

High Availability Stages of the services lifecycle

25

High Availability Service Strategy

Service strategy helps in defining availability requirements and rationalizing expenditures for improving service availability by detailing the relationship between service, IT, functional, and business strategy.

As an element in service strategy, service portfolios can be grouped into service tiers, with each tier having its own set of service-level objectives (SLOs) and service-level requirements (SLRs).

The service targets for each SLO may vary by service tier.

Service tiers can in turn include availability tiers, with key differences in their availability objectives.

Key SLOs associated with service availability can help with gathering and documenting service availability requirements.

26

High Availability Service Design

Service design involves determining and documenting service requirements and designing services to meet or exceed a set of functional and nonfunctional requirements.

Availability management is an IT process that is part of the service design stage of the service lifecycle.

Service design is directly responsible for using availability architecture patterns, technologies, and standards, in both processes both in the application design and technology infrastructure design processes.

The ITIL version 3 service design concept is a critical change from ITIL version 2.

In V2, availability management was part of service delivery.

By moving it to service design, ITIL version 3 makes it clear that waiting until service delivery to plan service levels, availability levels, capacity levels, continuity plans, security plans, and financial plans will not result in an efficient design.

27

High Availability Service Transition

Service transition moves the service package into operational mode and involves the development of the base configuration information and knowledge management related to the service.

This includes documentation of the service architecture, service related operational procedures, and other service specific documentation.

Involves the testing, evaluation, and validation of the service in a pre-production environment including testing, evaluation, and validation of HA technologies and capabilities.

Change, release, and transition planning must also be performed, including operational readiness and final production deployment.

Processes employed in service transition include asset configuration and knowledge management, change management, transition planning and support, release and deployment management, and service testing, validation, and evaluation processes.

28

High Availability Service Operation

Service operations in production involve operational activities such as:

Event and Incident management, which is critical for MTBSF, MTBCF, and MTTRS.

Post-deployment configuration audits, including HA configuration audits.

Post-deployment operational audits, such as change audits and maintenance audits.

Advanced change management capabilities and change models for HA services and applications.

Advanced release management capabilities and release models for HA services and applications.

Post-deployment operational work also involves day-to-day maintenance activities, both reactive and proactive, operational, infrastructural, and minor code-related changes, and major releases.

29

High Availability Service Improvement

Service improvement includes the development and implementation of service, application, and infrastructure availability improvement plans.

Availability improvement plans can be based

On thorough availability architecture analysis (i.e., identifying gaps between current availability capabilities and target availability architecture) or

On the ad hoc development and implementation of service, application, infrastructure, and operational architectural improvements as they relate to and impact service availability

30

High Availability Process and Tools

31

High Availability Architecture Patterns

32

33

High Availability Measuring HA

Question to be answerer

What percentage of the time is the application usable?

For HA, its measured as the number of nines

e.g. 99.999% is five nines

Calculated using simple probability

34

High Availability Measuring HA Example

Hardware

Cluster of 8 2-CPU servers, 99% Uptime each

Question

What is the availability of each configuration if a total of 8 CPUs are

required to service user requests within SLA requirements?

Solution

I f a total of 8 CPUs are required, this implies four servers are

sufficient to service application requests.

For the application to fail, five of the eight servers must fail

simultaneously: 0.01^5 = 1 e-10

Predicted application availability is 99.99999999% (annual

downtime of ~ 3ms)

35

High Availability Measuring HA – In reality …

Application availability is more complicated than it appears at first glance.

Human error is one of the biggest contributor to application downtime

Servers are rarely truly independent

A server failure may increase load on the remaining servers, triggering a cascade effect

Errors in shared components (network switches, clustering, power systems) can impact multiple servers

High Availability Measuring HA – Sequential Dependency

Components connected is a chain, relying on the previous component for availability

The total availability is always lower than the availability of the weakest link

36

Server 1 Server 2 Server 3

Availability (A)= AS1*AS2*AS3

High Availability Measuring HA – Sequ. Dep. Example

37

Availability = Database * Network * Web Server * Desktop

Availability = 98% * 98% * 97.5% * 96% = 89.89%

Total Infrastructure Availability = 89.89%

98%

Database Server

98%

Network

97.5%

Web Server

96%

Desktop

High Availability Measuring HA – Redundant Dep. Ex.

38

Database Availability= 1 – ((1 – 0.98) * (1 – 0.98)) = 0.9996

Database Availability = 99.96%

Availability = Database * Network * Server * Workstation

Availability = 0.9996 * 0.98 * 0.975 * 0.96 = 0.9169

Total Infrastructure Availability = 91.69%

Total availability is higher than the availability of the individual links

98%

Network

97.5%

Web Server

96%

Desktop

98%

Database Servers

98%

High Availability Measuring HA – Reality

If an application has three tiers with 99% uptime each, what’s the up-time for the application?

Measure probability as being up

Probability = .99 * .99 * .99

97% uptime = not even two nines!

The application availability is not even as good as the weakest link when adding a tier

even if it is more reliable than the other tiers, adding a tier will always reduce application availability

39

High Availability Synthesis

Use redundancy with failover to increase availability by eliminating Single Points Of Failure (SPOFs)

Decouple tiers and as much as possible make each tier self-sufficient ( or at least fail gracefully)

Keep Common Sense

Not all applications need HA it s expensive!

There is still room for human error and unavoidable downtime (e.g. certain upgrades)

40

41

Content


2. Reliability



5. Eliminating SPOF

6. Transaction

7. CAP Theorem

8. Scalability

9. Performance

10. Clustering

42

Eliminating SPOFs Introduction

SPOF = Single Point Of Failure

Whenever a single server can die and take down the application (or part of an application), that server is a SPOF

Eliminating SPOFs increases application availability

When a working system can take over for a failed system, that is called failover

A system that can fail over is not a SPOF

43

Component Redundancy

Eliminates single point of failure

Active / Active configuration

Example: Web Farm

Active / Passive

Example: Cluster of SQL Servers

Use High Availability Patterns

Y1

Y3

Y2 ZX

Y1

Y3

Y2 Load

Balancer

Y1

Y3

Y2 Load

Balancer

44

Eliminating SPOFs N-Tiers Architecture

Local or global load

balancer are used

today.

Can be either

hardware or

software based

Generally

software based.

Load balancing

can keep track

of the session

Mainly software

solution. Local

and distributed

cache

management

Mainly

software

solution. Local

and distributed

cache

management

45

Eliminating SPOFs HA Load Balancer

Local HA Load Balancers

Typically a master/slave configuration

Both Load Balancers receive all the traffic

The Load Balancers communicate directly over a dedicated cable

When the slave detects failure of the master, it assumes all responsibility for the current connections

May even be able to failover stateful connections, including HTTPS

Global Load Balancers

Used to direct traffic to a particular data center

Use an Authoritative Name Server e.g. to resolve www to particular data center

For disaster recovery, the www resolves to the primary data center unless it is down, in which case it resolves to the backup

For regional load-balancing, the www is resolved to the geographically closest data center

Modern Global Load Balancers do both Global and Local balancing

46

Eliminating SPOFs HA Load Balancer

Known appliance

BigIP

Alteon

Other software Software

Continuent (ex EMIC) a/cluster (European Connect)

Apache has multiple load balancing and failover plug-ins

BEA is providing natively an HTTP load balancing

IIS also

47

Eliminating SPOFs HA Database

HA Databases do not normally require additional programming in the application tier

Often implemented in the JDBC driver level or below

Failover may cause current pending transactions to roll back, but with a real HA database, no previously committed transactions are lost

The most reliable HA Database configuration is master/slave

The slave server is always ready for the master to die

One-way replication may even work across datacenters

48

Eliminating SPOFs HA Database

Hardware

SAN/NAS BAY with RAIDx

Software

Continuent m/cluster for mySQL (heterogeneous, SQL Server,

Sybase and Oracle cluster in Q3 2006)

Shared Nothing Architecture - load balance read / broadcast write

Oracle RAC

RAC is a kind of distributed cache on top of Oracle.

Require the same OS, same Oracle version

No load balancing, no failover within transactions

Only supports retry on new connections or retry of reads.

49

Eliminating SPOFs HA Application Tiers

Application tiers can be stateless

Stateless tiers (e.g. web servers) are HA using simple redundancy

Only problem is that statelessness in one tier usually just passes the

buck to the next tier, which is almost always more expensive

Application tiers are almost always stateful

Only two things can be lost: State and in-flight requests

To achieve HA, the Application tier must either manage its state

resiliently (e.g. in a clustered coherent cache) or back it up to a

central store

Idempotent actions can be replayed by the web tier when a server

fails

Eliminating SPOFs HA Application Tiers

Application cache can be implemented

in application using JEE standard jcache

using open source tool (OScache, ehcache)

Using commercial tool (like Tangosol, Coherence)

Application cache can be implemented in application using .net

ASP.NET application cache is a smart in-memory repository for data

Caching Application Block.

50

51

Eliminating SPOFs Web Tier to App Tier

Load balancers slow way down if the load balancing is sticky (keep track of pair session/server)

Best approach is for the load balancer to round-robin or randomize its load-balancing across all available web servers

but there’s a good reason:

Web servers (e.g. Apache, IIS, JES) can handle lots of concurrent connections, serve static content, and route requests to app servers

The web server plug-in for routing to the app server can do the sticky load balancing, guaranteeing that HTTP Sessions stick !

All application servers are offering clustering.

52

Content


2. Reliability



5. Eliminating SPOF

6. Transactions

7. CAP Theorem

8. Scalability

9. Performance

10. Clustering

Transactions Introduction

A transaction is a sequence of operations that change the state of an object or collection of objects in a well defined way.

Transactions are useful because they satisfy constraints about what the state of an object must be before, after or during a transaction.

For example, a particular type of transaction may satisfy a constraint that an attribute of an object must be greater after the transaction than it was before the transaction.

Sometimes, the constraints are unrelated to the objects that the transactions operate on.

For example, a transaction may be required to take place in less than a certain amount of time.

53

Transactions Must comply to ACID Properties

Atomicity: All-or-nothing process.

Atomicity guarantees that all operations within a transaction happen within a single unit of work

Consistency: System in consistent state.

Consistency guarantees that all transactional resources within a transaction are left in a consistent state either after the transaction succeeds and is committed or after it fails and all resources are rolled back to their previous state

Isolation: Not affected by other.

Isolation ensures that even though multiple transactions may be running in parallel they appear to be running in a serial manner

Durability: Once committed, effects persist.

Durability ensures that once a transaction has been marked as committed all information relating to the transaction has been committed to durable storage

54

Transactions XA Protocol

Open Group's X/Open Distributed Transaction Processing (DTP) model

Defines how an application program uses a transaction manager to coordinate a distributed transaction across multiple resource managers

Any resource manager that adheres to the XA specification can participate in a transaction coordinated by an XA-compliant transaction manager, thereby enabling different vendors' transactional products to work together.

All XA-compliant transactions are distributed transactions

XA supports both single-phase and two-phase commit

The transaction manager is responsible for making the final decision either to commit or rollback any distributed transaction.

For the transaction to commit successfully all of the individual resources must commit successfully; if any of them are unsuccessful, the transaction must roll back in all of the resources.

55

http://64.233.169.104/search?q=cache:PpeR_mzx0tMJ:www.opengroup.org/onlinepubs/009680699/toc.pdf+x/open+dtp&hl=en&ct=clnk&cd=21&gl=us&client=firefox-a

http://64.233.169.104/search?q=cache:PpeR_mzx0tMJ:www.opengroup.org/onlinepubs/009680699/toc.pdf+x/open+dtp&hl=en&ct=clnk&cd=21&gl=us&client=firefox-a

Transactions Tools

JTA (Java Transaction API) required for XA style transactions. An XA transaction involves coordination among the various resource managers, which is the responsibility of the transaction manager.

JTA specifies standard Java interfaces between the transaction manager and the application server and the resource managers.

56

http://java.sun.com/products/jta/

Transactions XA and SOA

Today some technologists are even positioning the enterprise service bus (ESB) as a standard mechanism for integrating systems with heterogeneous data interfaces.

While ESB and Web services can clearly be used to move data between disparate data sources, I would not recommend using a set of Web services to implement a distributed transaction if the transaction requirements could be achieved by using XA, even if the enabling Web services technology supported WS-Transaction (WS-TX).

The one advantage Web services have over XA is that XA is not a remote protocol.

57

Transactions WS-TX for SOA

WS-Transaction is the name of the OASIS group that is currently working on the transaction management specification. WS-TX is the name of the committee, and they are working on three specs:

WS-Coordination (WS-C) - a basic coordination mechanism on which protocols are layered;

WS-AtomicTransaction (WS-AT) - a classic two-phase commit protocol similar to XA;

WS-BusinessActivity (WS-BA) - a compensation based protocol designed for longer running interactions, such as BPEL scripts.

In practice, WS-AT should be used in conjunction with XA to implement a true distributed transaction.

WS-TX essentially extends transaction coordinators, such as OTS/JTS and Microsoft DTC to handle transactional Web services interoperability requirements.

58

Transactions Birth of XTP

Extreme Transaction-Processing Platform

Traditional online transaction processing (OLTP) architectures and products are wearing thin when it comes to supporting the growing transactional workloads generated by modern service oriented and event-driven architectures (SOAs and EDAs)

Users are looking for alternatives based on low-cost commodity hardware and modern software.

59

Transactions XTP Platform

An XTPP will be characterized by the following features:

A cohesive programming model supporting the development paradigms offered by the containers

Event-processing and service containers to enable development and execution of rich applications supporting even the most-complex requirements

Flow management container to enable application development and execution through composition of loosely coupled components (services or event handlers)

A batch container to support batch and high-performance computing (HPC)-style applications

A common distributed transaction manager, leveraged by the application containers for supporting transaction integrity in highly distributed architectures

60


A high-performance computing fabric, a communication and data-sharing infrastructure combining enterprise service bus and distributed caching mechanisms to support fast event propagation, service request dispatching and transactional data sharing between XTP application components and external applications.

Tera-architecture support to manage transparent and dynamic application and system components deployment and execution over large clusters of Linux, Unix or Windows servers but also pervasive computing processors.

Development tool, security, administration and management capabilities

61


62

Transactions XTP Platform Vendors

IBM WebSphere XD

an add-on product for several WebSphere (and non-IBM alike) products, providing distributed caching, stream processing, a batch framework, virtualization and other XTP features; it has announced support for OSGi in the WebSphere family and is one of the strongest SCA supporters.

Oracle declared on many public occasions that XTP was an area of strategic investment

in March 2007, it acquired Tangosol

Oracle also announced that it will introduce SCA and OSGi support in the next release of Oracle Fusion Middleware.

Oracle bought BEA and Sun

Microsoft's Windows Workflow Foundation

Layers a flow management, event-driven programming model atop the "classic," client/server-oriented .NET environment.

63


Tibco ActiveMatrix

Hybrid combination of container technology (POJO, Java EE and .NET), policy management and core enterprise service bus technologies.

Red Hat

acquired Mobicents, an open-source, JSLEE 1.0-compliant event-driven application platform

Mobicents runs on the mickokernel foundation of the Java EE-based, JBoss Application Server.

E2E Technologies provides E2E Bridge

Hybrid combination of ESB, flow management, service and event containers providing a UML-based programming model

64


GigaSpaces

Extreme Application Platform (XAP) — a platform middleware product combining Java, Spring, JavaSpaces, OSGi and a Java EE subset (JDBC, JMS and JCA) meant to address analytical and transactional applications.

Several grid-based application platform vendors (Appistry, Majitek and Paremus) have announced support for Spring and are extending their platforms for event-driven programming.

Event-driven application platform vendors (Kabira, jNetX, OpenCloud and WareLite) are beginning to move out of the traditional telecommunications and financial service sectors into other verticals (such as retail and defense).

65

66

Content


2. Reliability



5. Eliminating SPOF

6. Transaction

7. CAP Theorem

8. Scalability

9. Performance

10. Clustering

CAP theorem from Amazon

What goals might you want from a shared-data system?

Strong Consistency: all clients see the same view, even in presence of updates

High Availability: all clients can find some replica of the data, even in the presence of failures

Partition-tolerance: the system properties hold even when the system is partitioned

The theorem states that you can always have only two of the three CAP properties at the same time.

67

CAP theorem from Amazon

The first property, Consistency

Has to do with ACID systems

Big shops like Amazon and Google, as they handle an incredibly huge number of transactions and data, always need some kind of system partitioning.

For Amazon the third and second CAP properties (Availability and Partitioning) are fixed, they need to sacrifice Consistency.

It means they prefer to compensate or reconcile inconsistencies instead of sacrificing high availability, because their primary need is to scale well to allow for a smooth user experience.

68

http://en.wikipedia.org/wiki/ACID

CAP theorem Lessons learned from Amazon

Most legacy application servers and relational database systems are built with consistency as their primary target, while big shops really need high availability.

That's why firms like Google or Amazon have developed their own applicative infrastructure.

That's why, a two-phase commit protocol is never an appropriate choice in case of big scalability needs.

To scale-up what you really need are asynchronous, stateless services, together with a good reconciliation and compensation mechanism in case of errors.

Second, your data model has a dramatic impact on performances, that's why Amazon has implemented a simple put/get API instead of running complex database queries, and why Google performances are due to the MapReduce algorithm: simplicity rules.

69

http://labs.google.com/papers/mapreduce.html

70

Content


2. Reliability



5. Eliminating SPOF

6. Transaction

7. CAP Theorem

8. Scalability

9. Performance

10. Clustering

71

Scalability

Defined in terms of the impact on throughput as additional hardware resources are added

Adding CPUs/ RAM to a server : Scaling Up

Adding servers to a cluster : Scaling Out

Measured with Scaling Factor

The ratio of new capacity to old capacity as resources are increased

If doubling CPUs results in 1.9 x throughput, then Scalability Factor is 1.9, and the adjusted SF/CPU is 0.95

The ideal SF/ CPU is 1.0, a.k.a. linear scalability

72

Scaling in vs. Scaling out

IS professionals typically add capacity to computer systems by scaling up.

When response time starts to degrade because of additional workload or higher database capacities, the straightforward answer to the immediate performance problem is adding bigger, faster hardware.

Extrapolating from Moore's Law, which states that hardware performance will double every 18 months, you might conclude that scaling up is an adequate solution to handle growth for the foreseeable future.

However, you'll soon realize that Murphy's Law precludes Moore's Law.

73


Although the current 8-way SMP systems equipped with high-speed Storage Area Network (SAN) storage arrays provide tremendous scalability, they also bring to light several other scalability problems.

First, when a system reaches a certain point, further scaling up becomes prohibitively expensive.

Second, even with Moore's Law in full effect, you can't scale beyond a certain point—at least until vendors release the next generation of hardware.

Even beyond the hardware problems, you'll probably encounter software hurdles when you're trying to scale up.

Software systems such as databases have internal mechanisms that handle locking and other multi-user database issues.

These software structures have limited efficiency, and these limits typically become the real governing impediments to continued upward scalability.

74


Thus, you don't see SMP performance graphs continuing to demonstrate linear upward scalability as you add more processor power.

At some point, the curve always begins to flatten. At the upper reaches of that curve, you'll find that you need very expensive hardware upgrades to get very small performance improvements.

That’s where scaling out came into the game.

75


Scaling out can provide an effective answer to the problems of the scale-up scenario, by using the shared-nothing architecture.

Essentially, shared-nothing architecture means that each system operates independently.

Each system in the cluster maintains separate CPU, memory, and disk storage that other systems can't directly access.

To address capacity issues by scaling out, you add more hardware-not bigger hardware.

When you scale out, the absolute size and speed of a single system doesn't limit total capacity.

Shared-nothing architecture also skirts the software bottleneck by providing multiple multi-user concurrency mechanisms.

Because the workload is divided among the servers, total software capacity increases.

76


Although scaling out provides great answers to the inherent limitations in scale-up architecture, this method is no stranger to Murphy's Law, either.

At this point in the technology lifecycle, scaling out requires increased management overhead that is potentially as great as the performance gains it offers.

Even so, scaling out might be a viable solution to database implementations that have reached the limits of SMP scalability.

77

Scalability for Database Tier

Applications that go to the database for each request likely will have scalability problems

The Database tier is difficult and expensive to scale; it is difficult to scale a database server to more than a single host, and it becomes exponentially more expensive to add CPUs

Database servers scale sub- linearly at best with additonal CPUs, and there is a CPU limit

78

Super-Linear Scalability

I t is possible to exceed an SF of 1 .0

With two disks, reduced head content ion can increase the throughput of sequential I/O by even 100x.

Similar effects can occur with CPU caches and context switches.

Large cluster - aggregated data caches can offer super- linear scale by significantly increasing the hit rate, reducing the average data access cost

Can be also explained as a super-linear slow down as resources are reduced ( i.e. the converse)

79

Content


2. Reliability



5. Eliminating SPOF

6. Transaction

7. CAP Theorem

8. Scalability

9. Performance

10. Clustering

80

Performance

Defined as how fast operations complete

Typically measured as time (wall clock) elapsed between request and response

Elapsed time also known as latency

Web apps often measured on the server side as time to last byte (TTLB)

81

Scalability vs. Performance

Users are affected by poor performance

Poor performance is usually a result of poor scalability

Operating costs and capacity limitations are caused by poor scalability

Designing for scalability often has a negative impact on single-user performance

Building in the ability to scale out has overhead

But single-user performance doesn't often matter !

Once the maximum sustainable request rate is exceeded, performance will degrade

End user apps will degrade in a linear fashion as the request queue backs up

Automated applications will degrade exponentially

82

Scalable Performance

Scalable Performance is NOT focused on making an application faster ; rather, it is focused on insuring that the application Performance does not degrade beyond defined boundaries as the application gains additional users, how resources must grow to ensure that, and how one can be certain that additional resources ill solve the problem

Scalable Performance refers to overall response times for an application (SLA) that are within defined tolerances for normal use, remain within those tolerances up to the expected peak user load, and for which a clear understanding exists as to the resources that would be required to support additional load without exceeding those tolerances

83

Engineering For Performance

Build performance and scalability thinking in the development lifecycle

Define your objectives

Measure against your objectives

When You measure what you are speaking about, and express it

in numbers, you know something about it; but when You cannot

express it in numbers, your knowledge is of a meager and

unsatisfactory kind; it may be the beginning of knowledge, but you

have scarcely in your thoughts advanced to the state of science. - Lord Kelvin (William Thomson)

http://www-groups.dcs.st-and.ac.uk/history/Mathematicians/Thomson.html

84

Performance Modeling

A structured and repeatable approach to modeling the performance of your software

Similar to “Threat Modeling” in security

Begins during the early phases of your application design

Continues throughout the application lifecycle

Consists of

A document that captures your performance requirements

A process to incrementally define and capture the information that helps the teams working on your solution to focus on using, capturing, and sharing the correct information.

85

Performance modeling Process

Critical Scenarios

Have specific performance expectations or requirements.

Significant Scenarios

Do not have specific performance objectives

May impact other critical scenarios.

Look for scenarios which

Run in parallel to a performance critical scenario

Frequently executed

Account for a high percentage of system use

Consume significant system resources

1. Identify Key Scenarios

2. Identify Workloads

3. Identify Performance Objectives

4. Identify Processing Steps

5. Allocate Budget

6. Evaluate

7. Validate

Itera

te

86


Workload is usually derived from marketing data

Total users

Concurrently active users

Data volumes

Transaction volumes and transaction mix

Identify how this workload applies to an individual scenario

Support 100 concurrent users browsing

Support 10 concurrent users placing orders.





5. Allocate Budget

6. Evaluate

7. Validate

Itera

te

87


Performance and scalability goals should be defined as non-functional or operational requirements

Requirements should be based on previously identified workload

Consider the following:

Service level agreements

Response times

Projected growth

Lifetime of your application





5. Allocate Budget

6. Evaluate

7. Validate

Itera

te

88

Define Your Objectives

Performance and scalability goals should be defined as non-functional or operational requirements

Requirements should be based on expected use of the system

Compare to previous versions or similar systems

Metric Definition Measured By Impacts

Throughput How Many? Requests per second

Number of servers

Response Time How Fast? Client latency Customer Satisfaction

Resource Util. How Much? % of resource Hardware/ Network

Workload How many concurrent requests?

Concurrent requests for the system

Scalability, Concurrency

89

Define Your Objectives

Objectives must be SMART

S – Specific

M – Measurable

A – Achievable

R – Results Oriented

T – Time Specific

"application must run fast"

“Page should load quickly"

"3 second response time on home page with 100

concurrent users and < 70% CPU"

"25 journal updates posted per second with 500 concurrent

users and < 70% CPU"

r

a

"If You cannot

measure it, You

cannot improve

it.“

-Lord Kelvin

90

Build an objective

Scenario Response Time

Throughput Workload

Resource Utilization

Browse Home page

Client latency 3 seconds

50 requests per second

100 concurrent

users

< 60% CPU Utilization

Search Catalog

Client latency 5 seconds

10 requests per second

100 concurrent

users

< 60% CPU Utilization

91


Identify the steps that must take place to complete a scenario

Use cases, sequence diagrams, flowcharts etc. all provide useful input

Helps you to know where to instrument your code later

Start at a high level, don’t go to low





5. Allocate Budget

6. Evaluate

7. Validate

Itera

te

92


Use your performance baseline to measure how much time each processing step is taking

If you are not meeting your target budget the time among the processing steps





5. Allocate Budget

6. Evaluate

7. Validate

Itera

te

93


Run automated test scenarios and evaluate the performance against objectives

As much as possible, tests must be repeatable throughout application lifecycle





5. Allocate Budget

6. Evaluate

7. Validate

Itera

te

94


Check your results against performance objectives

Leave yourself a margin early in the project to avoid early performance optimization

As you progress toward completion allow less margin





5. Allocate Budget

6. Evaluate

7. Validate

Itera

te

95

Content


2. Reliability



5. Eliminating SPOF

6. Transaction

7. CAP Theorem

8. Scalability

9. Performance

10. Clustering

96

Clustering

Clustering enables multiple servers or server processes to work together

Clustering can be used to horizontally scale a tier, i.e. scale by adding servers

Clustering usually costs much less than buying a bigger server (vertical scaling)

Clustering also typically provide failover and other reliability benefits

97

Clustering Concepts

The less communication required, the better

Always better to be stateless in a tier if it does not cause a bottleneck in the next tier

Server farms: a stateless clustering model

The less coordination required, the better

Independence: Don’t go to the committee

Concurrency Control: Reduces scalability, so use only as necessary

98

Clustering Benefits

If the application has been built correctly, it supports a predictable scaling model

Clustering allows relatively inexpensive CPU and memory resources to be added to a production application in order to handle more concurrent users and/or more data

Provides redundancy

Simple ( n+ 1 ) model

99

Scalability of Clustering

The Potential for Negative Scale

Single server model allows unrestricted caching

Clustering may require the disabling of caching

Two servers often slower than one!

Data Integrity Challenges in a Cluster

How to maintain the data in sync among servers

How to keep in sync with the data tier

How to failover and fail-back servers without impact

Twitter

http://www.twitter.com/welkaim

Travel 2.0

http://www.netvibes.com/travl20

Linkedin

http://fr.linkedin.com/in/williamelkaim

Blog

http://www.reimagine.fr/

Contact: [email protected]

100



http://www.netvibes.com/travl20





mailto:[email protected]


Application architecture operational aspects

Technology

Transcript of Application architecture operational aspects