A Guide to Agile Performance - Capacitas You and... · 2018. 3. 12. · Performance provide a...

12
DR. MANZOOR MOHAMMED & THOMAS BARNS A Guide to Agile Performance How to Move Fast and Not Break Things

Transcript of A Guide to Agile Performance - Capacitas You and... · 2018. 3. 12. · Performance provide a...

Page 1: A Guide to Agile Performance - Capacitas You and... · 2018. 3. 12. · Performance provide a comprehensive way of measuring performance. In Agile & Continuous Cycles, there is simply

DR. MANZOOR MOHAMMED & THOMAS BARNS

A Guide to Agile PerformanceHow to Move Fast and Not Break Things

Page 2: A Guide to Agile Performance - Capacitas You and... · 2018. 3. 12. · Performance provide a comprehensive way of measuring performance. In Agile & Continuous Cycles, there is simply

About the AuthorsManzoor has worked on capacity and performance management projects for some of the largest ICT systems in the UK, including BT, Level 3 Communications and Syntegra.

One of his biggest achievements was building capacity models, which helped automate the capacity plan process for a large ISP, reducing cycle-time from several months to several days. Another is building a predictive

performance model for a large system integrator at bid stage, which helped the system integrator re-negotiate the SLA response times.

Manzoor’s key skill areas are: creating volumetric forecast models, building predictive capacity and performance models for ICT systems, and capacity planning and performance analysis of ICT systems.

Dr Manzoor Mohammed Director

Table of ContentsIntroduction

Key Takeaways

The “Old World”

Businesses Need IT to Deliver Software Faster

But What About Performance?

The Holy Grail of Faster Delivery & Good Performance

“I Can’t Possibly Build All of this into QA Within My Available Time & Budget, Right?”

The Capacitas Solution to Ensure Performance in an Agile / Devops / CI Environment

Summary

References

3

3

4

5

6-8

9-11

12-13

14-17

18

19

Thomas is Risk Modelling and Performance Engineering Service Lead at Capacitas, responsible for service definition and ensuring consistent best practice across projects.

Over the past 10 years he has worked on large projects providing capacity and performance expertise to clients and owned the roadmap for developing Capacitas’ technical software solutions.

During this time, he has seen a big shift in how software engineering is undertaken and

viewed by the business, and has built on this to introduce more effective and efficient performance risk management processes. This has meant shifting focus away from large scale system testing to a full lifecycle approach, alongside research and development in automated data analysis.

Thomas is currently defining and governing Performance Engineering processes and standards for a multi-million-pound multi-vendor programme of work at a FTSE 100 company.

Thomas Barns Principal Consultant

Page 3: A Guide to Agile Performance - Capacitas You and... · 2018. 3. 12. · Performance provide a comprehensive way of measuring performance. In Agile & Continuous Cycles, there is simply

IntroductionThis whitepaper is designed for roles at all levels across enterprise IT involved in major change programmes (including digital transformation, re-platforming, cloud migration, datacentre migration) and following, or planning to adopt, agile and devops delivery models.

We cover the core principles and best practice approaches for ensuring good performance, whilst increasing the velocity of delivery.

Page 3 Page 4

The “Old World”As we all know, the traditional software design methodology for delivery was the waterfall model. The waterfall model is a sequential (non-iterative) design process.

Projects which followed the waterfall methodology had a high risk of project slippage and a risk that the end- product wasn’t what the business actually wanted by the time the project was completed. This was because either the requirements were not comprehensive enough in capturing the business needs or the business requirements had changed by the time the end product was delivered.

The performance of the software delivered by these projects was typically ensured with long periods of testing near the end of the project lifecycle. These performance tests were often long and complicated, e.g. soak tests of 24 hour durations. They were typically carried out by performance testers who were very much focused on checking that the tests met formal Non-Functional Requirements (NFRs).

Performance is not simply about response times and throughput. That is too simplistic a way to measure performance. An all-embracing approach to measuring performance is required. Capacitas’s 7 Pillars of Performance provide a comprehensive way of measuring performance.

In Agile & Continuous Cycles, there is simply not enough time to test every change. A risk based approach, using techniques such as Risk Modelling is required.

Shift Left & Continuous Collaboration. There needs to be a shift from performance engineers and analysts testing at the end to one where they are involved throughout the lifecycle and are collaborating with the developers to build a better understanding of the software and also refine the conceptual model of the platform.

Smart Design: smart test designs are needed to expose risks at lower loads and in narrow test windows.

Automation of Testing & Analysis. Automated analysis needs to address not just response times and throughput but all 7 Pillars of Performance.

Key Takeaways

Page 4: A Guide to Agile Performance - Capacitas You and... · 2018. 3. 12. · Performance provide a comprehensive way of measuring performance. In Agile & Continuous Cycles, there is simply

Businesses Need IT to Deliver Software Faster

Page 5 Page 6

But What About Performance?

FIGU

RE

1Competetive Advantage

First Adopter

Competitor #1

Competitor #2

In digital markets, there is lots of competition within different sectors.

For example, in the airline sector, easyJet, Ryanair and BA are in competition with each other to introduce new products to increase revenue and attract new customers. In 2007, easyJet launched Speedy Boarding before Ryanair. A few years later they achieved a similar coup with Allocated Seating. This gave them a competitive advantage over Ryanair in increasing the revenue per passenger and also improving customer experience.

In the retail sector businesses are facing direct competition from Amazon because of their ability to provide an excellent digital experience. Retailers need their IT departments to deliver more functionality that can match the Amazon digital experience and do so within short timescales. Within Banking and Financial Services there is a rapidly growing ecosystem of Fintech start-ups threatening to undermine the conventional business models – again the established players need to quickly innovate or risk losing market share.

The use of conventional waterfall methodologies and their associated timescales are not fit for purpose in this new era of fast delivery in highly competitive markets.

Software in live is normally distributed over a complex IT environment on multiple-tiers of infrastructure and different technologies. How, in this world of rapid delivery and complex infrastructure and technologies, can we still deliver performance and minimise the risk to user experience and loss of revenue?

Capacitas has defined 7 Pillars of Performance. If any one of these pillars fail then the overall performance and end user experience is impacted and / or the cost of supporting the service increases substantially.

FIG

UR

E 2

THE CAPACITAS 7 PILLARS OF SOFTWARE PERFORMANCE

Throughput &

Response Time

Capacity

Efficiency

Scalability

Stability

Resilience

Instrumentation

Page 5: A Guide to Agile Performance - Capacitas You and... · 2018. 3. 12. · Performance provide a comprehensive way of measuring performance. In Agile & Continuous Cycles, there is simply

Page 7 Page 8

Throughput & Response Time

But What About Performance? (Continued)

This is how much capacity you need to support the software. There is a myth that in the world of cloud this is no longer an issue. In fact, the amount of cloud capacity provisioned has a direct bearing on cost incurred. In addition, cloud capacity is not always instantaneously available.

Capacity

This is the most widely understood criteria for performance. It measures the speed and successful throughput of the software. The speed can be measured at different points in the platform, based on user experience or time taken by the technical platform. Throughput is a measure of the rate at which work is achieved. For example, this could be the number of page views achieved per second or the number of database transactions processed per minute.

This is a measure of how much capacity is used to deliver a business function, e.g. the number of CPU seconds used to deliver the search functionality of a digital platform. We often work with applications that are able to meet throuhput and response requirements, but, due to their inefficiency, require a large number of server instances to run. This leads to excessive run cost.

Efficiency

Scalability

This is a measure of how stable performance is over long periods of time and prolonged periods of load.

Stability

This is a measure of whether software can scale linearly with increasing load and can use all the available capacity. If it can’t then it will act as a drag on the speed of delivering software change in the future.

This looks at how software behaves when an internal or external interface slows down or becomes unavailable. We would expect the parts of the software which do not call these internal and external interfaces to remain unaffected when these interfaces slow down or become unavailable. Usually, software is better at handling the non-availability of interfaces rather than the interfaces slowing down. A term that is sometime used by our customers is that their software can only handle “happy day scenarios”.

Resilience

Instrumentation

Instrumentation is critical. Without it we simply can’t understand the 6 pillars mentioned above. APM tools (such as AppDynamics, Dynatrace, New Relic) provide an invaluable source of data – however they need to be used in conjunction with other sources of data to get a comprehensive view of the performance. Using the ITIL Framework, our metrics fall within three categories: Business, Service and Component.

Page 6: A Guide to Agile Performance - Capacitas You and... · 2018. 3. 12. · Performance provide a comprehensive way of measuring performance. In Agile & Continuous Cycles, there is simply

The Holy Grail of Faster Delivery & Good Performance

Page 9 Page 10

Although development methodologies and technologies have evolved to deliver software faster, the same innovations for managing performance have not been put in place. The following section details an approach to delivering the holy grail of delivery speed whilst maintaining good performance. How do we ensure that the 7 Pillars of Performance are maintained given that developers must not be slowed down?

Facebook famously had a mantra of moving fast and breaking things. However, they have more recently back-tracked from this as they found that they were spending more time fixing production bugs than delivering new functionality. Our recommendation is that you can use this approach where your software is not business-critical. However, if your software is business critical there are Four Reasons why this will not work.

It does not work if you have large peaks.

Some performance defects only manifest at peak load. You can you release software on a normal load day and everything is fine. However, six months later, during a peak day, the defect in that release causes a system outage. At this stage it is very difficult to pinpoint the root cause of the defect. (See Fig.3)

Some defects only manifest over prolonged periods.

Performance defects such as memory and CPU leaks only manifest over long periods of time. On release all may look well, however after a period of time an incident will occur. Since multiple releases would have taken place in the period between, it is very difficult to unravel the code to find the root cause.

It may not be a leak.

It could also be a fundamental design flaw that only manifests after multiple releases are applied to weak foundations. (See Fig.4)

FIGU

RE

3

Routes Extension Normal Day Jan Promo Peak

Dem

and

EASYJET.COM DEMAND PROFILE

It is far more expensive to fix defects in live than it is early in the software lifecycle.

Research1 suggests that fixing defects in live is 100x more expensive than fixing defects during the design stage. We find in our customer engagements, complex performance defects take a long time to identify and resolve. This is due to a combination of many factors such as the complexity and distribution of the system that supports the software. A publicly documented example of this is Netflix who had a performance issue using node.js that took 16+ days to resolve. (See Fig.5)

FIG

UR

E 4

1 2 3 4 5 6 7 8 9 10 11 12

Releases

A PERFORMANCE DEFECT INTRODUCED IN RELEASE 2 MAY ONLY MANIFEST AFTER A PROLONGED PERIOD OF TIME

See http://techblog.netflix.com/2014/11/nodejs-in-flames.html

Integrating Software Assurance into the Software Development Life Cycle (SDLC) Journal of Information Systems Technology and Planning (2010); Maurice Dawson, Darrell N Burrell, Emad Rahim, Oklahoma State University - Main Campus - Stephen Brewster

1

Page 7: A Guide to Agile Performance - Capacitas You and... · 2018. 3. 12. · Performance provide a comprehensive way of measuring performance. In Agile & Continuous Cycles, there is simply

Page 11 Page 12

The Holy Grail of Faster Delivery & Good Performance (Continued)

FIGU

RE

5

120

RELATIVE COSTS TO FIX SOFTWARE DEFECTS (SOURCE: IBM SYSTEMS SCIENCES INSTITUTE)

100

80

60

40

20

0

Design Implementation Testing Maintenance

1x15x6.5x

100x

Stage Defect is Found

Cost & Efficiency.

This is often overlooked. There will be frequent releases into live in an Agile/Continuous delivery approach. On numerous engagements we have observed a small increase per release resulting in a cumulatively large increase over a longer time frame. Increases of 1-3% in capacity consumption per release are not uncommon in these frequent delivery models. This would lead to an annual increase in capacity consumption over a year up to 38% (assuming monthly releases). (See Fig.6)

In the cloud, this has a direct correlation with run cost.

FIGU

RE

6

A 3% DEGREDATION IN SOFTWARE EFFICIENCY PER RELEASE WILL COMPOUND TO A 38% INCREASE IN CLOUD OPEX OVER 12 RELEASES

1.5

1.0

0.5

0

1.00 1.03 1.06 1.09 1.13 1.16 1.19 1.23 1.27 1.30 1.34 1.38

Releases

1 2 3 4 5 6 7 8 9 10 11 12

Ann

ual C

loud

O

pex

£m

“I Can’t Possibly Build All of this into QA Within My Available Time & Budget, Right?”Wrong! You can by working smarter:

By taking a Risk-Based approach

By Automating execution and analysis

Your approach should reflect five key principles:

Identify Performance Risk Comprehensively.

Assuring throughput and response is not enough. Risk across the 7 Pillars of Performance must be addressed.

1

Implement a Lifecycle Risk Management Strategy that Focuses Time & Effort on High-Risk Items.

In an Agile/CI delivery approach it’s complex and time-consuming to test everything. Also testing is not always the appropriate solution. A more efficient approach is to look at those changes which are likely to be high risk of a defect being introduced and high risk that this defect would have a significant impact.

For instance, a change to search on an e-commerce website could be high risk in terms of a defect being introduced and since it is likely to be the most frequent action on the website could be viewed as having a high impact on the website quality. An example of a low risk change would be minor changes to the user interface.

Based on the risk, appropriate mitigation can then be put in place. This might involve load testing, but could be some other activity, including reviewing the implementation, testing at low loads or tracking in production.

2

Page 8: A Guide to Agile Performance - Capacitas You and... · 2018. 3. 12. · Performance provide a comprehensive way of measuring performance. In Agile & Continuous Cycles, there is simply

Page 13 Page 14

The Capacitas Solution to Ensure Performance in an Agile / Devops / CI Environment

“I Can’t Possibly Build All of this into QA Within My Available Time & Budget, Right?” (Continued)

Collaboration on Performance.

Fast performance engineering only happens when the relevant roles work closely together. This means performance engineers and analysts collaborating with architects, product owners, developers and testers to drive a focus on performance throughout the lifecycle. Bringing together performance and technical domain experts in this way provides the insight needed to spot problems quickly. It also means that the implementation teams gain the performance insight required to build an efficient system, while alsoprovidingtheperformanceteamwiththe domain knowledge to construct smart tests.

3

Smart Test Design.

In Continuous and Agile environments, we operate within narrow test windows on non-representative test environments. Smart test design is required to expose risks effectively, across the 7 pillars, within the constraints of time and environments.

4

Automation of Performance Analysis Throughout the Lifecycle.

The aim of automation is not just about speed but also to identify early warning signs of risks in non-representative load tests. In the old world, the performance testers had long test windows to conduct large and complex performance tests to re-create problems. In the new world of Agile/CI, the performance engineer does not have the time to run these large complex performance tests. In a CI cycle, you may only have 25 minutes. It is unlikely that you will recreate incidents in this test window. In order to identify risks, the performance engineer needs to look at metrics beyond conventional metrics such as response time, CPU utilisation. This means looking at metrics deep within the system to look for anomalies in behaviour that could present a risk. This requires performance engineering expertise and also domain expertise.

5

This is made up of 6 Modules as shown in the diagram below. There are three key points that need to be understood when viewing this diagram:

In an Agile, DevOps or CI context, there may be no clear start and end times of these activities as they are continuous.

Not all these modules will be carried out, the frequency and number of modules that are carried out is dependent on the level of risk of each change.

This is a collaborative approach with the development teams building a collective understanding of the software and how it works in the live ecosystem and in test environments.

FIG

UR

E 7

Software Development Lifecycle

Production Validation

THE CAPACITAS SOLUTION FOR CONTINUOUS PERFORMANCE ENGINEERING

ReleaseTestBuildDesignRequirements

Integration Load Testing

Early Load Testing

Profiling & Unit Tests

Performance Reviews

Risk Modelling

Risk Mitigation

Concept Collaboration Code Builds

Page 9: A Guide to Agile Performance - Capacitas You and... · 2018. 3. 12. · Performance provide a comprehensive way of measuring performance. In Agile & Continuous Cycles, there is simply

Page 15 Page 16

The Capacitas Solution to Ensure Performance in an Agile / Devops / CI Environment (Continued)

Risk Modelling

The performance lead ensures that all features and changes undergo a performance risk assessment during elaboration sessions and sprint planning, with appropriate maturing with business analysts and developers as required.

This typically includes conceptual modelling of the system to determine risks based on architectural or design decisions.

Following creation of the risk assessment a risk mitigation plan is put in place to ensure appropriate levels of intervention for each risk across the other Performance Engineering modules.

The performance lead is responsible for ensuring that all risks are mitigated appropriately and that risk levels are kept up to date and owned throughout the lifecycle.

1

Performance Reviews

Performance engineers work with architects, designers and developers to ensure that best practice is built into the application throughout.

This is based on architecture and design documents, and engagement with appropriate personnel where necessary.

Collaboration with developers takes place as high risk changes are developed.

Performance analysts check for performance anti-patterns and work with developers to eliminate issues before code check-in.

2

Profiling and Unit Testing

For high risk items the analyst works with the developers to create unit test definitions which are included in the risk mitigation plan, defining units to be tested and acceptance criteria.

For any areas of poor performance identified through unit testing the analyst collaborates with developers to assist with the identification of hotspots through code profiling.

This collaborative process pinpoints inefficiencies in the code during the development lifecycle.

3

Early Load Testing

Early tests are typically carried out in small, unrepresentative environments. In order to get round this limitation, smart tests need to be designed to expose performance risks by targeting key functionality.

At Capacitas, we use our proprietary software accelerator (TNT), to automatically analyse test results. TNT uses a 13 metric model to examine performance across the seven pillars, to automatically detect performance pathologies.

This activity is fully automated and carried out frequently.

4

Page 10: A Guide to Agile Performance - Capacitas You and... · 2018. 3. 12. · Performance provide a comprehensive way of measuring performance. In Agile & Continuous Cycles, there is simply

Page 17

The Capacitas Solution to Ensure Performance in an Agile / Devops / CI Environment (Continued)

Integration Load Testing

Integration load testing takes place as the system comes together, usually at the end of a cycle.

Scaled production-like workload mix tests are run over an integrated test environment.

Workload mixes can be altered to target different what-if scenarios of future user load and behaviour.

At Capacitas, we use our TNT software accelerator to automatically detect performance pathologies across the seven pillars and deliver rapid feedback to development teams.

5

Production Validation

After release in live, data is gathered and analysed.

A production health check is carried out to identify risks not observed in test. At Capacitas, we use our proprietary ‘Operational Analytics’ software accelerator to automate this analysis.

A before/after check on the monitoring data will be used to identify any impacts of the development work in production.

Findings of the production validation are fed back into the SDLC and the performance engineering cycle as continual service improvement actions.

6

Page 18

SummaryIn summary, capacitas believes that delivering change faster while maintaining performance requires the following five paradigm changes to the conventional performance engineering approach.

Performance is not simply about response times and throughput. That is too simplistic a way to measure performance. An all-embracing approach to measuring performance is required. Capacitas’s 7 Pillars of Performance provide a comprehensive way of measuring performance.

In Agile & Continuous Cycles, there is simply not enough time to test every change. A risk based approach, using techniques such as Risk Modelling is required.

Shift Left & Continuous Collaboration. There needs to be a shift from performance engineers and analysts testing at the end to one where they are involved throughout the lifecycle and are collaborating with the developers to build a better understanding of the software and also refine the conceptual model of the platform.

Smart Design: smart test designs are needed to expose risks at lower loads and in narrow test windows.

Automation of Testing & Analysis. Automated analysis needs to address not just response times and throughput but all 7 Pillars of Performance.

Page 11: A Guide to Agile Performance - Capacitas You and... · 2018. 3. 12. · Performance provide a comprehensive way of measuring performance. In Agile & Continuous Cycles, there is simply

ReferencesIntegrating Software Assurance Into The Software Development Life Cycle Journal Of Information Systems Technology And Planning (2010)

Why ‘Move Fast And Break Things’ Doesn’t Work Thomas Barns, Capacitas

Why Traditional Performance Testing Cannot Survive In An Agile And Devops World Andy Bolton, Capacitas

Why Testing At The End Doesn’t Work Prasham Garg, Capacitas

Automating Performance Test Analysis To Speed Up Software Delivery Ian Donnell, Capacitas

Node.js In Flames Netflix

Page 19

About CapacitasWe are Performance and Capacity Management Experts. Founded in 2002 and based in Central London, we have a team of 40 consultants. We deliver substantial reductions in risk and service cost for business-critical IT systems.

We have delivered amazing results, protecting e-commerce revenue of £6.3 billion p.a. for our clients and saving individual clients more than £23 million p.a. through infrastructure consolidation and application optimisation.

As an independent professional services organisation, we act as a trusted advisor to our clients, tailoring our service offerings to meet their specific business objectives and working hard to become long-term strategic partners. We are thought-leaders within performance and capacity management and invest heavily in R&D in these areas; as well as enabling us to deliver industry-leading managed services, our research and innovation forms the basis of our whitepapers and thought-leadership events.

Next Steps

Webinar easyJet CaseStudy - Managing Performance Whilst Delivering Faster and Implementing Rapid Technology Change

Infographic The Seven Pillars of Performance

Capacitas Blog http://www.capacitas.co.uk/blog

If you find this whitepaper relevant and interesting, you will like the following:

Page 12: A Guide to Agile Performance - Capacitas You and... · 2018. 3. 12. · Performance provide a comprehensive way of measuring performance. In Agile & Continuous Cycles, there is simply

Bring us Your Capacity and Performance IT Challenges

If you want to see big boosts to performance, with risk managed and costs controlled, then talk to us now to see how our expertise

gets you the most from your IT.

www.capacitas.co.uk

+44 (0) 20 7566 4869

[email protected]