Performance testing in scope of migration to cloud by Serghei Radov

Performance Testing in scope of migration to CloudSerghei Radov

Serghei Radov

Current position:Senior Performance Engineer at Lohika

Contacts : [email protected] Github: github.com/grinslifeSkype : serghei.radov

mailto:[email protected]

AGENDA● Cloud computing principles ● Challenges● Performance testing as part migration process● What toolset could be used ? ● How to avoid common pitfalls ?● Does the "90 percentile" really work?● What will be the cost of performance testing

toolset?

● Multi-tenancy ● Statistical multiplexing ● Horizontal scalability ● Data partitioning● Consistent hashing ● Eventual consistency

Cloud computing principles

Multi-tenancy

Statistical multiplexing

Horizontal scalability

Data partitioning

Eventual consistency

Cloud performance challenges

● Over provisioning● Under provisioning● ELB network traffic issues ● Availability and Reliability

Over provisioning

Under provisioning

Solution for effective provisioning

Predictive auto-scaling

Scale up early,

Scale down slowly

Use time as a proxy

Machine learning

Netflix’s Predictive Scaling Engine

Predictive Auto Scaling Engines tools

Scryer

Elastisys

AppDynamics

VMTurbo

Rancher

Multi cloud or hybrid cloud

Multiple Availability Zones

Zones independence

Deploy at multiple regions

Employ solid backup and

Recovery strategies

Some tips

➢ Define acceptance criteria ➢ Select tools for monitoring and

testing➢ Discuss capacity planning

responsibilities ➢ Workload Characterization➢ Test tools for testing➢ Run tests, analyze, scale, re-run <-

cycle ➢ Report to stakeholders

Define performance tests SLA

StatefulnessResponse timeTime-out Exceptions that can be included in

the SLA:FailureNetwork issuesDenial of serviceScheduled maintenance

New Relic Response times

NRQL - NewRelic query language

SELECT uniqueCount(session) FROM PageView SINCE 1 week agoSELECT uniqueCount(session) FROM PageView SINCE 1 week ago COMPARE WITH 1 week ago

SELECT count(*) FROM PageView SINCE 1 day ago COMPARE WITH 1 day ago TIMESERIES AUTOSELECT uniqueCount(uuid) FROM MobileSession FACET osVersion SINCE 7 days ago

https://docs.newrelic.com/docs/insights/new-relic-insights/using-new-relic-query-language/nrql-reference#state-select

https://docs.newrelic.com/docs/insights/new-relic-insights/using-new-relic-query-language/nrql-reference#func-uniqueCount


https://docs.newrelic.com/docs/insights/new-relic-insights/decorating-events/insights-attributes#browser-session

https://docs.newrelic.com/docs/insights/new-relic-insights/using-new-relic-query-language/nrql-reference#sel-from


https://docs.newrelic.com/docs/insights/new-relic-insights/decorating-events/insights-attributes#browser-defaults


https://docs.newrelic.com/docs/insights/new-relic-insights/using-new-relic-query-language/nrql-reference#sel-since





https://docs.newrelic.com/docs/insights/new-relic-insights/decorating-events/insights-attributes#browser-session







https://docs.newrelic.com/docs/insights/new-relic-insights/using-new-relic-query-language/nrql-reference#sel-compare



https://docs.newrelic.com/docs/insights/new-relic-insights/using-new-relic-query-language/nrql-reference#func-count

https://docs.newrelic.com/docs/insights/new-relic-insights/using-new-relic-query-language/nrql-reference#func-count









https://docs.newrelic.com/docs/insights/new-relic-insights/using-new-relic-query-language/nrql-reference#sel-timeseries

https://docs.newrelic.com/docs/insights/new-relic-insights/using-new-relic-query-language/nrql-reference#sel-timeseries




https://docs.newrelic.com/docs/insights/new-relic-insights/decorating-events/insights-attributes#uuid



https://docs.newrelic.com/docs/insights/new-relic-insights/decorating-events/insights-attributes#mobile-defaults

https://docs.newrelic.com/docs/insights/new-relic-insights/decorating-events/insights-attributes#mobile-defaults

https://docs.newrelic.com/docs/insights/new-relic-insights/using-new-relic-query-language/nrql-reference#sel-facet

https://docs.newrelic.com/docs/insights/new-relic-insights/using-new-relic-query-language/nrql-reference#sel-facet

https://docs.newrelic.com/docs/insights/new-relic-insights/decorating-events/insights-attributes#mob-osversion

https://docs.newrelic.com/docs/insights/new-relic-insights/decorating-events/insights-attributes#mob-osversion



Gathering response times

Additional response times metrics

All these response times are presented as part of App response time.

- Database response times

- Memcached response time

- WebExternal

- Ruby

- GC calls

New Relic provides advanced ability to trace response times across systems using NRQL.

Additional response times metrics

Transactions throughput

- DC and Cloud resources are not compatible due to differences in hardware configurations.

- Same transactions count should correspond to current production level at DC or above to be able to serve current users without latency.

Target PEAK load will be 1.14K RPM

Lowest point will be 430 RPM

Finding peaks

(extracted from New relic for presentation only instead of DataDog)

Scenario per one server

- Ramp up to 430 RPM slowly to 700 RPM in 4 hours

- Run test for 6 hours

- Ramp up to 1.14K rpm

- Run test for 11 hours

- Ramp down slowly

Hardware acceptance level

- App server CPU usage

- should not go above 60% during peak 150% load

- threshold of 80%

- Memory usage (avg 60%, threshold 80%)

- Network usage throughput (should correspond DC levels)

- Auto-scaled groups set to false ( initial criteria )

All these metric values depended on production usage, budget and target VMs provisioning size.

CPU usage per 1 server (DataDog)

Monitoring targets

Response timesResource utilisation at SUT Resource utilisation at Test ToolExceptions Workload behaviour

Load Test tool (flood.io)

Response times

Resource usage

to catch exceptions

Tracking workload in real-time

Select proper EC2 type for an AppGeneral Purpose Compute Optimized Memory Optimized GPU Storage Optimized Dense-storage Instances

Model vCPUMem (GiB) Storage

Dedicated EBS Bandwidth (Mbps)

c4.large 2 3.75 EBS-Only 500c4.xlarge 4 7.5 EBS-Only 750c4.2xlarg

e 8 15 EBS-Only 1,000c4.4xlarg

e 16 30 EBS-Only 2,000c4.8xlarg

e 36 60 EBS-Only 4,000

Select proper EC2 type for an App

Workload Characterization- Catch traffic patterns

- Resource utilisation

- Distribution of response times

- Distribution of response sizes

- Characterizations of users behaviour

- Analyse input data

- Use performance analysis toolkit

Traffic patterns

“Keep workload as real as possible.”

Resource utilisation

Characterize user behaviour Investigate user actions by help of

- New Relic Browser (session+funnel functions)

- Universal Analytics with User behaviour path

- Mixpanel.com (needs code injection)

- Server’s logs at NGINX- (http requests, REST calls)

- Sumo-logic (apache access logs)

- Server’s App logs (HP ALM has QC sense)

- DB activity logs (applied solution)

Write analytical tools that will

Parse access / ELB logs

Unite into scripts by timestamp and IP

Reduce amount of unique scripts

Restore high level user actions

Workload distribution

Write load test scripts

Hard Way

Jmeter Gatling Locust Grinder Tsung

Open Source load tools - 54 found

Distributed JMeter testing

● BlazeMeter - (JMeter)

● Visual Studio Team Services - (JMeter)

● Flood IO - (Jmeter , Gatling, Ruby DSL)● Redline 13 - (Jmeter , Gatling, Ruby DSL)

● OctoPerf - (JMeter)

Load Tool as Service Providers

Create a Grid ( Docker containers)

Flood.io Grids ( JM at Docker EC2)

Create a Flood (upload jmx & data)

➢ Define acceptance criteria

➢ Select tools for monitoring and testing

➢ Discuss capacity planning responsibilities

➢ Workload Characterization

➢ Test tools for testing

➢ Run tests, analyze, scale, re-run <- cycle

➢ Report to stakeholders

Load Test tool (flood.io)

General Test result

Amazon Approval for Large Tests is needed

Flood.io results split by transactions

➢ Define acceptance criteria

➢ Select tools for monitoring and testing

➢ Discuss capacity planning responsibilities

➢ Workload Characterization

➢ Test tools for testing

➢ Run tests, analyze, scale, re-run <- cycle

➢ Report to stakeholders

Reports● Goals & achievements (e.g 150% of Daily RPM is reached)

● Side effects are found (DB connections limit reached due to quick ramp up)

● Exceptions caught during testing (e.g. ELB lost connections)

● Run-time notes and fixes made by DevOps (EC2 change during the test iterations)

● Observations ( CPU usage was critical resource during RPM increase)

● Recommendations ( EC2 - add more VM, add more Shards DB)

Pitfalls during performance testing

Pitfall 1 : 90% percentile matches to prod.Pitfall 2 : Extrapolation on horizontal scalePitfall 3 : Use a Small Amount of Hard Coded DataPitfall 5 : Run Tests from One LocationPitfall 4 : Focus on a Single Use Case

Does the "90 percentile" really work ?

What will be the cost of performance testing toolset?

Cloud Jmeter Provider Type Users Monthly Nodes/Hours AWS cost

BlazeMeter pro 3K 499 100 167.50$

Flood.io(shared nodes) pay as you go 15K+ 499 100 167.50$

SOASTA pay as you go 10K 22500 undefined 0

Questions and Answers

Thank You!

Performance testing in scope of migration to cloud by Serghei Radov

Software

Transcript of Performance testing in scope of migration to cloud by Serghei Radov