Performance testing in scope of migration to cloud by Serghei Radov
-
Upload
valeriia-maliarenko -
Category
Software
-
view
172 -
download
2
Transcript of Performance testing in scope of migration to cloud by Serghei Radov
Performance Testing in scope of migration to CloudSerghei Radov
Serghei Radov
Current position:Senior Performance Engineer at Lohika
Contacts : [email protected] Github: github.com/grinslifeSkype : serghei.radov
AGENDA● Cloud computing principles ● Challenges● Performance testing as part migration process● What toolset could be used ? ● How to avoid common pitfalls ?● Does the "90 percentile" really work?● What will be the cost of performance testing
toolset?
● Multi-tenancy ● Statistical multiplexing ● Horizontal scalability ● Data partitioning● Consistent hashing ● Eventual consistency
Cloud computing principles
Multi-tenancy
Statistical multiplexing
Horizontal scalability
Data partitioning
Eventual consistency
Cloud performance challenges
● Over provisioning● Under provisioning● ELB network traffic issues ● Availability and Reliability
Over provisioning
Under provisioning
Solution for effective provisioning
Predictive auto-scaling
Scale up early,
Scale down slowly
Use time as a proxy
Machine learning
Netflix’s Predictive Scaling Engine
Predictive Auto Scaling Engines tools
Scryer
Elastisys
AppDynamics
VMTurbo
Rancher
Multi cloud or hybrid cloud
Multiple Availability Zones
Zones independence
Deploy at multiple regions
Employ solid backup and
Recovery strategies
Some tips
➢ Define acceptance criteria ➢ Select tools for monitoring and
testing➢ Discuss capacity planning
responsibilities ➢ Workload Characterization➢ Test tools for testing➢ Run tests, analyze, scale, re-run <-
cycle ➢ Report to stakeholders
Define performance tests SLA
StatefulnessResponse timeTime-out Exceptions that can be included in
the SLA:FailureNetwork issuesDenial of serviceScheduled maintenance
New Relic Response times
NRQL - NewRelic query language
SELECT uniqueCount(session) FROM PageView SINCE 1 week agoSELECT uniqueCount(session) FROM PageView SINCE 1 week ago COMPARE WITH 1 week ago
SELECT count(*) FROM PageView SINCE 1 day ago COMPARE WITH 1 day ago TIMESERIES AUTOSELECT uniqueCount(uuid) FROM MobileSession FACET osVersion SINCE 7 days ago
Gathering response times
Additional response times metrics
All these response times are presented as part of App response time.
- Database response times
- Memcached response time
- WebExternal
- Ruby
- GC calls
New Relic provides advanced ability to trace response times across systems using NRQL.
Additional response times metrics
Transactions throughput
- DC and Cloud resources are not compatible due to differences in hardware configurations.
- Same transactions count should correspond to current production level at DC or above to be able to serve current users without latency.
Target PEAK load will be 1.14K RPM
Lowest point will be 430 RPM
Finding peaks
(extracted from New relic for presentation only instead of DataDog)
Scenario per one server
- Ramp up to 430 RPM slowly to 700 RPM in 4 hours
- Run test for 6 hours
- Ramp up to 1.14K rpm
- Run test for 11 hours
- Ramp down slowly
Hardware acceptance level
- App server CPU usage
- should not go above 60% during peak 150% load
- threshold of 80%
- Memory usage (avg 60%, threshold 80%)
- Network usage throughput (should correspond DC levels)
- Auto-scaled groups set to false ( initial criteria )
All these metric values depended on production usage, budget and target VMs provisioning size.
CPU usage per 1 server (DataDog)
➢ Define acceptance criteria ➢ Select tools for monitoring and
testing➢ Discuss capacity planning
responsibilities ➢ Workload Characterization➢ Test tools for testing➢ Run tests, analyze, scale, re-run <-
cycle ➢ Report to stakeholders
Monitoring targets
Response timesResource utilisation at SUT Resource utilisation at Test ToolExceptions Workload behaviour
Load Test tool (flood.io)
Response times
Resource usage
to catch exceptions
Tracking workload in real-time
➢ Define acceptance criteria ➢ Select tools for monitoring and
testing➢ Discuss capacity planning
responsibilities ➢ Workload Characterization➢ Test tools for testing➢ Run tests, analyze, scale, re-run <-
cycle ➢ Report to stakeholders
Select proper EC2 type for an AppGeneral Purpose Compute Optimized Memory Optimized GPU Storage Optimized Dense-storage Instances
Model vCPUMem (GiB) Storage
Dedicated EBS Bandwidth (Mbps)
c4.large 2 3.75 EBS-Only 500c4.xlarge 4 7.5 EBS-Only 750c4.2xlarg
e 8 15 EBS-Only 1,000c4.4xlarg
e 16 30 EBS-Only 2,000c4.8xlarg
e 36 60 EBS-Only 4,000
Select proper EC2 type for an App
➢ Define acceptance criteria ➢ Select tools for monitoring and
testing➢ Discuss capacity planning
responsibilities ➢ Workload Characterization➢ Test tools for testing➢ Run tests, analyze, scale, re-run <-
cycle ➢ Report to stakeholders
Workload Characterization- Catch traffic patterns
- Resource utilisation
- Distribution of response times
- Distribution of response sizes
- Characterizations of users behaviour
- Analyse input data
- Use performance analysis toolkit
Traffic patterns
“Keep workload as real as possible.”
Resource utilisation
Characterize user behaviour Investigate user actions by help of
- New Relic Browser (session+funnel functions)
- Universal Analytics with User behaviour path
- Mixpanel.com (needs code injection)
- Server’s logs at NGINX- (http requests, REST calls)
- Sumo-logic (apache access logs)
- Server’s App logs (HP ALM has QC sense)
- DB activity logs (applied solution)
Write analytical tools that will
Parse access / ELB logs
Unite into scripts by timestamp and IP
Reduce amount of unique scripts
Restore high level user actions
Workload distribution
Write load test scripts
Hard Way
➢ Define acceptance criteria ➢ Select tools for monitoring and
testing➢ Discuss capacity planning
responsibilities ➢ Workload Characterization➢ Test tools for testing➢ Run tests, analyze, scale, re-run <-
cycle ➢ Report to stakeholders
Jmeter Gatling Locust Grinder Tsung
Open Source load tools - 54 found
Distributed JMeter testing
● BlazeMeter - (JMeter)
● Visual Studio Team Services - (JMeter)
● Flood IO - (Jmeter , Gatling, Ruby DSL)● Redline 13 - (Jmeter , Gatling, Ruby DSL)
● OctoPerf - (JMeter)
Load Tool as Service Providers
Create a Grid ( Docker containers)
Flood.io Grids ( JM at Docker EC2)
Create a Flood (upload jmx & data)
➢ Define acceptance criteria
➢ Select tools for monitoring and testing
➢ Discuss capacity planning responsibilities
➢ Workload Characterization
➢ Test tools for testing
➢ Run tests, analyze, scale, re-run <- cycle
➢ Report to stakeholders
Load Test tool (flood.io)
General Test result
Amazon Approval for Large Tests is needed
Flood.io results split by transactions
➢ Define acceptance criteria
➢ Select tools for monitoring and testing
➢ Discuss capacity planning responsibilities
➢ Workload Characterization
➢ Test tools for testing
➢ Run tests, analyze, scale, re-run <- cycle
➢ Report to stakeholders
Reports● Goals & achievements (e.g 150% of Daily RPM is reached)
● Side effects are found (DB connections limit reached due to quick ramp up)
● Exceptions caught during testing (e.g. ELB lost connections)
● Run-time notes and fixes made by DevOps (EC2 change during the test iterations)
● Observations ( CPU usage was critical resource during RPM increase)
● Recommendations ( EC2 - add more VM, add more Shards DB)
Pitfalls during performance testing
Pitfall 1 : 90% percentile matches to prod.Pitfall 2 : Extrapolation on horizontal scalePitfall 3 : Use a Small Amount of Hard Coded DataPitfall 5 : Run Tests from One LocationPitfall 4 : Focus on a Single Use Case
Does the "90 percentile" really work ?
Does the "90 percentile" really work ?
Does the "90 percentile" really work ?
Does the "90 percentile" really work ?
What will be the cost of performance testing toolset?
Cloud Jmeter Provider Type Users Monthly Nodes/Hours AWS cost
BlazeMeter pro 3K 499 100 167.50$
Flood.io(shared nodes) pay as you go 15K+ 499 100 167.50$
SOASTA pay as you go 10K 22500 undefined 0
Questions and Answers
Thank You!